Tell me more about these chips
Ganesh Venkataramanan, former senior director of Autopilot hardware, presenting the D1 training tile at Tesla’s 2021 AI Day.
Image Credits: Tesla/screenshot of streamed event
Image Credits:Screenshot | Tesla
Tesla is of a similar opinion to Apple, in that it believes hardware and software should be designed to work together. That’s why Tesla is working to move away from the standard GPU hardware and
design its own chips to power Dojo.
Tesla unveiled its D1 chip, a silicon square the size of a palm, on AI Day in 2021. The D1 chip entered into production as of at least May this year. The Taiwan Semiconductor Manufacturing Company (TSMC) is manufacturing the chips using 7 nanometer semiconductor nodes. The D1 has 50 billion transistors and a large die size of 645 millimeters squared, according to Tesla. This is all to say that the D1 promises to be extremely powerful and efficient and to handle complex tasks quickly.
“We can do compute and data transfers simultaneously, and our custom ISA, which is the instruction set architecture, is fully optimized for machine learning workloads,” said Ganesh Venkataramanan, former senior director of Autopilot hardware, at Tesla’s 2021 AI Day. “This is a pure machine learning machine.”
The D1 is still not as powerful as Nvidia’s A100 chip, though, which is also manufactured by TSMC using a 7 nanometer process. The A100 contains 54 billion transistors and has a die size of 826 square millimeters, so it performs slightly better than Tesla’s D1.
To get a higher bandwidth and higher compute power, Tesla’s AI team fused 25 D1 chips together into one tile to function as a unified computer system. Each tile has a compute power of 9 petaflops and 36 terabytes per second of bandwidth, and contains all the hardware necessary for power, cooling and data transfer. You can think of the tile as a self-sufficient computer made up of 25 smaller computers. Six of those tiles make up one rack, and two racks make up a cabinet. Ten cabinets make up an ExaPOD. At AI Day 2022, Tesla said Dojo would scale by deploying multiple ExaPODs. All of this together makes up the supercomputer.
Tesla is also working on a next-gen D2 chip that aims to solve information flow bottlenecks. Instead of connecting the individual chips, the D2 would put the entire Dojo tile onto a single wafer of silicon.
Tesla hasn’t confirmed how many D1 chips it has ordered or expects to receive. The company also hasn’t provided a timeline for how long it will take to get Dojo supercomputers running on D1 chips.
In response to
a June post on X that said: “Elon is building a giant GPU cooler in Texas,” Musk replied that Tesla was aiming for “half Tesla AI hardware, half Nvidia/other” over the next 18 months or so. The “other” could be AMD chips, per
Musk’s comment in January.
What does Dojo mean for Tesla?
Tesla’s humanoid robot Optimus Prime II at WAIC in Shanghai, China, on July 7, 2024.
Image Credits: Costfoto/NurPhoto via Getty Images)
Image Credits:Getty Images
Taking control of its own chip production means that Tesla might one day be able to quickly add large amounts of compute power to AI training programs at a low cost, particularly as Tesla and TSMC scale up chip production.
It also means that Tesla may not have to rely on Nvidia’s chips in the future, which are increasingly expensive and hard to secure.
During Tesla’s second-quarter earnings call, Musk said that demand for Nvidia hardware is “so high that it’s often difficult to get the GPUs.” He said he was “quite concerned about actually being able to get steady GPUs when we want them, and I think this therefore requires that we put a lot more effort on Dojo in order to ensure that we’ve got the training capability that we need.”
That said, Tesla is still buying Nvidia chips today to train its AI. In June,
Musk posted on X:
“Of the roughly $10B in AI-related expenditures I said Tesla would make this year, about half is internal, primarily the Tesla-designed AI inference computer and sensors present in all of our cars, plus Dojo. For building the AI training superclusters, Nvidia hardware is about 2/3 of the cost. My current best guess for Nvidia purchases by Tesla are $3B to $4B this year.”
Inference compute refers to the AI computations performed by Tesla cars in real time, and is separate from the training compute that Dojo is responsible for.
Dojo is a risky bet, one that Musk has hedged several times by saying that Tesla might not succeed.
In the long run, Tesla could theoretically create a new business model based on its AI division. Musk has said that the first version of Dojo will be tailored for Tesla computer vision labeling and training, which is great for FSD and for training
Optimus, Tesla’s humanoid robot. But it wouldn’t be useful for much else.
Musk has said that future versions of Dojo will be more tailored to general purpose AI training. One potential problem with that is that almost all AI software out there has been written to work with GPUs. Using Dojo to train general purpose AI models would require rewriting the software.
That is, unless Tesla rents out its compute, similar to how AWS and Azure rent out cloud computing capabilities. Musk also noted during Q2 earnings that he sees “a path to being competitive with Nvidia with Dojo.”
A September 2023 report from Morgan Stanley predicted that Dojo could
add $500 billion to Tesla’s market value by unlocking new revenue streams in the form of robotaxis and software services.
In short, Dojo’s chips are an insurance policy for the automaker, but one that could pay dividends.
How far along is Dojo?
Nvidia CEO Jen-Hsun Huang and Tesla CEO Elon Musk at the GPU Technology Conference in San Jose, California.
Image Credits: Kim Kulish/Corbis via Getty Images
Image Credits:Getty Images
Reuters reported last year that Tesla began production on Dojo in July 2023, but a
June 2023 post from Musk suggested that Dojo had been “online and running useful tasks for a few months.”
Around the same time, Tesla said it expected Dojo to be one of the top five most powerful supercomputers by February 2024 — a feat that has yet to be publicly disclosed, leaving us doubtful that it has occurred.
The company also said it expects Dojo’s total compute to reach 100 exaflops in October 2024. (1 exaflop is equal to 1 quintillion computer operations per second. To reach 100 exaflops and assuming that one D1 can achieve 362 teraflops, Tesla would need more than 276,000 D1s, or around 320,500 Nvidia A100 GPUs.)
Tesla also pledged in January 2024 to
spend $500 million to build a Dojo supercomputer at its gigafactory in Buffalo, New York.
In May 2024,
Musk noted that the rear portion of Tesla’s Austin gigafactory will be reserved for a “super dense, water-cooled supercomputer cluster.”
Just after Tesla’s second-quarter earnings call, Musk
posted on X that the automaker’s AI team is using Tesla HW4 AI computer (renamed AI4), which is the hardware that lives on Tesla vehicles, in the training loop with Nvidia GPUs. He noted that the breakdown is roughly 90,000 Nvidia H100s plus 40,000 AI4 computers.
“And Dojo 1 will have roughly 8k H100-equivalent of training online by end of year,” he continued. “Not massive, but not trivial either.”