AWS Invests in Habana AI Training Chips
Amazon Web Services (AWS) has invested in Habana Gaudi AI training chips for its cloud offering, a big win for the Intel-owned startup. During the keynote speech at AWS’ re:Invent conference, CEO Andy Jassy also announced that the company has developed its own AI training chip, Trainium.
Cloud providers have been cautious so far when it comes to investing in third-party chips with new compute architectures for AI acceleration, preferring to develop their own specialized processors instead (Google TPU, Baidu Kunlun, Alibaba Hanguang, AWS Inferentia). The exceptions are the Graphcore chips available in Microsoft’s Azure cloud (though these are prioritized for customers who are “pushing the boundaries of machine learning”) and the Groq accelerators available with cloud service provider Nimbix (for “selected customers” only). Today’s news therefore makes Habana’s Gaudi instances at AWS the only major rollout of a brand new third-party compute architecture in the cloud to date.
Jassy said in his keynote that AWS’ aim is to provide a better price-performance option for AI training workloads than GPUs, and that Habana Gaudi accelerators contributed to that cost reduction for customers. In AWS’ internal tests, price-performance metrics for Habana Gaudi-based EC2 instances were up to 40% better than for current GPU-based EC2 instances on AI workloads, he said.
Habana Labs, now part of Intel, is an Israeli startup based in Tel Aviv. The company’s Gaudi AI training chip, launched in 2019, has eight VLIW SIMD (very long instruction word, single instruction multiple data) vector processor cores, which it calls tensor processor cores (TPC), and 32GB HBM2 (high bandwidth memory, second generation) memory. It also has on-chip RoCE (remote direct memory access over Converged Ethernet) communications for scaling to very large systems.
AWS will offer Gaudi-based EC2 instances in the first half of 2021. Each 8-card Gaudi EC2 instance can process about 12,000 images-per-second training ResNet 50 on TensorFlow.
A next-generation 7-nm version of Gaudi is currently in the works, according to the Habana.
Jassy also announced AWS’ own training chip, Trainium, in a bid to push AI training costs below even what Habana hardware can offer. While he didn’t give away much detail, Jassy hinted that AWS plans to offer Trainium instances at the lowest cost in the cloud for training, and that each instance will offer a hefty TFLOPS performance figure.
This is the second custom AI accelerator chip AWS has designed, joining Inferentia, which is designed for AI inference workloads. Inferentia is based on custom Neuron cores with substantial on-chip memory. Trainium AI training chips will use the same Neuron software development kit (SDK) as Inferentia.
Trainium instances will be available in 2021.
- Special Project: Intelligence at Hyper-Scale
- AI Chip Menu Expands for Data Centers
- Can AI Accelerators Green the Data Center?
- Hyper-scale Infrastructure Services Accelerate