H5打包apk常見穩定性問題🟣Telegram:@apk2bot.emp

This weekover 30,000 people are descending upon San JoseCalif.to attend Nvidia GTCthe so-called Superbowl of AI—a nickname that may or may not have been coined by Nvidia. At the main event Jensen HuangNvidia CEOtook the stage to announce (among other things) a new line of next generation Vera Rubin chips that represent a first for the GPU giant: a chip designed specifically to handle AI inference. The Nvidia Groq 3 language processing unit (LPU) incorporates intellectual property Nvidia licensed from the start-up Groq last Christmas Eve for US $20 billion.

“FinallyAI is able to do productive workand therefore the inflection point of inference has arrived,” Huang told the crowd. “AI now has to think. In order to thinkit has to inference. AI now has to do; in order to doit has to inference.”

Training and inference tasks have distinct computational requirements. While training can be done on huge amounts of data at the same time and can take weeksinference must be run on a user’s query when it comes in. Unlike traininginference doesn’t require running costly backpropagation. With inferencethe most important thing is low latency—users expect the chatbot to answer quicklyand for thinking or reasoning models inference runs many times before the user even sees an output.

Over the past few yearsinference-specific chip start-ups were experiencing a sort of Cambrian explosionwith different companies exploring distinct approaches to speed up the task. The start-ups include D-matrix with digital in-memory compute, Etched with an ASIC for transformer inference, RainAI with neuromorphic chips, EnCharge with analog in-memory compute, Tensordyne with logarithmic math to make AI computations more efficient, FuriosaAI with hardware optimized for tensor operation rather than vector-matrix multiplicationand others.

Late last yearit looked like Nvidia had picked one of the winners among the crop of inference chipswhen it announced its deal with Groq. The Nvidia Groq 3 LPU reveal came a mere two and a half months afterhighlighting the urgency of the growing inference market.

Memory bandwidth and data flow

Groq’s approach to accelerating inference relies on interleaving processing units with memory units on the chip. Instead of relying on high-bandwidth memory (HBM) situated next to GPUs it leans on SRAM memory integrated within the processor itself. This design greatly simplifies the flow of data through the chipallowing it to proceed in a streamlinedlinear fashion.

“The data actually flows directly through the SRAM,” Mark Heaps said at the Supercomputing conference in 2024. Heaps was a chief technology evangelist at Groq at the time and is now director of developer marketing at Nvidia. “When you look at a multi-core GPUa lot of the instruction commands need to be sent off the chipto get into memory and then come back in. We don’t have that. It all passes through in a linear order.”

Using SRAM allows that linear data flow to happen exceptionally fastleading to the low latency required for inference applications. “The LPU is optimized strictly for that extreme low latency token generation,” says Ian BuckVP and general manager of hyperscale and high-performance computing at Nvidia.

Comparing the Rubin GPU and Groq 3 LPU side by side highlights the difference. The Rubin GPU has access to a whopping 288 gigabytes of HBM and is capable of 50 quadrillion floating-point operations per second (petaFLOPS) of 4-bit computation. The Groq 3 LPU contains a mere 500 megabytes of SRAM memoryand is capable of 1.2 petaFLOPS of 8-bit computation. On the other handwhile the Rubin GPU has a memory bandwidth of 22 terabytes per secondat 150 TB/s the Groq 3 LPU is seven times as fast,. The leanspeed-focused design is what allows the LPU to excel at inference.

The new inference chip underscores the ongoing trend of AI adoptionwhich shifts the computational load from just building ever bigger models to actually using those models at scale .“NVIDIA’s announcement validates the importance of SRAM-based architectures for large-scale inferenceand no one has pushed SRAM density further than d-Matrix,” says d-Matrix CEO Sid Sheth. He’s betting that data center customers will want a variety of processors for inference. “The winning systems will combine different types of silicon and fit easily into existing data centers alongside GPUs.”

Inference-only chips may not be the only solution. Late last weekAmazon Web Services said that it will deploy a new kind of inferencing system in its data centers. The system is a combination of AWS’ Tranium AI accelerator and Cerebras Systems’ third generation computer CS-3which is built around the largest single chip ever made. The two-part system is meant to take advantage of a technique called inference disaggregation. It separates inference into two parts—processing the promptcalled prefilland generating the outputcalled decode. Prefill is inherently parallelcomputationally intensiveand doesn’t need much memory bandwidth. While decode is a more serial process that needs a lot of memory bandwidth. Cerebras has maximized the memory bandwidth issue by building more 44 GB of SRAM on its chip connected by a 21 PB/s network.

Nvidiatoointends to take advantage of inference disaggregation in its newcombined compute tray called the Nvidia Groq 3 LPX. Each tray will house 8 Groq 3 LPUs and a Vera Rubinwhich pairs Rubin GPUs with a Vera CPU. The pre-fill and the more computationally intensive parts of the decode are done on Vera Rubinwhile the final part is done on the Groq 3 LPUleveraging the strengths of each chip. “We’re in volume production now,” Huang said.

From Your Site Articles

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

With Nvidia Groq 3the Era of AI Inference Is (Probably) Here

Jensen Huang unveiled a new chip based on tech purchased from Groq

Memory bandwidth and data flow

Video Friday: Humanoid Learns Tennis Skills By Playing

AI Aims for Autonomous Wheelchair Navigation

IEEE and Academia are Creating Microcredential Programs

Related Stories

Nvidia Chip Detects Faces in Less Than a Millisecond

You Should Have a Say in Military AI Policy

How AI Is Transforming Mathematical Proof Verification

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum including the ability to save articles to read laterdownload Spectrum Collectionsand participate in conversations with readers and editors. For more exclusive content and featuresconsider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articlesarchivesPDF downloadsand other benefits. Learn more about IEEE →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articlesarchivesPDF downloadsand other benefits. Learn more about IEEE →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articlesdownload collections, and post comments — all free! For full access and benefits, subscribe to Spectrum.

With Nvidia Groq 3the Era of AI Inference Is (Probably) Here

Jensen Huang unveiled a new chip based on tech purchased from Groq

Memory bandwidth and data flow

Video Friday: Humanoid Learns Tennis Skills By Playing

AI Aims for Autonomous Wheelchair Navigation

IEEE and Academia are Creating Microcredential Programs

Related Stories

Nvidia Chip Detects Faces in Less Than a Millisecond

You Should Have a Say in Military AI Policy

How AI Is Transforming Mathematical Proof Verification