Skip to main content

Command Palette

Search for a command to run...

Nemotron Super (and LLMs) for Dummies: Part 2 - How its built?

The Heavy Machinery: LLM Training at Scale

Updated
18 min read
A
A curios Software Engineer, trying to understand how stuff works!

If Part 1 was about the "Brain" (the math and logic) of LLMs, this blog is about the "Body" - the Infrastructure that enables them.

This is a layered cake right from the bare metal boxes to the software stack. Training a 120B parameter giant like Nemotron-3-Super isn't just a coding task; it’s an orchestration of massive power, heat, and high-speed data.

So Let’s grab that coffee and start peeling back the layers of the "Super" infrastructure.

The Metal - GPUs, Memory and Networks!

A GPU is like a factory worker whose job is to perform huge matrix multiplications.

A single Blackwell B200 GPU (or any other GPU) has 3 main components. The Streaming Multiprocessor (SM), The local memory to a processor (SRAM) (think of this like a L1 Cache for a CPU) and the High Bandwidth Memory (HBM).

The SM is the compute and the HBM is the worlds most expensive real estate today! thats right GPU memory.

Lets do some math here! A 120 billion parameter model like nemotron-3-super in FP16 precision would require 120,000,000,000 x 2 bytes per parameter (for FP16 precision) = 240GB!

Yes you are right to spot that a single Blackwell B200 with 192 GB of HBM isnt sufficient to even load the full memory for inference. And when its time for training we need memory to first load the model, load the optimizer states and the data into the same memory. So yes its obvious that we need more GPUs, but how many? Well thats a question of tradeoff between time and .... money!. The more the GPUs the faster you can complete the training runs. Just to set context, we are talking of time in order of weeks to train a single LLM model.

For ref, heres what a B200 chip looks like

But ofcourse hyper-scalers aren't betting on AI with a single GPU but thousands of them! nemotron was trained on a 1K GPU cluster, for your usual tools like ChatGPT and Claude, serving millions of queries for millions of customer would make the GPU numbers skyrocket even for just inference.

Okay so I think we have established peace with the fact that we need a lot of GPUs and we need to distribute our training jobs across these GPUs. Before we dive deep into how can that be achieved on a GPU cluster, we need to think about one last thing! Communication between GPUs!

A typical node with a single GPU and CPU uses the PCIe interconnects to move data between CPU and GPU (stuff like batching, preprocessing etc happens on CPU and then the data is forwarded to the GPU for computation), thats actually very slow because a PCIe's bandwidth is usually in ~32-64GB ranges and we need to move TBs of training data between CPU-GPU and multiple GPUs on the same node and GPUs across different nodes. NVIDIA has dedicated hardwares to solve this bandwidth problems! lets understand.

a. NVLink (Intra-Node)
When you have 8 GPUs inside a single box (like a DGX B200), they shouldn't have to talk to each other through the CPU or the slow PCIe lanes. That’s like calling someone in the next room by routing the call through a satellite.

  • The Tech: NVLink is a direct, point-to-point connection between GPUs.

  • The Speed: In the Blackwell (B200) era, a single GPU has 1.8 TB/s of NVLink bandwidth.

  • The Philosophy: This is so fast that the GPUs can treat each other's HBM (memory) as if it were their own. This is known as Memory Pooling. If GPU #1 needs a piece of the model sitting on GPU #8, it just "grabs" it instantly.

b. NVSwitch (Intra-Node)
If you have 8 GPUs, you don't want to just connect them in a circle (where GPU 1 has to talk through GPU 2 and 3 to reach GPU 4). You want a dedicated "Traffic Controller."

  • The Role: NVSwitch is a physical chip inside the server that allows every GPU to talk to every other GPU simultaneously at full speed.

  • The Benefit: It prevents "traffic jams." Whether GPU 1 is talking to GPU 2 or GPU 8, the speed is the same. This is what makes a single box of 8 GPUs act like one giant, 1.5 TB-memory monster.

c. [The InfiniBand Network](http://Chapter 1. Introduction to InfiniBand and RDMA Red Hat Documentation https://docs.redhat.com › red_hat_enterprise_linux › html) (Inter-Node)
Now, what happens when you need to connect thousands of these boxes together across a data center? NVLink doesn't stretch that far. This is where we leave the "building" and enter the "city-wide" network.

  • The Tech: InfiniBand (IB) is the gold standard for AI networking. Unlike standard Ethernet (which is designed for reliability over the messy internet), InfiniBand is designed for extreme low latency and Remote Direct Memory Access (RDMA).

  • RDMA (The Magic Trick): Remote Direct Memory Access (RDMA) allows a GPU in Box A to read data directly from the memory of a GPU in Box B without asking the CPU for permission. It’s like teleporting ingredients from a pantry in Miami to a kitchen in New York without any paperwork.

With all these light-speed communications equipments, we are now ready with an AI Factory that is ready to ship intelligence. Heres a visual of that factory!

3-D Parallelism - The division of Labour!

Alright! now lets get back to our original problem One model - many GPUs how do we pull this off? there are more than 1 ways the 3D Parallelism and in a full scale AI Factory all of them are employed at the same time to get the max out of those GPUs. Lets dive deep into this.

  1. Data Parallelism (DP)
    This is the most intuitive and simplest form of breaking a task into smaller jobs. Imaging a Pre-Training run with 25 Trillion tokens or an SFT Run with 7 Millions chat samples, we divide the chapters that Nemo needs to read across different GPU groups, lets say we have 1k GPUs, we divide the data in 25% shards and let each of the 4 GPU groups (which have an entire copy of the model on them) calculate their weight updates independently, post which there weights will be accumulated centrally and averaged out to compute the model weights.

  2. Tensor Parallelism (TP)
    Things get interesting here. Imagine an attention computation operation while training or inference. 1M context window i.e 1M tokens (N=1,000,000), the "Attention Matrix" (Q * K^T) is N x N. thats trillions of floating point number even with FP16 precision this is too big for a single GPU, so now instead of dividing the data here, we split the math into multiple sub problems like creating tiles of the attention matrix (500k columns on GPU A & the other 500k columns on GPU B), the two GPUs compute their attentions after which the results of the split are accumulated! Hmmm, did we miss something? YES!, remember the attention has to be computed on all the input tokens not just half of them i.e GPU A needs to compute attention on GPU B's share of 500k tokens as well and vice versa. This is where those high speed inter GPU NVLinks come into picture, GPU B sends its "Key" and "Value" tensors to GPU A, and GPU A sends its own to GPU B. It’s like two researchers writing a massive report. Researcher A reads the first half of the library, Researcher B reads the second. They then swap their notes (the "tiles") to make sure their final conclusions account for the entire library.

  3. Pipeline Parallelism (PP)
    Any LLM model consists of 100s of layers like a Transformer Layer or a MoE layer, Pipeline Parallelism divides those layers across GPUs, something similar to an assembly line in a factory. Each layer computes its part and then sends over the data and to the next Layer in the pipeline.

The 3-D Parallelism mechanics allows to distribute the model and work across our factory workers i.e GPUs, but there are nuances worth noting here, Tensor Parallelism requires very frequent movement of input matrices across GPUs, to avoid wasting time and compute cycles, it is usually done between GPUs on a single Node so that they can leverage the 1.5TB/s bandwidth of NVLink. Pipeline parallelism involves movement of weights and hence is usually divided among nodes under the same InfiniBand Switch, to reduce the latency of output transfers. The biggest division of them all Data Parallelism can afford some delays because of its inherent parallel paradigm and thus is usually done on cross node GPUs. Heres a demonstration of 3-D Parallelism in our AI Factory.

In theroy this division of labour sounds good but if you are breaking a sweat by thinking about how will you even begin to divide stuff up when you are training your own models! I have some Good News! YOU DONT HAVE TO.

Popular training libraries like [torchtrain](http://Thijsvanede/torch-train: Training wrapper ... GitHub https://github.com › Thijsvanede › torch-train) or NVIDIA's Megatron-Core does the heavy lifting for you and divides the training using the above mentioned paradigms. Heres a sample code snippet on torch using Megatron-Core:

import torch
from megatron.core import parallel_state

# Define the "Shape" of your training cluster
TP_SIZE = 8   # Slice each layer across 8 GPUs (Inside one box)
PP_SIZE = 4   # Stack layers across 4 groups of GPUs (The Assembly Line)
CP_SIZE = 2   # Context Parallelism (To handle that 1M token window)

# Initialize the distributed environment
torch.distributed.init_process_group(backend="nccl")

# The "Magic" Command: This shreds the model automatically
parallel_state.initialize_model_parallel(
    tensor_model_parallel_size=TP_SIZE,
    pipeline_model_parallel_size=PP_SIZE,
    context_parallel_size=CP_SIZE
)

print(f"Factory Ready! Tensor Parallel: {TP_SIZE}, Pipeline Parallel: {PP_SIZE}")

A simple model.forward() abstracts away the decision makings under the hood! the library knows exactly which GPU needs which slice of the math.

But if the curiosity has got the better of you heres a little sneak peek at how is this handled at the library level which brings CuDA kernels, NCCL , NVLinks, GPUs and PyTorch on the same table!

Let’s trace a single operation: A 120B MoE Layer Forward Pass.

1. The Handover (PyTorch to Megatron)

When you hit forward(), Megatron-Core intercepts it. It knows that for a Latent MoE layer, it can't just do a standard multiplication. It looks at its "Blueprint" and sees that the experts are sharded across 8 GPUs.

2. The Dispatch (NCCL "Teleportation")

The library calls NCCL (pronounced "Nickel"). This is the magic part. NCCL looks at the hardware and says:

"Okay, I see these 8 GPUs are connected via NVLink. I’m going to use the ncclEpDispatch primitive to teleport these tokens to the correct 'Expert' GPU without waking up the CPU."

3. The Heavy Lifting (CUDA Kernels)

Once the data arrives in the GPU's HBM, the CUDA Kernels take over. It uses specialized "Expert GEMM" kernels. These kernels are hard-coded to handle 4-bit math at extreme speeds, keeping the SMs (The Chefs) at 100% capacity.

4. The Sync (All-Reduce)

Once the "Expert" GPUs finish their math, they have partial answers. NCCL kicks in again for an All-Reduce operation. It uses the high-speed NVLinks to sum up all the answers across the GPUs and put the final result back into the main "Assembly Line."

On Paper our Factory looks perfect to start producing tokens at light speed! but hey! something just slowed down our factory, lets take a look at what happened?

It turns out one of nodes with 8 GPUs has suddenly shutdown, this has stopped the attention computation of Layer number 16 in our model pipeline which in turn has stopped the entire pipeline! How do we deal with this? What are we missing?

Thats right! A MANAGER! for our factory. So lets move ahead and hire a manager to keep things running.

Orchestration - Final Piece of the Puzzle!

Ray is the unified orchestration framework that acts as the "Manager" of our AI factory. While libraries like Megatron-Core handle the shredding of the model, Ray handles the survival of the cluster.

Let me try to explain Ray and its component in an intuitive way.

Ray uses a hierarchical design that separates the "Control Plane" (management) from the "Data Plane" (execution).

At its most basic level, a Ray cluster consists of a Head Node and multiple Worker Nodes.

The Head Node (The Brain): The head node is the central coordinator. While it can run tasks like any other node, its primary job is to manage the cluster's state using the following components.

  • Global Control Store (GCS): This is a key-value store (often backed by Redis) that keeps track of everything happening in the cluster—where every object is located, which nodes are alive, and which tasks are running.

  • Autoscaler: It monitors the resource demands of your application. If you suddenly submit 100 tasks that each need a GPU, the autoscaler talks to your cloud provider (AWS/GCP/Azure) to spin up more worker nodes.

API Server: Provides an interface for the Ray Dashboard and Job submissions.

Worker Nodes (The Muscle): Worker nodes are where your actual Python code executes. Every worker node contains several key processes:

  • Raylet: This is the local "manager" of the node. It has two main parts: Local Scheduler: Decides which task to run on which CPU/GPU within that specific node. If the local node is full, it talks to other Raylets to "spill over" tasks.

  • Object Manager: Handles the movement of data between nodes.

  • Workers: These are separate Python processes that actually execute your @ray.remote tasks and actors.

The Distributed Object Store (Shared Memory)
This is Ray's "secret sauce" for performance. Instead of sending huge chunks of data back and forth over the network (which is slow), Ray uses Shared Memory (Plasma).

  • Zero-Copy Reading: On a single node, multiple worker processes can read the same large data object (like a 10GB NumPy array) simultaneously without copying it into their own local memory.

  • Object Directory: The GCS keeps a map of which node holds which object. If Task A on Node 1 needs an object created by Task B on Node 2, the Object Managers handle the transfer automatically. +1

The Driver The Driver is the program you actually run (e.g., your script or Jupyter Notebook). It connects to the Head Node, submits tasks, and receives the results.

The most crucial bottle neck that Ray addresses in AI Training Infra is the movement of TBs data across GPUs and Nodes and it does so by employing an offset addressable memory mapping of objects and using references of these objects to share them. Let me simplify that!

In a traditional distributed system, sharing a 100GB dataset with 8 workers on a single node requires the system to pickle and copy that data into 8 separate memory spaces, ballooning your RAM usage to 800GB and stalling the CPU with heavy serialization.

Ray's architecture replaces this "copying" with "mapping." It stores the 100GB dataset exactly once in the Plasma Object Store and gives each worker an Object Reference (ObjectRef). Instead of receiving a local copy, the worker uses the reference to map the object's memory address into its own space [(mmap)](http://How does memory-mapping (mmap) work? | by Jimmy Lee Medium · Jimmy Lee 20+ likes · 2 years ago), jumping directly to specific data points via offsets in the underlying Apache Arrow format. This allows 8 workers to process the same 100GB of data simultaneously while consuming only 100GB of physical RAM, effectively achieving zero-copy performance that stays constant regardless of how many workers are reading the data.

Ray also takes care of scheduling training jobs based on their resource requirements and handles faults in the system by replacing bad nodes, pausing & resuming training jobs and periodically checkpointing data such as model weights to ensure recoverability in case of catastrophic failure.

Our completed AI Factory now looks something like this!

Infrastructure for RL Training

RL increases the complexity of training exponentially as it involves model rollouts (inferencing), weight updates and Reward Calculations across different environments! So lets take some time to discuss how is RL scaled on 1k GPU cluster with NemoRL, [Nemo Gym](http://NeMo Gym Documentation NVIDIA Docs https://docs.nvidia.com › nemo › gym › latest), Ray and Async GRPO (yes the same RL training algorithm we discussed in part 1) in focus.

An important part of the RL training eco system is managing the lifecycle of sandbox environments. Ray and Nemo Gym makes it easy for you. Each sandbox environment such as a code repository with its necessary runtime is launched as Nemo Gym instances on Ray Actors. The model interacts with these environment and perform tasks. Ray's orchestration makes house keeping a bit easy by replacing broken environments (because maybe the model deleted the code repo or triggered an infinite loop) with fresh spin ups.

Async GRPO implemented through the NemoRL library on a Ray cluster is one of the coolest thing that nemotron trainers pulled of to improve training efficiency. This is how it works in a nutshell.

In standard RL, the model updates its weights and then waits for all the sandboxes to finish their new rollouts (A rollout is basically taking the latest model and launching it for inference on vLLM or Tensor-RT inference engines) and calculate the weight updates for a second time. Async GRPO (Group Relative Policy Optimization) removes the "wait" by treating these as two separate jobs.

NemoRL splits the entire Ray cluster in two groups.

A group of Ray Actors (running on a bunch of GPU nodes) work as Inference Workers with inference engines such as vLLM or Tensor-RT Sandboxes. These expose OpenAI compatible HTTP endpoints that the NemoGym sandboxes can call the get model responses and calculate rewards.

Once the sandboxes (Inference Workers) are ready with their results after multi turn interactions with the model, they write their results in to the Replay Buffer, think of these as a Queue which our actors use for communications.

The second group of actors working as Learners hold the complete model for weight updates using the 3-D parallelism we talked about earlier. These actors take the results from the Replay Buffer and calculate rewards and update model weights.

But this design introduces a critical flaw in the training pipeline. Imagine a sandbox environment is slow and while its using version 1 of model to run the inference, the Learners have already updates the weights to version 3. This might make the model forget important stuff in its training course over the time. Nemotron-3-Super specifically mentions that the NemoRL lib makes sure that no inference worker is running more than 1 version older than the learners. How is that possible while the training loop is running you might ask?

In the NeMo RL config, there is a critical parameter often called max_staleness. By default, this is set to 1.

  • The Rule: The system literally refuses to start a new training step if the data in the Replay Buffer is more than N versions old.

  • The Enforcement: The Ray Central Controller acts as a traffic cop. It monitors the "Version ID" of every trajectory sitting in the buffer. If the Learners reach Version 10, but the buffer is still full of Version 8 data, the Learner pauses. It waits until the Inference Workers (vLLM) catch up and deliver fresh Version 9 or 10 data.

Heres a visualization to clarify things up a bit.

As Master Oogway once said: Failure is not an exception; it is an inevitable phenomenon! (I actually don't know if he really said it, but he was wise enough to know this).

In our high-tech factory, we can have the most beautiful code in the world, but eventually, we have to deal with the physical world. We are running thousands of GPUs and miles of InfiniBand network fabrics at 100% capacity, 24/7. These rigs generate enough heat to boil about 6,000 liters of water every hour, and when you push hardware that hard, things will break. It’s not a question of if, but when.

Whether it’s a GPU smoking out, a network switch flaking, or a software library hitting a weird edge case, our assembly line is going to face interruptions. Incorporating Fault Tolerance through the Ray controller (to make sure dead jobs are replaced) and async checkpointing to secondary storage layers like S3 is the most practical way to handle galvanize against failures.

Wrapping Up!

We have covered a long journey starting from covering the intuition behind LLM's math to how these ideas are brought to life with massive infrastructure engineering! Squeezing the last ounce of work from a single GPU is the ultimate goal of these training and inference infrastructures.

Yes! these are complex and expensive systems to build for "just exploration" but I hope this conversation helps in uncovering the basics of LLM Infra and makes us appreciate the amount of work done to generate a single token, the next time we ask ChatGPT to tell count 'r's in Strawberry!.