Nemotron Super (and LLMs) for Dummies: Part 1 - How it thinks?
The "Brain" (Architecture, Logic, and Reasoning)
I have been using LLMs and AI Agents everyday for my day to day work for quite sometime now! and I am sure just like me, most of us are addicted towards this natural and conversational style of getting stuff done. But have you ever wondered how are these behemoths of models actually trained to think? So did I and viola! NVIDIA dropped a gift to the community by open sourcing their most capable model nemotron-3-super , thats right!, the data , the model, the training recipes and the training infrastructure, practically everything you need to make one for yourself! (If you are a GPU collector ofcourse!).
So I’ve a few days digging through the Nemotron-3-Super technical report to distill exactly how this "Super" architecture works. I'm breaking my findings into this two-part conversation:
Part 1: The Math (How it Thinks) – We’ll pull back the curtain on the logic, the hybrid "Mamba-Transformer" brain, and the 25-trillion-token diet.
Part 2: The Engineering (How it is Built) – We’ll look at the actual factory floor, from the NeMo framework to the training loops that make it all run.
If you were ever curious about LLMs, or are looking to spin up your own flavors of LLMs or if you just wanted to dive deep into this ocean of matrix multiplications, I hope this conversation helps!
PreTraining - Nemo learns to Read & Write!
The first thing any "language" model needs to do is learn to read and write, and it does so by gulping a huge corpus of text from across the planet. The model's objective in pre-training is simple, given a sequence of tokens, predict the next token. If you think about it, that is what happens at the back when you chat with ChatGPT or Gemini; token prediction one by one.
Lets visualize this with an example! Consider a bright sunny day, you have your morning coffee and you decided to learn filmmaking! (because who doesn't) and you throw in the query
hey Nemo!, how can I make a movie with my iPhone?
At the back this query is broken down into a sequence of tokens that might look something like this.
`[<'hey', token_id: 1234>, <'Gemini', token_id: 345>, ...., <'iPhone', token_id: 12313>, <'?', token_id: 22>]
What is a tokenID you might ask? Theres a near 100% chance that a word that you have thrown at the model was already seen by the model during pre-training and it is in its vocabulary, the ID is the positional information of that word in the model's Dictionary. The token sequence is converted to a sequence of tensors (floating point vectors) to make it as a viable input to the model. There are several tokenization mechanism, but Byte Pair Encoding (BPE) is a popular choice for LLMs for its efficiency.
Okay back to our example! At this point Nemo has no clue about film making, but it has a memory of everything it has read in its pre-training, there might have been articles or reddit threads about "filmmaking" or "film-making on iPhone" in the training data. Nemo uses these text excerpts from its memory to predict the next token that should follow the last token in the query. For every single word in its dictionary (all 256,000 of them), it assigns a probability. If the query is "How can I make a movie with my...", the word "iPhone" might have a 90% probability, while "Toaster" has 0.0001%. It picks the winner and moves on to the next one. It takes the original query + first predicted token to predict the second token in the sequence and the cycle continues until the model predicts some special End-Of-Sequence tokens.
Yes! you are right if you are thinking that this autoregressive algorithm sounds costly. Inference token prediction with traditional Transformer models is Quadratic in a nature and can put a huge dents in your pockets. Techniques like KV Caching were introduced to address this limitation but this is still an open ended problem in the world of LLMs.
Okay! Sorry for the detour, buts lets get back to pre-training Nemo.
The training loop essentially makes the model better at predicting the next obvious token through multiple passes over the training data. The loss function used here is Negative Log Likelihood estimation (NLL). Think of it as a penalty for being wrong or unsure.
A random example of Pre-Training text sample from nemotron-3-super
Doing Business As (DBA)
A DBA can be a Sole Proprietor or a General Partnership. It may also be owned by an entity that wishes to transact business under an assumed name.
A Sole Proprietor is one individual in business alone. Sole Proprietors are the most common form of business structure. This type of business is simple to form and operate, and may enjoy greater flexibility of management and fewer legal controls. However, the business owner is personally liable for all debts incurred by the business.
After completing the form, you can submit it online via the “Submit a Paper Filing” option.
You can not merge a DBA into another entity or merge another entity into a DBA.
You can not convert a DBA to another entity type or convert another entity to become a DBA. You can, however, cancel the DBA by filling out the DBA Cancellation and then register the new entity as the desired entity type. To change an entity to a DBA, you will need to cancel/dissolve/terminate/withdraw that entity and then register as a new DBA
Question: What is one advantage of a sole proprietorship?
Answer: A sole proprietorship is simple to form and operate, and may enjoy greater flexibility of management and fewer legal controls.
Question: What is a DBA and who can own it?
Answer: A DBA is a business name used by a sole proprietor or general partnership, and it can also be owned by an entity that wishes to transact business under an assumed name.
The Filmmaking Example:
Nemo sees: "How can I make a movie with my..."
The target is "iPhone".
Case A (High Certainty): Nemo gives "iPhone" a 90% chance (\(P=0.9\)).
$$L = -\log(0.9) \approx \mathbf{0.10} \text{ (Small penalty)}$$
Case B (Low Certainty): Nemo gives "iPhone" only a 10% chance (\(P=0.1\)).
$$L = -\log(0.1) \approx \mathbf{2.30} \text{ (Huge penalty)}$$
The Goal: Drive the loss to zero. The more certain Nemo is about the correct word, the lower the score.
This is the heartbeat of the NeMo framework, repeated trillions of times, Nemo guesses the next token, We find the NLL score based on the actual word in the data, We send that "loss" value backward to nudge the model's weights. If the loss was high, the nudge is aggressive; if it was low, we just fine-tune.
Before we continue towards the next phases of training, lets take a quick glance at the hybrid architecture of the Nemotron Model
The Mamba-Transformer Latent MoE Hybrid
Lots of sci-fi words there!, lets break them down one by one.
Transformers
Everyone has heard this before, its the core engine that drives these giants models through its attention mechanism. There are several deep dives on transformers buts here my two cents on trying to make it intuitive.
Imagine you are reading a complex legal contract. As you read the word "it" in the third paragraph, your brain instantly "shines a spotlight" back on the first paragraph where the company name was mentioned.The Win: Transformers are brilliant at connecting dots across long distances. They don't just read; they understand context with near-perfect fidelity.
The Catch: This "spotlight" is expensive. To understand a word, the Transformer has to look at every other word you’ve typed. If you double the text, the work quadruples (O(n^2)). This is why traditional LLMs get slow and "memory-hungry" as your conversation gets long.
Checkout this awesome HF article on how transformers work to dive deep
Mamba
Mamba is the new challenger on the block. It’s a State Space Model (SSM), and it thinks very differently. Think of Mamba like a person watching a movie. Instead of re-watching every previous scene to understand the current one (like a Transformer), Mamba maintains a running summary in its head. As new information comes in, it updates that summary and throws away the raw details it no longer needs.The Win: Because it only keeps a fixed-size "summary" (state), it is incredibly fast and scales linearly (O(n)). You can feed it a 1-million token book, and it won't slow down.
The Catch: Sometimes a summary isn't enough. For super complex reasoning or tiny details buried 500 pages ago, the "spotlight" of a Transformer is still superior.
MoE & Latent MoE
Finally, we have the Mixture of Experts (MoE). Imagine a hospital with 120 billion doctors. If every doctor tried to treat every patient, it would be chaos (and expensive). In an MoE model, the "brain" is split into specialized "experts." When you ask a math question, the model only "wakes up" the math experts.- The Standard MoE Win: You get the intelligence of a 120B model but only pay the "compute cost" of a 12B model (because only 12B are active).
The "Latent" MoE : In a normal MoE, moving data between "experts" is like trying to shove a giant grand piano through a narrow hospital hallway—it’s slow and takes up too much space (bandwidth). NVIDIA’s Latent MoE adds a brilliant step. It "shrinks" the data into a tiny, dense package (the Latent Space) and sends that tiny package to the experts. Because the package is so small, Nemo can afford to consult 4x as many experts (consulting 4 instead of 1) for the exact same "price" as a normal model.
Why does this makes a difference you might ask? A Latent MoE layer is close to a normal MoE model in terms of computation cost. But here is the twist, research has proven that the higher the number of experts in a model, the better the intelligence and accuracy. Hence the metric that Latent MoE strives to improve is Intelligence (Accuracy) per unit of inference cost.
A MoE layer is usually a group of Feed Forward Networks with a router gate attached to them. The objective is to train the router gate to offload specific sequence of tokens to either a single or a group of experts at a time. For example if you ask a question about Python, the router sends those tokens to the "Coding Expert" FFNs; if you ask about filmmaking, they go to the "Creative Writing" experts. The loss function usually used to train this router is called Load Balancing Loss (or Auxiliary Loss) which prevents the router from being lazy and routing everything to a single expert which can not only overwhelm the GPU but also degrade the quality of generation.
Checkout this HF article on MoE to understand it better.
Heres a look at Nemotron-3-super hybrid architecture, It uses Mamba layers to handle the massive context and speed, and interleaves Transformer layers to handle the deep, sharp reasoning which essentially reduces the O(N^2) attention computation tax to O(N) and makes it more efficient for inference. (yes that means more tokens for the same amount of money)
SFT (Supervised Fine Tuning) - Nemo Graduates!
Nemo has learned to read and now can respond to queries, but theres good chance of it returning non sensical replies that sound more like document completion, something like
Instead of giving you advice, a base model might return something like this:
Nemo (Base): "Hey Nemo, how can I make a movie with my iPhone? This is a question many beginners ask when starting their journey in digital cinematography. In this article, we will explore 10 tips for mobile filmmaking. Tip 1: Clean your lens..."
Supervised Fine Tuning teaches the model to use reasoning (thinking tokens), make tool calls, follow instructions and respond in specific patterns or structures. This is achieved by training the base model (pre-trained model) on Q&A Pairs, Chat samples etc.
SFT essentially teaches a basic language model to behave as an agent and interact with tools available to it and produce Chain Of Thought traces to derive accurate responses.
Nemotron-3-super's SFT training data consists of 7 million such samples, lets take a look at a sample of our initial request of filmmaking and how are loss functions calculated on such conversational samples.
[
{
"role": "user",
"content": "I'm shooting a 15-minute 4K ProRes clip in Miami tomorrow at sunset. Can you check if the weather is clear and tell me how much storage I'll need?"
},
{
"role": "assistant",
"content": "",
"reasoning_content": "<think>\n1. User needs weather for Miami tomorrow sunset.\n2. User needs storage for 15 min 4K ProRes.\n3. Plan: Call get_weather and calculate_storage tools.\n</think>",
"tool_calls": [
{
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"Miami, FL\", \"time\": \"sunset tomorrow\"}"
}
}
]
},
{
"role": "tool",
"content": "{\"status\": \"Clear\", \"sunset_time\": \"7:45 PM\", \"humidity\": \"80%\"}",
"tool_call_id": "call_miami_992"
},
{
"role": "assistant",
"content": "Great news! Miami will be clear at 7:45 PM tomorrow. For a 15-minute ProRes 4K clip, you will need approximately **78.7 GB** of storage. Since you're shooting ProRes, I recommend using an external SSD."
}
]
This JSON formatted conversation sample is converted to a raw token sequence as a data preprocessing step called Prompt Templating, which looks something like
System: You are a filmmaking assistant. <policy> ... </policy>
User: I'm shooting a 15-minute 4K ProRes clip in Miami...
Assistant: <think>
1. User needs weather for Miami...
2. Plan: Call get_weather and calculate_storage tools.
</think>
Great news! Miami will be clear at 7:45 PM tomorrow...
A token you might Wonder! Yes thats the way a model is steered towards tool calling in between a conversation and it learns to draw inferences from the result of tool calls. The SFT training data often contains domain specific examples like using linux terminal commands or performing a web-search and many more. This is actually the stage where a dumb language model starts becoming an agent.
The Chain-of-Thought reasoning traces that you might have encountered while using tools like Claude Code are taught to the model. Take a look at a sample reasoning trace that the model can learn.
User: "If it takes 5 towels 5 hours to dry outside in the sun, how long does it take 10 towels to dry?"
The Learned Reasoning Trace (Chain-of-Thought):
<think>
1. Identify the core variables: Number of towels = 5, Time = 5 hours.
2. New variables: Number of towels = 10.
3. Determine the relationship: Does drying time scale linearly with the number of items?
4. Physical logic check: No. The sun shines on all towels simultaneously (parallel processing).
5. Constraint check: Assuming there is enough space on the clothesline for all 10 towels.
6. Conclusion: The time remains constant regardless of the number of towels.
7. Final Answer: 5 hours.
</think>
If you just twitched! and thought to yourself, essentially a model's reasoning and thinking capability is a game of guessing probabilities of tokens, you have hit the right spot! There are own set of limitations there as described in this paper, but when trained on millions of such reasoning traces the model does get to learn them correctly. (Sometimes I am Flabbergasted by the magic a lot of numbers together can do!)
Alright! lets drop our emotions and get back at teaching Nemo!
An important thing to note is that during the SFT runs, the goal is to make the model learn how the assistant responds, and any errors in predicting user tokens can be ignored. To achieve this the NLL loss function is used with a loss mask which works as demonstrated below
It’s a simple array of 0s and 1s that aligns perfectly with our token sequence:
System/User Tokens: Mask =
0(Loss is ignored)Assistant Tokens: Mask =
1(Loss is calculated)Assistant Final Tokens: Mask =
1(Loss is calculated)
The resulting data packet for one training step:
{
"input_ids": [1502, 25, ..., 2043, ..., 3122, 25, 12, 442, ..., 13, 4532],
"labels": [1502, 25, ..., 2043, ..., 3122, 25, 12, 442, ..., 13, 4532],
"loss_mask": [0, 0, ..., 0, ..., 0, 0, 1, 1, ..., 1, 1]
}
During the forward pass, the model predicts the "Next Token" for every position. The training loop then applies the mask.
$$L_{SFT} = \frac{\sum (Loss \times Loss_Mask)}{\sum Loss_Mask}$$
Intuition: If the model predicts "Rainy" inside the <think> block when the golden data said "Clear," the loss is calculated. But if the model predicts the wrong punctuation in the User's sentence, the mask is 0, so the error is multiplied by zero and effectively "deleted" from the model's memory.
Post SFT, Nemo can now act as an assistant that can think and use tools to respond to your queries, but theres a catch, its has learned to use known tools and think based on known reasoning traces, what if Nemo is deployed to a completely unknown terrain. We dont want it to throw out some random tokens from its training set so lets take a look at the Boss level of LLM Training - Reinforcement Learning, that makes Nemo ready to face the world!
Reinforcement Learning - Nemo's Internship
Reinforcement Learning allows Nemo to make a lot trials and learn from errors i.e developing a self correction muscle through constant visits to simulated Gyms and doing a lot of sets and reps of the same task each time with some degree of variations.
Let me simply that, now that Nemo has trained to be a perfect intern with reasoning and tools at its disposable, its time for Nemo to perform some tasks and take feedback for them.
The first phase of this training is RLVR (Reinforcement Learning through Verifiable Rewards) . Nemo will be launched in a controlled sandbox environment and will be asked to perform a task which can be verified by the environment itself. Lets take a classic example of solving a software bug.
A sandbox environment is launched with a python codebase in a docker container in way such that Nemo can access this container
Nemo is given a bug description and the environment (docker container) includes set of unit tests and integration tests.
Since Nemo has learned to write code during its pre-training phases and has learned to reason and use tools, Nemo produces multiple (lets say 16 ) versions of code after reasoning and using tools like web-search.
The environment grades each version of the solution against its set of test and generates a score for each of them. The scoring mechanism can be anything like if the code crashes then score it 0, or if it fails the linter and formatter score it 0.2 and if it passes all the tests score it 1 and so on and so forth.
The model weights are is updated based on these scores using the GRPO algorithm as demonstrated by the following equation.
$$\mathcal{J}{GRPO}(\theta) = \mathbb{E} \left[ \frac{1}{G} \sum{i=1}^G \left( \min \left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)} \hat{A}i, \text{clip} \left( \frac{\pi\theta(o_i|q)}{\pi_{\theta_{old}}(o_i|q)}, 1-\epsilon, 1+\epsilon \right) \hat{A}i \right) - \beta D{KL}(\pi_\theta || \pi_{ref}) \right) \right]$$
Don't be scared!, this sci-fi equation that looks like it came straight out of the movie Inception describes rather something very intuitive. t’s how the model decides which of those 16 "Gym" attempts was the winner and how much it should change its brain because of it.
Let's break down the "Philosophy" of the math:
1. The Group Average
$$(\frac{1}{G} \sum_{i=1}^G)$$
The Symbol: G is the group size (Nemo's 16 different attempts).
The Philosophy: Fairness. Instead of judging Nemo against some impossible "perfect" standard, we judge him against himself. By averaging the scores of the whole group, we can see which attempts were genuinely "above average."
2. The Ratio
$$(\frac{\pi_\theta}{\pi_{\theta_{old}}})$$
- The Philosophy: Progress Tracking. This ratio asks: "How much more likely is the New Nemo to pick this answer compared to the version of himself from 5 minutes ago?" It ensures we are moving in the right direction.
3. The Advantage
$$(\hat{A}_i)$$
calculated using the mean and standard deviation of the group rewards for a specific version of Nemo's output,
$$\hat{A}_i = \frac{r_i - \text{mean}(r_1, r_2, ..., r_G)}{\text{std}(r_1, r_2, ..., r_G)}$$
- The Philosophy: Grading on a Curve. This is the actual "Relative Score." If Nemo’s 5th attempt was better than the average of the other 15 attempts, the Advantage is positive. If it was a total disaster, it’s negative.
4. The "Safety Rails"
The Symbol:
$$\text{clip}(..., 1-\epsilon, 1+\epsilon)$$
The Philosophy: Moderation. This is the most important part of stable AI training. It says: "Even if you found a brilliant answer, don't change your entire brain overnight." It limits the update to a small percentage (usually 10-20%). If the model changes too fast, it "trips" and forgets everything it previously learned (Catastrophic Forgetting).
5. The "Leash"
- The Symbol:
$$D_{KL}(\pi_\theta || \pi_{ref})$$
The Philosophy: Identity Protection. This is the Kullback–Leibler (KL) Divergence. It measures how much the current model has drifted away from the original "High School" model.
Why it's there: If you give a model rewards for math, it might start speaking in weird, unreadable symbols just to get a higher score (Reward Hacking). The KL penalty is a leash that pulls Nemo back and says: "Be smarter, but stay human. Don't stop sounding like the helpful assistant I trained you to be."
In short: GRPO is like grading on a curve. Nemo takes 16 shots at the same goal, and the math rewards the versions that performed better than the group average while using the D_{KL} penalty to make sure he doesn't go 'rogue' and start speaking gibberish just to get a higher score.
Post RLVR, Nemo is then subjected to the second phase of Reinforcement Learning i.e RLHF (Reinforcement Learning with Human Feedback).
There are things in the real world that a Python sandbox or a calculator simply can’t grade. How "polite" is the response? Is it being too wordy? Is it safe? Is it actually helpful, or just technically correct? These are the "human vibes" that require a different kind of training. Enters a dummy human (i.e another LLM), since using actual Humans for feedback is expensive (lets be honest, its not scalable) LLMs are usually trained using , yes you guessed it right ; An Another LLM!
A Generative Reward Model (GenRM) such as GPT-OSS or QWEN-235b is used to grades the model using rubrics like HelpSteer2 that scores responses on several parameters like Helpfulness: Does this actually solve the user's problem. Correctness: Are the facts (like that 700 Mbps bitrate) accurate? Coherence: Does the answer flow naturally, or is it a "word salad"? Complexity: Is the depth of the answer appropriate for the question? etc. The grades are used to update the model using the same GRPO algorithm we discussed above.
A single RL training step involves, data loading, launching sandbox environments gradient optimizations, model inference to generate responses and the most important part accumulating weight updates. A true nightmare for an infra guy like me; not just because of the huge resources it demands but its sensitivity against failure of any one component. We will look at how this seemingly impossible task is pulled off at scale in Part 2
But yes back to our training course, RL has made Nemo battle-tested and now its ready to be introduced to the world.
Intuitively, the Post Training Phase (SFT + RL) is where base models can be molded into experts. If you ever thought of adapting a base model that can run offline and carry out some very niche tasks for you like auto responding to an annoying colleague over slack (yes something lord Gilfoyle did 10 years back) or something actually worth while like taking care of production issues, Post training would be the way to go! Why offline you ask! because why not? Mah Model Mah Rules...!
Anyway, the nemotron-3-super has a lot more winning takes on training LLMs that are worth a mention. Lets look at them.
Some more Highlights from the Nemotron-3-Super Report
Pre-training with NVF4 precisions: usually models are trained in high precision (NVF16 or NVF32) and then quantized to lower bits to reduce the memory footprint. This often degrades the accuracy of the Model, nemotron-3-super-120b was pretrained in a lower precision to build the intelligence with lower memoey footprint in the first place.
The "Infinity" Memory: 1-Million Token Context Most models start "forgetting" the beginning of a conversation after a few dozen pages. Thanks to that Mamba-Transformer hybrid architecture, Nemotron-3-Super can handle 1 million tokens in its active memory.
Steerable Reasoning: The "Deep Thought" Slider NVIDIA introduced Steered Chain of Thought (SCoT). Unlike other models that either "think" too much or too little, Nemo is steerable. You can actually "tune" how much reasoning Nemo does. Need a quick, snappy answer for a simple email? Turn it down. Need a deep, multi-step architectural plan for a film studio? Turn the "thinking" knob to the max.
Native Speculative Decoding: The Built-in Turbo Because Nemo was trained with Multi-Token Prediction (MTP), it has a built-in "drafting" brain. While standard models write one word at a time, Nemo uses its extra MTP layers to "guess" the next 4 words ahead of time. The main brain then just verifies them. This native trick gives it a 7.5x speed boost during inference without needing any extra software hacks.
Wrapping Up!
I hope that now you have some basic understanding on how LLMs are trained in general. As you might have thought while reading these training recipes, LLMs are indeed very large, resource hungry and heavy software programs that seem to be penetrating almost every domain of tasks that we do everyday. Having the ability to train custom models or tune them to specific tasks on niche domains like Legal, Finance and hundreds of other things might be a way to achieve more with less resources. Nemotron has been built with efficiency imbibed in it (NVF4 pre-training, Mamba Layers, MTP, PivotRL) and serves a golden reference for building open source, efficient and tailored LLMs.
If you are tired of hearing the word efficient in this conversation, heres a food for thought that might help ease the pain. We are swiftly moving away from generic chat based systems to more agentic systems that perform stuff. This requires a lot of tokens, agent to agent interactions (OpenClaw!), storing and retrieving memories for agents, millions of available tools, thousands of long running background tasks (Autoresearch for example)! The time isn't far when even a 1M context window would seem small, hence in my opinion efficiency is no longer a 'nice-to-have.' Striving to lower the cost of every unit of intelligence produced is the way to ensure the ROI of autonomy actually outweighs the compute bill.
If you are interested in learning about the Infrastructure that serves as a foundation stone of LLMs, do take a look at Part 2: The Engineering (How it is Built)
