202603242029-nanochat
🎯 Core Idea
nanochat is not best understood as a generic LLM framework. It is a compact, opinionated end-to-end training and serving repo for a small ChatGPT-like model. The repo tries to keep the whole path visible in one place: tokenizer training, base-model pretraining, supervised fine-tuning, reinforcement learning, evaluation, checkpoint loading, inference, and a simple web chat UI. The main thing worth studying is how little abstraction it uses to connect these stages.
The architectural center of gravity is the combination of nanochat/gpt.py and the stage entrypoints in scripts/. The reusable logic lives in the nanochat/ package, while the top-level scripts orchestrate each training or inference phase. That means the repo reads more like a research harness with a thin product shell than like a traditional application with deep service layers. If the goal is to understand the implementation, the right move is to follow the training and inference pipeline rather than treating each directory as equally important.
What makes the repo interesting is that it deliberately collapses many choices into a few core mechanisms. A single depth dial drives much of the training recipe. One GPT implementation is reused across pretraining, chat fine-tuning, RL, and serving. One tokenizer protocol defines how conversations, tool use, and evaluation data are serialized. One checkpoint-loading path reconnects all later stages back to earlier artifacts. The repo is therefore most legible when read as a pipeline with shared primitives, not as a collection of standalone tools.
🌲 Branching Questions
➡ What is the best mental model for this project?
The cleanest mental model is: nanochat is a full-stack LLM training pipeline with a chat demo attached at the end. The real structure is not frontend + backend; it is tokenizer -> base model -> chat model -> inference engine -> UI.
The scripts/ directory defines the phase entrypoints:
scripts/tok_train.pyandscripts/tok_eval.pyfor tokenizer workscripts/base_train.pyandscripts/base_eval.pyfor base-model pretraining and evaluationscripts/chat_sft.py,scripts/chat_eval.py, andscripts/chat_rl.pyfor chat specializationscripts/chat_cli.pyandscripts/chat_web.pyfor inference surfaces
The nanochat/ package contains the reusable implementation shared by those phases. In practice, the most important internal modules are:
nanochat/gpt.pyfor the model definition and optimizer groupingnanochat/tokenizer.pyfor the repo’s special-token chat formatnanochat/checkpoint_manager.pyfor loading and saving stage outputsnanochat/engine.pyfor efficient inference and token-level tool-use handling
So the repo is best read as a staged system whose scripts are thin orchestration layers over a compact core.
➡ Which files should I read first, and why?
Start with README.md, but only long enough to get the intended workflow and the repo’s own map of the directories. The README is useful because it tells you what the author thinks the repo is for: a minimal, hackable harness for training and talking to a small chat model. But the README is not where the implementation logic lives.
The first real file to read is nanochat/gpt.py. This is the most important file in the repo. It defines:
GPTConfig- the attention and MLP blocks
- the unusual architectural choices such as grouped-query attention, sliding-window attention patterns, value embeddings, residual scaling,
smear, andbackout - the explicit dtype strategy through the custom
Linearlayer - optimizer parameter grouping through
setup_optimizer - FLOP and parameter-count estimation used by the training scripts
The second file is scripts/base_train.py. This is where the repo’s training philosophy becomes concrete. It turns a few user-facing knobs such as --depth into a full training run: model sizing, target token budget, batch sizing, scaling-law-derived schedules, dataloaders, evaluation cadence, checkpointing, and logging. If gpt.py tells you what the model is, base_train.py tells you how the repo expects that model to be trained.
Then read nanochat/tokenizer.py and nanochat/checkpoint_manager.py. tokenizer.py is important because it defines the repo’s conversation wire format through special tokens like <|user_start|>, <|assistant_start|>, and the Python-tool markers. checkpoint_manager.py is the glue that lets the same model family move across base, sft, and rl phases without introducing a larger framework layer.
After that, read scripts/chat_sft.py and nanochat/engine.py. chat_sft.py shows how the repo turns a base language model into a chat model through conversation rendering and task mixtures. engine.py shows how inference is actually made efficient and how lightweight tool use is inserted into generation.
➡ What is the main execution flow?
The implementation flow starts with tokenization. nanochat/tokenizer.py provides tokenizer training and inference wrappers, but the crucial part for the rest of the repo is the conversation rendering logic. The function render_conversation turns structured user/assistant exchanges into token IDs plus supervision masks. That one design choice has large consequences: chat training, RL prompting, and serving all depend on the same serialized conversation format.
The next important flow is base-model pretraining. scripts/base_train.py parses the training arguments, initializes distributed compute, loads the tokenizer, and builds the model through GPT in nanochat/gpt.py. The key implementation idea here is that --depth is treated as the main complexity dial. Many other hyperparameters are then derived from that choice instead of being separately hand-tuned every time. After model construction, the script builds tokenized distributed loaders, runs the training loop, periodically evaluates validation bpb and CORE, and saves checkpoints through nanochat/checkpoint_manager.py.
Once a base checkpoint exists, scripts/chat_sft.py loads it and runs supervised fine-tuning. The training data is assembled through TaskMixture in tasks/common.py, mixing SmolTalk, MMLU, GSM8K, identity conversations, and spelling tasks. The important logic here is not a new model architecture. It is the reuse of the same GPT model plus the tokenizer’s conversation serialization format. In other words, the chat model is mostly “the base model trained on a conversation protocol,” not “a separate chat stack.”
Then scripts/chat_rl.py optionally continues from the sft checkpoint. It loads the chat model, uses Engine.generate_batch from nanochat/engine.py to sample multiple completions per prompt, scores them with task-specific reward logic, computes simple advantages, and updates the model with a lightweight policy-gradient loop. This RL stage matters because it shows the repo’s incremental design: later stages reuse the same model, tokenizer, and inference engine instead of branching into separate infrastructure.
Finally, scripts/chat_web.py serves the chat model. It loads the final checkpoint, creates a pool of per-GPU workers, and exposes a FastAPI server. Requests are tokenized into the same assistant/user marker format, then streamed through Engine.generate. The actual web layer is thin. Most of the interesting serving logic is still inside nanochat/engine.py, especially the KV-cache prefill/decode flow and the token-level state machine for optional calculator-like tool execution.
➡ What are the key abstractions or design choices?
The first key abstraction is the single shared GPT implementation in nanochat/gpt.py. Instead of separate model implementations for pretraining and chat, the repo keeps one model definition and changes the surrounding training phase. This makes the repository easier to follow because model behavior is centralized.
The second key abstraction is the tokenizer-mediated conversation protocol in nanochat/tokenizer.py. This is more important than it looks. The special tokens do not just help formatting; they define how the whole repo thinks about user prompts, assistant completions, Python tool calls, and tool outputs. Once that protocol is fixed, both SFT and serving can share the same structure.
The third key abstraction is the stage-based checkpoint system in nanochat/checkpoint_manager.py. The loader treats base, sft, and rl as related stages rather than separate product surfaces. That is why scripts like chat_sft.py, chat_rl.py, and chat_web.py can stay relatively small: they mostly delegate artifact loading instead of rebuilding model state logic themselves.
The fourth key abstraction is Engine in nanochat/engine.py. Engine is not just a helper for token generation. It is the serving-time execution layer that handles batched generation, KV-cache replication, sampling, stopping conditions, and the lightweight tool-use state machine. If gpt.py is the model core, engine.py is the runtime inference core.
A final design choice worth noticing is how much the repo prefers direct derivation over large config systems. scripts/base_train.py computes many quantities from a few base assumptions. This makes the code easier to audit, but it also means many decisions are embedded as code and formulas instead of externalized configuration. That is good for readability if you are studying one author’s system, but it also means you have to read the code to understand the defaults.
➡ What reading path gives the fastest understanding?
Use this order:
A. README.md
- Get the repo purpose, intended workflow, and top-level map.
B. nanochat/gpt.py
- Understand the model definition, important architectural choices, and optimizer grouping.
C. scripts/base_train.py
- Understand how the repo turns
--depthinto an actual pretraining run.
D. nanochat/tokenizer.py
- Understand the special-token format and how conversations are rendered.
E. nanochat/checkpoint_manager.py
- Understand how
base,sft, andrlstages connect.
F. scripts/chat_sft.py
- Understand how the base model becomes a chat model.
G. tasks/common.py
- Understand how task mixtures are assembled for training.
H. nanochat/engine.py
- Understand the real inference runtime and tool-use handling.
I. scripts/chat_web.py
- Understand how the runtime is exposed through FastAPI and the web UI.
J. scripts/chat_rl.py
- Understand how the repo extends chat behavior with RL after SFT.
This order works because it follows dependency and conceptual weight. If you start from chat_web.py, you will see the interface before you understand the machinery. If you start from gpt.py and base_train.py, the rest of the repo becomes much easier to place.
📚 References
- https://github.com/karpathy/nanochat
- https://github.com/karpathy/nanochat/discussions/1
- https://github.com/karpathy/nanochat/discussions/481