ML foundations
Imitation learning, policy gradient, and distributed RL — the primitives everything builds on
- §5 "off-policy learning" = exactly BC on expert demonstrations
- Table 2: off-policy policy passes 5/24 convergence tests despite lowest trajectory MAE — this is the BC failure mode made empirical
- The entire motivation for §5.1 on-policy IMPALA training comes from this paper's failure analysis
- §5.1 IMPALA is DAgger with the Plan Model replacing π* (the expert oracle)
- IMPALA actors = DAgger's rollout step
- World Model Plan Head = DAgger's expert query step
- Central learner = DAgger's retraining step
- IMPALA ([7]) inherits the actor-critic separation from PPO
- Comma.ai uses imitation loss instead of PPO's surrogate — but the distributed actor/learner split is identical
- The V-trace correction in IMPALA serves the same role as the clipped ratio in PPO: correcting for off-policyness
- §5.1 adopts IMPALA's actor-learner split exactly
- Each actor rolls out
h^{π,wp}= {(o_t, a_t, â^wp_t)} where â^wp comes from the Plan Model - The parameter server pattern (learner broadcasts updated policy to actors) is explicit in §5.1
- Comma.ai replaces the RL reward signal with imitation loss on â^wp — so no V-trace correction is needed
Localisation stack — pose p_t
The GPS + vision fusion system that produces the 6-DOF pose conditioning signal for the World Model
- §2.2 explicitly states pose p_t is from a "tightly coupled GPS/Vision MSCKF [14,17,26]"
- The pose is the transition signal for the World Model (§2.4): w maps (images, poses) → next image
- Using pose instead of raw actions means the WM is Vehicle Model-independent — you can augment VM parameters without retraining the WM
- In comma2k19 dataset, poses are pre-computed — you don't need to run MSCKF yourself
- The comma2k19 dataset provides these poses pre-computed at every frame
- p_t = (x,y,z,φ,θ,ψ) ∈ ℝ⁶ is directly from this filter's output
- In urban canyons (GPS degraded), the visual part takes over — the filter is robust by design
- §4.5: "400k segments, each 1 minute" — this is the exact comma2k19 segment format scaled up
- The 5 Hz downsampling in §4.5 matches the native 5-Hz MSCKF pose output
- Download:
github.com/commaai/comma2k19 - The full internal dataset has millions of segments — comma2k19 is 2,000 hours, a 5% sample
Generative models — VAE + diffusion
The image encoder (§4.1), the DiT architecture (§4.2), and the Rectified Flow objective (§4.2.2) all come from here
- §4.1 uses the Stable Diffusion VAE as a fixed codec — 64×64 RGB → 8×8×4 latents
- Scale factor 0.18215 (from LDM paper) normalises the latent distribution to unit variance
- The World Model never touches pixels — it operates entirely in latent space
- Policy also operates on raw pixels — only the WM uses latents
- §4.1: "we use the pretrained Stable Diffusion image VAE [23]"
- The VAE is frozen — it is never fine-tuned on driving data
- Frames downscaled to 128×256 before VAE → latents are 16×32 (not 8×8, the 64×64 model produces 8×8 from 64×64 inputs)
- The scale factor 0.18215 normalises latent variance to ~1.0 for stable diffusion training
- §4.2.2: "we adopt the Rectified Flow (RF) objective [16]"
- τ ~ LogitNormal(0.0, 1.0) [8] concentrates training at mid-noise levels
- §4.3: 15 Euler steps Δτ = 1/15 for sequential sampling
- The model predicts (o - ε), not o directly — this is the velocity parameterisation
- 3D input: patch table extended to (frame × height × width) then flattened
- Causal mask: frame-wise triangular mask enables KV-caching for autoregressive sampling
- Multi-conditioning: pose + τ + world-timestep all summed before AdaLN
- Plan Head added: residual FF blocks on pooled context tokens → trajectory
- Three sizes: 250M (GPT-2), 500M (GPT-medium), 1B (GPT-large)
- §4.2.1: "we use the DiT architecture [19], adapted to 3D inputs"
- The scaling results in Fig.5 of the paper mirror DiT's scaling law: more params + more data → lower LPIPS
- GPT-2 model sizes [21] are used to define the three DiT variants
- §4.2.2: "we sample the noise timestep τ ~ Logit-Normal(0.0, 1.0) [8]"
- §4.4 noise augmentation: context frames noised at τ ~ LogitNormal(0.0, 0.25) — narrower distribution = less noise on context
- §4.2.1: "conditioning signals ... passed to the Adaptive Layer Norm layer (AdaLN) [32]"
- The conditioning vector c = sum(pose_embed, τ_embed, world_t_embed)
- AdaLN-Zero initialisation (DiT paper) means the model starts as a pure transformer and gradually learns to use the conditioning
World models for control
The conceptual lineage of using a learned simulator to train a policy — from Ha & Schmidhuber to GAIA-1
- [10] is one of only 3 world model papers cited in the introduction (§1)
- The "dream training" concept = comma.ai's on-policy training inside the WM simulator
- Main upgrade: DiT produces photorealistic frames; MDN-RNN produces abstract latent distributions
- Future anchoring (§2.5) is not in this paper — it's comma.ai's novel contribution
- §2.5: "we can train non-causal World Models similar to [2] conditioned on future observations"
- Future anchoring is the key mechanism that makes the Plan Model work — without it, the Plan Model doesn't know what "good" looks like
- F = (f_s, f_e) defines the future horizon: f_s is when anchoring starts, f_e is when it ends
- [11] cited in the introduction as the driving world model precedent
- GAIA-1 shows that video generation on driving data is tractable at scale
- Comma.ai adds the crucial step: actually training a policy inside the WM and deploying it
- GAIA-1 doesn't use future anchoring — it can't produce recovery-pressure trajectories
- §4.2.1: conditioning signals include "vehicle poses" — directly from NWM design
- §2.4: "using the pose as the transition signal ... enables augmenting the Vehicle Model's parameters without needing to retrain the World Model"
- Bar et al. condition on future pose for planning — comma.ai adds future anchoring on top of this
- No future anchoring — cannot produce recovery-pressure trajectories
- No Plan Head — NWM is a world model only, not a plan model
- No on-policy training inside the WM
- Not deployed on real hardware — evaluation is video quality only
- §4.4: "we use a noise level augmentation technique ... A similar technique was proposed in [29]"
- Comma.ai difference: they don't discretise noise levels (GameNGen used a discrete set)
- This is what makes Fig.6 (left) in the paper show stable LPIPS across 40 simulated frames
- Aug prob = 0.3, σ = 0.25 for context frames (anchor frames are never noised)
Policy architecture + planning
FastViT extractor, temporal transformer, MHP loss, and the information bottleneck — the actual driving policy
- §5: "a small Transformer [31] based temporal model"
- Input: FastViT features over last 2 seconds (at 20 Hz = ~40 frames)
- Output: action logits + 5-hypothesis trajectory plan
- Frozen during on-policy training — only the temporal model is updated (§5.1)
- §5: "a supervised feature extractor based on the FastViT architecture [30]"
- Trained jointly on lane lines, road edges, lead car, ego trajectory — all as auxiliary heads
- Frozen during on-policy training — only the temporal Transformer is updated
- The information bottleneck (§5.2) is applied to FastViT's output before the temporal model
- §4.2.2: "The Plan Head output T uses a Multi-hypothesis Planning loss (MHP) [5] with 5 hypotheses"
- §5: policy trajectory head also uses MHP with 5 hypotheses + Laplace prior
- At inference: pick hypothesis with highest log-weight; take its mean as the predicted trajectory
- During IMPALA rollout: the Plan Model's best hypothesis provides â^wp for the learner
- §1: "End-to-End (E2E) learning ... [4]" — DAVE-2 is the starting point of the E2E tradition
- Comma.ai claims [§1]: "to our knowledge, this is the first work to show how E2E training, without handcrafted features, can be used in a real-world ADAS"
- DAVE-2 doesn't count because it uses handcrafted features implicitly (road segmentation preprocessing)
The comma.ai paper itself
arXiv:2504.19077 — now every citation maps to something you've read. Read in section order.
- The distinction between state space S and observation space O (§2.1) — policy only sees images, not full state
- The Vehicle Model (§2.3) forward and inverse: forward gives next pose from action; inverse gives action from trajectory
- Future anchoring (§2.5): F = (f_s, f_e) where f_s > T — anchor is always in the future
- "Recovery pressure" (§2.5) — with future anchoring the model learns to recover from bad states
- Reprojective simulator: 24/24 convergence (good) — but this is partly because of shortcut learning
- WM simulator: 24/24 convergence too — but without the cheating
- Field results: WM policy has 52.49% engaged distance vs reprojective's 48.10%
- The WM's advantage grows with deployment time — no shortcut features to exploit
- 15 Euler steps, Δτ = 1/15 (τ goes 1→0)
- Each step: predict velocity v = w(o_τ, p, τ); update o_τ -= Δτ·v
- After sampling o_T: shift context window, append o_T, repeat for T+1
- KV-caching enabled by the frame-wise causal mask
- 250M → 500M → 1B: LPIPS improves (lower = better, baseline 0.148 from VAE compression)
- 100k → 200k → 400k segments: LPIPS improves — both scale directions matter
- 500M on 400k is the default for all experiments
- "First work to show E2E training, without handcrafted features, used in a real-world ADAS"
- "First use of a world model simulator for on-policy training of a policy deployed in the real world"
- Both claims are about real deployment, not just simulation results
Reuse guide — comma.ai components
What is open, where to find it, and how to use it in your own project
github.com/commaai/comma2k19stabilityai/sd-vae-ft-mse on HuggingFaceAutoencoderKL.from_pretrained(...)github.com/commaai/openpilot — MIT licenceselfdrive/modeld/ servicesupercombo.onnx (publicly downloadable)selfdrive/modeld/models/supercombo.onnxonnxruntime.InferenceSessiongithub.com/commaai/laikafacebookresearch/DiT (official PyTorch)Suggested 4-week reading schedule
~10 hrs/week. Designed so each week ends with actionable understanding you can implement.