Tesla’s Full Self-Driving (FSD) Chip is a custom-designed ASIC optimized for vision-based autonomous driving. Here’s a breakdown of its microarchitecture, from silicon to software:

1. Key Specifications (HW3/HW4)

Metric	FSD Chip (HW3)	FSD Chip (HW4)
Process Node	14nm (Samsung)	7nm (Samsung)
Die Size	260 mm²	~200 mm² (est.)
Transistors	6B	10B+ (est.)
Peak TOPS	144 TOPS (INT8)	256 TOPS (INT8)
Power Consumption	36W	45W (est.)
Cameras Supported	8x 1.2MP	12x 5MP

2. Block Diagram & Core Components

plaintext

┌─────────────────────────────────────────────────────┐
│                   Tesla FSD Chip                    │
├───────────────────┬─────────────────┬───────────────┤
│    **Dual NPUs**  │ **GPU**         │ **CPU Cores** │
│ (Neural Processor)│ (Custom)        │ (ARM A72)     │
├───────────────────┼─────────────────┼───────────────┤
│ - 96x96 MAC array │ - 1TFLOPS (FP32)│ - 12x A72     │
│ - 2GHz clock      │ - Texture units │ - Lockstep    │
│ - 32MB SRAM cache │                 │   for ASIL-D  │
└───────────────────┴─────────────────┴───────────────┘

3. Neural Processing Unit (NPU) – The Secret Sauce

Array Structure:
- 96x96 MAC (Multiply-Accumulate) units per NPU (x2 in HW3).
- Optimized for 8-bit integer (INT8) operations (95% of Tesla’s NN workloads).
On-Chip Memory:
- 32MB SRAM cache (vs. 4–8MB in competing chips like NVIDIA Xavier).
- Reduces DRAM access latency by 5x (critical for real-time inference).
Custom ISA:
- Supports Tesla’s HydraNet multi-task learning (simultaneous detection/lane prediction).

4. CPU & GPU Components

CPU:
- 12x ARM Cortex-A72 (64-bit) in triple-redundant lockstep for ASIL-D safety.
- Runs lightweight tasks (sensor polling, CAN bus communication).
GPU:
- Custom-designed ~1 TFLOPS FP32 unit (not for graphics, but for post-processing).
- Handles non-ML tasks like image warping (for multi-camera stitching).

5. Memory Hierarchy

plaintext

┌──────────────┐       ┌──────────────┐
│  32MB SRAM   │◄─────►│   Dual NPUs  │ (On-chip)
└──────────────┘       └──────────────┘
       ▲                      
       │  ~100GB/s bandwidth
┌──────────────┐              
│  8GB LPDDR4  │ (Off-chip)  
└──────────────┘

SRAM-First Design: Minimizes external memory access (power-efficient).
No HBM/GDDR: Unlike NVIDIA/AMD chips, Tesla prioritizes latency over bandwidth.

6. Software Stack Integration

Compiler: Custom toolchain converts PyTorch models to NPU-optimized bytecode.
Real-Time OS: Lightweight Tesla OS (modified Linux) with <10μs interrupt latency.
HydraNet: Runs 48 neural networks in parallel (e.g., traffic light, obstacle, depth estimation).

7. HW3 vs. HW4 Improvements

Feature	HW3 (2019)	HW4 (2023)
NPUs	2x 96x96 MAC	2x 128x128 MAC (est.)
Camera Inputs	8x 1.2MP (HDR)	12x 5MP (HDR++)
Safety	ASIL-B	ASIL-D
Backward Compatible	No	Yes (with HW3 cameras)

8. Benchmark vs. Competitors

Chip	TOPS (INT8)	Power	SRAM	Use Case
Tesla FSD HW4	256	45W	32MB	Vision-only autonomy
NVIDIA Orin	254	60W	8MB	Multi-sensor fusion
Mobileye EyeQ6	48	10W	16MB	L2+ ADAS

Why Tesla’s NPU Wins:

5x TOPS/mm² efficiency vs. GPUs (dedicated silicon for vision NNs).
Zero external memory access for common ops (e.g., convolutions).

9. Limitations & Trade-Offs

No LiDAR/Radar Support: HW4 still lacks hardware accelerators for time-of-flight processing.
Fixed-Precision Only: No FP16/FP32 in NPUs (limits future model complexity).
Thermal Constraints: Sustained 45W requires liquid cooling in Cybertruck.

10. The Dojo Connection

Dojo D1 Chip: Scaled-up version of FSD NPU (354 TOPS, 1.25TB/s fabric).
Training-Inference Symmetry: Models trained on Dojo map directly to FSD NPUs.

Key Takeaways

Domain-Specific Design: Tesla’s NPUs are optimized only for camera-based autonomy.
Memory is King: 32MB SRAM avoids the "memory wall" that bottlenecks GPUs.
Vertical Integration: From silicon (Samsung) to software (HydraNet), Tesla controls the stack.

For sensor fusion (LiDAR/radar), FPGAs still dominate—but for vision-only scale, Tesla’s ASIC approach is unmatched.

Tesla FSD Chip Microarchitecture: A Deep Dive