# NVIDIA DGX Spark: When Benchmark Numbers Meet Production Reality **A 6-Day Deep Dive into Real-World ML Performance** <div class="callout" data-callout="success"> <div class="callout-title">Follow-Up Available</div> <div class="callout-content"> **Update (Oct 28):** I found the root cause and solutions! Most of the issues documented here were CUDA version mismatches, not hardware problems. Read the follow-up for the 3.6x performance breakthrough and complete solutions. </div> </div> **→ [[AI Systems & Architecture/dgx-spark-week-one-finding-the-right-stack|Week One Update: Finding the Right Stack]] - The solution and 3.6x performance breakthrough** > **Update (Oct 27):** After posting this on Hacker News, I received excellent technical feedback that revealed gaps in my testing and some overclaimed conclusions. I've updated the article with corrections, additional tests, and acknowledgments. Thanks to enum, veber-alex, bradfa, furyofantares, stuckinhell, jasonjmcghee, eadwu, and renaudr for the constructive criticism. This is what good technical discussion looks like. --- NVIDIA recently published [benchmarks showcasing the DGX Spark](https://developer.nvidia.com/blog/how-nvidia-dgx-sparks-performance-enables-intensive-ai-tasks/): 82,739 tokens/second for fine-tuning, sub-1% accuracy degradation with FP4, and impressive inference throughput. After spending 6+ days running intensive ML workloads on a DGX Spark (training multiple models from scratch, fine-tuning with LoRA, and benchmarking inference), I can tell you the real story. **The short version:** NVIDIA's numbers are technically accurate. But they don't tell you about the FP16 precision issues, memory fragmentation requiring hard reboots, or the 15 hours I spent debugging "training failures" that turned out to be inference bugs. This is the post I wish I'd read before diving in. ## What NVIDIA Showed Us NVIDIA's blog highlights some impressive numbers: **Fine-Tuning Performance:** - Llama 3.2 3B: **82,739 tokens/sec** (full fine-tuning, BF16) - Llama 3.1 8B: **53,657 tokens/sec** (LoRA, BF16) - Llama 3.3 70B: **5,079 tokens/sec** (QLoRA, FP4) **Inference Performance:** - Qwen3 14B: **5,928 tokens/sec** prompt processing, 22.71 tokens/sec generation - GPT-OSS-20B: **82.74 tokens/sec** generation **Key Claims:** - 1 petaflop of FP4 compute - Less than 1% accuracy degradation with FP4 - 273 GB/sec memory bandwidth - Support for 128GB+ models locally Impressive on paper. Let's see how it holds up. ## My Testing Environment **Hardware:** - DGX Spark (ARM64 architecture) - GB10 GPU (Blackwell generation, unified memory) - Driver 580.95.05 - CUDA 13.0 - Ubuntu 24.04.3 LTS **Software:** - PyTorch 2.5.0 (NVIDIA container: nvcr.io/nvidia/pytorch:24.10-py3) - Ollama 0.3.9 for inference - Transformers 4.44.0 **Workloads:** 1. **Inference Benchmark:** Phi-3.5-mini-instruct (3.8B params) via Ollama 2. **Fine-Tuning:** 7 LoRA experiments on Gemma-3-4b-it for medical Q&A (10,000 examples) 3. **Training:** NanoChat project (125M param models from scratch) **Duration:** 6+ consecutive days of ML work ## The Results: What Matches ### ✅ Training Performance is Real (When It Works) My Gemma-3-4b-it LoRA fine-tuning achieved speeds comparable to NVIDIA's benchmarks. Training completed in 10-12 hours for 3 epochs on 10,000 examples with batch size 4, right in line with NVIDIA's Llama 3.1 8B numbers. **Training configuration:** ```python - Base Model: Gemma-3-4b-it (4B parameters) - Method: LoRA (rank 16, alpha 32) - Precision: BF16 mixed precision - Batch Size: 4 - Learning Rate: 2e-5 - Epochs: 3 - Dataset: 10,000 medical Q&A pairs ``` **Results across 7 experiments:** - Experiment 1-5 (baseline): **70% accuracy** - Experiment 7 (optimized): **82% accuracy** - Experiment 8 (SFTTrainer): **84% accuracy** All models trained successfully with smooth loss curves. Training throughput was excellent. **Verdict:** ✅ NVIDIA's training performance claims are accurate. ### ✅ Inference Speed Scales as Expected I benchmarked Phi-3.5-mini-instruct (3.8B params, Q4_K_M quantization) via Ollama: ``` Prompt Type Tokens/Second ----------------------------------------- Short (20 tokens) 83.25 Medium (34 tokens) 79.74 Long (113 tokens) 78.47 ----------------------------------------- Average ~80 tokens/sec ``` NVIDIA showed 22.71 tokens/sec for Qwen3 14B (nearly 4x larger). My 3.8B model at 80 tokens/sec is roughly 3.6x faster, which tracks with the size difference. **Verdict:** ✅ Inference speed scales proportionally with model size. ### ✅ 4-Bit Quantization Works NVIDIA claims "less than 1% accuracy degradation" with FP4. I used Q4_K_M (4-bit) quantization extensively via Ollama, and the model quality was excellent (coherent, contextually appropriate responses with no noticeable degradation). **Verdict:** ✅ 4-bit quantization is production-viable. ## The Reality: What They Didn't Tell You ### ⚠️ FP16 GPU Inference Has Numerical Issues **Update:** This section has been revised based on community feedback. My original title "GPU Inference is Fundamentally Broken" was overstated. The issue appears to be FP16-specific. Here's what happened. After training my first model (Experiment 1), I loaded it for evaluation: ```python model = AutoModelForCausalLM.from_pretrained( "./models/exp-001/final", torch_dtype=torch.float16, # FP16 for inference device_map={"": 0} # GPU ) # Generate text output = model.generate(...) ``` **Result:** Empty responses. PAD tokens. Sometimes inf/nan errors crashing CUDA. I assumed my training failed. I retrained with different hyperparameters (Experiment 2). Same result. Experiment 3, 4, 5, all "failed." I spent 15+ hours debugging my training code, convinced I was doing something wrong. Then I tried this: ```python model = AutoModelForCausalLM.from_pretrained( "./models/exp-001/final", torch_dtype=torch.float32, # CPU uses FP32 device_map='cpu' # CPU instead of GPU ) # Generate text output = model.generate(...) ``` **Result:** Perfect. Coherent, relevant medical responses. The model worked beautifully. **All 7 experiments had trained successfully.** The training loss decreased smoothly. The models learned. But FP16 GPU inference was broken. ### What I Should Have Tested **HN user `enum` asked the critical question:** "Does it work if you change to torch.bfloat16?" I trained with BF16 (which worked perfectly) but tested inference with FP16 (which failed). **I never tested BF16 inference.** This is a significant gap in my testing. The issue might be: - **FP16 inference specifically broken** (numerical instability on this hardware) - **BF16 inference might work fine** (matching training precision) I can't access the original trained models to retest this now, but it's the obvious next experiment. ### The Pattern (Revised) - ✅ **Training on GPU (BF16):** Works perfectly - ✅ **Inference on CPU (FP32):** Works perfectly - ❌ **Inference on GPU (FP16):** Produces inf/nan, empty responses, or crashes - ❓ **Inference on GPU (BF16):** **Not tested yet** (likely works based on community feedback) - ✅ **Inference via Ollama:** Works (see next section) **My mistake:** Claiming "GPU inference is fundamentally broken" without testing all precision combinations. The issue is likely FP16-specific, not a fundamental GPU problem. For PyTorch/Transformers users: **stick to BF16 for both training and inference**, or use CPU/Ollama for evaluation. ### ❌ Memory Fragmentation Causes System Hangs During Experiment 6, my training script was humming along beautifully. Loss decreasing, checkpoints saving, everything looked perfect. Then at 88% complete (7.5 hours in), the system froze. Hard. No response to keyboard, SSH connection dead, GPU stuck. Hard reboot required. Progress lost. **The issue:** GPU memory fragmentation during long-running training causes system-level instability. **Required mitigations for all subsequent training:** ```python # Critical: Clear GPU cache regularly if step % 50 == 0: torch.cuda.empty_cache() # Critical: Limit training duration if time.time() - start_time > 2.5 * 3600: # 2.5 hours logging.info("Approaching safety limit - saving and exiting") save_checkpoint(model, optimizer, step) break # Critical: Checkpoint frequently (not too frequently) if step % 200 == 0: # Not 100 - creates more fragmentation save_checkpoint(model, optimizer, step) # Monitor GPU memory if step % 100 == 0: allocated = torch.cuda.memory_allocated() / 1e9 if allocated > 75: # >75GB on 128GB system logging.warning("High GPU memory - fragmentation risk") ``` After implementing these mitigations, I successfully trained Experiments 7 and 8 to completion. But the limitation is real: **maximum 2-3 hour training sessions** before restarting. NVIDIA's benchmarks don't mention this constraint. Their tests were likely short-duration runs that didn't expose the fragmentation issue. **Community note (from `eadwu`):** This might be a Linux kernel memory management issue rather than pure hardware limitation. Similar issues occur on WSL and can be mitigated with memory compaction services. Worth investigating kernel-level solutions. ### ⚠️ llama.cpp: My Experience vs. Official Benchmarks I tried running llama.cpp directly (without Ollama's wrapper): ```bash ./llama-cli \ --model Phi-3.5-mini-instruct-Q4_K_M.gguf \ --prompt "Explain what a neural network is in one sentence." \ --n-predict 512 ``` **Result:** Empty responses on all test prompts. Zero tokens generated. **However:** HN user `veber-alex` pointed out that [official llama.cpp benchmarks](https://github.com/ggml-org/llama.cpp/discussions/16578) show the DGX Spark running multiple models successfully, including: - Llama models working - Qwen models working - GPU acceleration confirmed - Multiple quantization levels tested **My assessment:** I likely hit a version mismatch, build configuration issue, or Phi-3.5-specific problem. The official benchmarks prove llama.cpp works on this hardware. My experience was real, but not representative of llama.cpp's capabilities on DGX Spark. **Recommendation:** Build llama.cpp from source for best results, or use Ollama (which bundles a tested version). ## Root Cause: Blackwell + ARM64 Combination **Update:** My original section claimed "ARM64 + CUDA support is brand new." HN user `bradfa` correctly pointed out: "Aarch64 and CUDA has been a thing for many years on Jetson boards." I need to be more precise. The issues exist because of three factors converging: ### 1. ARM64 Architecture (Mature, But...) - ARM64 + CUDA: **Mature** (Jetson boards since ~2015) - Most ML tools primarily tested on x86_64 - PyTorch available via NVIDIA Docker containers (recommended path) - Some Python packages may need building from source ### 2. Blackwell GB10 GPU (New) - Newest GPU generation (sm_121 compute capability) - **Unified memory architecture** (CPU and GPU share 128GB RAM) - Limited real-world production testing vs. established datacenter GPUs - Driver maturity: 6-12 months behind older architectures ### 3. CUDA 13.0 (Latest) - Released ~6 months ago - Works well with established workflows - Requires PyTorch 2.5+ (cutting edge) **The specific combination that's bleeding edge:** ``` Blackwell GB10 + ARM64 + CUDA 13.0 = New Territory ↓ Limited production testing of this specific stack ↓ Edge cases in numerical precision (FP16 inference) ↓ Memory management challenges (training) ``` NVIDIA's benchmarks likely use: - TensorRT-LLM (not standard PyTorch) - Short-duration controlled tests - Configurations that avoid the fragmentation trigger - BF16 consistently (not mixed FP16/BF16) ## Precision Deep Dive (Updated) Here's what actually works vs. what's broken: | Use Case | Precision | Device | Status | Performance | |----------|-----------|--------|--------|-------------| | Training | BF16 | GPU | ✅ Works | Excellent | | Training | FP32 | GPU | ✅ Works | Slower but stable | | Inference | FP16 | GPU | ❌ Broken | inf/nan errors | | Inference | BF16 | GPU | ❓ Not Tested | Unknown (likely works) | | Inference | FP32 | CPU | ✅ Works | Slower, reliable | | Inference | Q4_K_M | GPU (Ollama) | ✅ Works | 80 tok/sec | **The key insight:** FP16 GPU inference has issues. BF16 likely works. CPU works fine. ## Ollama: GPU-Accelerated Inference That Works **Update:** My original article stated "Inference via Ollama: Works (CPU-optimized backend)." HN user `jasonjmcghee` asked: "From the article it sounds like ollama runs cpu inference not GPU inference. Is that the case for you?" I was wrong. **Ollama IS using GPU.** After posting, I verified: ```bash # During Ollama inference nvidia-smi dmon -s u # gpu sm mem enc dec jpg ofa # Idx % % % % % % 0 96 0 0 0 0 0 ``` **96% GPU utilization during inference.** It's definitely using the GPU. **Why memory shows 0%:** The GB10 has a unified memory architecture where CPU and GPU share the same 128GB RAM pool. Traditional discrete GPU memory metrics don't apply here. ✅ **Inference via Ollama:** ```bash # Stable, reliable, good performance ollama run phi3.5:3.8b-mini-instruct-q4_K_M # Results: 23 tokens/sec, GPU-accelerated ``` Ollama uses GPU acceleration with optimized quantized inference. The combination of quantization (Q4_K_M) and GPU acceleration gives production-ready performance. ## What This Means for Production ### What's Production-Ready ✅ **Training with workarounds:** ```python # Implement these in all training scripts: 1. torch.cuda.empty_cache() every 50 steps 2. Maximum 2-3 hour sessions with checkpointing 3. Monitor GPU memory continuously 4. Use BF16 consistently (training AND inference) # Result: Successfully trained 7+ models ``` ✅ **Inference via Ollama:** ```bash # GPU-accelerated, stable, good performance ollama run phi3.5:3.8b-mini-instruct-q4_K_M # 23 tok/s, production-ready ``` ✅ **CPU inference for evaluation:** ```python # Slow but 100% reliable model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.float32, device_map='cpu' ) # Perfect for batch evaluation ``` ### What Needs More Testing ⚠️ **GPU inference (PyTorch/Transformers):** - FP16 has numerical instability (confirmed) - BF16 not tested yet (likely works) - **Test this yourself before production use** ⚠️ **Multi-hour unattended training:** - Memory fragmentation risk after 3-8 hours - Requires babysitting or session limits - **Train in chunks with checkpointing** ### What About TensorRT-LLM? **Update:** HN user `renaudr` asked: "Have you tried to run GPT-OSS-120b using TRT-LLM?" I haven't. NVIDIA's benchmarks likely used TensorRT-LLM, their optimized inference engine. I focused on standard PyTorch/Transformers workflows (what most practitioners use). If you need production GPU inference performance matching NVIDIA's benchmarks, **TensorRT-LLM is the recommended path**. It likely avoids the FP16 precision issues I encountered. That's a separate deep-dive I haven't done yet. ## My Actual Numbers vs. NVIDIA's Claims | Metric | NVIDIA | My Experience | Match? | |--------|---------|---------------|--------| | Training speed | 53,657-82,739 tok/sec | Comparable relative speed | ✅ | | Inference speed | 22-82 tok/sec | 23 tok/sec (Ollama, GPU) | ✅ | | 4-bit quality | <1% degradation | No noticeable degradation | ✅ | | Training precision | BF16 | BF16 works perfectly | ✅ | | **GPU inference** | **Implied working** | **FP16 broken, BF16 untested** | ⚠️ | | **Training stability** | **Not mentioned** | **2-3 hour limit** | ❌ | | **Memory management** | **Not mentioned** | **Manual cache clearing** | ❌ | ## Recommendations ### If You're Using DGX Spark **DO:** - Implement GPU cache clearing (every 50 steps) - Limit training sessions to 2-3 hours - Checkpoint frequently (every 200+ steps) - **Use BF16 consistently** for training and inference - Test BF16 inference before claiming GPU inference is broken - Use Ollama for production inference (GPU-accelerated) - Build llama.cpp from source if needed - Monitor NVIDIA driver updates **DON'T:** - Use FP16 for GPU inference (numerical instability) - Plan 8+ hour unattended training - Skip the workarounds - Assume your experience generalizes without more testing ### If You're Considering DGX Spark **It's worth it if:** - You're an experienced ML engineer - You need local training capabilities - You can implement stability workarounds - Ollama or TRT-LLM inference meets your needs - You're willing to monitor driver updates **Look elsewhere if:** - You need plug-and-play GPU inference - You expect production-ready out-of-box - You need long unattended training runs - You don't have time for workarounds ## The Silver Lining Despite the issues, I successfully: ✅ Fine-tuned 7 models with 70-84% accuracy on medical Q&A ✅ Achieved 23 tokens/sec GPU-accelerated inference via Ollama ✅ Validated NVIDIA's training performance claims ✅ Built a production-ready medical chatbot ✅ Documented all workarounds for future users **The hardware is powerful.** But it requires expert-level knowledge to navigate current limitations. ## Lessons Learned ### 1. Test All Precision Combinations When FP16 inference failed, I should have tested BF16 inference (matching training precision) before claiming GPU inference was broken. **Complete your testing matrix before drawing conclusions.** ### 2. Hardware Issues Can Look Like Software Bugs I spent 15+ hours debugging my "broken" training code. The training was fine; FP16 inference had numerical issues. **Check your assumptions systematically.** ### 3. Benchmark Numbers Are True But Incomplete NVIDIA's numbers are accurate for their test conditions. But they don't reveal precision constraints, stability limits, or workarounds needed for production. **Context matters.** ### 4. Bleeding Edge Means Trade-offs Blackwell + ARM64 + CUDA 13.0 is cutting edge. That means bugs, limitations, and workarounds. **Wait 6-12 months if you need stability, or be ready to troubleshoot.** ### 5. Community Feedback is Invaluable The HN community caught my incomplete testing, overclaimed conclusions, and factual errors. **Ship early, iterate publicly, accept criticism gracefully.** ## Community Acknowledgments This article improved significantly thanks to Hacker News feedback. Special thanks to: - **enum** - Caught the BF16 vs FP16 testing gap, shared PyTorch installation insights - **veber-alex** - Provided llama.cpp official benchmarks link - **bradfa** - Corrected ARM64+CUDA maturity claims (Jetson history) - **furyofantares** - Questioned incomplete testing and overclaimed conclusions - **stuckinhell** - Shared contrary experience (their DGX inference works fine) - **jasonjmcghee** - Asked about Ollama CPU vs GPU usage - **eadwu** - Explained memory fragmentation might be kernel-level - **renaudr** - Suggested TRT-LLM testing Technical discussion like this makes everyone smarter. I'm grateful for the pushback. ## What I'd Tell NVIDIA If NVIDIA engineers are reading this, here's constructive feedback: **Document these in your communications:** 1. **Precision guidance:** - FP16 inference behavior on Blackwell+ARM64 - BF16 consistency recommendation (training and inference) - When to use TRT-LLM vs. PyTorch 2. **Stability constraints:** - Training session duration considerations - Memory management best practices - GPU cache clearing recommendations 3. **Inference setup:** - Which backends work out-of-box (TRT-LLM vs PyTorch) - Unified memory architecture implications - Ollama as tested inference path 4. **Real-world examples:** - Long-running training strategies - Production inference patterns - Monitoring and mitigation tactics Your benchmarks are accurate. They'd be more valuable with context about setup, limitations, and recommended practices. ## Conclusion: Powerful But Not Plug-and-Play The NVIDIA DGX Spark delivers on raw performance when you work within its constraints. Training throughput matches the benchmarks. Inference speed is excellent (via Ollama). The hardware potential is real. But it's not plug-and-play. FP16 GPU inference has issues. Memory fragmentation limits training sessions. You need expertise to navigate the current limitations. **NVIDIA's benchmarks are technically true.** They're just not the whole truth. For ML engineers willing to implement workarounds and test thoroughly, DGX Spark is a powerful tool. For teams expecting production-ready performance out-of-box, the maturity isn't there yet (especially for standard PyTorch workflows on ARM64). **My verdict:** Cautiously recommended for experts. Wait 6-12 months if you need stability. Test BF16 inference before I did. --- ## Appendix: Full Experimental Data ### Inference Benchmark Details **Test Model:** Phi-3.5-mini-instruct (3.8B parameters) **Quantization:** Q4_K_M (4-bit) **Engine:** Ollama 0.3.9 (llama.cpp backend) **Acceleration:** GPU (96% utilization confirmed) **Configuration:** ```json { "temperature": 0.7, "top_p": 0.9, "max_tokens": 512, "repeat_penalty": 1.1, "num_runs_per_prompt": 3 } ``` **Results by Prompt Category:** | Category | Prompt Length | Avg Response Length | Tokens/Sec | |----------|---------------|---------------------|------------| | Short | 20 tokens | 46 tokens | 83.25 | | Medium | 34 tokens | 456 tokens | 79.74 | | Long | 113 tokens | 512 tokens | 78.47 | ### Fine-Tuning Experiment Summary **Base Model:** google/gemma-3-4b-it **Dataset:** PubMedQA artificial (10,000 medical Q&A pairs) **Method:** LoRA fine-tuning **Configuration:** ```python { "lora_r": 16, "lora_alpha": 32, "lora_dropout": 0.05, "learning_rate": 2e-5, "batch_size": 4, "epochs": 3, "precision": "bf16", "gradient_accumulation_steps": 2 } ``` **Results:** | Experiment | Training Time | Final Loss | FP16 GPU Test | CPU Test | Accuracy | |------------|--------------|------------|----------|----------|----------| | Exp 1-5 | ~3 hours each | Decreased | ❌ | ✅ | 70% | | Exp 6 | 7.5h (hung) | N/A | N/A | N/A | N/A | | Exp 7 | ~12 hours | 1.82 | ❌ | ✅ | 82% | | Exp 8 | ~10.5 hours | 1.61 | ❌ | ✅ | **84%** | **Best Model (Experiment 8):** - Used SFTTrainer instead of base Trainer - Text-based dataset (no pre-tokenization) - Dynamic padding - +14% improvement over baseline ### System Configuration ```yaml Hardware: Platform: NVIDIA DGX Spark Architecture: ARM64 (aarch64) GPU: GB10 (Blackwell, unified memory) GPU Driver: 580.95.05 CUDA: 13.0 Memory: 128 GB unified (CPU/GPU share) Software: OS: Ubuntu 24.04.3 LTS Kernel: 6.11.0-1016-nvidia PyTorch: 2.5.0a0+e000cf0ad9.nv24.10 (Docker) Container: nvcr.io/nvidia/pytorch:24.10-py3 Transformers: 4.44.0 Ollama: 0.3.9 CUDA Runtime: 13.0 cuDNN: 9.5.1 Stability Mitigations: - GPU cache clearing: Every 50 steps - Checkpoint interval: 200 steps - Max session duration: 2.5 hours - Memory monitoring: Every 100 steps - Inference: BF16 recommended (FP16 has issues) ``` --- <div class="callout" data-callout="tip"> <div class="callout-title">Related Articles</div> <div class="callout-content"> **Follow-up:** [[AI Systems & Architecture/dgx-spark-week-one-finding-the-right-stack|Week One Update: Finding the Right Stack]] - The root cause and 3.6x performance breakthrough! **More from the DGX Lab Chronicles:** - [[Practical Applications/dgx-lab-intelligent-gateway-heuristics-vs-ml-day-1|Day 1: When Simple Heuristics Beat ML by 95,000x]] - [[Practical Applications/dgx-lab-supercharged-bashrc-ml-workflows-day-2|Day 2: Supercharge Your Shell with 50+ ML Productivity Aliases]] - [[Practical Applications/dgx-lab-building-complete-rag-infrastructure-day-3|Day 3: Building a Complete RAG Infrastructure]] **Production ML Resources:** - [[Practical Applications/building-production-ml-workspace-part-1-structure|Building a Production ML Workspace Series]] - [[Emerging Trends/the-hidden-crisis-in-llm-fine-tuning-catastrophic-forgetting|The Hidden Crisis in LLM Fine-Tuning]] </div> </div> --- **About This Series:** I'm documenting my journey building production ML systems on an NVIDIA DGX Spark. The wins, the losses, the mistakes, and the corrections. This is Day 4 of the DGX Lab Chronicles. **Want to follow along?** Every article shares real code, actual performance data, and honest assessments (including when I get things wrong). **Found errors or have corrections?** The HN community made this article significantly better. Feel free to reach out. --- **Published:** October 26, 2025 **Updated:** October 27, 2025 **Series:** DGX Lab Chronicles (Day 4) **Reading Time:** 12 minutes