# DGX Lab: When Simple Heuristics Beat ML by 95,000x
<div class="callout" data-callout="info">
<div class="callout-title">Lab Session Info</div>
<div class="callout-content">
**Date**: October 20, 2025
**DGX System**: NVIDIA DGX Workstation
**Session Duration**: ~6 hours
**Primary Focus**: Multi-engine AI gateway with intelligent routing
</div>
</div>
## The Unexpected Discovery
Today I built something that challenged one of my core assumptions about AI systems: that more complex is better. I set out to deploy a 1.5B parameter machine learning model for routing AI requests between inference engines. What I ended up with was **50 lines of heuristics that matched the ML model's 90% accuracy while being 95,000x faster**.
This isn't a story about ML being bad—it's about recognizing when simpler solutions can achieve the same goals with dramatically better performance.
## The Challenge
I'm building a multi-engine AI gateway for my DGX workstation that intelligently routes requests between different inference backends:
- **Ollama**: Fast, efficient for simple queries
- **llama.cpp**: Better for code generation and long contexts
- **Cloud models** (future): Reserved for complex reasoning
The routing decision needs to happen in **under 50ms** to avoid adding noticeable latency to user requests. The question: which engine handles each request?
## The Plan: ML-Based Router
I started with [Arch-Router-1.5B](https://huggingface.co/nicholasKluge/Arch-Router-1.5B), a model specifically trained to route LLM requests. The architecture was elegant:
```python
# Load Arch-Router model
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained(
"nicholasKluge/Arch-Router-1.5B"
)
tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/Arch-Router-1.5B")
# Route a request
def route_with_ml(prompt: str) -> str:
inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
outputs = model(**inputs)
prediction = outputs.logits.argmax().item()
return "ollama" if prediction == 0 else "llamacpp"
```
I built the service, deployed it, and ran my first benchmark.
**Result: 950ms average routing time.**
That's 19x over my 50ms target. Not even close.
## The Critical Pivot
Here's where the session got interesting. Instead of immediately reaching for optimization (GPU acceleration, smaller model, quantization), I stopped and asked: **What patterns is the ML model actually learning?**
I analyzed the routing decisions across 100 test prompts and found clear patterns:
### Pattern 1: Context Length
- Prompts <1,024 chars → Ollama (fast responses)
- Prompts >4,096 chars → llama.cpp (large context window)
### Pattern 2: Content Type
- Interactive keywords ("hello", "how are you") → Ollama
- Code keywords ("def", "function", "class") → llama.cpp
### Pattern 3: Explicit Requirements
- User specifies "fast" → Ollama
- User requests "detailed analysis" → llama.cpp
These weren't complex, high-dimensional patterns. They were **clear decision boundaries** that could be captured with simple rules.
## The Heuristics Solution
I built a new router in 50 lines:
```python
from enum import Enum
from typing import Tuple
class Engine(Enum):
OLLAMA = "ollama"
LLAMACPP = "llamacpp"
def route_with_heuristics(prompt: str) -> Tuple[Engine, float, str]:
"""Route based on heuristics, return (engine, confidence, rationale)"""
confidence = 0.5 # Base confidence
# Context length heuristic
prompt_length = len(prompt)
if prompt_length > 4096:
confidence += 0.3
return (
Engine.LLAMACPP,
min(confidence, 1.0),
"Long context (>4096 chars) better for llama.cpp"
)
elif prompt_length < 1024:
confidence += 0.2
# Code detection heuristic
code_keywords = ["def ", "function", "class ", "import ", "const ", "let "]
if any(kw in prompt.lower() for kw in code_keywords):
confidence += 0.4
return (
Engine.LLAMACPP,
min(confidence, 1.0),
"Code generation keywords detected"
)
# Interactive heuristic
interactive_keywords = ["hi", "hello", "how are you", "what's"]
if any(kw in prompt.lower() for kw in interactive_keywords):
confidence += 0.3
return (
Engine.OLLAMA,
min(confidence, 1.0),
"Interactive query, use fast engine"
)
# Default: short prompts to Ollama
return (
Engine.OLLAMA,
confidence,
"Short prompt, default to fast engine"
)
```
## The Results
I ran the same test suite on both approaches:
| Metric | ML Model (Arch-Router-1.5B) | Heuristics |
|--------|----------------------------|------------|
| **Routing Time** | 950ms | 0.008ms |
| **Accuracy** | 90% (9/10 correct) | 90% (9/10 correct) |
| **Speedup** | Baseline | **95,000x faster** |
| **Infrastructure** | Requires PyTorch + model files | Pure Python, no dependencies |
| **Explainability** | Black box | Human-readable rationale |
| **GPU Required** | Yes (for <50ms target) | No |
The heuristics matched ML accuracy while being **95,000x faster**. Routing overhead went from 950ms to 0.008ms—essentially free.
## Building the Complete System
With fast routing solved, I completed the full gateway architecture:
### Architecture Overview
```
User Request
↓
Unified Gateway (port 5000)
↓
├─ Parse request & extract prompt
├─ Route with heuristics (0.008ms)
│ └─ Engine: ollama or llamacpp
├─ Execute on LiteLLM Proxy (port 4000)
│ └─ Forward to appropriate backend
└─ Return response + routing metadata
```
### Key Components
**1. Heuristic Router Service** (`arch_router_lite.py`)
- FastAPI service on port 8888
- Endpoints: `/route`, `/execute`, `/health`, `/metrics`
- JSONL request logging
- Real-time metrics aggregation
**2. LiteLLM Proxy** (existing tool)
- OpenAI-compatible API gateway
- Manages connections to Ollama and llama.cpp
- Handles model loading, retries, fallbacks
**3. Unified Gateway** (`unified_gateway.py`)
- Single entry point on port 5000
- OpenAI-compatible `/v1/chat/completions` endpoint
- Automatic routing or manual model selection
- Includes routing metadata in responses
### Usage Example
```python
import openai
# Point to local gateway instead of OpenAI
openai.api_base = "http://localhost:5000/v1"
openai.api_key = "dummy"
# Auto-routing: Gateway picks best engine
response = openai.ChatCompletion.create(
model="auto",
messages=[{"role": "user", "content": "Write a Python function to reverse a string"}]
)
# Response includes routing metadata
print(response["routing_metadata"])
# {
# "engine": "llamacpp",
# "confidence": 0.92,
# "rationale": "Code generation keywords detected",
# "routing_time_ms": 0.008
# }
```
## The Observability Layer
I added comprehensive logging and metrics:
### Request Logging (JSONL)
Every request is logged to `~/workspace/logs/router_requests.jsonl`:
```json
{
"timestamp": "2025-10-20T14:32:15.847Z",
"prompt_length": 67,
"prompt_preview": "Write a Python function to reverse a string",
"engine": "llamacpp",
"routing_confidence": 0.92,
"routing_rationale": "Code generation keywords detected",
"routing_time_ms": 0.008,
"inference_time_ms": 1450,
"total_time_ms": 1450.008,
"success": true
}
```
### Real-Time Metrics
The `/metrics` endpoint aggregates logs on-demand:
```json
{
"total_requests": 156,
"requests_by_engine": {
"ollama": 89,
"llamacpp": 67
},
"avg_routing_time_ms": 0.009,
"avg_inference_time_ms": 1420,
"success_rate": 0.99
}
```
### CLI Dashboard
I built a terminal dashboard with auto-refresh:
```
================================================================================
📊 Arch-Router Metrics Dashboard
================================================================================
📈 Summary Statistics
--------------------------------------------------------------------------------
Total Requests: 156
Success Rate: 99.4%
Avg Routing Time: 0.009ms
Avg Inference Time: 1420ms
🚀 Requests by Engine
--------------------------------------------------------------------------------
ollama | ████████████████████████████ | 89 ( 57.1%)
llamacpp | ████████████████████ | 67 ( 42.9%)
📝 Recent Requests (Last 10)
--------------------------------------------------------------------------------
✅ 5s ago | ollama | 1380ms | Hello! How can I help?...
✅ 12s ago | llamacpp | 1560ms | Write a function to reverse...
✅ 18s ago | ollama | 1290ms | What is machine learning?...
```
## What I Learned
### 1. When Heuristics Beat Machine Learning
ML is powerful, but it's not always the right tool. Heuristics work better when:
- **Decision boundaries are clear** (not high-dimensional or nuanced)
- **Speed is critical** (<1ms requirements)
- **Explainability matters** (need to debug routing decisions)
- **Infrastructure is constrained** (no GPU, edge deployment)
This project's routing task had clear patterns that could be captured with rules. ML was overkill.
### 2. JSONL as a Lightweight Database
Append-only JSONL logs provided 90% of database functionality with 10% of complexity:
- **Fast writes** (O(1) append)
- **Easy parsing** (one JSON per line)
- **Human-readable** (debugging with cat/jq/grep)
- **No setup** (just filesystem)
For <10K requests, on-demand metrics aggregation from logs is perfectly acceptable. Scale to a real database when you have proven need.
### 3. Transparency Builds Trust
Including `routing_metadata` in every response provides:
- **Debugging** without server access
- **User confidence** through explainability
- **Optimization opportunities** (users can adjust prompts)
Make your AI systems explainable by exposing decision rationale, not just results.
### 4. OpenAI API as Universal Interface
By implementing OpenAI's `/v1/chat/completions` API, my gateway works with:
- Existing SDKs (openai-python, openai-node)
- LangChain, AutoGPT, Continue.dev
- Any tool expecting OpenAI format
Standards matter. Adopt them.
## Performance Metrics
Final system performance:
| Metric | Target | Achieved | Status |
|--------|--------|----------|--------|
| Routing Overhead | <50ms | 0.008ms | ✅ 6,250x better |
| Routing Accuracy | >80% | 90% | ✅ Exceeds target |
| End-to-End Latency | <2s | 1.45s | ✅ 27% better |
| Success Rate | >95% | 99.4% | ✅ Exceeds target |
## Challenges & Solutions
<div class="callout" data-callout="warning">
<div class="callout-title">Challenge: ML Model Too Slow</div>
<div class="callout-content">
Arch-Router-1.5B took 950ms on CPU, 19x over the 50ms target. GPU acceleration could reduce this to ~20ms, but still adds complexity.
</div>
</div>
<div class="callout" data-callout="success">
<div class="callout-title">Solution: Extract Heuristics from ML Patterns</div>
<div class="callout-content">
Analyzed ML routing decisions to identify clear patterns. Implemented heuristics capturing the same logic with 0.008ms overhead and zero infrastructure.
</div>
</div>
<div class="callout" data-callout="warning">
<div class="callout-title">Challenge: Service Health Checks</div>
<div class="callout-content">
Ollama doesn't have a `/health` endpoint like most services—it uses `/api/tags` instead.
</div>
</div>
<div class="callout" data-callout="success">
<div class="callout-title">Solution: Service-Specific Health Checks</div>
<div class="callout-content">
Built flexible health checking that tries standard endpoints first, then falls back to service-specific alternatives.
</div>
</div>
## Next Steps
This gateway is Phase 1 of a larger AI infrastructure project. Coming next:
- [ ] **Phase 2a**: Document ingestion pipeline for RAG
- [ ] **Phase 2b**: Vector database integration (Chroma or Qdrant)
- [ ] **Phase 2c**: RAG query endpoint with context retrieval
- [ ] **Cloud model integration**: Add Claude/GPT-4 for complex reasoning
- [ ] **Request caching**: LRU cache for identical prompts
- [ ] **Streaming responses**: SSE for real-time token generation
---
## Related Articles
[[Practical Applications/building-production-ml-workspace-part-4-agents|Building Production ML Workspaces: AI Agents]]
[[AI Systems & Architecture/⌂ AI Systems & Architecture|AI Systems & Architecture Overview]]
[[Practical Applications/⌂ Practical Applications|Practical Applications Hub]]
<div class="callout" data-callout="tip">
<div class="callout-title">Try It Yourself</div>
<div class="callout-content">
**Quick Start:**
1. Clone this approach with your own backends (Ollama, vLLM, etc.)
2. Start with simple heuristics: prompt length + keyword matching
3. Log every request to JSONL for analysis
4. Measure: does your heuristic-based router meet your latency target?
5. Only add ML if heuristics can't reach your accuracy goal
**Key Insight:** Measure first, optimize second. Don't assume ML is needed until simple solutions fail.
</div>
</div>
---
*This is Day 1 of the DGX Lab Chronicles, documenting real AI experiments on NVIDIA DGX hardware. Session files and code available on the DGX system at `/home/bioinfo/workspace/infrastructure/gateway/`.*