Inference Core Loop of an Inference Engine - Sladyn's Engineering Field Manual

# Inference: Core Loop of an Inference Engine Inference is the step where a trained model is used to generate output from input. For a large language model, the model has already learned weights during training. At inference time, we provide input tokens and ask the model to predict what should come next. ## Step 1: Load The Model The inference system loads the model weights into memory. In practice, this is usually hidden behind libraries and serving systems, but the core idea is simple: the model has to be available before it can produce predictions. ## Step 2: Tokenize The Input Text is not passed to the model as raw strings. It is converted into tokens. For a prompt like: ```text What is the capital of France? ``` the tokenizer converts text into token IDs. Those token IDs become the input to the model. ## Step 3: Generate One Token At A Time At a simplified level, generation looks like this: ```python for _ in range(max_new_tokens): outputs = model(input_ids) next_token_id = outputs.logits[0, -1, :].argmax().item() generated_ids.append(next_token_id) if next_token_id == eos_token_id: break input_ids = torch.cat( [input_ids, torch.tensor([[next_token_id]], device=device)], dim=1, ) ``` The model produces logits, which are raw scores over possible next tokens. A decoding strategy chooses the next token, appends it to the sequence, and repeats the loop. ## Autoregression This process is autoregressive: each new token depends on the tokens that came before it. That is why generation is sequential in an important way. The model cannot know token 50 until it has generated token 49. ## Why Inference Engineering Gets Hard The simple loop hides the real production complexity: - Batching many user requests. - Managing GPU memory. - Caching keys and values. - Scheduling prefill and decode work. - Keeping latency predictable. - Serving many model sizes and traffic patterns. ## Related - [[Inference Prefill and Decode]] - [[Field Notes/KV Cache]] - [[Topics/AI Infrastructure]] - [[Learning Paths/AI Infrastructure from First Principles]]