Welcome to the singularity. The pace at which the tools and best practices in LLM engineering are evolving makes it almost impossible to keep up with if you are not developing almost every day. This article provides an overview of the state of the art as of early 2025.
At the core of LLM engineering of course are the LLM models. Frontier (or massive scale) models are the paid versions of models like ChatGPT. Open source (or [[open weight]]) models are also available like LLama.
There are multiple ways to use models
- **Chat interfaces**: all you need is a text box. Simply type your chat in on the services web interface and get a response. Start for free or upgrade to paid plans with monthly subscriptions.
- **Cloud APIs**: Interact with LLMs running in the cloud with an API. Frameworks like [[LangChain]] wrap multiple APIs to provide a seamless experience.
- **Managed AI cloud services**: interact with Cloud APIs through an interface managed by the service provider. Includes [[Amazon Bedrock]], [[Google Vertex]], and [[Azure AI Studio]].
- **Direct inference**: Run the LLM locally or on a [[virtual machine]]. Use [[Ollama]] to run locally or [[HuggingFace]] to access models in the cloud.
## frontier model
Frontier refers to the largest and most capable models offered by LLM providers. These are typically not free or [[open weight]].
Frontier models are offered by LLM providers including
- OpenAI
- Anthropic
- Google
- Cohere
- Meta
- Perplexity
## open weight
An open weight model may be downloaded entirely for local inference, fine tuning and other tasks (given you have a sufficient compute and storage). Contrast with a [[frontier model]] which may only be used for inference on the vendor's servers.
## ollama
[Ollama](https://ollama.com/) was originally launched to make available the open-source LLM offered by Meta called Llama. It now offers [many different models](https://ollama.com/search) in one convenient package.
## install ollama
Download the [Ollama](https://ollama.com/) installer by clicking the download on the home page. Select the correct installer for your operating system. Find the executable in your `Downloads/` folder and run the installer (all you have to do is click Install).
> [!Example]- Installation Screens
> 
>
>
Open [[Git Bash]] or [[Windows PowerShell]] to run one of the LLMs. To run `llama3.2`, use
```bash
ollama run llama3.2
```
This might take a few minutes depending on the size and your internet connection.
Once the model is loaded, ask a question
```bash
What is the airspeed velocity of an unladen swallow?
```
To quit the session
```bash
/exit
```
## serve ollama locally
Ollama can be served locally on `localhost` to enable endpoint API calls. This should happen automatically if you opt to run `ollama` on start. If you visit [http://localhost:11434/](http://localhost:11434/) you should see the message `Ollama is running`. If not, open [[Bash]] and enter `ollama serve`.
## ollama package
The `ollama` package is a [[Python]] library for running LLMs. Make sure you to [[serve ollama locally]].
```python
import ollama
MODEL = "llama3.2"
messages = [
{"role": "user", "content": "Describe some of the business applications of Generative AI"}
]
response = ollama.chat(model=MODEL, messages=messages)
print(response['message']['content'])
```
## OpenAI API
The OpenAI API will let you use OpenAI's frontier models in your application. To use the OpenAI API, go to the [OpenAI developer platform](https://platform.openai.com/docs/overview) and log in.
> [!Note]
> Many other LLM providers have replicated OpenAPI's web endpoints allowing the OpenAI API to work for their models as well. For example
> ```python
> from openai import OpenAI
>
> ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
> response = ollama_via_openai.chat.completions.create(
> model=MODEL,
> messages=messages
>)
> print(response.choices[0].message.content)
> ```
> will run your local ollama model (if being served).
You must supply at least \$5 in credits to use the API. Open Settings (gear icon in the top right) and then the Billing page from the side panel. Load the amount you are comfortable with. You can choose to enable auto recharge, but I don't recommend it until you are sure your how much your application is going to charge.
Next, open the Dashboard page (in the top nav bar) and open the API keys page from the side panel. Click **+ Create new secret key**. Optionally, give it a descriptive name. Copy the key from the dialog box. You will not be able to see the key again! (But just delete and re-create the key if you lose it for any reason).
Save the key into an [[environment file]] like (no spaces)
```.env
OPENAI_API_KEY=sk-proj-...
```
> [!Tip]
> Use [[nano]] from [[Bash]] to create a `.env` file and update with your secret key
> ```bash
> nano .env
> ```
> type `OPENAI_API_KEY=` and paste the key. Then use `Ctrl+O` to save and `Ctrl+X` to exit. Confirm you saved the file correctly with
> ```bash
> cat .env
> ```
> Then type `clear` to clear the screen so your key is no longer showing.
### basic requests
The basic request format is a call to the `openai.chat.completions` API. Set it up such that the `user_prompt` is its own function that returns a prompt based on some parameters for re-use with similar prompts.
```python
response = openai.chat.completions.create(
model=MODEL,
messages=[
{'role': 'system', 'content': system_prompt},
{'role': 'user', 'content': user_prompt()}
]
)
result = response.choices[0].message.content
```
### streaming responses
You can also stream responses, for example when working in a Jupyter Notebook, to get the same feel as when using the chat interface.
```python
def stream_response(system_prompt, user_prompt):
stream = openai.chat.completions.create(
model=MODEL,
messages=[
{'role': 'system', 'content': system_prompt},
{'role': 'user', 'content': user_prompt()}
],
stream=True
)
response = ""
display_handle = display(Markdown(""), display_id=True)
for chunk in stream:
response += chunk.choices[0].delta.content or ''
response = response.replace("```","").replace("markdown", "")
update_display(Markdown(response), display_id=display_handle.display_id)
stream_response(system_prompt, user_prompt)
```
## Anthropic API
Open the Anthropic Console, login
## hardware requirements
- What it's like running on a Microsoft Surface 9; Dell Precision 5560
- Build your own machine
- NVIDIA CUDA GPU
- Apple Silicon
- alternative: cloud compute
## running LLMs on a Mac
Use Apple Silicon to run LLMs on a Mac.
https://www.youtube.com/watch?v=bp2eev21Qfo
## open WebUI
Open WebUI is a fully-featured full stack solution for building your own LLM application. To use Open WebUI, simply clone the open WebUI repository. Out of the box, it provides a chatbot interface just like [[ChatGPT]].
Open WebUI is bult on [[Svelte]] with a [[Python]] backend and has a [[Docker]] configuration. You also need [[Node.js]] installed.
```
mamba create -n openwebui
mamba activate openwebui
pip install -r requirements.txt
npm install
npm run build
bash start.sh
```
OpenWebUI requires authentication, however it only connects to a local database. (Check out that code to see how that works!)
## model context protocol
Model context protocol (MCP) is the new new in [[LLM engineering]], introduced by [[Anthropic]] in November 2024. First, create an MCP server that serves the resources and tools to any MCP host. The MCP host can spin up that server on demand to include it in its workflow. You can create an MCP server for your local desktop filesystem or any web-based tool running locally or on the cloud. Try starting with this tutorial to [extend](https://modelcontextprotocol.io/quickstart/user) Claude for Desktop and check out the [awesome-mcp-servers](https://github.com/punkpeye/awesome-mcp-servers) repo on GitHub.
> [!Tip]- Additional Resources
> - [Hugging Face | What is MCP, and why is everyone suddenly talking about it?](https://huggingface.co/blog/Kseniase/mcp)
> - [LangChain (YouTube) | Understanding MCP From Scratch](https://www.youtube.com/watch?v=CDjjaTALI68)
## Windsurf
[Windsurf](https://codeium.com/windsurf) is a popular [[IDE]] for [[LLM engineering]] offered by [[codeium]]. Check out the [launch video](https://www.youtube.com/watch?v=bVNNvWq6dKo).
Windsurf is built around these principles:
- Trajectories: the agent can predict future work based on past work
- Meta-learning: learns preferences over time
- Scale with intelligence: Windsurf will grow with the industry, adopting the best practices
Windsurf claims that 90% of the code it's users commit is written by the agent, not the developer.
## Vellum
Use [Vellum AI](https://www.vellum.ai/llm-leaderboard) LLM Leaderboard to compare model costs and context windows (scroll down for *Context window, cost and speed comparison*).
## replit
## structured outputs
Structured outputs define specific data structures for LLM responses. This can be very helpful for ensuring the response from an LLM can be parsed in downstream application code.
For the simplest use cases, you can simply provide an example in [[one-shot prompting]].
For OpenAI only, you can use `response_format={"type": "json_object"}` but you must also include something about JSON in your prompt as well.
## webdev arena
## gradio
Gradio (from [[HuggingFace]]) lets you build and share machine learning applications, but is best known as a UI wrapper for LLM applications.
For LLM apps often all you need is a text box for input and text box for output. Gradio makes this super simple. It also allows you to quickly share demos with your colleagues with its unique sharing feature.
```python
import gradio as gr
def greet(name):
return "Hello " + name + "!"
demo = gr.Interface(fn=greet, inputs="text", outputs="text)
demo.launch()
```
Pass `share=True` to `launch` to create a public URL where others can access your application. Note that your machine will be used to run the code; gradio is not spinning up a VM. This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces ([https://huggingface.co/spaces](https://huggingface.co/spaces)).
For basic chatbots, Gradio expects a function `chat(message, history)` where `message` is the prompt to use and `history` is the past conversation in the model's required format.
```python
def chat(message, history):
messages = [{"role": "system", "content": system_message}]
+ history
+ [{"role": "user", "content": message}]
stream = openai.chat.completions.create(
model=MODEL,
messages=messages,
stream=True
)
response = ""
for chunk in stream:
response += chunk.choices[0].delta.content or ''
yield response
```
## HuggingFace
HuggingFace offers models (1M+ as of last count), datasets (like [[Kaggle]]), and [[Hugging Face Spaces]] to build and deploy apps.
HuggingFace has libraries `hub` for models, `datasets` for datasets, `transformers` for custom model architectures, `peft` for finetuning, `trl` for reinforcement learning, and `accelerate` for hardware optimization.
To use HuggingFace, first create a HuggingFace account (and confirm your email address). Then create a new token (API key) of type "Write" under Access Tokens. Store the key in your [[environment file]]. Don't worry about the fine-grained permissions, unless you know what you're doing; just select the "Write" tab.
## inference endpoint
HuggingFace makes it very easy to run inference on the cloud by deploying a model. There is a cost associated but it can be a good solution for short term deployments.
```python
client = InferenceClient(URL, token=hf_token)
client.text_completion(message)
```
### transformer
```bash
mamba install transformers
```
#### pipelines
The `pipeline` module provides a very simple interface for [[inference]] and common [[NLP]] tasks. Provide `model=` to specify a model, otherwise HuggingFace will select the default model for that task. If using a GPU, also supply `device='cuda'`.
```python
from huggingface_hub import login
from transformers import pipeline
# Load API key in Colab
from google.colab import userdata
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credentials=True)
# Sentiment Analysis
classifier = pipeline("sentiment-analysis") # device="cuda" if GPU
result = classifier("text to classify")
# Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
result = ner("text with entities")
# Question Answering with Context
question_answerer = pipelin("question-answering")
result = question_answerer("question")
# Text Summarization
summarizer = pipeline("summarization")
text = """Text to summarize"""
summary = summarizer(text, max_length=50, min_length=25, do_sample=False)
print(result[0]['summary_text'])
## Translation English to Spanish
translator = pipeline("translation_en_to_sp")
result = translator("Text to translate")
print(result[0]['tranlsation_text'])
## Classification
classifier = pipeline("zero-shot-classification")
result = classifer(
"Text to classify",
candidate_labels=["technology", "sports", "politics"]
)
## Text Generation
generator = pipeline("text-generation")
result = generator("Beginning of text")
print(result[0]['generated_text'])
```
#### models
[[HuggingFace]] uses [[PyTorch]] under the hood to run models. `CausalLM` includes all auto-regressive models.
```python
model = AutoModelForCausalLM.from_pretrained(model, device_map="auto")
# Set input
messages = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
]
inputs = tokenizer.apply_chat_template(messages)
# Get output
output = model.generate(inputs, max_new_tokens=100)
```
Optionally, provide use [[quantization]] to reduce the precision of model weights.
To clean up after loading a model to free up space on the GPU use
```python
del inputs, output, model
torch.cuda.empty_cache()
```
> [!Tip]- Additional Resources
> - [Learn LLMs on Hugging Face](https://huggingface.co/learn)
### HuggingFace Spaces
HuggingFace Spaces hosts AI applications for free. Applications are often built with [[Gradio]] or [[Streamlit]].
## tokenizer
A tokenizer converts text to tokens. Each model has its own tokenizer. You must use the same tokenizer during inference.
A tokenizer contains a vocabulary--the exhaustive list of all tokens in the system. A tokenizer focused on English text will have a very different vocabulary from a tokenizer focused on coding. Special tokens may be reserved for beginning of the text string, the start of a sentence, end of sentence, etc. These special tokens require no special treatment during inference or fine tuning, they are treated the same as all other tokens. Where a token represents the beginning of a word, it will include a space in front. Tokens are also typically case sensitive.
The tokenizer decomposes the text string into the vocabulary of tokens. Each token is identified by a unique ID.
In [[HuggingFace]] use the `AutoTokenizer` from the `transformers` library.
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('<model>', trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # pad with end of sentence
```
### chat template
Some models may include a chat template to format a chat into a series of tokens.
For example, a typical chat template is
```python
messages = [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)
```
Which when encoded and decoded shows the additional tokens that are used to provide additional context to the model (this is a Phi3 template, others will be different).
```
<|system|>
You are a helpful assistant
<|user|>
Tell a light-hearted joek for a room of Data Scientists
<|assistant|>
```
## instruct
Models that have been finetuned for chat are called **instruct** models.
## quantization
Quantization reduces the precision of each of the weights from 32-bit numbers to 8-bit numbers (or similar). The accuracy is reduced but only slightly. Ensure the `bitsandbytes` library is installed.
```bash
pip install bitsandbytes
```
Use the `BitsAndBytesConfig` class of the [[HuggingFace]] `transformers` library.
```python
from transformers import BitsAndBytesConfig
import torch
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True, # quantize twice
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4"
)
```
## chinchilla scaling law
The chinchilla scaling law, originally suggested by the Google Deep Mind team, states that the number of parameters in a model should be proportional to the number of training tokens for the [[transformer]] architecture.
## LLM benchmarks
Benchmarks are useful for understanding and comparing the performance of [[LLM]]s in different domains. Benchmarks are not perfect--they can be too narrow in scope, miss important measures of reasoning, and there is the potential for the tests to leak into the models themselves through direct training on test questions or overfitting to benchmarks.
Benchmarks are collected on various leaderboards including
- [Open LLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/)
- [BigCode](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard) (coding specific)
- [LLM Perf](https://huggingface.co/spaces/optimum/llm-perf-leaderboard) (performance on specific hardware)
- [HuggingFace Others](https://huggingface.co/spaces?q=leaderboard) (domain-specific models)
- [Vellum](https://www.vellum.ai/llm-leaderboard) (API cost, context windows)
- [SEAL](https://scale.com/leaderboard) (expert skills)
- [Chatbot Arena](https://beta.lmarena.ai/leaderboard) (human preference)
Common benchmarks include
- **ARC**: Reasoning
- **DROP**: Language Composition
- **HellaSwag**: Common Sense
- **MMLU**: Understanding
- **ThruthfulQA**: Accuracy
- **Winogrande**: Context
- **GSM8K**: Math
- **ELO**: Chat
- **HumanEval**: Python coding
- **MultiPL-EL**: Coding (general)
The newer and more difficult benchmarks include
- **GPQA** (Google-proof Q&A): PhD-level questions
- **BBHard**: Future capabilities of LLMs (but no longer!)
- **Math Lv 5**: High school math competition puzzles
- **IFEVAL**: Difficult instructions to follow
- **MuSR**: Multi-step soft reasoning (solve 1,000 word murder mystery novels)
- **MMLU-PRO**: More nuanced understanding tasks
## Retrieval Augmented Generation
RAG (Retrieval Augmented Generation), which is also referred to as **grounding** (especially in the Google ecosystem) is a method for bringing additional context to a LLM at inference to improve responses.
## LangChain
LangChain was developed in 2022 to provide a simplifying framework for working with LLMs. As the APIs for each LLM provider have converged, the need for such a framework has decreased. The people behind LangChain have recently released [[LangGraph]] which can be especially useful for agentic systems.
LangChain is useful for RAG applications.
LangChain has abstractions of LLM, Retrievers, and Memory. Creating a RAG pipeline is quite easy with these three abstractions.
```python
llm = ChatOpenAI(temperature=0.7)
memory = ConversationBufferMemory(memory_key='chat_history', return_messages)
retriever = vectorstore.as_retriever()
conversation_chain = CnoversationalRetrievalChain.from_llm(
llm=llm, retriever=retriever, memory=memory
)
```
### LangChain Expression Language
LangChain Expression Language (LCEL) is a declarative language for defining workflows using [[YAML]].
## chunking
Chunking is an important step in RAG to break the knowledge base down to fit inside the context window. Chunking is as much an art as a science. How well you do it will influence the performance of your RAG system.
```
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
```
## vector embeddings
Vector embeddings convert tokens (or words, sentences, chunks, documents, graph nodes, and concepts) to a high-dimensional space that encodes the semantic meaning, such that similar tokens are near each other.
Options include
- Word2Vec (2013)
- BERT (2018)
- OpenAI Embeddings (2024)
## vector database
A vector database efficiently stores and retrieves vector embeddings.
## chroma
Chroma is an open source [[vector database]].
Chroma is embedded in [[LangChain]] for easy integration.
```python
from langchain_chroma import Chroma
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=db_name
)
```
Alternatives include MongoDB and Postgres for persistent storage or Facebook FAISS for in-memory
## FAISS
Facebook AI Similarity Search is an open source library for search by semantic similarity. It can act as an in-memory vector store for [[RAG]] applications.
Use LangChain to access FAISS for RAG.
```python
from langchain.vectorstores import FAISS
vectorstore = FAISS.from_documents(
documents=chunks,
embeddings=embeddings
)
```
## t-stochastic neighborhood embeddings
T-stochastic neighborhood embeddings (TSNE) is a common method for reducing the dimensionality of vector embeddings.
```python
tsne = TSNE(n_components=3) # 3D
reduced_vectors = tsne.fit_transform(vectors)
```
## graph RAG
- https://graphrag.com/
- https://neo4j.com/blog/genai/graphrag-manifesto/
- https://www.deeplearning.ai/short-courses/knowledge-graphs-rag/ (2 hour course)
- [Neo4j in the cloud](https://neo4j.com/free-graph-database/)(free!)
- [Paco Nathan](https://sessionize.com/pacoid/): see three talks
- Use [GlobalFishingWatch API](https://github.com/GlobalFishingWatch/gfw-api-python-client) to explore potential fisheries abuses as portfolio project
- https://research.google/blog/the-evolution-of-graph-learning/