Welcome to the singularity. The pace at which the tools and best practices in LLM engineering are evolving makes it almost impossible to keep up with if you are not developing almost every day. This article provides an overview of the state of the art as of early 2025. At the core of LLM engineering of course are the LLM models. Frontier (or massive scale) models are the paid versions of models like ChatGPT. Open source (or [[open weight]]) models are also available like LLama. There are multiple ways to use models - **Chat interfaces**: all you need is a text box. Simply type your chat in on the services web interface and get a response. Start for free or upgrade to paid plans with monthly subscriptions. - **Cloud APIs**: Interact with LLMs running in the cloud with an API. Frameworks like [[LangChain]] wrap multiple APIs to provide a seamless experience. - **Managed AI cloud services**: interact with Cloud APIs through an interface managed by the service provider. Includes [[Amazon Bedrock]], [[Google Vertex]], and [[Azure AI Studio]]. - **Direct inference**: Run the LLM locally or on a [[virtual machine]]. Use [[Ollama]] to run locally or [[HuggingFace]] to access models in the cloud. ## frontier model Frontier refers to the largest and most capable models offered by LLM providers. These are typically not free or [[open weight]]. Frontier models are offered by LLM providers including - OpenAI - Anthropic - Google - Cohere - Meta - Perplexity ## open weight An open weight model may be downloaded entirely for local inference, fine tuning and other tasks (given you have a sufficient compute and storage). Contrast with a [[frontier model]] which may only be used for inference on the vendor's servers. ## ollama [Ollama](https://ollama.com/) was originally launched to make available the open-source LLM offered by Meta called Llama. It now offers [many different models](https://ollama.com/search) in one convenient package. ## install ollama Download the [Ollama](https://ollama.com/) installer by clicking the download on the home page. Select the correct installer for your operating system. Find the executable in your `Downloads/` folder and run the installer (all you have to do is click Install). > [!Example]- Installation Screens > ![img](https://storage.googleapis.com/ei-dev-assets/assets/OllamaSetup.tmp_2Zf1bYdarC.png) > > Open [[Git Bash]] or [[Windows PowerShell]] to run one of the LLMs. To run `llama3.2`, use ```bash ollama run llama3.2 ``` This might take a few minutes depending on the size and your internet connection. Once the model is loaded, ask a question ```bash What is the airspeed velocity of an unladen swallow? ``` To quit the session ```bash /exit ``` ## serve ollama locally Ollama can be served locally on `localhost` to enable endpoint API calls. This should happen automatically if you opt to run `ollama` on start. If you visit [http://localhost:11434/](http://localhost:11434/) you should see the message `Ollama is running`. If not, open [[Bash]] and enter `ollama serve`. ## ollama package The `ollama` package is a [[Python]] library for running LLMs. Make sure you to [[serve ollama locally]]. ```python import ollama MODEL = "llama3.2" messages = [ {"role": "user", "content": "Describe some of the business applications of Generative AI"} ] response = ollama.chat(model=MODEL, messages=messages) print(response['message']['content']) ``` ## OpenAI API The OpenAI API will let you use OpenAI's frontier models in your application. To use the OpenAI API, go to the [OpenAI developer platform](https://platform.openai.com/docs/overview) and log in. > [!Note] > Many other LLM providers have replicated OpenAPI's web endpoints allowing the OpenAI API to work for their models as well. For example > ```python > from openai import OpenAI > > ollama_via_openai = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama') > response = ollama_via_openai.chat.completions.create( > model=MODEL, > messages=messages >) > print(response.choices[0].message.content) > ``` > will run your local ollama model (if being served). You must supply at least \$5 in credits to use the API. Open Settings (gear icon in the top right) and then the Billing page from the side panel. Load the amount you are comfortable with. You can choose to enable auto recharge, but I don't recommend it until you are sure your how much your application is going to charge. Next, open the Dashboard page (in the top nav bar) and open the API keys page from the side panel. Click **+ Create new secret key**. Optionally, give it a descriptive name. Copy the key from the dialog box. You will not be able to see the key again! (But just delete and re-create the key if you lose it for any reason). Save the key into an [[environment file]] like (no spaces) ```.env OPENAI_API_KEY=sk-proj-... ``` > [!Tip] > Use [[nano]] from [[Bash]] to create a `.env` file and update with your secret key > ```bash > nano .env > ``` > type `OPENAI_API_KEY=` and paste the key. Then use `Ctrl+O` to save and `Ctrl+X` to exit. Confirm you saved the file correctly with > ```bash > cat .env > ``` > Then type `clear` to clear the screen so your key is no longer showing. ### basic requests The basic request format is a call to the `openai.chat.completions` API. Set it up such that the `user_prompt` is its own function that returns a prompt based on some parameters for re-use with similar prompts. ```python response = openai.chat.completions.create( model=MODEL, messages=[ {'role': 'system', 'content': system_prompt}, {'role': 'user', 'content': user_prompt()} ] ) result = response.choices[0].message.content ``` ### streaming responses You can also stream responses, for example when working in a Jupyter Notebook, to get the same feel as when using the chat interface. ```python def stream_response(system_prompt, user_prompt): stream = openai.chat.completions.create( model=MODEL, messages=[ {'role': 'system', 'content': system_prompt}, {'role': 'user', 'content': user_prompt()} ], stream=True ) response = "" display_handle = display(Markdown(""), display_id=True) for chunk in stream: response += chunk.choices[0].delta.content or '' response = response.replace("```","").replace("markdown", "") update_display(Markdown(response), display_id=display_handle.display_id) stream_response(system_prompt, user_prompt) ``` ## Anthropic API Open the Anthropic Console, login ## hardware requirements - What it's like running on a Microsoft Surface 9; Dell Precision 5560 - Build your own machine - NVIDIA CUDA GPU - Apple Silicon - alternative: cloud compute ## running LLMs on a Mac Use Apple Silicon to run LLMs on a Mac. https://www.youtube.com/watch?v=bp2eev21Qfo ## open WebUI Open WebUI is a fully-featured full stack solution for building your own LLM application. To use Open WebUI, simply clone the open WebUI repository. Out of the box, it provides a chatbot interface just like [[ChatGPT]]. Open WebUI is bult on [[Svelte]] with a [[Python]] backend and has a [[Docker]] configuration. You also need [[Node.js]] installed. ``` mamba create -n openwebui mamba activate openwebui pip install -r requirements.txt npm install npm run build bash start.sh ``` OpenWebUI requires authentication, however it only connects to a local database. (Check out that code to see how that works!) ## model context protocol Model context protocol (MCP) is the new new in [[LLM engineering]], introduced by [[Anthropic]] in November 2024. First, create an MCP server that serves the resources and tools to any MCP host. The MCP host can spin up that server on demand to include it in its workflow. You can create an MCP server for your local desktop filesystem or any web-based tool running locally or on the cloud. Try starting with this tutorial to [extend](https://modelcontextprotocol.io/quickstart/user) Claude for Desktop and check out the [awesome-mcp-servers](https://github.com/punkpeye/awesome-mcp-servers) repo on GitHub. > [!Tip]- Additional Resources > - [Hugging Face | What is MCP, and why is everyone suddenly talking about it?](https://huggingface.co/blog/Kseniase/mcp) > - [LangChain (YouTube) | Understanding MCP From Scratch](https://www.youtube.com/watch?v=CDjjaTALI68) ## Windsurf [Windsurf](https://codeium.com/windsurf) is a popular [[IDE]] for [[LLM engineering]] offered by [[codeium]]. Check out the [launch video](https://www.youtube.com/watch?v=bVNNvWq6dKo). Windsurf is built around these principles: - Trajectories: the agent can predict future work based on past work - Meta-learning: learns preferences over time - Scale with intelligence: Windsurf will grow with the industry, adopting the best practices Windsurf claims that 90% of the code it's users commit is written by the agent, not the developer. ## Vellum Use [Vellum AI](https://www.vellum.ai/llm-leaderboard) LLM Leaderboard to compare model costs and context windows (scroll down for *Context window, cost and speed comparison*). ## replit ## structured outputs Structured outputs define specific data structures for LLM responses. This can be very helpful for ensuring the response from an LLM can be parsed in downstream application code. For the simplest use cases, you can simply provide an example in [[one-shot prompting]]. For OpenAI only, you can use `response_format={"type": "json_object"}` but you must also include something about JSON in your prompt as well. ## webdev arena ## gradio Gradio (from [[HuggingFace]]) lets you build and share machine learning applications, but is best known as a UI wrapper for LLM applications. For LLM apps often all you need is a text box for input and text box for output. Gradio makes this super simple. It also allows you to quickly share demos with your colleagues with its unique sharing feature. ```python import gradio as gr def greet(name): return "Hello " + name + "!" demo = gr.Interface(fn=greet, inputs="text", outputs="text) demo.launch() ``` Pass `share=True` to `launch` to create a public URL where others can access your application. Note that your machine will be used to run the code; gradio is not spinning up a VM. This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces ([https://huggingface.co/spaces](https://huggingface.co/spaces)). For basic chatbots, Gradio expects a function `chat(message, history)` where `message` is the prompt to use and `history` is the past conversation in the model's required format. ```python def chat(message, history): messages = [{"role": "system", "content": system_message}] + history + [{"role": "user", "content": message}] stream = openai.chat.completions.create( model=MODEL, messages=messages, stream=True ) response = "" for chunk in stream: response += chunk.choices[0].delta.content or '' yield response ``` ## HuggingFace HuggingFace offers models (1M+ as of last count), datasets (like [[Kaggle]]), and [[Hugging Face Spaces]] to build and deploy apps. HuggingFace has libraries `hub` for models, `datasets` for datasets, `transformers` for custom model architectures, `peft` for finetuning, `trl` for reinforcement learning, and `accelerate` for hardware optimization. To use HuggingFace, first create a HuggingFace account (and confirm your email address). Then create a new token (API key) of type "Write" under Access Tokens. Store the key in your [[environment file]]. Don't worry about the fine-grained permissions, unless you know what you're doing; just select the "Write" tab. ## inference endpoint HuggingFace makes it very easy to run inference on the cloud by deploying a model. There is a cost associated but it can be a good solution for short term deployments. ```python client = InferenceClient(URL, token=hf_token) client.text_completion(message) ``` ### transformer ```bash mamba install transformers ``` #### pipelines The `pipeline` module provides a very simple interface for [[inference]] and common [[NLP]] tasks. Provide `model=` to specify a model, otherwise HuggingFace will select the default model for that task. If using a GPU, also supply `device='cuda'`. ```python from huggingface_hub import login from transformers import pipeline # Load API key in Colab from google.colab import userdata hf_token = userdata.get('HF_TOKEN') login(hf_token, add_to_git_credentials=True) # Sentiment Analysis classifier = pipeline("sentiment-analysis") # device="cuda" if GPU result = classifier("text to classify") # Named Entity Recognition ner = pipeline("ner", grouped_entities=True) result = ner("text with entities") # Question Answering with Context question_answerer = pipelin("question-answering") result = question_answerer("question") # Text Summarization summarizer = pipeline("summarization") text = """Text to summarize""" summary = summarizer(text, max_length=50, min_length=25, do_sample=False) print(result[0]['summary_text']) ## Translation English to Spanish translator = pipeline("translation_en_to_sp") result = translator("Text to translate") print(result[0]['tranlsation_text']) ## Classification classifier = pipeline("zero-shot-classification") result = classifer( "Text to classify", candidate_labels=["technology", "sports", "politics"] ) ## Text Generation generator = pipeline("text-generation") result = generator("Beginning of text") print(result[0]['generated_text']) ``` #### models [[HuggingFace]] uses [[PyTorch]] under the hood to run models. `CausalLM` includes all auto-regressive models. ```python model = AutoModelForCausalLM.from_pretrained(model, device_map="auto") # Set input messages = [ {"role": "system", "content": "You are a helpful assistant"}, {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"} ] inputs = tokenizer.apply_chat_template(messages) # Get output output = model.generate(inputs, max_new_tokens=100) ``` Optionally, provide use [[quantization]] to reduce the precision of model weights. To clean up after loading a model to free up space on the GPU use ```python del inputs, output, model torch.cuda.empty_cache() ``` > [!Tip]- Additional Resources > - [Learn LLMs on Hugging Face](https://huggingface.co/learn) ### HuggingFace Spaces HuggingFace Spaces hosts AI applications for free. Applications are often built with [[Gradio]] or [[Streamlit]]. ## tokenizer A tokenizer converts text to tokens. Each model has its own tokenizer. You must use the same tokenizer during inference. A tokenizer contains a vocabulary--the exhaustive list of all tokens in the system. A tokenizer focused on English text will have a very different vocabulary from a tokenizer focused on coding. Special tokens may be reserved for beginning of the text string, the start of a sentence, end of sentence, etc. These special tokens require no special treatment during inference or fine tuning, they are treated the same as all other tokens. Where a token represents the beginning of a word, it will include a space in front. Tokens are also typically case sensitive. The tokenizer decomposes the text string into the vocabulary of tokens. Each token is identified by a unique ID. In [[HuggingFace]] use the `AutoTokenizer` from the `transformers` library. ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('<model>', trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token # pad with end of sentence ``` ### chat template Some models may include a chat template to format a chat into a series of tokens. For example, a typical chat template is ```python messages = [ {"role": "system", "content": "You are a helpful assistant"}, {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"} ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) print(prompt) ``` Which when encoded and decoded shows the additional tokens that are used to provide additional context to the model (this is a Phi3 template, others will be different). ``` <|system|> You are a helpful assistant <|user|> Tell a light-hearted joek for a room of Data Scientists <|assistant|> ``` ## instruct Models that have been finetuned for chat are called **instruct** models. ## quantization Quantization reduces the precision of each of the weights from 32-bit numbers to 8-bit numbers (or similar). The accuracy is reduced but only slightly. Ensure the `bitsandbytes` library is installed. ```bash pip install bitsandbytes ``` Use the `BitsAndBytesConfig` class of the [[HuggingFace]] `transformers` library. ```python from transformers import BitsAndBytesConfig import torch quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, # quantize twice bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type="nf4" ) ``` ## chinchilla scaling law The chinchilla scaling law, originally suggested by the Google Deep Mind team, states that the number of parameters in a model should be proportional to the number of training tokens for the [[transformer]] architecture. ## LLM benchmarks Benchmarks are useful for understanding and comparing the performance of [[LLM]]s in different domains. Benchmarks are not perfect--they can be too narrow in scope, miss important measures of reasoning, and there is the potential for the tests to leak into the models themselves through direct training on test questions or overfitting to benchmarks. Benchmarks are collected on various leaderboards including - [Open LLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) - [BigCode](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard) (coding specific) - [LLM Perf](https://huggingface.co/spaces/optimum/llm-perf-leaderboard) (performance on specific hardware) - [HuggingFace Others](https://huggingface.co/spaces?q=leaderboard) (domain-specific models) - [Vellum](https://www.vellum.ai/llm-leaderboard) (API cost, context windows) - [SEAL](https://scale.com/leaderboard) (expert skills) - [Chatbot Arena](https://beta.lmarena.ai/leaderboard) (human preference) Common benchmarks include - **ARC**: Reasoning - **DROP**: Language Composition - **HellaSwag**: Common Sense - **MMLU**: Understanding - **ThruthfulQA**: Accuracy - **Winogrande**: Context - **GSM8K**: Math - **ELO**: Chat - **HumanEval**: Python coding - **MultiPL-EL**: Coding (general) The newer and more difficult benchmarks include - **GPQA** (Google-proof Q&A): PhD-level questions - **BBHard**: Future capabilities of LLMs (but no longer!) - **Math Lv 5**: High school math competition puzzles - **IFEVAL**: Difficult instructions to follow - **MuSR**: Multi-step soft reasoning (solve 1,000 word murder mystery novels) - **MMLU-PRO**: More nuanced understanding tasks ## Retrieval Augmented Generation RAG (Retrieval Augmented Generation), which is also referred to as **grounding** (especially in the Google ecosystem) is a method for bringing additional context to a LLM at inference to improve responses. ## LangChain LangChain was developed in 2022 to provide a simplifying framework for working with LLMs. As the APIs for each LLM provider have converged, the need for such a framework has decreased. The people behind LangChain have recently released [[LangGraph]] which can be especially useful for agentic systems. LangChain is useful for RAG applications. LangChain has abstractions of LLM, Retrievers, and Memory. Creating a RAG pipeline is quite easy with these three abstractions. ```python llm = ChatOpenAI(temperature=0.7) memory = ConversationBufferMemory(memory_key='chat_history', return_messages) retriever = vectorstore.as_retriever() conversation_chain = CnoversationalRetrievalChain.from_llm( llm=llm, retriever=retriever, memory=memory ) ``` ### LangChain Expression Language LangChain Expression Language (LCEL) is a declarative language for defining workflows using [[YAML]]. ## chunking Chunking is an important step in RAG to break the knowledge base down to fit inside the context window. Chunking is as much an art as a science. How well you do it will influence the performance of your RAG system. ``` text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200) ``` ## vector embeddings Vector embeddings convert tokens (or words, sentences, chunks, documents, graph nodes, and concepts) to a high-dimensional space that encodes the semantic meaning, such that similar tokens are near each other. Options include - Word2Vec (2013) - BERT (2018) - OpenAI Embeddings (2024) ## vector database A vector database efficiently stores and retrieves vector embeddings. ## chroma Chroma is an open source [[vector database]]. Chroma is embedded in [[LangChain]] for easy integration. ```python from langchain_chroma import Chroma vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory=db_name ) ``` Alternatives include MongoDB and Postgres for persistent storage or Facebook FAISS for in-memory ## FAISS Facebook AI Similarity Search is an open source library for search by semantic similarity. It can act as an in-memory vector store for [[RAG]] applications. Use LangChain to access FAISS for RAG. ```python from langchain.vectorstores import FAISS vectorstore = FAISS.from_documents( documents=chunks, embeddings=embeddings ) ``` ## t-stochastic neighborhood embeddings T-stochastic neighborhood embeddings (TSNE) is a common method for reducing the dimensionality of vector embeddings. ```python tsne = TSNE(n_components=3) # 3D reduced_vectors = tsne.fit_transform(vectors) ``` ## graph RAG - https://graphrag.com/ - https://neo4j.com/blog/genai/graphrag-manifesto/ - https://www.deeplearning.ai/short-courses/knowledge-graphs-rag/ (2 hour course) - [Neo4j in the cloud](https://neo4j.com/free-graph-database/)(free!) - [Paco Nathan](https://sessionize.com/pacoid/): see three talks - Use [GlobalFishingWatch API](https://github.com/GlobalFishingWatch/gfw-api-python-client) to explore potential fisheries abuses as portfolio project - https://research.google/blog/the-evolution-of-graph-learning/