beginner guide to fully local RAG on entry-level machines - yactouat

## prerequisites - it's better if you have a VPS or a physical server somewhere on which you can test your deployment, but you can do all the steps on your machine anyhow - you can write and run basic Python code - you know what a LLM is - you know what LLM tokens are ## what RAG systems are _Retrieval-Augment Generation_ systems combine: - retrieval of information that is outside the knowledge base of a given LLM - generation of answers that are more contextualized, accurante, and relevant, on the LLM's end This hybrid model leverages the precision of retrieval methods and the creativity of generation models, making it highly effective for tasks requiring detailed and factual responses, such as question answering and document summarization. ## the use-case All companies manage some sort of written documentation, to be used internally or to be published externally. We want to build a RAG system that facilitates access and query over this written data and that is both easy to setup and as cheap as it can be, in terms of computational power and billing. The system I have in mind: - can be run on CPU onlu - has a UI - uses a free and open-source embedding model and LLM - uses `pgvector`, as a lot of systems out there run on Postgres I personally believe in a future where tiny machines and IoT devices will host powerful optimized models that will remove most of the current needs to send everything to a giga model in some giga server farm somewhere (and all the privacy issues that necessarily arise from this practice). Open-source LLMs make daily progress to bring us closer to that future. For this exercise, let's use the open-sourced technical documentation of [Scalingo](https://scalingo.com/), a well-known cloud provider in France. This documentation can be found @ https://github.com/Scalingo/documentation. The actual documentation lies in the `src/_posts` folder. ## step 1: download `ollama` and select the right model Nowadays, running powerful LLMs locally is ridiculously easy when using tools such as [`ollama`](https://ollama.com/). Just follow the installation instructions for your #OS. From now on, we'll assume using bash on Ubuntu. ### what is `ollama`? `ollama` is a versatile tool designed for running large language models (LLMs) locally on your computer. It offers a streamlined and user-friendly way to leverage powerful AI models like Llama 3, Mistral, and others without relying on cloud services. This approach provides significant benefits in terms of speed, privacy, and cost efficiency, as all data processing happens locally, eliminating the need for data transfers to external servers. Additionally, [its integration with Python](https://github.com/ollama/ollama-python) enables seamless incorporation into existing workflows and projects. The documentation of `ollama` says `You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models`, this means you can run relatively small models easily on a low-end server. ### 7B models? tokens? In the world of generative AI, you will often see terms like `tokens` and `parameters` pop up. In our demo, we will try to run the smallest models possible, so that our consumption footprint gets very low; let's fix us a maximum of 7B parameters. Parameters in a machine learning model are essentially the weights and biases that the model learns during the training process. They are the internal variables that adjust themselves to minimize the error in predictions. These parameters are critical because they enable the model to learn and generalize from the training data. Tokens are the pieces of data that LLMs process. In simpler terms, a token can be as small as a single character or as large as a whole word. The specific definition of a token can vary depending on the language model and its tokenizer, but generally, tokens represent the smallest unit of meaningful data for the model. They are numerical representations of units of semantic meaning. You will often see the term `token` used in conjunction of the idea a `context window`, which represents how many tokens a LLM can keep in memory during a conversation. The longer the context window, the longer meaningful conversations (in theory) can be conducted within a single conversation with the LLM. ### selecting the right model With `ollama`, running a LLM (a Small Language Model in our case, let's use that term for now) is as easy as `ollama run <model_name>`. So, after I've browsed the [`ollama` models library](https://ollama.com/library) for the most popular yet smallest models, I've downloaded a few of them and tested them in my terminal; some of them kept outputting non-sensical answers on my machine (such as `codegemma:2b`) so I've discarded them. I've found out, by tinkering around and not with systematic tests (although that would be quite interesting) that `deepseek-coder:1.3b` offers a particulary good performance/quality of answers ratio. `deepseek-coder:1.3b` is a 1.3B parameters SLM (Small Language Model) that weights only 776MB 🤯 [developed by deepseek](https://www.deepseek.com/), a major Chinese AI company. It has been trained on a high-quality dataset of 2 trillion tokens. This model is optimized for running on various hardware, including mobile devices, which enables local inference without needing cloud connectivity. Its strengths are: - a long 16K tokens context window - highly scalable, as the 1.3B and the other higher models in the series can suit various types of machines and deployments - it has been created for coding tasks, which may make it suitable for technical documentation RAG - it's small yet it performs quite well on various benchmarks Some use-cases of such a model are: - environments requiring strict data privacy - mobile agentic applications that can run code - industrial IoT devices performing intelligent tasks on the edge ## step 2: make sure that we can run on CPU only Just open `htop` or `btop` and, with another terminal tab or window, run: `ollama run deepseek-coder:1.3b` In the conversation, tell the LLM to generate a very long sentence and then go back to your `htop`: this will give you a quick sense of the resource consumption of the model's inference. Still, we need to be absolutely sure that this thing can run on a customer-grade server as well, provided that it is powerful enough: I have an 8GB RAM 2 CPU cores VPS somewhere. This server has no GPU, so let's run it there: ![[ollama-vps-1.png]] As you can see, I am warned that I will use the thing in CPU-only mode. Alright let's pull `deepseek-coder:1.3b` on this remote server. After having ran it on the VPS, I noticed the tokens throughput was a little bit slower on this remote machine, however the thing works like a charm ! now I'm thinking "hello costless intelligent scraping with other generalistic small footprint models" 🤑 ## step 3: RAG with LlamaIndex Now that our LLM setup is ready, lets put together a RAG system using the famous [RAG in 5 lines of code LlamaIndex example](https://docs.llamaindex.ai/en/stable/getting_started/starter_example_local/) that we'll tweak a little bit to meet our requirements. Basically, we will: - download Scalingo's documentation on disk - set up a vector store using an open-source embeddings model from [HuggingFace](https://huggingface.co/) - load our local instance of `deepseek-coder:1.3b` via LlamaIndex - create an index with the vector store and our documents - query our documentation from our terminal ! ### wait, what is LlamaIndex in the first place? [LlamaIndex](https://docs.llamaindex.ai/en/stable/) is a data framework for building context-augmented LLM applications. With it, you can create: - autonomous agents that can perform research and take actions - Q&A chatbots - tools to extract data from various data sources - tools to summarize, complete, classify, etc. written content All these use-cases basically augment what LLMs can do with more relevant context than their initial knowledge base and abilities. LlamaIndex has been designed to allow for LLM querying large-scale data efficiently. Currently, Llama Index officially supports Python and Typescript. ### and what about embeddings? Embeddings are basically a way of representing data, in our case text, as vectors (often represented as lists of numbers). A vector is a quantity that has both magnitude and direction. A 2-D vector such as `[3,4]`, for instance, can be thought as a point in a 2-dimensional space (like an X-Y plane). Vectors used as embeddings by LLMs are high-dimensional vectors that allow to capture a lot of semantic intricacies in text. These embeddings are produced by specialized neural networks that learn to identify patterns and relationships between words based on their context; needless to say that these models are trained on large datasets of text. Embeddings can be used for document similarity analysis, clustering, enhancing search algorithms, and more. For the embeddings model, we'll use [`nomic-embed-text-v1.5`](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5), which performs better than several OpenAI models served through their API! This model: - produces high dimensional vectors (up to 768 dim) - produces great alignment of semantically similar meaning tokens - supports various embedding dimensions (from 64 to 768) - has a long context of up to 8192 tokens, which makes it suitable with very large datasets and pieces of content ### setting up `pgvector` There are many vector stores out there to store embeddings, but we want something that integrates with Postgres, so let's use `pgvector`, which is an extension that you have to build after having downloaded it from GitHub. I personally run a dockerized instance of Postgres, here is the `Dockerfile`: ```Dockerfile # postgres image with `pgvector` enabled FROM postgres:16.3 RUN apt-get update \ && apt-get install -y postgresql-server-dev-all build-essential \ && apt-get install -y git \ && git clone https://github.com/pgvector/pgvector.git \ && cd pgvector \ && make \ && make install \ && apt-get remove -y git build-essential \ && apt-get autoremove -y \ && rm -rf /var/lib/apt/lists/* EXPOSE 5432 ``` ### show me the code! 👨‍💻 Without a UI and without `pgvector`, our app' looks like this (I am showing you the full script, then we'll provide more explanations on some parts of it) => ```python from datetime import datetime from dotenv import load_dotenv from llama_index.core import ( # function to create better responses get_response_synthesizer, SimpleDirectoryReader, Settings, # abstraction that integrates various storage backends StorageContext, VectorStoreIndex ) from llama_index.core.postprocessor import SimilarityPostprocessor from llama_index.core.query_engine import RetrieverQueryEngine from llama_index.core.retrievers import VectorIndexRetriever from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.llms.ollama import Ollama from llama_index.vector_stores.postgres import PGVectorStore import logging import os import psycopg2 from sqlalchemy import make_url import sys def set_local_models(model: str = "deepseek-coder:1.3b"): # use Nomic Settings.embed_model = HuggingFaceEmbedding( model_name="nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True ) # setting a high request timeout in case you need to build an answer based on a large set of documents Settings.llm = Ollama(model=model, request_timeout=120) # ! comment if you don't want to see everything that's happening under the hood logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout)) # time the execution start = datetime.now() # of course, you can store db credentials in some secret place if you want connection_string = "postgresql://postgres:postgres@localhost:5432" db_name = "postgres" vector_table = "knowledge_base_vectors" conn = psycopg2.connect(connection_string) conn.autocommit = True load_dotenv() set_local_models() PERSIST_DIR = "data" documents = SimpleDirectoryReader(os.environ.get("KNOWLEDGE_BASE_DIR"), recursive=True).load_data() url = make_url(connection_string) vector_store = PGVectorStore.from_params( database=db_name, host=url.host, password=url.password, port=url.port, user=url.username, table_name="knowledge_base_vectors", # embed dim for this model can be found on https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 embed_dim=768 ) storage_context = StorageContext.from_defaults(vector_store=vector_store) # if index does not exist create it # index = VectorStoreIndex.from_documents( # documents, storage_context=storage_context, show_progress=True # ) # if index already exists, load it index = VectorStoreIndex.from_vector_store(vector_store=vector_store) # configure retriever retriever = VectorIndexRetriever( index=index, similarity_top_k=10, ) # configure response synthesizer response_synthesizer = get_response_synthesizer(streaming=True) # assemble query engine query_engine = RetrieverQueryEngine( retriever=retriever, response_synthesizer=response_synthesizer, # discarding nodes which similarity is below a certain threshold node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)], ) # getting the query from the command line query = "help me get started with Node.js express app deployment" if len(sys.argv) >= 2: query = " ".join(sys.argv[1:]) response = query_engine.query(query) # print(textwrap.fill(str(response), 100)) response.print_response_stream() end = datetime.now() # print the time it took to execute the script print(f"Time taken: {(end - start).total_seconds()}") ``` #### the `StorageContext` LlamaIndex represents data as `indices`, `nodes`, and `vectors`. These are manipulated via the `StorageContext` abstraction. ##### nodes They are the basic building blocks in LlamaIndex: they represent chunks of ingested documents, they also encapsulate the metadata around these chunks. They store individual pieces of information from larger documents. They are to be part of various data structures used within the framework. ##### indices They are data structures that organize and store metadata about the aforementioned nodes. Their function is to allow for quick location and retrieval of nodes based on search queries; this is done via keyword indices, embeddings, and more. During data ingestion, documents are split into chunks and converted into nodes. These nodes are then indexed, and their semantic content is embedded into vectors. When a query is made, indices are used to quickly locate relevant nodes, and vector stores facilitate finding semantically similar nodes based on the query's embedding vector. #### the `RetrieverQueryEngine` The `RetrieverQueryEngine` in LlamaIndex is a versatile query engine designed to fetch relevant context from an index based on a user's query. It consists of: - a data retriever - a response synthesizer In our case the data retriever would be the `VectorIndexRetriever` that we have plugged in to our Postgres vector dabatase. #### the `SimilarityPostProcessor` With this LlamaIndex module, we make sure that only a subset of the retrieved data is being used for the final output, based on a similarity score threshold. It's basically a filter for nodes. #### `get_response_synthesizer` This function is used to generate responses from the language models that are used using: - a query - a set of text chunks retrieved from the storage context The text chunks themselves are processed by the LLMs using a configurable strategy: the _response mode_. Response modes include: - **compact**: combines text chunks into larger consolidated chunks that fit within the context window of the LLM, reducing the number of calls needed; this is the default mode - **refine**: iteratively generates and refines an answer by going through each text chunk; this mode makes a separate LLM call per node (something to keep in mind if you're paying for tokens), making it suitable for detailed answers - **tree summarize**: recursively merges text chunks and summarizes them in a bottom-up fashion (i.e. building a tree from leaves to root) => it is a "summary of summaries" - and more [in the docs](https://docs.llamaindex.ai/en/stable/module_guides/querying/response_synthesizers/#configuring-the-response-mode)! ## step 5: serve the app with Streamlit Now that we have a working script, let's wire this to a Streamlit UI. Streamlit is an open-source Python framework designed to simplify the creation and sharing of interactive data applications. It's particularly popular among data scientists and machine learning engineers due to its ease of use and ability to transform Python scripts into fully functional web applications with minimal code. Again, this can be done with very few lines of code once you've added `streamlit` to your Python requirements: ```python import logging import streamlit as st import sys from rag import get_streamed_rag_query_engine # ! comment if you don't want to see everything that's happening under the hood # logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) # logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout)) # initialize chat history if "history" not in st.session_state: st.session_state.history = [] def get_streamed_res(input_prompt: str): query_engine = get_streamed_rag_query_engine() res = query_engine.query(input_prompt) for x in res.response_gen: yield x + "" st.title("technical documentation RAG demo 🤖📚") # display chat messages history for message in st.session_state.history: with st.chat_message(message["role"]): st.markdown(message["content"]) # react to user input if prompt := st.chat_input("Hello 👋"): # display user message with st.chat_message("user"): st.markdown(prompt) # add it to chat history st.session_state.history.append({"role": "user", "content": prompt}) # display bot response with st.chat_message("assistant"): response = st.write_stream(get_streamed_res(prompt)) # add bot response to history as well st.session_state.history.append({"role": "assistant", "content": response}) ``` ... less than 50 lines of code and you have a functional chat UI 😎 The responses could be perfected, but the result is truly impressive, considering how small our model is: ![[streamlit-app-llamaindex-demo-1.webm]] ![[streamlit-app-llamaindex-demo-2.webm]] ## wrapping it up As you can see, it is more easier than ever to build context-rich applications that are cheap in both resource consumption and actual money. There are many more ways to improve this little demo, such as: - create a multi modal knowledge base RAG using `llava` - deploy the thing on various platforms (VPS, bare metal server, serverless containers, etc.) - enhance the generated answers using various grounding techniques - implement a human in the loop feature, where actual humans take over the bot when things get difficult with a given customer - make the system more _agentic_ by letting it evaluate if the user's query has been fulfilled, if the user's query is relevant, etc. - package the app and build it in WebAssembly - parallelize the calls to the SLM on response generation - update existing vectors with contents from the same source instead of adding to the vector database systematically - update the vectorized documentation on a schedule ... don't hesitate to PR @ https://github.com/yactouat/documentation-rag-demo if you'd like to improve it!