### Advanced Retrieval for AI with Chroma These are notes from a [crash course on Retrieval Augmented Generation](https://learn.deeplearning.ai/advanced-retrieval-for-ai/lesson/1/introduction ) (or RAGs) taught by [Anton Troynikov](https://www.troynikov.io/), the co-founder of [Chroma](https://docs.trychroma.com/getting-started) — a vector database system designed to work with LLMs. #### How RAGs work RAGs are architectural systems that are designed to augment a Large Language Model's (LLM) existing knowledge. For example, if you are a business and want to create an AI-augmented internal knowledge base, you might link an OpenAI API call with a vector database of your own companies internal documents. This would all users to query the KB with natural language, and search that database more efficiently. The flow of actions works in the following way: - A user inputs a query that is to be passed to a LLM - Before passing the query to the LLM, the query is converted into an embedding - The system then looks for relevant information in a vector database (like the KB mentioned above). It does this by finding nearest neighbors in the embedding space. Then the text of those embeddings are returned. - The returned text is then inserted into the query as additional information with a prompt for the LLM that instructs it to rely upon the provided information for it's answer. ### RAG Pitfalls (Relevancy and Distraction) RAGs can look up information based on embedding similarity, however, a common issue is that the returned documents will be **talking about** similar topics without actually providing the **answer**. That is, the information we end up telling the model to rely upon can actually *distract* the model by providing information that is not relevant to the query. Some solutions to this problem are discussed the following sections. #### Query expansion Reference: https://arxiv.org/abs/2305.03653 One approach to solve this problem is called "Query Expansion" which amounts to a clever form of prompt engineering. The way it works is to first ask the LLM to provide an **example answer** which is then utilized as input to find matching documents. For example, here is a function that calls OpenAI with a prompt (in `content`) that illustrates how this would work. In this example, you can imagine that we are asking the LLM to search an annual financial report document for answers. ```python def augment_query_generated(query, model="gpt-3.5-turbo"): messages = [ { "role": "system", "content": "You are a helpful expert financial research assistant. Provide an example answer to the given question, that might be found in a document like an annual report. " }, {"role": "user", "content": query} ] response = openai_client.chat.completions.create( model=model, messages=messages, ) content = response.choices[0].message.content return content ``` In practice, this would look like the below... ```python original_query = "Was there significant turnover in the executive team?" hypothetical_answer = augment_query_generated(original_query) joint_query = f"{original_query} {hypothetical_answer}" print(joint_query) ``` ... which might return something like: ``` Was there significant turnover in the executive team? In the fiscal year 2020, there was no significant turnover in the executive team. All key executives remained in their positions throughout the year, providing stability and continuity in leadership. This helped to maintain a cohesive and effective management team, ensuring the successful execution of our strategic initiatives. The executive team's tenure, expertise, and strong working relationships contributed to the company's continued growth and overall success in the market. ``` The above text then becomes the input that we pass to `chroma` ... ```python results = chroma_collection.query(query_texts=joint_query, n_results=5, include=['documents', 'embeddings']) retrieved_documents = results['documents'][0] ``` **From a geometric perspective, we can imagine this as trying to place our query *closer* to more relevant answers within the embedding space.** ##### Query expansion with multiple queries A similar approach would be to ask the LLM to provide multiple versions of the same question. Here we get further into the prompt engineering world as the types of questions that we want can really vary based on the instruction we provide the LLM. Below is an example of a function that might do this in the same context. ```python def augment_multiple_query(query, model="gpt-3.5-turbo"): messages = [ { "role": "system", "content": "You are a helpful expert financial research assistant. Your users are asking questions about an annual report. " "Suggest up to five additional related questions to help them find the information they need, for the provided question. " "Suggest only short questions without compound sentences. Suggest a variety of questions that cover different aspects of the topic." "Make sure they are complete questions, and that they are related to the original question." "Output one question per line. Do not number the questions." }, {"role": "user", "content": query} ] response = openai_client.chat.completions.create( model=model, messages=messages, ) content = response.choices[0].message.content content = content.split("\n") return content ``` If we print these ... ```python original_query = "What were the most important factors that contributed to increases in revenue?" augmented_queries = augment_multiple_query(original_query) for query in augmented_queries: print(query) ``` ... we get the below ``` What were the top-performing product lines or services in terms of revenue growth? How did changes in pricing affect revenue growth? Were there any significant increases in customer acquisitions that impacted revenue? Did any changes in market conditions or demographics contribute to the increase in revenue? How did changes in marketing strategies or campaigns impact revenue growth? ``` In practice, it looks like the below. > Note in the below: > 1. `chroma` can handle multiple queries in parallel without issue by simply passing a list of queries > 2. we must deduplicate documents that are retrieved by multiple similar queries! ```python queries = [original_query] + augmented_queries results = chroma_collection.query(query_texts=queries, n_results=5, include=['documents', 'embeddings']) retrieved_documents = results['documents'] # Deduplicate the retrieved documents unique_documents = set() for documents in retrieved_documents: for document in documents: unique_documents.add(document) ``` **From a geometric perspective, we should think about this as creating more queries to search a larger space!** > *Note on API Costs!* > > This approach necessarily requires that your API cost will increase by (approximately) a factor of $n-1$, where $n$ represents the number of queries that you make. This is because you will get the same number of candidate documents for each query (minus duplicates). #### Cross-encoder re-reranking This is another approach that allows you to rerank your responses in a more "local" way that is specific to the query, instead of using the larger model which is more naive to the query. ```python # load packages from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction import numpy as np # Set up embedding function embedding_function = SentenceTransformerEmbeddingFunction() # Load the chroma microsoft annual report collection chroma_collection = load_chroma(filename='microsoft_annual_report_2022.pdf', collection_name='microsoft_annual_report_2022', embedding_function=embedding_function) # Make query, this time taking 10 results to rerank the "long tail" query = "What has been the investment in research and development?" results = chroma_collection.query(query_texts=query, n_results=10, include=['documents', 'embeddings']) retrieved_documents = results['documents'][0] for document in results['documents'][0]: print(word_wrap(document)) print('') ``` Here, we can now load and begin using our cross-encoder ```python from sentence_transformers import CrossEncoder cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') pairs = [[query, doc] for doc in retrieved_documents] scores = cross_encoder.predict(pairs) print("Scores:") for score in scores: print(score) ``` this will return ``` Scores: 0.9869341 2.6445777 -0.26802987 -10.731592 -7.706605 -5.6469994 -4.297035 -10.933233 -7.038428 -7.324694 ``` You can see how ranking these responses based on the cross encoder score will raise up some of the documents that would not have been returned previously. ##### Reranking with Query Expansion This idea can be extrapolated to query expansion as well. ```python original_query = "What were the most important factors that contributed to increases in revenue?" generated_queries = [ "What were the major drivers of revenue growth?", "Were there any new product launches that contributed to the increase in revenue?", "Did any changes in pricing or promotions impact the revenue growth?", "What were the key market trends that facilitated the increase in revenue?", "Did any acquisitions or partnerships contribute to the revenue growth?" ] # Combine queries and pass to chroma queries = [original_query] + generated_queries results = chroma_collection.query(query_texts=queries, n_results=10, include=['documents', 'embeddings']) retrieved_documents = results['documents'] # Deduplicate the retrieved documents unique_documents = set() for documents in retrieved_documents: for document in documents: unique_documents.add(document) unique_documents = list(unique_documents) pairs = [] for doc in unique_documents: pairs.append([original_query, doc]) scores = cross_encoder.predict(pairs) print("Scores:") for score in scores: print(score) ``` Which will print something like the below: ``` Scores: -10.148884 -8.505109 -5.1418324 -3.7948635 -1.1369953 -10.0839405 -7.754099 -5.27475 -4.6518917 -6.902092 -9.918428 -3.7681568 -11.0792675 -7.917177 -9.80788 -10.711212 -10.042843 -9.768024 -4.818485 -7.490655 -9.357723 -10.000138 -4.341766 ``` To reorder, we can use the below: ```python print("New Ordering:") for o in np.argsort(scores)[::-1]: print(o) ``` which will print ``` New Ordering: 4 11 3 22 8 18 2 7 9 19 6 13 1 20 17 14 10 21 16 5 0 15 12 ``` --- #### Related #AI #LLM