Check out the latest blog by our Consultant, Tristan Hosken, as he explores Retrieval Augmented Generation (RAG). Tristan provides insights into advantages and disadvantages of RAG through hands-on experiments with AWS’s Bedrock and Azure’s OpenAI service.
I don’t think it is possible to have not heard about the excitement around Large Language Models (LLMs) over the last year or so. At OpenCredo we have been following their development, and since hands-on experience is always best for understanding a technology, we have started to experiment with building an application around an LLM.
The use case we chose for our experimentation was to have a chat app that we could ask for information about a group of people, based on their CVs. The idea is that we can feed it the CVs of these people and then ask questions about their skills and experience. This use case was chosen because it involves semi-structured documents with quite a high density of information, rather than free flowing and verbose texts such as this blog post, so could present more of a challenge for the application.
This application lends itself well to Retrieval Augmented Generation (RAG). This method involves using data retrieved from a knowledge base to assist the LLM in generating its response. The power here is that a vast store of information (the knowledge base) can be used to tailor a generic LLM to a particular use case, using your own data, and without having to repeatedly train the model to be able to provide it with more recent data.
An additional benefit is that because we are using a knowledge base, we can attribute responses back to the source content; this helps build our confidence in the responses and allows us to check out the original source to find further information or context. Additionally it should help to reduce the chance of hallucinations, by restricting the information in responses to information found in the knowledge base.
A high level overview of how RAG typically works is as follows:
The documents we will be ingesting into this RAG application will be CVs. As always, data privacy is top of our minds, but we also wanted to have the freedom to try out different models and services without having to slow down to delve into the security and privacy details first. Therefore, instead of using real CVs we created fictitious ones. We want this application to be able to work with any (sensible) CV format. So, while we know the format of the CVs, we will not be making use of this knowledge to pre-process the documents or prompting the LLM with this information to assist the generation of accurate responses.
Now that we have the problem we are using for the experiment (CV Q&A chatbot), our core methodology (RAG), and the data to form our knowledge base (fictitious CVs), we can think about how we want to implement this and the technologies we want to try out.
The two main services we will be using are AWS Bedrock and Azure OpenAI. Its interesting to note here the significant difference in approach that AWS and Azure have taken with their approach to incorporating LLMs into their cloud services – AWS make a large number of internally and externally produced models accessible via Bedrock, while Azure have invested into partnering with OpenAI and solely offer their models as an integration.
We wanted to avoid having to spend time writing a full application in order to test out the performance of RAG with these LLMs, so for Azure OpenAI we found a sample app on GitHub and for AWS Bedrock we used the Knowledge Base feature in AWS Bedrock which is a managed RAG implementation that comes with a simple UI in the AWS console for testing. These are the two setups we will be using for the tests:
Our approach consisted of asking the same set of questions to each application and assessing the quality and accuracy of the response. There is a lack of determinism with these models, so asking the same question multiple times can result in different results. In this experiment we didn’t want to go down the route of asking each question X times and then calculating the percentage of times it was right, we are just wanting to get an initial feel for the performance and behaviours – perhaps in the future…
Before looking at where the application fell down, we should mention where it performed well. As we have all quickly become accustomed to over the last year or so, the answers were well crafted linguistically, and never failed to interpret the intent of the questions or the terms used. A point that I found particularly impressive, and a good demonstration of the power of RAG vs classical search applications, was that when asking for examples of platform engineering projects it managed to return examples of projects that don’t mention platform engineering explicitly but that did describe platform topics and tools. This shows how the LLM and embedding models’ ability to “understand” the context and meaning of words and synonymous concepts provides a significant benefit to query results.
Some of the issues
Both approaches performed comparably, so from here on we continue with only testing the Knowledge Base feature in AWS Bedrock, using Anthropic’s Claude V2 LLM.
From these initial results we came up with the hypothesis that we could improve the results if we could ensure that the context of each chunk was maintained. A briefly explored avenue was to try to process the document more intelligently, breaking it into chunks at “sensible” locations rather than by token count, and parsing tables so that they are just viewed as a single lump of text. However, this is clearly opening up a complex rabbit hole, so investigation in this direction was paused in favour of quicker paths.
As a quick and dirty experiment we removed the RAG part and just passed the contents of the CVs directly to Claude V2 with a simple prompt prefixing them to explain that these are CVs. The results from asking our test set of questions was impressive, it got them all completely correct! This clearly isn’t a scalable approach though, we were only able to do this because our document set is small enough (<1k tokens per document) to easily fit within the 100k token context length. This is why RAG is important, to avoid this limitation.
However, this did inspire the next test, which was to not divide each CV into chunks, and generate the embeddings from whole documents. The results here were as good as when passing all the documents to Claude, except for still failing the question about who has most DevOps experience.
What we’ve shown here is that the results are only as good as the chunks retrieved. We know the LLMs are capable if they are provided with the correct info, but if the wrong chunks or chunks with incomplete information are provided then we end up with the garbage in garbage out situation. This highlights the importance of the chunking process in getting good results. The solution we found here was to have all of the information about a particular person in a single chunk. This type of solution won’t always be possible though, and demonstrates the importance of understanding and considering your problem space when deciding on parameters such as chunk size and chunking algorithm. Claude V2 is perfectly capable of coping with larger chunk sizes, since its context window can be as large as 100k tokens. However, the vector embeddings are where large chunk sizes could cause issues, since the vector now has to encode more information and potentially more concepts, which could lead to less accurate retrieval. Additionally, the larger the chunk size the fewer chunks can be presented to the LLM for the generation step before filling its context window.
We still observed incorrect answers when the question involved comparing across the whole (or a large segment of the) set of documents, such as asking for who has the most years of experience. I don’t think this is something that is going to be easy to solve in the RAG domain, since correctly answering this question relies on collecting the years of experience from all documents before the comparison can be done; whereas the current RAG algorithms don’t filter in such a way.
Where RAG has really shown its value vs classical search methods is in understanding the content of the text and then being able to find sections that would be difficult to find using normal searches based around keywords. Our example of this was finding platform engineering projects without the project explicitly defining itself as such. The power of the LLM vs keyword based searches is that in the latter we would have missed these projects unless the user took the time and effort to include enough related keywords – and then sift through the extra unwanted results that were returned.
This experimentation was done with a pretty small set of documents (manual document generation was expensive in time and effort) and a low number of questions. It would be better to test with a significantly larger set of documents and more questions, to be able to get some more robust results.
Additionally, we started with 2 different apps to test because we wanted to get some exposure to both AWS and Azure’s AI services, as well as to see if there was much of a comparison to be made, but in future if we just pick one approach then it will make it easier to automate the experiments and therefore allow some more rigorous testing and data collection.
In future tests we envisage digging deeper into chunk sizing, including introducing mixed chunk sizes, and using “intelligent” chunking rather than predetermined chunk sizes. As part of this intelligent chunking I would also want to add to the capabilities of the document parsing, such as trying to accurately represent tables, and potentially introducing caption generation for images.
We didn’t touch on any direct experimentation with the vector embeddings and their similarity search, but it could be interesting to focus on how the embedding and similarity search performs with different chunk sizes and the different number of topics/concepts covered in a chunk.
Finally, I’d like to explore mechanisms for handling questions that require comparisons across a large set of documents (the “Who has the most years of experience?” problem). Initially there are two approaches we could take, but both require upfront consideration of the subject of these questions, and therefore adds to the level of tailoring needed to the particular use case.
The first approach is to extract certain information out of the documents, such as the number of years of experience, and attach this as metadata to the chunks for each document, or store it in a separate database. At query time, questions that require sorting or min/max queries can use this metadata to correctly identify relevant chunks in combination with the similarity search – leading to a hybrid approach that makes use of LLMs in addition to an exact and a deterministic traditional query engine.
The second approach is to keep wholly within the RAG paradigm. On document ingestion we still require extracting metadata from the documents, but we create summary documents by adding all the metadata of each type into a single document that then itself gets ingested into the knowledge base. So, for example, we would have a document containing the years of experience of each person. Query time then remains exactly the same as vanilla RAG, and the “Who has the most years of experience?” question should pick up that document for use in the generation of its answer. For larger knowledge base sizes these summary documents would have to be chunked and this may impact the accuracy of the results, but this just opens up further avenues for experimentation. For example, we could reduce document size by partitioning the summary by specialism, ending up with a years of experience document for DevOps specialists, another for data engineers, etc.
In both cases we should be able to use the LLM for extracting data for the summary documents, allowing us to remain agnostic to the documents’ structures.
This blog is written exclusively by the OpenCredo team. We do not accept external contributions.