RAG vs Finetuning vs Prompt Engineering: A pragmatic view on LLM implementation
The first half of this article tries to give a brief introduction to challenges in practical implementation of LLMs, whereas the second half focuses on solution approaches for addressing these challenges and how they compare to each other. Please skip to the later section if you are already aware of LLM related challenges in general.
The Challenge
With the onset of the Generative AI wave, specifically the Large Language Model (LLM) text generating subset, almost every organization is interested in implementing a GenAI application for their own purposes. While this sounds relatively straight forward considering the plethora of high performing LLMs mushrooming nowadays, the reality is a far cry from this. It is true that many of the powerful LLMs are capable of quite effectively responding to most of the asks from a general space of knowledge, but when it comes to localized enterprise or domain specific information the same models underperform by a large margin. And from an organization's perspective this space of responding based on localized knowledge is mostly what the need is. For example a text-to-sql generation model, in addition to being capable of generating syntax-perfect sql queries, it as well needs to know the business semantics of the enterprise and the data schema information – both of which are protected local information.
Now, let us try to generalize the areas where these powerful LLMs lack in comparison when it comes to being fit for practical purposes in an enterprise.
Knowledge Cut-off
The challenge where the model is unaware of any information post the training point of time or beyond the scope of training dataset used is termed as the knowledge cut-off problem. Essentially the model remains static and frozen to the time and scope it has been trained to. And is true in terms of any public event that occurred after the time of training or private information which is not available for the model to train. This becomes a major limitation when it comes to enterprise purposes where the private information local to the enterprise is key to accurately responding to the requests as expected.
Hallucination
This is a common issue faced by LLMs where the model generates factually incorrect data for the given context. The problem is even more severe when it comes to the case of enterprise purposes where it is expected to understand the intent of the request and respond based on local context of the enterprise. The challenge could be various like not knowing the business vocabulary, unfamiliar question patterns, outdated training knowledge, lack of local knowledge private to the organization etc. Essentially the model does not have the knowledge and context understanding of what is being requested from it and force generates some text based on its original generic training.
Blackbox Model
Explanability and interpretability of a model depends on how best it can explain how, why and from where a particular output was generated. When it comes to GenAI, generally this is a severely lacking space and most models does not give information on how a particular output was generated or even the specific sources of information used for the current generation. In cases where we have information on the original data sources used for training and the model architecture, especially for in-house trained models, the traceability factor is much improved. But still, considering the vast size of the training datasets it gives us no useful information as to why a particular output was generated. This is a challenge when it come to trustability and adoption of models.
Inefficient & Costly
The full-fledged training process involved in building a LLM is quite costly both in terms of money and time, and is not an efficient option to try out for business use case implementation in any enterprise. Even when coming to operational costs involved in using LLMs, it proves to be quite costly with all the inefficiencies in the implementation approaches used – which is true even while leveraging some of the methodologies we are going to discuss here. And while it is not in scope of the current discussion to go into very detailed discussion on reasons for the inefficiency, we will still look into some of the factors contributing to this.
Security Risks & Ethical Concerns
Security vulnerability risks like data theft, data manipulation, and denial of service are common risks associated with any business application, especially when it is open to internet. Enterprises follow various authentication and authorization mechanisms to ensure only the right party gets the right access. With AI in picture, these security risks reach altogether a new dimension with deterministic nature of applications going down. And this is especially true in case of LLMs which are capable of responding with generated content to external requests. Furthermore, there are additional concerns of data and concept leak when using external GenAI services which are hosted outside the organizations boundary of control. All these along with other factors lead to multiple ethical concerns including data privacy violations, sensitive information disclosure, distribution of harmful content, amplification of existing bias, potentially wrong guidance, lack of transparency, and more. Ethical concerns is a large subject area by itself and we will not be going into detailed discussions on this topic in this article.
The Solution
These limitations adversely affect the GenAI applications leveraging LLMs in terms of accuracy of response, cost of implementation, and trust factor for adoption, especially when it comes to business applications requiring responses contextual to enterprise internal information. Now let us look at the most prominent techniques used to address these limitations – prompt engineering, finetuning and retrieval augmented generation.
Prompt Engineering
Prompting is a technique of guiding a language models response behaviour by refining the inputs supplied. Prompting methods can vary from simple phrases to detailed instructions based on task requirements and model capability. The approach of designing and optimizing prompts for a specific task, to ask the right questions, is called prompt engineering.
Essentially prompt engineering help LLMs generate the most desirable response for the given purpose and context. And it is a critical requirement when it comes to enterprise business applications demanding responses with proper understanding of intent and context of the request. Some of the prompting techniques in practice are basic Direct Prompting, Role Prompting with model role assignment strategy, Few-Shot Prompting with in-prompt demonstrations, Chain-of-thought (CoT) Prompting with intermediate step guidance, Self-Ask Prompting with input decomposition etc. We will not be going into the details of these techniques in this article.
But prompt engineering alone cannot address the core issue of not having the required business or local enterprise knowledge. This is where we need to look into the next two methods
Finetuning
Finetuning is the process of further training a pre-trained model on a task specific dataset. Generally in case of LLMs, the base pre-training is done on very large datasets while the finetuning is on much smaller datasets.
The conventional approach of LLM finetuning is full finetuning where all the parameters are open for update similar to initial pre-training and the only real difference is the size and content of dataset. The finetuning on task specific data, especially private data of the enterprise, brings in the much needed knowledge of the domain/business and local context required for generating more accurate and desired responses. In addition to full finetuning, there are other variations like sequential finetuning on multiple related tasks or domains, multi-level finetuning (MLFT) a variant of sequential method with successively finer tuning approach, parameter-efficient finetuning (PEFT) where the tuning only updates selected parameters, low-rank adaptation (LoRA) a variation of PEFT, Adapter Training of plugged-in lightweight modules etc. These improved methods helps the model achieve better performance, tuning efficiency or even both.
While finetuning enables the model to handle a much wider range of scenarios at improved accuracy and reduced hallucination, it is a relatively costly, time consuming and expertise requiring process. Moreover, the method still suffers from time period knowledge cutoff and even hallucination to some extend.
Retrieval Augmented Generation (RAG)
RAG brings in the power of context retrieval from relevant data sources and combines with overall prompting strategy to generate contextually accurate responses grounded on facts. Essentially RAG technique enables the model to lookup external information to improve the response generation.
This is an extremely potent capability when it comes to practical implementation of LLMs in enterprises, where the model can actively refer to latest information private to the enterprise for response generation. A simple RAG can be just a retrieval mechanism fetching a static text content, appending that to the prompt and request/input text, and feeding this engineered content to the LLM model for response/output text generation. While this is the basic concept, generally implementations use approaches like vector store based search to get contextually right subset of information relevant to the request.
RAG technique minimizes hallucinations, is time relevant, transparent in terms of sourcing of information, and relatively cost effective. While being a top suited option for private content relevant generation, RAG technique performance is not up to the mark when it comes to internal/local business and content understanding. In such scenarios the model might fail to understand the right intent of the request, meaning of business jargons, contextual derivations and definitions etc.
And The Best Approach Is...
We have seen that each of these approaches help overcome various challenges faced by generic LLMs. And is unavoidable for implementing practical GenAI business applications using LLMs which can respond reasonably well to the requests while keeping to the local enterprise context.
Now coming to the original question raised, RAG vs Finetuning vs Prompt Engineering - which is the best option? Let us first look at what each technique is good at.
While each method, if enhanced well enough, is capable of addressing most of limitations mentioned, they cannot fully replace each others capability. And in fact they can compliment each other - boosting the strong areas and countering the weak areas of each other. Thus, using a combination of prompt engineering, finetuning and RAG is one of the best approaches that can be followed for implementing practical and performing business applications leveraging LLMs. This Adaptive RAG method works by leveraging optimized prompts bringing in the rightly aligned requests, finetuned models for getting the right local context and business understanding, and generating the final response based on both identified local context and information. Adaptive RAG can be implemented as a single all-purpose model or an ensemble chain of purpose-specific models; we will not be going to more details of implementation in this article.
Conclusion
In the current state of the art, the combination of prompt engineering, finetuning and RAG with an ensemble of models - an Adaptive RAG (chain) can be the right choice for implementing a practical LLM based business application. While it is still not the perfect solution for all challenges we discussed earlier, it is capable of addressing most of them to a good extent. And is definitely an option worth trying for enterprises focusing on GenAI implementation in their applications.
References
Great write-up on the topic - well done!
Great explanation, thanks for your effort 💕
Thank you for this comprehensive guide on these 3 solutions, it helped me to understand better what I should choose for my product.
Good one! PEFT and RAG are commonly used techniques and there are counters on both and use a scenario based call. Typically rule of thumb is for Financial & Healthcare related context searches use Fine tuning over RAGs