Popular Right Now
20 July 2024
RAG: Retrieval Augmented Generation with Python
N.Rampersand
Retrieval Augmented Generation
Agenda
-
Large Language Models
-
Limitations of LLMS
-
Retrieval Augmented Generation (RAG)
-
Technical Deep Dive
-
Live Demo
-
The potential of RAG
Large Language Models
-
Generative AI
Understand and generate “human-like” text
-
Architecture
Transformer architectures with self-attention mechanisms
-
Models
GPT, Llama, BERT, Mistral, Phi
-
Applications
Translation, summariSation, question answering, document analysis, coding assistance, research, customer service, etc.
Limitations of Large Language Models
- Hallucinations: Generating incorrect or nonsensical information.
- Knowledge Cutoff: Limited by the training data's cutoff date.
- No Interpretability: Difficulty in understanding how conclusions are reached.
What is RAG?
- Hybrid AI model combining capabilities of traditional LLMs with real-time information retrieval.
- Enhances the accuracy and reliability of generative models.
- Builds trust by providing dynamic information retrieval.
RAG example
An overview of RAG
Why is RAG Important Today?
- Mitigates hallucinations and knowledge cutoff.
- Enhanced transparency with access to the source or reference.
- Builds trust.
Real-world Applications of RAG
- Customer Support
- Academic Research
- Healthcare and Diagnostics
- Documentation and Development
- Legal Document Analysis
- Financial Analysis
Types of RAG Systems
Main RAG Components
- Query Processing and Information Retrieval
- Processing Documents
- Response generation
Processing Documents in a RAG System
Query Processing, Retrieval, and Response Generation
Types of RAG Systems
- Sparse Retrieval: Uses keywords and tf-idf scores. Efficient but not context-aware.
- Dense Retrieval: Uses machine learning models to understand the semantic context of queries and documents. Context-aware.
Simple RAG Architecture
Full RAG Process
#the implementation process
1. Load documents from folder
2. Split documents
3. Create vector embeddings
4. Store vector embeddings
5. Get user query
6. Create vector embeddings of question
7. Conduct dense retrieval using a semantic search of vector database]
8. Filter the top 3 chunks of data according to similarity score
9. Generate promt and combine user query, context and prompt config
10. Send prompt to LLM
11. LLM will generate an accurate and reliable answer
Implementation of RAG
What Do We Need?
- LLM
- RAG Framework
- Data
- RAG Pipeline
Tools and Technologies
- Language: Python
- LLM: Ollama
- Model: Llama 3 - 7b
- RAG Framework: Langchain
- Data: A biography about Paul Graham (text)
Note: Another capable RAG library is llama-index.
Implementation
-
Imports
#imports from langchain_community.document_loaders import PDFPlumberLoader from langchain_community.document_loaders import TextLoader from langchain_experimental.text_splitter import SemanticChunker from langchain_huggingface import HuggingFaceEmbeddings from langchain_community.vectorstores import FAISS from langchain_community.document_loaders import DirectoryLoader
-
Document loader
# loader = PDFPlumberLoader("data/paul_graham_essay.pdf") # loader=TextLoader("data/paul_graham_essay.txt") loader = DirectoryLoader("data/", glob="**/*.txt", loader_cls=TextLoader,use_multithreading=True,show_progress=True) docs = loader.load() # Check the number of pages print("Number of pages in ingested data:",len(docs))
-
Text splitting
#split the documents into chunks embedder = HuggingFaceEmbeddings(model_kwargs={'device':'mps'}) text_splitter = SemanticChunker(embedder) documents = text_splitter.split_documents(docs)
-
Embedding
#create vector embeddings # Check number of chunks created print("Number of chunks created: ", len(documents)) # Printing first few chunks for i in range(len(documents)): print() print(f"CHUNK : {i+1}") print(documents[i].page_content) # Create the vector store vector = FAISS.from_documents(documents, embedder)
-
Retriever
# Retrieval retriever = vector.as_retriever(search_type="similarity", search_kwargs={"k": 3}) retrieved_docs = retriever.invoke("did he like programming?") retrieved_docs
-
LLM and prompt
from langchain.chains import RetrievalQA from langchain.chains.llm import LLMChain from langchain.chains.combine_documents.stuff import StuffDocumentsChain from langchain.prompts import PromptTemplate from langchain_community.llms import Ollama llm = Ollama(model="llama3")
-
Prompt
prompt = """ 1. Use the following parts of context to answer the question at the end. 2. If you don't know the answer, then just say that "I don't know" but don't make up an answer on your own. 3. Keep the answer clear and limited to 3 or 4 sentences. There is no need to say "According to the context" Context: {context} Question: {question} Helpful Answer:"""
-
RAG pipeline
contextual_prompt = PromptTemplate.from_template(prompt) llm_pipeline = LLMChain( llm=llm, prompt=contextual_prompt, callbacks=None, verbose=True) rag_prompt = PromptTemplate( input_variables=["page_content", "source"], template="Context:\ncontent:{page_content}\nsource:{source}", ) retrieval = StuffDocumentsChain( llm_chain=llm_pipeline, document_variable_name="context", document_prompt=rag_prompt, callbacks=None, ) chat = RetrievalQA( combine_documents_chain=retrieval, verbose=False, retriever=retriever, return_source_documents=False, )
-
Testing
print(chat("did paul graham like animals?")['result'])
Data directory
Results