# Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) is a method for including (parts of) matching documents as context for questions to a Large Language Model (LLM).
This can help reduce hallucinations and wrong answers.
A system for RAG has two parts: a document database with a search index and a large language model.

When the user asks a question, the question is handled in two stages.
First, the question is used as a search query for the document database.
The search results are then sent together with the question to the LLM.
The LLM is prompted to answer the question based on the context in the search results.

We will use [LangChain](https://www.langchain.com/), an open-source library for making applications with LLMs.
This chapter was inspired by the article
[Retrieval-Augmented Generation (RAG) with open-source Hugging Face LLMs using LangChain](
https://medium.com/@jiangan0808/retrieval-augmented-generation-rag-with-open-source-hugging-face-llms-using-langchain-bd618371be9d).

## Installing Software
We’ll need to install some libraries first:

In [1]:
!pip install --upgrade sentence-transformers huggingface-hub faiss-cpu sentencepiece protobuf langchain langchain-community pypdf

Defaulting to user installation because normal site-packages is not writeable


## The Language Model
We’ll use models from [HuggingFace](https://huggingface.co/), a website that has tools and models for machine learning.
We’ll use the open-source LLM [mistralai/Mistral-Nemo-Instruct-2407]( https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407).
This model has 12 billion parameters.
For comparison, one of the largest LLMs at the time of writing is Llama 3.1, with 405 billion parameters.
Still, Mistral-Nemo-Instruct is around 25 GB, which makes it a quite large model.
To run it, we must have a GPU with at least 25 GB memory.
It can also be run without a GPU, but that will be much slower.

### Model Storage Location
We must download the model we want to use.
Because of the requirements mentioned above, we run our program on the [Fox](https://www.uio.no/english/services/it/research/hpc/fox/) high-performance computer at UiO.
We must set the location where our program should store the models that we download from HuggingFace:

In [2]:
%env HF_HOME=/fp/projects01/ec367/huggingface/cache/

env: HF_HOME=/fp/projects01/ec367/huggingface/cache/


```{note}
If you run the program locally on your own computer, you might not need to set `HF_HOME`.
```

### HuggingFace Login
Even though the model Mistral-Nemo-Instruct-2407 is open source, we must log in to HuggingFace to download it.

In [3]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### The Model
Now, we are ready to download and use the model.
To use the model, we create a *pipeline*.
A pipeline can consist of several processing steps, but in this case, we only need one step.
We can use the method `HuggingFacePipeline.from_model_id()`, which automatically downloads the specified model from HuggingFace.

In [4]:
from langchain_community.llms import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id='mistralai/Mistral-Nemo-Instruct-2407',
    task='text-generation',
    device=0,
    pipeline_kwargs={
        'max_new_tokens': 300,
        'temperature': 0.3,
        'num_beams': 4,
        'do_sample': True
    }
)

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

```{admonition} Pipeline Arguments
We give some arguments to the pipeline:
- `model_id`: the name of the  model on HuggingFace
- `task`:  the task you want to use the model for,  other alternatives are  translation and summarization
- `device`: the GPU hardware device to use. If we don't specify a device, no GPU will be used.
- `pipeline_kwargs`: additional parameters that are passed to the model.
    - `max_new_tokens`: maximum length of the generated text
    - `do_sample`: by default, the most likely next word is chosen.  This makes the output deterministic. We can introduce some randomness by sampling among the  most likely words instead.
    - `temperature`: the temperature controls the statistical *distribution* of the next word and is usually between 0 and 1. A low temperature increases the probability of common words. A high temperature increases the probability of outputting a rare word. Model makers often recommend a temperature setting, which we can use as a starting point.
    - `num_beams`: by default the model works with a single sequence of  tokens/words. With beam search, the program  builds multiple sequences at the same time, and then selects the best one in the end.
```

```{tip}
If you're working on a computer with less memory, you might need to try a smaller model.
You can try for example `mistralai/Mistral-7B-Instruct-v0.3` or `meta-llama/Llama-3.2-1B-Instruct`. The latter has only 1 billion parameters, and might be possible to use on a laptop, depending on how much memory it has.

If you use models from Meta, you might need to set `pad_token_id`:

    llm.pipeline.tokenizer.pad_token_id = llm.pipeline.tokenizer.eos_token_id

```

## Using the Language Model
Now, the language model is ready to use.
Let’s try to use only the language model without RAG.
We can send it a query:


In [5]:
query = 'what are the main problems with bitcoin?'
output = llm.invoke(query)
print(output)

what are the main problems with bitcoin? Bitcoin has several challenges and criticisms, including:

1. **Volatility**: Bitcoin's price is highly volatile, making it less suitable as a medium of exchange for everyday transactions. Its value can fluctuate significantly over short periods, which can make it difficult to use for purchasing goods and services.

2. **Scalability**: Bitcoin's network can only process a limited number of transactions per second (around 7), which can lead to slower transaction times and higher fees during periods of heavy usage. This is often referred to as the "block size debate."

3. **Energy Consumption**: Bitcoin's proof-of-work (PoW) consensus mechanism requires a large amount of energy to secure the network. This has led to concerns about its environmental impact. Some estimates suggest that Bitcoin's energy consumption is comparable to that of entire countries.

4. **Regulation**: Bitcoin's decentralized nature makes it difficult for governments to regul

This answer was generated based only on the information contained in the language model.
To improve the accuracy of the answer, we can provide the language model with additional context for our query.
To do that, we must load our document collection.


## The Vectorizer
Text must be [vectorized](vectorizing) before it can be processed.
Our HuggingFace pipeline will do that automatically for the large language model.
But we must make a vectorizer for the search index for our documents database.
We use a vectorizer called a word embedding model from HuggingFace.
Again, the HuggingFace library will automatically download the model.

In [6]:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

huggingface_embeddings = HuggingFaceBgeEmbeddings(
    model_name='BAAI/bge-m3',
    model_kwargs = {'device': 'cuda:0'},
    #or: model_kwargs={'device':'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)

```{admonition} Embeddings Arguments
These are the arguments to the embedding model:
- 'model_name': the name of the model on HuggingFace
- 'device':  the hardware device to use, either a GPU or CPU
- 'normalize_embeddings':  embeddings can have different magnitudes. Normalizing the embeddings makes their magnitudes equal.
```

## Loading the Documents
We use a document loader from the LangChain library
to load all the PDFs in the  folder called  `documents`.

In [7]:
from langchain_community.document_loaders import PyPDFDirectoryLoader

In [8]:
#from langchain_community.document_loaders import DirectoryLoader
#loader = DirectoryLoader('./documents/')

In [9]:
loader = PyPDFDirectoryLoader('./documents/')
docs = loader.load()

Ignoring wrong pointing object 13 0 (offset 0)
Ignoring wrong pointing object 29 0 (offset 0)
Ignoring wrong pointing object 35 0 (offset 0)
Ignoring wrong pointing object 43 0 (offset 0)
Ignoring wrong pointing object 112 0 (offset 0)
Ignoring wrong pointing object 137 0 (offset 0)
Ignoring wrong pointing object 148 0 (offset 0)


The document loader loads each PDF page as a separate 'document'.
This is partly for technical reasons because that is the way PDFs are structured.
But we would want to split our documents into smaller chunks anyway.
We can check how long our documents are. 
First, we define a function for this:


In [10]:
import statistics
def average_length(documents):
    return statistics.fmean([len(doc.page_content) for doc in documents])

Now, we can use this function on our documents:

In [11]:
print(f'Number of documents: {len(docs)}, average document length: {int(average_length(docs))}')
print('Maximum document length: ', max([len(doc.page_content) for doc in docs]))

Number of documents: 213, average document length: 2199
Maximum document length:  9839


We can examine one of the documents:

In [12]:
print(docs[0])

page_content='Pr
ogramming Languages and Law
A Research Agenda
James Grimmelmann
james.grimmelmann@cornell.edu
Cornell University
Law School and Cornell Tech
New York City, NY, USA
ABSTRACT
If code is law, then the language of law is a programming lan-
guage.Lawyersandlegalscholarscanlearnaboutlawbystudying
programming-language theory, and programming-language tools
can be usefully applied to legal problems. This article surveys the
history of research into programming languages and law and pre-
sents ten promising avenues for future efforts. Its goals are to ex-
plain how the combination of programming languages and law is
distinctive within the broader field of computer science and law,
and to demonstrate with concrete examples the remarkable power
of programming-language concepts in this new domain.
CCS CONCEPTS
•Software and its engineering →General programming lan-
guages; Domainspecificlanguages ;•Socialandprofessional
topics →Computing / technology policy.
KEYWORDS
programming l

## Splitting the Documents
Since we are only using PDFs with quite short pages, we can use them as they are.
Other, longer documents, for example the documents or webpages, we might need to split into chunks. 
We can use a text splitter from LangChain to split documents.


In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 5000, #  or less, like 700 for models with smaller context windows
    chunk_overlap  = 100,
)
docs = text_splitter.split_documents(docs)

```{admonition} Text  Splitter Arguments
These are the arguments to the text splitter:
- 'chunk_size': the number of tokens in each chunk.  Not necessarily the same as the number of words.
- 'chunk_overlap': the number of tokens that are included in both chunks where the text is split.
```

We can check if the average and maximum document length has changed:

In [14]:
print(f'Number of documents: {len(docs)}, average document length: {int(average_length(docs))}')
print('Maximum document length: ', max([len(doc.page_content) for doc in docs]))

Number of documents: 226, average document length: 2075
Maximum document length:  4991


In [15]:
import numpy as np
sample_embedding = np.array(huggingface_embeddings.embed_query(docs[0].page_content))
#print('Sample embedding of a document chunk: ', sample_embedding)
print('Size of the embedding: ', sample_embedding.shape)

Size of the embedding:  (1024,)


## The Document Index
Next, we make a search index for our documents.
We will use this index for the retrieval part of 'Retrieval-Augmented Generation'.
We use the open-source library [FAISS](https://github.com/facebookresearch/faiss)
(Facebook AI Similarity Search) through LangChain.

In [16]:
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(docs, huggingface_embeddings)

FAISS can find documents that match a search query:

In [17]:
query = 'what are the main problems with bitcoin?'
relevant_documents = vectorstore.similarity_search(query)
print(f'Number of documents found: {len(relevant_documents)}')

Number of documents found: 4


We can display the first document:

In [18]:
print(relevant_documents[0].page_content)

Yale Information Society Project 8 8 problems. Thefts, bugs, and other problems can be undone if detected in time. Cryptocurrencies lack this critical feature. This is why cryptocurrency thefts, as a fraction of the available currency, are orders of magnitude more common and severe than thefts in the normal financial system.  The largest significant electronic bank heist, targeting the Bank of Bangladesh, managed to steal roughly $100 million.8 Cryptocurrency hacks of similar magnitude are almost a monthly occurrence; indeed, in the largest cryptocurrency hack on record, of Axie Infinity’s “Ronin Bridge,” hackers stole over $600 million.9 This ease of theft is inherent in the very nature of cryptocurrency. Stealing $10 million in physical cash requires that someone break into a secure location and move 100 kilograms of physical paper. Stealing $10 million in a traditional bank transfer requires both breaking into the bank’s computer and also quickly moving the money through a series of

For our RAG application we need to access the search engine through an interface called a retriever:

In [19]:
retriever = vectorstore.as_retriever(search_kwargs={'k': 3})

```{admonition} Retriever Arguments
These are the arguments to the retriever:
- 'k': the number of documents to return (kNN search)
```

## Making a Prompt
We can use a *prompt* to tell the language model how to answer.
The prompt should contain a few short, helpful instructions.
In addition, we provide placeholders for the context and the question.
LangChain replaces these with the actual context and question when we execute a query.


In [20]:
from langchain.prompts import PromptTemplate

prompt_template = '''You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Keep the answer concise.
Context: {context}

Question: {input}

Answer:
'''

prompt = PromptTemplate(template=prompt_template,
                        input_variables=['context', 'input'])

## Making the «Chatbot»
Now we can use the module `create_retrieval_chain` from LangChain to make an agent for answering questions, a «chatbot».


In [21]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

combine_docs_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, combine_docs_chain)

## Asking the «Chatbot»
Now, we can send our query to the chatbot.


In [22]:
result = rag_chain.invoke({'input': query})

In [23]:
print(result['answer'])

You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Keep the answer concise.
Context: Yale Information Society Project 8 8 problems. Thefts, bugs, and other problems can be undone if detected in time. Cryptocurrencies lack this critical feature. This is why cryptocurrency thefts, as a fraction of the available currency, are orders of magnitude more common and severe than thefts in the normal financial system.  The largest significant electronic bank heist, targeting the Bank of Bangladesh, managed to steal roughly $100 million.8 Cryptocurrency hacks of similar magnitude are almost a monthly occurrence; indeed, in the largest cryptocurrency hack on record, of Axie Infinity’s “Ronin Bridge,” hackers stole over $600 million.9 This ease of theft is inherent in the very nature of cryptocurrency. Stealing $10 million in physical cash requires that someone break int

This answer contains information about transaction fees from the context.
This information wasn’t in the previous answer, when we queried only the language model without Retrieval-Augmented Generation.