Retrieval-Augmented Generation#
Retrieval-Augmented Generation (RAG) is a method for including (parts of) matching documents as context for questions to a Large Language Model (LLM). This can help reduce hallucinations and wrong answers. A system for RAG has two parts: a document database with a search index and a large language model.
When the user asks a question, the question is handled in two stages. First, the question is used as a search query for the document database. The search results are then sent together with the question to the LLM. The LLM is prompted to answer the question based on the context in the search results.
We will use LangChain, an open-source library for making applications with LLMs. This chapter was inspired by the article Retrieval-Augmented Generation (RAG) with open-source Hugging Face LLMs using LangChain.
Installing Software#
We’ll need to install some libraries first:
!pip install --upgrade sentence-transformers huggingface-hub faiss-cpu sentencepiece protobuf langchain langchain-community pypdf
The Language Model#
We’ll use models from HuggingFace, a website that has tools and models for machine learning. We’ll use the open-source LLM mistralai/Mistral-Nemo-Instruct-2407. This model has 12 billion parameters. For comparison, one of the largest LLMs at the time of writing is Llama 3.1, with 405 billion parameters. Still, Mistral-Nemo-Instruct is around 25 GB, which makes it a quite large model. To run it, we must have a GPU with at least 25 GB memory. It can also be run without a GPU, but that will be much slower.
Model Storage Location#
We must download the model we want to use. Because of the requirements mentioned above, we run our program on the Fox high-performance computer at UiO. We must set the location where our program should store the models that we download from HuggingFace:
%env HF_HOME=/fp/projects01/ec367/huggingface/cache/
Note
If you run the program locally on your own computer, you might not need to set HF_HOME
.
HuggingFace Login#
Even though the model Mistral-Nemo-Instruct-2407 is open source, we must log in to HuggingFace to download it.
from huggingface_hub import login
login()
The Model#
Now, we are ready to download and use the model.
To use the model, we create a pipeline.
A pipeline can consist of several processing steps, but in this case, we only need one step.
We can use the method HuggingFacePipeline.from_model_id()
, which automatically downloads the specified model from HuggingFace.
from langchain_community.llms import HuggingFacePipeline
llm = HuggingFacePipeline.from_model_id(
model_id='mistralai/Mistral-Nemo-Instruct-2407',
task='text-generation',
device=0,
pipeline_kwargs={
'max_new_tokens': 300,
'temperature': 0.3,
'num_beams': 4,
'do_sample': True
}
)
Pipeline Arguments
We give some arguments to the pipeline:
model_id
: the name of the model on HuggingFacetask
: the task you want to use the model for, other alternatives are translation and summarizationdevice
: the GPU hardware device to use. If we don’t specify a device, no GPU will be used.pipeline_kwargs
: additional parameters that are passed to the model.max_new_tokens
: maximum length of the generated textdo_sample
: by default, the most likely next word is chosen. This makes the output deterministic. We can introduce some randomness by sampling among the most likely words instead.temperature
: the temperature controls the statistical distribution of the next word and is usually between 0 and 1. A low temperature increases the probability of common words. A high temperature increases the probability of outputting a rare word. Model makers often recommend a temperature setting, which we can use as a starting point.num_beams
: by default the model works with a single sequence of tokens/words. With beam search, the program builds multiple sequences at the same time, and then selects the best one in the end.
Tip
If you’re working on a computer with less memory, you might need to try a smaller model.
You can try for example mistralai/Mistral-7B-Instruct-v0.3
or meta-llama/Llama-3.2-1B-Instruct
. The latter has only 1 billion parameters, and might be possible to use on a laptop, depending on how much memory it has.
If you use models from Meta, you might need to set pad_token_id
:
llm.pipeline.tokenizer.pad_token_id = llm.pipeline.tokenizer.eos_token_id
Using the Language Model#
Now, the language model is ready to use. Let’s try to use only the language model without RAG. We can send it a query:
query = 'what are the main problems with bitcoin?'
output = llm.invoke(query)
print(output)
what are the main problems with bitcoin? Bitcoin has several challenges and criticisms, including:
1. **Volatility**: Bitcoin's price is highly volatile, making it less suitable as a medium of exchange for everyday transactions. Its value can fluctuate significantly over short periods, which can make it difficult to use for purchasing goods and services.
2. **Scalability**: Bitcoin's network can only process a limited number of transactions per second (around 7), which can lead to slower transaction times and higher fees during periods of heavy usage. This is often referred to as the "block size debate."
3. **Energy Consumption**: Bitcoin's proof-of-work (PoW) consensus mechanism requires a large amount of energy to secure the network. This has led to concerns about its environmental impact. Some estimates suggest that Bitcoin's energy consumption is comparable to that of entire countries.
4. **Regulation**: Bitcoin's decentralized nature makes it difficult for governments to regulate. While this is often seen as a strength, it also means that there's no central authority to protect users if something goes wrong. This has led to concerns about consumer protection and money laundering.
5. **Security**: While Bitcoin's blockchain is secure, individual users' bitcoins can be stolen or lost if they don't properly secure their private keys. There have also been instances of exchanges being hacked, leading to significant losses for users.
6. **Adoption**: For Bitcoin to become a widely used currency, it needs to be adopted by a large number of people
This answer was generated based only on the information contained in the language model. To improve the accuracy of the answer, we can provide the language model with additional context for our query. To do that, we must load our document collection.
The Vectorizer#
Text must be vectorized before it can be processed. Our HuggingFace pipeline will do that automatically for the large language model. But we must make a vectorizer for the search index for our documents database. We use a vectorizer called a word embedding model from HuggingFace. Again, the HuggingFace library will automatically download the model.
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
huggingface_embeddings = HuggingFaceBgeEmbeddings(
model_name='BAAI/bge-m3',
model_kwargs = {'device': 'cuda:0'},
#or: model_kwargs={'device':'cpu'},
encode_kwargs={'normalize_embeddings': True}
)
Loading the Documents#
We use a document loader from the LangChain library
to load all the PDFs in the folder called documents
.
from langchain_community.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader('./documents/')
docs = loader.load()
The document loader loads each PDF page as a separate ‘document’. This is partly for technical reasons because that is the way PDFs are structured. But we would want to split our documents into smaller chunks anyway. We can check how long our documents are. First, we define a function for this:
import statistics
def average_length(documents):
return statistics.fmean([len(doc.page_content) for doc in documents])
Now, we can use this function on our documents:
print(f'Number of documents: {len(docs)}, average document length: {int(average_length(docs))}')
print('Maximum document length: ', max([len(doc.page_content) for doc in docs]))
Number of documents: 213, average document length: 2199
Maximum document length: 9839
We can examine one of the documents:
print(docs[0])
page_content='Pr
ogramming Languages and Law
A Research Agenda
James Grimmelmann
james.grimmelmann@cornell.edu
Cornell University
Law School and Cornell Tech
New York City, NY, USA
ABSTRACT
If code is law, then the language of law is a programming lan-
guage.Lawyersandlegalscholarscanlearnaboutlawbystudying
programming-language theory, and programming-language tools
can be usefully applied to legal problems. This article surveys the
history of research into programming languages and law and pre-
sents ten promising avenues for future efforts. Its goals are to ex-
plain how the combination of programming languages and law is
distinctive within the broader field of computer science and law,
and to demonstrate with concrete examples the remarkable power
of programming-language concepts in this new domain.
CCS CONCEPTS
•Software and its engineering →General programming lan-
guages; Domainspecificlanguages ;•Socialandprofessional
topics →Computing / technology policy.
KEYWORDS
programming languages, law
ACM Reference Format:
James Grimmelmann. 2022. Programming Languages and Law: A Research
Agenda.In Proceedings of the 2022 Symposium on Computer Science and Law
(CSLAW ’22), November 1–2, 2022, Washington, DC, USA. ACM, New York,
NY, USA, 11 pages. https://doi.org/10.1145/3511265.3550447
1 INTRODUCTION
Computer science contains multitudes. It ranges from pure math-
ematics to quantum physics, from the heights of theory to the
depths of systems engineering.
Some of its subfields speak to urgent problems law faces. Crim-
inal procedure [60] and national security law [27] cannot regulate
the world as it exists without taking account of whether, when,
and how data can be kept private. Other subfields provide new
perspectives on law. The “law as data” movement [9, 75] uses com-
putational methods like topic modeling and decision-tree learning
to analyze legal datasets in subjects as diverse as trademark in-
fringement, [17] judicial rhetoric, [74] and the network structure
of the United States Code. [59]
This
work is licensed under a Creative Commons Attribu-
tion International 4.0 License.
CSLAW ’22, November 1–2, 2022, Washington, DC, USA
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9234-1/22/11.
https://doi.org/10.1145/3511265.3550447Iwouldliketoarguethatthecomputer-sciencefieldof program-
ming-language (PL) theory speaks to law in both of these senses.
Notonlyisitindispensableforansweringcertainkindsofpractical
legal questions, but its application can “illuminate the entire law.”
[36] Just as microeconomics provides a new and illuminating way
to think about rights and remedies, and just as corpus linguistics
provides a new and illuminating way to think about legal inter-
pretation, PL theory provides new and illuminating ways to think
about familiar issues from all across the law.
Consider, for example, the M++ project to formalize French tax
law (described in more detail in Section 2.2). M++ is distinguished
from the kind of routine systems engineering that tax authorities
around the world perform on their computer systems by its rig-
orous use of PL theory to design a new programming language
for describing the provisions of the French tax code. On the one
hand, M++ is useful because it is a clean, modern language that
is amenable to correctness proofs, improving the reliability of tax
computations. On the other hand, M++ programs mirror the struc-
ture of the tax laws they formalize. Instead of treating the rules of
tax law as an ad hocdesign document, M++ treats the tax code as
though they were itself a program, one meant to be “executed” by
lawyers and accountants. The goal is not just to do the same thing
as the tax code, but to do it in the same way, section by section,
clause by clause.
To generalize, PL theory has something unique to offer law be-
cause there is a crucial similarity between lawyers and program-
mers: the way they use words. Computer science and law are both
linguistic professions. Programmers and lawyers use language to
create,manipulate,andinterpretcomplexabstractions.Aprogram-
mer who uses the right words in the right way makes a computer
do something. A lawyer who uses the right words in the right way
changes people’s rights and obligations. There is a nearly exact
analogy between the text of a program and the text of a law.
Thisparallel createsaunique opportunity for PLtheory as a dis-
cipline to contribute to law. Some CS subfields, such as artificial in-
telligence (AI), deal with legal structures. Others, such as natural
language processing (NLP), deal with legal language. But only PL
theory provides a principled, systematic framework to analyze le-
gal structures in terms of the linguistic expressions lawyers use to
create them. PL abstractions have an unmatched expressive power
in capturing the linguistic abstractions of law.
Over a decade ago, Paul Ohm proposed a new research agenda
for“computerprogrammingandlaw,”describingindetailthevalue
of executable code for legal scholarship: by gathering and analyz-
ing information about the law more efficiently, by communicating
155
' metadata={'source': 'documents/Grimmelmann - 2022 - Programming Languages and Law A Research Agenda.pdf', 'page': 0}
Splitting the Documents#
Since we are only using PDFs with quite short pages, we can use them as they are. Other, longer documents, for example the documents or webpages, we might need to split into chunks. We can use a text splitter from LangChain to split documents.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 5000, # or less, like 700 for models with smaller context windows
chunk_overlap = 100,
)
docs = text_splitter.split_documents(docs)
Text Splitter Arguments
These are the arguments to the text splitter:
‘chunk_size’: the number of tokens in each chunk. Not necessarily the same as the number of words.
‘chunk_overlap’: the number of tokens that are included in both chunks where the text is split.
We can check if the average and maximum document length has changed:
print(f'Number of documents: {len(docs)}, average document length: {int(average_length(docs))}')
print('Maximum document length: ', max([len(doc.page_content) for doc in docs]))
Number of documents: 226, average document length: 2075
Maximum document length: 4991
The Document Index#
Next, we make a search index for our documents. We will use this index for the retrieval part of ‘Retrieval-Augmented Generation’. We use the open-source library FAISS (Facebook AI Similarity Search) through LangChain.
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(docs, huggingface_embeddings)
FAISS can find documents that match a search query:
query = 'what are the main problems with bitcoin?'
relevant_documents = vectorstore.similarity_search(query)
print(f'Number of documents found: {len(relevant_documents)}')
Number of documents found: 4
We can display the first document:
print(relevant_documents[0].page_content)
Yale Information Society Project 8 8 problems. Thefts, bugs, and other problems can be undone if detected in time. Cryptocurrencies lack this critical feature. This is why cryptocurrency thefts, as a fraction of the available currency, are orders of magnitude more common and severe than thefts in the normal financial system. The largest significant electronic bank heist, targeting the Bank of Bangladesh, managed to steal roughly $100 million.8 Cryptocurrency hacks of similar magnitude are almost a monthly occurrence; indeed, in the largest cryptocurrency hack on record, of Axie Infinity’s “Ronin Bridge,” hackers stole over $600 million.9 This ease of theft is inherent in the very nature of cryptocurrency. Stealing $10 million in physical cash requires that someone break into a secure location and move 100 kilograms of physical paper. Stealing $10 million in a traditional bank transfer requires both breaking into the bank’s computer and also quickly moving the money through a series of accounts to hide its origin, such that the victim’s bank cannot undo the theft. Stealing $10 million in cryptocurrency controlled by a computer, on the other hand, requires compromising the computer but—critically—the victim can’t recover the money.10 This creates significant friction in buying cryptocurrencies. Someone who wishes to sell cryptocurrencies cannot accept a conventional electronic payment. Instead they either have to have an established relationship with the buyer (to know the buyer poses an acceptable credit risk), accept cash, or accept an electronic payment and then wait for a few days.11 This drives up the price of buying cryptocurrency as all three options (validating credit risk, accepting cache, or waiting) incur additional expenses not present in other payment systems. Furthermore, the actual cryptocurrency transactions themselves can be surprisingly expensive.12 In order to act as a limit on spam transactions, where someone creates a huge number of useless transitions that need to be validated, slowing down the transaction verification process, any given cryptocurrency allows only a limited number of transactions per block in the blockchain. When the desired number of transactions is below this threshold, transactions are nearly free. But if the desired transaction rate exceeds this threshold, then prices can spiral as a fee auction is used to select which transactions to process due to the inelastic supply of available slots.
For our RAG application we need to access the search engine through an interface called a retriever:
retriever = vectorstore.as_retriever(search_kwargs={'k': 3})
Retriever Arguments
These are the arguments to the retriever:
‘k’: the number of documents to return (kNN search)
Making a Prompt#
We can use a prompt to tell the language model how to answer. The prompt should contain a few short, helpful instructions. In addition, we provide placeholders for the context and the question. LangChain replaces these with the actual context and question when we execute a query.
from langchain.prompts import PromptTemplate
prompt_template = '''You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Keep the answer concise.
Context: {context}
Question: {input}
Answer:
'''
prompt = PromptTemplate(template=prompt_template,
input_variables=['context', 'input'])
Making the «Chatbot»#
Now we can use the module create_retrieval_chain
from LangChain to make an agent for answering questions, a «chatbot».
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
combine_docs_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, combine_docs_chain)
Asking the «Chatbot»#
Now, we can send our query to the chatbot.
result = rag_chain.invoke({'input': query})
print(result['answer'])
You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Keep the answer concise.
Context: Yale Information Society Project 8 8 problems. Thefts, bugs, and other problems can be undone if detected in time. Cryptocurrencies lack this critical feature. This is why cryptocurrency thefts, as a fraction of the available currency, are orders of magnitude more common and severe than thefts in the normal financial system. The largest significant electronic bank heist, targeting the Bank of Bangladesh, managed to steal roughly $100 million.8 Cryptocurrency hacks of similar magnitude are almost a monthly occurrence; indeed, in the largest cryptocurrency hack on record, of Axie Infinity’s “Ronin Bridge,” hackers stole over $600 million.9 This ease of theft is inherent in the very nature of cryptocurrency. Stealing $10 million in physical cash requires that someone break into a secure location and move 100 kilograms of physical paper. Stealing $10 million in a traditional bank transfer requires both breaking into the bank’s computer and also quickly moving the money through a series of accounts to hide its origin, such that the victim’s bank cannot undo the theft. Stealing $10 million in cryptocurrency controlled by a computer, on the other hand, requires compromising the computer but—critically—the victim can’t recover the money.10 This creates significant friction in buying cryptocurrencies. Someone who wishes to sell cryptocurrencies cannot accept a conventional electronic payment. Instead they either have to have an established relationship with the buyer (to know the buyer poses an acceptable credit risk), accept cash, or accept an electronic payment and then wait for a few days.11 This drives up the price of buying cryptocurrency as all three options (validating credit risk, accepting cache, or waiting) incur additional expenses not present in other payment systems. Furthermore, the actual cryptocurrency transactions themselves can be surprisingly expensive.12 In order to act as a limit on spam transactions, where someone creates a huge number of useless transitions that need to be validated, slowing down the transaction verification process, any given cryptocurrency allows only a limited number of transactions per block in the blockchain. When the desired number of transactions is below this threshold, transactions are nearly free. But if the desired transaction rate exceeds this threshold, then prices can spiral as a fee auction is used to select which transactions to process due to the inelastic supply of available slots.
JUNE 2018 | VOL. 61 | NO. 6 | COMMUNICATIONS OF THE ACM 23viewpoints
This was not because our Bitcoin
was stolen from a honeypot, rather the
graduate student who created the wallet
maintained a copy and his account was
compromised. If security experts can’t
safely keep cryptocurrencies on an Inter -
net-connected computer, nobody can. If
Bitcoin is the “Internet of money,” what
does it say that it cannot be safely stored
on an Internet connected computer?
Bugs can also naturally cause sig-
nificant damage to cryptocurrency
holdings. Although this potentially can
affect any cryptocurrency, the biggest
danger for bugs arises when cryptocur-
rencies are combined with “smart con-
tracts”—programs that are generally
immutable once deployed and that au-
tomatically execute upon the transfer
of currency. The most successful plat-
form for these is Ethereum, a crypto-
currency that allows writing programs
in a language called Solidity.
Bugs in these smart contracts can
be catastrophic. The first big smart
contract, the DAO or Decentralized Au -
tonomous Organization, sought to cre -
ate a democratic mutual fund where
investors could invest their Ethereum
and then vote on possible investments.
Approximately 10% of all Ethereum
ended up in the DAO before someone
discovered a reentrancy bug that en -
abled the attacker to effectively steal all
the Ethereum. The only reason this bug
and theft did not result in global losses
is that Ethereum developers released a
new version of the system that effective -
ly undid the theft by altering the sup -
posedly immutable blockchain.
Since then there have been other
catastrophic bugs in these smart con-
tracts, the biggest one in the Parity
Ethereum wallet software (see https://
bit.ly/2Fm7je4). The first bug enabled
the mass theft from “multisignature”
wallets, which supposedly required
multiple independent cryptographic
signatures on transfers as a way to pre-
vent theft. Fortunately, that bug caused
limited damage because a good thief
stole most of the money and then re-
turned it to the victims. Yet, the good
news was limited as a subsequent bug
rendered all of the new multisignature
wallets permanently inaccessible, ef-
fectively destroying some $150M in no-
tional value. This buggy code was large-
ly written by Gavin Wood, the creator
of the Solidity programming language and one of the founders of Ethereum.
Again, we have a situation where even
an expert’s efforts fell short.
Individual Economic Risks
Everything about the cryptocurrency
space is full of bubbles. Since all volatile
cryptocurrencies are actually substan -
tially inferior for legal purposes, this im -
plies that the actual value as currency is
effectively $0, so the only store of value
is in other utility for a distributed trust -
less public append-only ledger.
Yet the Bitcoin blockchain, due to
consolidation of mining into a few min -
ing pools, does not actually distribute
trust. Instead the system is effectively
controlled by less than 10 entities self-
selected by their willingness to consume
power and anyone using Bitcoin implic -
itly trusts a majority of these few entities.
Every proof of work blockchain seems to
experience similar consolidation as the
more efficient miners inevitably drive out
less efficient ones. Given the almost trivial
cost of building a three-transactions-per-
second distributed system with identified
and trusted entities using cryptographic
signatures instead of proof of work this
suggests the utility value for these cryp -
tocurrencies is also effectively $0. This
means everyone participating in the
cryptocurrency market is basing the val -
ue only on the price that somebody else
will pay—no different from tulip bulbs or
beanie babies—and are all vulnerable to
substantial and sudden collapse .
But further magnifying the prob-
lem is a large number of scams. There
is a current trend in “Initial Coin Of-
ferings,” mostly consisting of crypto-
graphic tokens implemented on top
of an existing cryptocurrencies such as
Bitcoin or Ethereum. Although claim-
ing to be crowd-sold tokens for pur-
chase of future services, the tradeable
nature of these tokens has resulted in
their acting as unregistered securities in a bubble market. There are also or-
ganized groups conducting pump-and-
dump schemes, complete with fancy
websites, animated advertisements,
and even placing paper advertisements
in BART commuter trains in San Fran-
cisco, CA. This market developed large-
ly in absence of regulation, although
regulators like the U.S. Securities and
Exchange Commission are finally start-
ing to pay attention.
Likewise, not only is a bubble often
a natural Ponzi scheme, there are many
explicit or likely Ponzi schemes. In the
early days of Bitcoin approximately 10%
of all Bitcoin were invested in Bitcoin
Savings and Trust, a Ponzi scheme run
by a pseudonymous individual known
The Death of Cryptocurrency | Nicholas Weaver 9 Bitcoin is particularly limited in this respect. Due to an early decision to limit spam by restricting the block size to just one megabyte, the Bitcoin network can only process somewhere between three and seven transactions per second worldwide. In comparison, the typical load on the VISA network is 1,700 transactions per second, and VISA has tested the system up to 64,000 transactions per second. During times of congestion, this can lead to the price for Bitcoin transactions reaching $50 or more. Other cryptocurrencies may have higher limits, which naturally leads them to be more vulnerable to spam. High congestion fees ensure that Bitcoin transactions can never be used for everyday, low-value payments. It is inconceivable that consumers would be willing to pay an extra $50 at the grocery store because they went shopping on a Saturday or Sunday afternoon. Cryptocurrency advocates will insist that “layer-two solutions” exist for this problem. They will often point to the Bitcoin “Lightning Network,” a protocol implemented on top of the underlying cryptocurrency, as an example of a solution. Unfortunately these don’t solve the fundamental problem of limited transaction capacity. Lightning works by creating a pre-funded payment channel between the user and a central relayer.13 From there the user can issue or receive payments that pass through a chain of relayers to the recipient. Eventually, a user may close the channel and receive the Bitcoin back onto the main blockchain. Thus, the internal payments no longer need to be recorded on the central blockchain. In creating, adding funds, and closing the channel, the user still needs to conduct a normal Bitcoin transaction. The Lightning network’s ability to create or close channels is limited by Bitcoin’s own transaction limitations. Therefore, Lightning cannot provide scaling as there is still a substantial limit on the number of channels that can be created, funded, or closed per second. The one example where Bitcoin did scale to a significant number of transactions was in El Salvador, though it scaled, ironically, by not actually using Bitcoin to process payments.14 The dictator of El Salvador, President Nayib Bukele, passed a law mandating that Bitcoin, along with the US dollar, would now be considered official currencies and merchants were
Question: what are the main problems with bitcoin?
Answer:
- High risk of theft due to lack of undo feature
- Expensive transactions, especially during congestion
- Limited transaction capacity (3-7 transactions per second)
- Centralization of mining power in few entities
- High number of scams and Ponzi schemes
- Bugs in smart contracts can lead to significant losses
- Not suitable for everyday, low-value payments due to high fees
This answer contains information about transaction fees from the context. This information wasn’t in the previous answer, when we queried only the language model without Retrieval-Augmented Generation.