deepset Jina Embeddings V2 Legal Analysis Rag

Jina Embeddings V2 Legal Analysis Rag

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

alph-notebooks/haystack-cookbook / jina-embeddings-v2-legal-analysis-rag.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Using the Jina-embeddings-v2-base-en model in a Haystack RAG pipeline for legal document analysis

One foggy day in October 2023, I was narrowly excused from jury duty. I had mixed feelings about it, since it actually seemed like a pretty interesting case (Google v. Sonos). A few months later, I idly wondered how the proceedings turned out. I could just read the news, but what's the fun in that? Let's see how AI can solve this problem.

Jina.ai recently released jina-embeddings-v2-base-en. It's an open-source text embedding model capable of accommodating up to 8192 tokens. Splitting text into larger chunks is helpful for understanding longer documents. One of the use cases this model is especially suited for is legal document analysis.

In this demo, we'll build a RAG pipeline to discover the outcome of the Google v. Sonos case, using the following technologies:

the jina-embeddings-v2-base-en model
Haystack, the open source LLM orchestration framework
Chroma to store our vector embeddings, via the Chroma Document Store Haystack integration
the open source Mistral 7B Instruct LLM

Prerequisites:

You need a Jina AI key - get a free one here.
You also need an Hugging Face access token

First, install all our required dependencies.

[ ]

Then input our credentials.

[ ]

Build an Indexing Pipeline

At a high level, the LinkContentFetcher pulls this document from its URL. Then we convert it from a PDF into a Document object Haystack can understand.

We preprocess it by removing whitespace and redundant substrings. Then split it into chunks, generate embeddings, and write these embeddings into the ChromaDocumentStore.

[ ]

Query pipeline

Now the real fun begins. Let's create a query pipeline so we can actually start asking questions. We write a prompt allowing us to pass our documents to the Mistral-7B LLM. Then we initiatialize the LLM via the HuggingFaceAPIGenerator.

To use this model, you need to accept the conditions here: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

In Haystack 2.0 retrievers are tightly coupled to DocumentStores. If we pass in the retriever we initialized earlier, this pipeline can access those embeddings we generated, and pass them to the LLM.

[ ]

Time to ask a question!

[ ]

Alternate cases to explore

The indexing pipeline is written so that you can swap in other documents and analyze them. can You can try plugging the following URLs (or any PDF written in English) into the indexing pipeline and re-running all the code blocks below it.

Google v. Oracle: https://supreme.justia.com/cases/federal/us/593/18-956/case.pdf
JACK DANIEL’S PROPERTIES, INC. v. VIP PRODUCTS LLC: https://www.supremecourt.gov/opinions/22pdf/22-148_3e04.pdf

Note: if you want to change the prompt template, you'll also need to re-run the code blocks starting where the DocumentStore is defined.

Wrapping it up

Thanks for reading! If you're interested in learning more about the technologies used here, check out these blog posts: