Notebooks
M
Milvus
Rag With Milvus And Unstructured

Rag With Milvus And Unstructured

image-searchvector-databasesemantic-searchIntegrationmilvusembeddingsunstructured-dataquestion-answeringLLMmilvus-bootcampdeep-learningimage-recognitionimage-classificationaudio-searchPythonragNLP

Build a RAG with Milvus and Unstructured

Unstructured provides a platform and tools to ingest and process unstructured documents for Retrieval Augmented Generation (RAG) and model fine-tuning. It offers both a no-code UI platform and serverless API services, allowing users to process data on Unstructured-hosted compute resources.

In this tutorial, we will use Unstructured to ingest PDF documents and then use Milvus to build a RAG pipeline.

Preparation

Dependencies and Environment

[ ]

Installation Options:

  • For processing all document formats: pip install "unstructured[all-docs]"
  • For specific formats (e.g., PDF): pip install "unstructured[pdf]"
  • For more installation options, see the Unstructured documentation

If you are using Google Colab, to enable dependencies just installed, you may need to restart the runtime (click on the "Runtime" menu at the top of the screen, and select "Restart session" from the dropdown menu).

We will use OpenAI as the LLM in this example. You should prepare the api key OPENAI_API_KEY as an environment variable.

[2]

Prepare Milvus and OpenAI clients

You can use the Milvus client to create a Milvus collection and insert data into it.

[3]

As for the argument of MilvusClient:

  • Setting the uri as a local file, e.g../milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file.
  • If you have large scale of data, say more than a million vectors, you can set up a more performant Milvus server on Docker or Kubernetes. In this setup, please use the server address and port as your uri, e.g.http://localhost:19530. If you enable the authentication feature on Milvus, use "<your_username>:<your_password>" as the token, otherwise don't set the token.
  • If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust the uri and token, which correspond to the Public Endpoint and Api key in Zilliz Cloud.

Check if the collection already exists and drop it if it does.

[4]

Prepare a OpenAI client to generate embeddings and generate responses.

[5]

Generate a test embedding and print its dimension and first few elements.

[6]
1536
[0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]

Create Milvus Collection

We will create a collection with the following schema:

  • id: the primary key, which is a unique identifier for each document.
  • vector: the embedding of the document.
  • text: the text content of the document.
  • metadata: the metadata of the document.

Then we build an AUTOINDEX index on the vector field. And then create the collection.

[7]

Load data from Unstructured

Unstructured provides a flexible and powerful ingestion pipeline to process various file types, including PDF, HTML, and more. We will partition and chunk a local PDF file. And then load the data into Milvus.

[16]

Let's examine the partitioned elements from the PDF file. Each element represents a chunk of content extracted by Unstructured's partitioning process.

[15]
What is Milvus?

Milvus is a high-performance, highly scalable vector database that runs efficiently across a wide range of environments, from a laptop to large-scale distributed systems. It is available as both open-source software and a cloud service.

Insert data into Milvus.

[10]
{'insert_count': 29, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], 'cost': 0}

Retrieve and Generate Response

Define a function to retrieve relevant documents from Milvus.

[11]

Define a function to generate a response using the retrieved documents in the RAG pipeline.

[12]

Let's test the RAG pipeline with a sample question.

[13]
Question: What is the Advanced Search Algorithms in Milvus?
Answer: The Advanced Search Algorithms in Milvus include a wide range of in-memory and on-disk indexing/search algorithms such as IVF, HNSW, and DiskANN. These algorithms have been deeply optimized, and Milvus delivers 30%-70% better performance compared to popular implementations like FAISS and HNSWLib.