Milvus Readthedocs Zilliz Langchain

Readthedocs Zilliz Langchain

image-searchvector-databasesemantic-searchmilvusembeddingsunstructured-dataquestion-answeringLLMmilvus-bootcampdeep-learningimage-recognitionimage-classificationaudio-searchPythonbootcampragNLP

alph-notebooks/milvus-bootcamp / readthedocs_zilliz_langchain.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

ReadtheDocs Retrieval Augmented Generation (RAG) using Zilliz Free Tier

In this notebook, we are going to use Milvus documentation pages to create a chatbot about our product. The chatbot is going to follow RAG steps to retrieve chunks of data using Semantic Vector Search, then the Question + Context will be fed as a Prompt to a LLM to generate an answer.

Many RAG demos use OpenAI for the Embedding Model and ChatGPT for the Generative AI model. In this notebook, we will demo a fully open source RAG stack.

Using open-source Q&A with retrieval saves money since we make free calls to our own data almost all the time - retrieval, evaluation, and development iterations. We only make a paid call to OpenAI once for the final chat generation step.

Let's get started!

[1]

[2]

Download Milvus documentation.

The data we’ll use is our own product documentation web pages. ReadTheDocs is an open-source free software documentation hosting platform, where documentation is written with the Sphinx document generator.

The code block below downloads the web pages into a local directory called rtdocs.

I've already uploaded the rtdocs data folder to github, so you should see it if you cloned my repo.

[3]

[4]

loaded 22 documents

Start up Milvus running in local Docker (or Zilliz free tier)

⛔️ Make sure you pip install the correct version of pymilvus and server yml file. Versions (major and minor) should all match.

Install Docker
Start your Docker Desktop
Download the latest docker-compose.yml (or run the wget command, replacing version to what you are using)

wget https://github.com/milvus-io/milvus/releases/download/v2.4.0-rc.1/milvus-standalone-docker-compose.yml -O docker-compose.yml

From your terminal:
- cd into directory where you saved the .yml file (usualy same dir as this notebook)
- docker compose up -d
- verify (either in terminal or on Docker Desktop) the containers are running
From your code (see notebook code below):
- Import milvus
- Connect to the local milvus server

[5]

Pymilvus: 2.4.0
v2.4.0-rc.1-dev

Optionally, use Zilliz free tier cluster

To use fully-managed Milvus on Ziliz Cloud free trial.

Choose the default "Starter" option when you provision > Create collection > Give it a name > Create cluster and collection.
On the Cluster main page, copy your API Key and store it locally in a .env variable. See note below how to do that.
Also on the Cluster main page, copy the Public Endpoint URI.

💡 Note: To keep your tokens private, best practice is to use an env variable. See how to save api key in env variable.

👉🏼 In Jupyter, you need a .env file (in same dir as notebooks) containing lines like this:

ZILLIZ_API_KEY=f370c...
OPENAI_API_KEY=sk-H...
VARIABLE_NAME=value...

[6]

Load the Embedding Model checkpoint and use it to create vector embeddings

What are Embeddings?

Check out this blog for an introduction to embeddings.

An excellent place to start is by selecting an embedding model from the HuggingFace MTEB Leaderboard, sorted descending by the "Retrieval Average'' column since this task is most relevant to RAG. Then, choose the smallest, highest-ranking embedding model. But, Beware!! some models listed are overfit to the training data, so they won't perform on your data as promised.

Milvus (and Zilliz) only supports tested embedding models that are not overfit.

In this notebook, we will use the open-source BGE-M3 which supports:

over 100 languages
context lengths of up to 8192
multiple embedding inferences such as dense (semantic), sparse (lexical), and multi-vector Colbert reranking.

BGE-M3 holds the distinction of being the first embedding model to offer support for all three retrieval methods, achieving state-of-the-art performance on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmark tests. Paper, HuggingFace

Milvus, the world's first Open Source Vector Database, plays a vital role in semantic search with scaleable, efficient storage and search for GenerativeAI workflows. Its advanced functionalities include metadata filtering and hybrid search. Since version 2.4, Milvus has built-in support for BGE M3.

[7]

device: cpu

Fetching 30 files:   0%|          | 0/30 [00:00<?, ?it/s]

dense_dim: 1024
sparse_dim: 250002
colbert_dim: 1024

Create a Milvus collection

You can think of a collection in Milvus like a "table" in SQL databases. The collection will contain the

Schema (or no-schema Milvus client).
💡 You'll need the vector EMBEDDING_DIM parameter from your embedding model. Typical values are:
- 1024 for sbert embedding models
- 1536 for ada-002 OpenAI embedding models
Vector index for efficient vector search
Vector distance metric for measuring nearest neighbor vectors
Consistency level In Milvus, transactional consistency is possible; however, according to the CAP theorem, some latency must be sacrificed. 💡 Searching movie reviews is not mission-critical, so eventually consistent is fine here.

Add a Vector Index

The vector index determines the vector search algorithm used to find the closest vectors in your data to the query a user submits.

Most vector indexes use different sets of parameters depending on whether the database is:

inserting vectors (creation mode) - vs -
searching vectors (search mode)

Scroll down the docs page to see a table listing different vector indexes available on Milvus. For example:

FLAT - deterministic exhaustive search
IVF_FLAT or IVF_SQ8 - Hash index (stochastic approximate search)
HNSW - Graph index (stochastic approximate search)
AUTOINDEX - OSS or Zilliz cloud automatic index based on type of GPU, size of data.

Besides a search algorithm, we also need to specify a distance metric, that is, a definition of what is considered "close" in vector space. In the cell below, the HNSW search index is chosen. Its possible distance metrics are one of:

L2 - L2-norm
IP - Dot-product
COSINE - Angular distance

💡 Most use cases work better with normalized embeddings, in which case L2 is useless (every vector has length=1) and IP and COSINE are the same. Only choose L2 if you plan to keep your embeddings unnormalized.

[8]

Successfully dropped collection: `MilvusDocs`
Successfully created collection: `MilvusDocs`

Chunking

Before embedding, it is necessary to decide your chunk strategy, chunk size, and chunk overlap. In this demo, I will use:

Strategy = Use markdown header hierarchies. Keep markdown sections together unless they are too long.
Chunk size = Use the embedding model's parameter MAX_SEQ_LENGTH
Overlap = Rule-of-thumb 10-15%
Function =
- Langchain's HTMLHeaderTextSplitter to split markdown sections.
- Langchain's RecursiveCharacterTextSplitter to split up long reviews recursively.

Notice below, each chunk is grounded with the document source page.
In addition, header titles are kept together with the chunk of markdown text.

[9]

chunk_size: 512, chunk_overlap: 51.0
chunking time: 0.028937101364135742
docs: 22, split into: 22
split into chunks: 304, type: list of <class 'langchain_core.documents.base.Document'>

Looking at a sample chunk...
Why Milvus Docs Tutorials Tools Blog Community Stars0 Try Managed Milvus FREE Search Home v2.4.x Abo
{'h1': 'Why Milvus Docs Tutorials Tools Blog Community Stars0 Try Managed Milvus FREE Search Home v2.4.x Abo', 'source': 'rtdocs/quickstart.html'}

[10]

Why Milvus Docs Tutorials Tools Blog Community Stars0 Try Managed Milvus FREE Search Home v2.4.x Abo
{'h1': 'Why Milvus Docs Tutorials Tools Blog Community Stars0 Try Managed Milvus FREE Search Home v2.4.x Abo', 'source': 'https://milvus.io/docs/quickstart.md'}

Use the built-in Milvus BGE M3 embedding functions. The output will be 2 vectors:

embeddings['dense'][i] is a list of numpy arrays, one per chunk. Milvus supports more than 1 dense embedding vector if desired, so i is the ith dense embedding vector.
embeddings['sparse'][:, [i]] is a scipy sparse matrix where each column represents a chunk.

[11]

Inference Embeddings: 100%|██████████| 19/19 [00:33<00:00,  1.79s/it]

Embedding time for 304 chunks: 34.03 seconds

Insert data into Milvus

For each original text chunk, we'll write the sextuplet (chunk, h1, h2, source, dense_vector, sparse_vector) into the database.

The Milvus Client wrapper can only handle loading data from a list of dictionaries.

Otherwise, in general, Milvus supports loading data from:

pandas dataframes
list of dictionaries

[12]

304
<class 'dict'> 6
{'chunk': 'Why Milvus Docs Tutorials Tools Blog Community Stars0 Try Managed '
          'Milvus FREE Search Home v2.4.x About Milvus Get '
          'StartedPrerequisitesInstall MilvusInstall SDKsQuickstart Concepts '
          'User Guide Embeddings Administration Guide Tools Integrations '
          'Example Applications FAQs API reference Quickstart This guide '
          'explains how to connect to your Milvus cluster and performs CRUD '
          'operations in minutes Before you start You have installed Milvus '
          'standalone or Milvus cluster. You have installed preferred SDKs. '
          'You can',
 'dense_vector': array([-0.01666467,  0.05284622, -0.05246124, ..., -0.0182556 ,
        0.03670057, -0.00945159], dtype=float32),
 'h1': 'Why Milvus Docs Tutorials Tools Blog Community Sta',
 'h2': '',
 'source': 'https://milvus.io/docs/quickstart.md',
 'sparse_vector': <1x250002 sparse array of type '<class 'numpy.float32'>'
	with 63 stored elements in Compressed Sparse Row format>}

[13]

Start inserting entities
Milvus insert time for 304 vectors: 0.22 seconds

Aside - example Milvus collection API calls

https://milvus.io/docs/manage-collections.md#View-Collections

Below are some common API calls for checking a collection.

.num_entities, flushes data and executes row count.
.describe_collection(), gives details about the schema, index, collection.
.query(), gives back selected data from the collection.

[14]

Count rows: 304
timing: 0.0056 seconds

{'aliases': [],
 'auto_id': True,
 'collection_id': 449197422121357429,
 'collection_name': 'MilvusDocs',
 'consistency_level': 3,
 'description': '',
 'enable_dynamic_field': False,
 'fields': [{'auto_id': True,
             'description': '',
             'field_id': 100,
             'is_primary': True,
             'name': 'id',
             'params': {},
             'type': <DataType.INT64: 5>},
            {'description': '',
             'field_id': 101,
             'name': 'sparse_vector',
             'params': {},
             'type': <DataType.SPARSE_FLOAT_VECTOR: 104>},
            {'description': '',
             'field_id': 102,
             'name': 'dense_vector',
             'params': {'dim': 1024},
             'type': <DataType.FLOAT_VECTOR: 101>},
            {'description': '',
             'field_id': 103,
             'name': 'chunk',
             'params': {'max_length': 65535},
             'type': <DataType.VARCHAR: 21>},
            {'description': '',
             'field_id': 104,
             'name': 'source',
             'params': {'max_length': 65535},
             'type': <DataType.VARCHAR: 21>},
            {'description': '',
             'field_id': 105,
             'name': 'h1',
             'params': {'max_length': 100},
             'type': <DataType.VARCHAR: 21>},
            {'description': '',
             'field_id': 106,
             'name': 'h2',
             'params': {'max_length': 65535},
             'type': <DataType.VARCHAR: 21>}],
 'num_partitions': 1,
 'num_shards': 1,
 'properties': {}}
timing: 0.0031 seconds

[{'count(*)': 304}]
timing: 0.0096 seconds

Ask a question about your data

So far in this demo notebook:

Your custom data has been mapped into a vector embedding space
Those vector embeddings have been saved into a vector database

Next, you can ask a question about your custom data!

💡 In LLM vocabulary:

Query is the generic term for user questions.
A query is a list of multiple individual questions, up to maybe 1000 different questions!

Question usually refers to a single user question.
In our example below, the user question is "What is AUTOINDEX in Milvus Client?"

Semantic Search = very fast search of the entire knowledge base to find the TOP_K documentation chunks with the closest embeddings to the user's query.

💡 The same model should always be used for consistency for all the embeddings data and the query.

[15]

query length: 75

[16]

Execute a vector search

Search Milvus using PyMilvus API.

💡 By their nature, vector searches are "semantic" searches. For example, if you were to search for "leaky faucet":

Traditional Key-word Search - either or both words "leaky", "faucet" would have to match some text in order to return a web page or link text to the document.

Semantic search - results containing words "drippy" "taps" would be returned as well because these words mean the same thing even though they are different words.

[17]

Milvus Client search time for 304 vectors: 0.012432098388671875 seconds
type: <class 'pymilvus.client.abstract.Hits'>, count: 2

Assemble and inspect the search result

The search result is in the variable results[0] consisting of top_k-count of objects of type 'pymilvus.client.abstract.Hits'

[18]

Retrieved result #449197422118233005
distance = 0.032522473484277725

Retrieved result #449197422118233006
distance = 0.032522473484277725

Use an LLM to Generate a chat response to the user's question using the Retrieved Context.

Many different generative LLMs exist these days. Check out the lmsys leaderboard.

In this notebook, we'll try these LLMs:

The newly released open-source Llama 3 from Meta.
The cheapest, paid model from Anthropic Claude3 Haiku.
The standard in its price cateogory, gpt-3.5-turbo, from Openai.

[34]

Length long text to summarize: 1017

Try Llama3 with Ollama to generate a human-like chat response to the user's question

Follow the instructions to install ollama and pull a model.
https://github.com/ollama/ollama

View details about which models are supported by ollama.
https://ollama.com/library/llama3

That page says ollama run llama3 will by default pull the latest "instruct" model, which is fine-tuned for chat/dialogue use cases.

The other kind of llama3 models are "pre-trained" base model.
Example: ollama run llama3:text ollama run llama3:70b-text

Format gguf means the model runs on CPU. gg = "Georgi Gerganov", creator of the C library model format ggml, which was recently changed to gguf.

Quantization (think of it like vector compaction) can lead to higher throughput at the expense of lower accuracy. For the curious, quantization meanings can be found on:
https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main.

Below just listing the main quantization types.

q4_0: Original quant method, 4-bit.
q4_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
q5_0: Higher accuracy, higher resource usage and slower inference.
q5_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
q 6_k: Uses Q8_K for all tensors
q8_0: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

[20]

MODEL:llama3:latest, FORMAT:gguf, PARAMETER_SIZE:8B, QUANTIZATION_LEVEL:Q4_0,

[50]

[51]

('According to the provided context, the parameters for HNSW (Hierarchical '
 'Navigable Small World Graph) are:  * `M`: The maximum degree of nodes on '
 'each layer of the graph. This value can improve recall rate at the cost of '
 'increased search time. * `ef` (in construction or searching targets): A '
 'search range parameter that can be used to specify the search scope.  In '
 'simpler terms, these parameters help control how efficiently HNSW searches '
 'for nearest neighbors in a dataset. By adjusting these values, you can '
 'balance the trade-off between search accuracy and speed.')

Use Anthropic to generate a human-like chat response to the user's question

We've practiced retrieval for free on our own data using open-source LLMs.

Now let's make a call to the paid Claude3. List of models

Opus - most expensive
Sonnet
Haiku - least expensive!

Prompt engineering tutorials

[23]

[24]

Model: claude-3-haiku-20240307

Question: What do the parameters for HNSW mean?

[25]

[26]

[27]

Try OpenAI to generate a human-like chat response to the user's question

We've practiced retrieval for free on our own data using open-source LLMs.

Now let's make a call to the paid OpenAI GPT.

💡 Note: For use cases that need to always be factually grounded, use very low temperature values while more creative tasks can benefit from higher temperatures.

[28]

[29]

Question: What do the parameters for HNSW mean?
('Answer: The parameters for HNSW are as follows:\n'
 '- M: Maximum degree of the node, limiting the number of connections each '
 'node can have on each layer of the graph. It ranges from 2 to 2048.\n'
 '- efConstruction: Used during index building to specify a search range.\n'
 '- ef: Used when searching for targets to specify a search range.')

[30]

[31]

[32]

Author: Christy Bergman

Python implementation: CPython
Python version       : 3.11.8
IPython version      : 8.22.2

torch    : 2.2.2
pymilvus : 2.4.0
langchain: 0.1.16
ollama   : 0.1.8
anthropic: 0.25.6
openai   : 1.14.3

conda environment: py311-ray