Weaviate BigQuery Weaviate DSPy RAG

BigQuery Weaviate DSPy RAG

vector-searchbigqueryvector-databaseretrieval-augmented-generationcloud-hyperscalersgooglellm-frameworksfunction-callingweaviate-recipesintegrationsPythongenerative-ai

alph-notebooks/weaviate-recipes / BigQuery-Weaviate-DSPy-RAG.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

RAGwithContextFusion

How to build a RAG System with Weaviate, BigQuery, and DSPy

Retrieval-Augmented Generation (RAG) systems combine the power of Large Language Models with knowledge sources, such as databases.

This tutorial will show you how to use DSPy to combine multiple knowledge sources, using Weaviate for vector search on chunks from the Weaviate blog post and Google's BigQuery for structured information about the authors of the blogs, such as their names, what team they work on at Weaviate, how many blogs they have written, and whether they are an active member of the Weaviate team.

We will use DSPy to create our RAGwithContextFusion agent to route queries, convert natural language queries into SQL commands to send to BigQuery, and use the acquired context to answer questions. DSPy uses the Gemini LLM under the hood.

alt text

Connect DSPy to the Gemini API

alt text

Image source: https://gemini.google.com/

[2]

/Users/cshorten/Desktop/DSPy-local/cohere_fix/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

['Hello!']

[133]

["**Google BigQuery** is a fully managed, serverless data warehouse that enables fast and cost-effective analysis of large datasets. It is a cloud-based service that allows users to store, query, and analyze data at scale.\n\n**Key Features:**\n\n* **Massive Scalability:** BigQuery can handle datasets up to petabytes in size, making it suitable for large-scale data analysis.\n* **Fast Query Performance:** BigQuery uses a distributed processing engine to execute queries quickly, even on massive datasets.\n* **Serverless Architecture:** BigQuery is a fully managed service, eliminating the need for infrastructure management and maintenance.\n* **Cost-Effective:** BigQuery charges only for the data stored and the queries executed, making it a cost-effective solution for data analysis.\n* **Standard SQL Support:** BigQuery supports standard SQL, making it easy for users to write complex queries and perform advanced data analysis.\n* **Integration with Google Cloud Platform:** BigQuery seamlessly integrates with other Google Cloud services, such as Cloud Storage, Cloud Dataflow, and Cloud Machine Learning.\n* **Data Sharing and Collaboration:** BigQuery allows users to share datasets and collaborate with others, enabling data-driven decision-making across teams.\n\n**Use Cases:**\n\nBigQuery is used for a wide range of data analysis applications, including:\n\n* Business Intelligence and Reporting\n* Data Exploration and Visualization\n* Machine Learning and Data Science\n* Log Analysis and Monitoring\n* Data Warehousing and Data Lake Management\n\n**Benefits:**\n\n* **Reduced Time to Insight:** BigQuery's fast query performance enables users to quickly extract insights from large datasets.\n* **Cost Optimization:** The serverless architecture and pay-as-you-go pricing model help organizations optimize their data analysis costs.\n* **Improved Data Governance:** BigQuery provides data access controls and audit logs, ensuring data security and compliance.\n* **Enhanced Collaboration:** Data sharing and collaboration features facilitate data-driven decision-making across teams.\n* **Scalability and Flexibility:** BigQuery's massive scalability and flexible data ingestion options allow organizations to handle growing data volumes and evolving analysis needs."]

Load Unstructured Text Data into Weaviate

Load Unstructured Text Data into Memory

[3]

---
title: Combining LangChain and Weaviate
slug: combining-langchain-and-weaviate
authors: [erika]
date: 2023-02-21
tags: ['integrations']
image: ./img/hero.png
description: "LangChain is one of the most exciting new tools in AI. It helps overcome many limitations of LLMs, such as hallucination and limited input lengths."
---
![Combining LangChain and Weaviate](./img/hero.png)

Large Language Models (LLMs) have revolutionized the way we interact and communicate with computers. These machines can understand and generate human-like language on a massive scale. LLMs are a versatile tool that is seen in many applications like chatbots, content creation, and much more. Despite being a powerful tool, LLMs have the drawback of being too general.

Create a Weaviate Schema and Import Data

[289]

Query Test

[290]

{'content': "In a garbage-collected language, such as Go, C#, or Java, the programmer doesn't have to deallocate objects manually after using them. A GC cycle runs periodically to collect memory no longer needed and ensure it can be assigned again. Using a garbage-collected language is a trade-off between development complexity and execution time. Some CPU time has to be spent at runtime to run the GC cycles. Go's Garbage collector is highly concurrent and [quite efficient](https://tip.golang.org/doc/gc-guide#Understanding_costs)."}

Load Structured Data into BigQuery

From cloud.google.com/bigquery, "BigQuery is a fully managed, AI-ready data analytics platform that helps you maximize value from your data and is designed to be multi-engine, multi-format, and multi-cloud". For example, companies often store information about transactions or customer relationships in structured tables.

alt text

Image source: https://cloud.google.com/bigquery

Connect to BigQuery

Download the google-cloud-bigquery Python client with pip!

This tutorial is written with google-cloud-bigquery==3.21.0

[65]


[notice] A new release of pip available: 22.3.1 -> 24.0
[notice] To update, run: pip install --upgrade pip

[66]

[67]

Google Cloud Data Marketplace with BigQuery

You can access many datasets in the Google Cloud Data Marketplace!

Maybe your RAG application needs to know what the most commonly occuring names of residents in Texas are!

[336]

Row(('Ruby', 314), {'name': 0, 'number': 1})
Row(('Louise', 127), {'name': 0, 'number': 1})
Row(('Carrie', 63), {'name': 0, 'number': 1})

Learn more about the Google Cloud Marketplace in the 95th Weaviate Podcast with Dai Vu, Director of Google Cloud Marketplace and ISV GTM and Bob van Luijt, Weaviate Co-Founder and CEO!

alt text

Custom Schema

Schema created in the Google Cloud console.

[69]

/var/folders/41/8dp_379x15d8zz4ppsjthdw40000gn/T/ipykernel_8009/3365144275.py:1: PendingDeprecationWarning: Client.dataset is deprecated and will be removed in a future version. Use a string like 'my_project.my_dataset' or a cloud.google.bigquery.DatasetReference object, instead.
  table_ref = bigquery_client.dataset("WeaviateBlogs").table("BlogInfo")

bigquery-playground-422417.WeaviateBlogs.BlogInfo
Rows inserted successfully.

[70]

Row(('Abdel Rodriguez', 'Applied Research', 5, True), {'Name': 0, 'Team': 1, 'Blogs_Written': 2, 'Active_Weaviate_Team_Member': 3})

Storing Monitoring Logs in Weaviate and BigQuery

[315]

/var/folders/41/8dp_379x15d8zz4ppsjthdw40000gn/T/ipykernel_8009/497179841.py:7: PendingDeprecationWarning: Client.dataset is deprecated and will be removed in a future version. Use a string like 'my_project.my_dataset' or a cloud.google.bigquery.DatasetReference object, instead.
  table_ref = bigquery_client.dataset("WeaviateBlogs").table("RAGLogs")

RAGwithContextFusion Program

alt text

Image source: https://dspy-docs.vercel.app/

Now we will turn to our RAGwithContextFusion program that uses the:

Blog chunks stored in Weaviate
Author metadata stored in BigQuery
RAG logs stored in Weaviate
RAG logs stored in BigQuery

To answer questions, the program will route queries to the appropriate information sources, looping when multiple rounds of queries are needed.

DSPy Signatures and Route Enum

[316]

Database Tools

[317]

Structured Schema Info and Data Source Metadata

[318]

RAGwithContextFusion

[333]

[324]

[332]

Prediction(
,    answer='Zain Hasan and Erika Cardenas are the most frequent authors of Weaviate blog posts, with 20 posts each.'
,)

[326]

Prediction(
,    answer='Ref2Vec infers a centroid vector from a user\'s references to other vectors. This vector is updated in real-time to reflect the user\'s preferences and actions. Ref2Vec integrates with Weaviate through the "user-as-query" method, where the user\'s vector is used as a query to fetch relevant products.'
,)