Notebooks
M
MongoDB
Self Querying Mongodb Unstructured Langgraph

Self Querying Mongodb Unstructured Langgraph

agentsartificial-intelligencellmsmongodb-genai-showcasenotebooksgenerative-airag

Open In Colab

View Article

Building an Advanced RAG System with Self-Querying Retrieval

This notebook shows how to incorporate self-querying retrieval into a RAG application using Unstructured, MongoDB and LangGraph.

Step 1: Install required libraries

  • langgraph: Python package to build stateful, multi-actor applications with LLMs

- **openai**: Python package to interact with OpenAI APIs

- **pymongo**: Python package to interact with MongoDB databases and collections

- **sentence-transformers**: Python package for open-source language models

- **unstructured-ingest**: Python package for data processing using Unstructured

[ ]
[ ]

Step 2: Setup prerequisites

  • Set the Unstructured API key and URL: Steps to obtain the API key and URL are here

  • Set the AWS access keys: Steps to obtain the AWS access keys are here

  • Set the MongoDB connection string: Follow the steps here to get the connection string from the Atlas UI.

  • Set the OpenAI API key: Steps to obtain an API key as here

[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[10]
[11]

Step 3: Partition, chunk and embed PDF files

Let's set up the PDF preprocessing pipeline with Unstructured. The pipeline will:

  1. Ingest data from an S3 bucket/local directory
  2. Partition documents: extract text and metadata, split the documents into document elements, such as titles, paragraphs (narrative text), tables, images, lists, etc. Learn more about document elements in [Unstructured documentation])https://docs.unstructured.io/api-reference/api-services/document-elements).
  3. Chunk the documents.
  4. Embed the documents with the BAAI/bge-base-en-v1.5 embedding model the Hugging Face Hub.
  5. Save the results locally.
[ ]
[ ]
[ ]

Step 4: Add custom metadata to the processed documents

For each document, we want to add the company name and fiscal year as custom metadata, to enable smart pre-filtering for more precise document retrieval.

Luckily the Form-10K documents have a more or less standard page with this information, so we can use regex to extract this information.

[ ]
[ ]
[ ]

We'll walk through the directory with the embedding results, and for each document find the company name and year, and add it as custom metadata to all elements of the document.

[ ]
[ ]

Step 5: Write the processed documents to MongoDB

To write the final processed documents to MongoDB, we will need to rerun the same pipeline, except we'll now change the destination from local to MongoDB. The pipeline will not repeat partitioning, chunking and embedding steps, since there are results for them already cached in the WORK_DIR. It will pick up the customized embedding results and load them into a MongoDB collection.

[ ]
[ ]

Next, we are going to use LangGraph to build our investment assistant. With LangGraph, we can build LLM systems as graphs with a shared state, conditional edges, and cyclic loops between nodes.

Step 6: Define graph state

Let's first define the state of our graph. The state is a mutable object that tracks different attributes as we pass through the nodes in the graph. We can include custom attributes within the state that represent parameters we want to track.

[ ]
[ ]

Step 7: Define graph nodes

Next, let's add the graph nodes. Nodes in LangGraph are functions or tools that your system has access to in order to complete the task. Each node updates one or more attributes in the graph state with its return value after it executes. Our assistant has four nodes:

  1. Metadata Extractor: Extract metadata from a natural language query
  2. Filter Generator: Generate a MongoDB Query API filter definition
  3. MongoDB Atlas Vector Search: Retrieve documents from MongoDB using semantic search
  4. Answer Generator: Generate an answer to the user question

Metadata Extractor

[ ]
[ ]
[ ]
[ ]

Filter Generator

[ ]

MongoDB Atlas Vector Search

[ ]
[ ]
[ ]
[ ]

Answer Generator

[ ]
[ ]

Step 8: Define conditional edges

Conditional edges in LangGraph decide which node in the graph to visit next. Here, we have a single conditional edge to skip filter generation and go directly to the vector search step if no metadata was extracted from the user query.

[ ]

Step 9: Build the graph/flow

This is where we actually define the flow of the graph by connecting nodes to edges.

[ ]
[ ]
[ ]
[44]
Output

Step 10: Execute the graph

[ ]
[41]
---EXTRACTING METADATA---
---CHECK FOR METADATA---
---DECISION: GENERATE FILTER---
Node extract_metadata:
{'metadata': {'metadata.custom_metadata.company': ['WALMART INC.'], 'metadata.custom_metadata.year': ['2023']}}
---GENERATING FILTER DEFINITION---
Node generate_filter:
{'filter': {'$and': [{'metadata.custom_metadata.company': {'$eq': 'WALMART INC.'}}, {'metadata.custom_metadata.year': {'$eq': 2023}}]}}
---PERFORMING VECTOR SEARCH---
Node vector_search:
{'context': 'DOCUMENTS INCORPORATED BY REFERENCE\n\nDocument Portions of the registrant\'s Proxy Statement for the Annual Meeting of Shareholders to be held May 31, 2023 (the "Proxy Statement")\n\nParts Into Which Incorporated Part III\n\nWalmart Inc. Form 10-K For the Fiscal Year Ended January 31, 2023\n\nTable of Contents\n\nWALMART INC. ANNUAL REPORT ON FORM 10-K FOR THE FISCAL YEAR ENDED JANUARY 31, 2023\n\nAll references in this Annual Report on Form 10-K, the information incorporated into this Annual Report on Form 10-K by reference to information in the Proxy Statement of Walmart Inc. for its Annual Shareholders\' Meeting to be held on May 31, 2023 and in the exhibits to this Annual Report on Form 10-K to "Walmart Inc.," "Walmart," "the Company," "our Company," "we," "us" and "our" are to the Delaware corporation named "Walmart Inc." and, except where expressly noted otherwise or the context otherwise requires, that corporation\'s consolidated subsidiaries.\n\nPART I\n\nWalmart International includes numerous formats divided into two major categories: retail and wholesale. These categories consist of many formats, including: supercenters, supermarkets, hypermarkets, warehouse clubs (including Sam\'s Clubs) and cash & carry, as well as eCommerce through walmart.com.mx, walmart.ca, flipkart.com, walmart.cn and other sites. Walmart International had net sales of $101.0 billion for fiscal 2023, representing 17% of our fiscal 2023 consolidated net sales, and had net sales of $101.0 billion and $121.4 billion for fiscal 2022 and 2021, respectively. The gross profit rate is lower than that of Walmart U.S. primarily because of its format mix.\n\nWalmart International\'s strategy is to create strong local businesses powered by Walmart which means being locally relevant and customer-focused in each of the markets it operates. We are being deliberate about where and how we choose to operate and continue to re-shape the portfolio to best enable long-term, sustainable and profitable growth. As such, we have taken certain strategic actions to strengthen our Walmart International portfolio for the long-term, which include the following highlights over the last three years:\n\nDivested of Walmart Argentina in November 2020.\n\nDivested of Asda Group Limited ("Asda"), our retail operations in the U.K., in February 2021.\n\nDivested of a majority stake in Seiyu, our retail operations in Japan, in March 2021.\n\nOmni-channel. Walmart U.S. provides an omni-channel experience to customers, integrating retail stores and eCommerce, through services such as pickup and delivery, in-home delivery, ship-from-store, and digital pharmacy fulfillment options. As of January 31, 2023, we had more than 4,600 pickup locations and more than 3,900 same-day delivery locations. Our Walmart+ membership offering provides enhanced omni-channel shopping benefits including unlimited free shipping on eligible items with no order minimum, unlimited delivery from store, fuel discounts, access to Paramount+ streaming service, and mobile scan & go for a streamlined in-store shopping experience. We have several eCommerce websites, the largest of which is walmart.com. We define eCommerce sales as sales initiated by customers digitally and fulfilled by a number of methods including our dedicated eCommerce fulfillment centers and leveraging our stores, as well as certain other business offerings that are part of our flywheel strategy, such as our Walmart Connect advertising business. The following table provides the approximate size of our retail stores as of January 31, 2023:\n\nOur strategy is to make every day easier for busy families, operate with discipline, sharpen our culture and become more digital, and make trust a competitive advantage. Making life easier for busy families includes our commitment to price leadership, which has been and will remain a cornerstone of our business, as well as increasing convenience to save our customers time. By leading on price, we earn the trust of our customers every day by providing a broad assortment of quality merchandise and services at everyday low prices ("EDLP"). EDLP is our pricing philosophy under which we price items at a low price every day so our customers trust that our prices will not change under frequent promotional activity. Everyday low cost ("EDLC") is our commitment to control expenses so our cost savings can be passed along to our customers.\n\nOur operations comprise three reportable segments: Walmart U.S., Walmart International and Sam\'s Club. Our fiscal year ends on January 31 for our United States ("U.S.") and Canadian operations. We consolidate all other operations generally using a one-month lag and on a calendar year basis. Our discussion is as of and for the fiscal years ended January 31, 2023 ("fiscal 2023"), January 31, 2022 ("fiscal 2022") and January 31, 2021 ("fiscal 2021"). During fiscal 2023, we generated total revenues of $611.3 billion, which was comprised primarily of net sales of $605.9 billion.'}
---GENERATING THE ANSWER---
Node generate_answer:
{'memory': [HumanMessage(content='DOCUMENTS INCORPORATED BY REFERENCE\n\nDocument Portions of the registrant\'s Proxy Statement for the Annual Meeting of Shareholders to be held May 31, 2023 (the "Proxy Statement")\n\nParts Into Which Incorporated Part III\n\nWalmart Inc. Form 10-K For the Fiscal Year Ended January 31, 2023\n\nTable of Contents\n\nWALMART INC. ANNUAL REPORT ON FORM 10-K FOR THE FISCAL YEAR ENDED JANUARY 31, 2023\n\nAll references in this Annual Report on Form 10-K, the information incorporated into this Annual Report on Form 10-K by reference to information in the Proxy Statement of Walmart Inc. for its Annual Shareholders\' Meeting to be held on May 31, 2023 and in the exhibits to this Annual Report on Form 10-K to "Walmart Inc.," "Walmart," "the Company," "our Company," "we," "us" and "our" are to the Delaware corporation named "Walmart Inc." and, except where expressly noted otherwise or the context otherwise requires, that corporation\'s consolidated subsidiaries.\n\nPART I\n\nWalmart International includes numerous formats divided into two major categories: retail and wholesale. These categories consist of many formats, including: supercenters, supermarkets, hypermarkets, warehouse clubs (including Sam\'s Clubs) and cash & carry, as well as eCommerce through walmart.com.mx, walmart.ca, flipkart.com, walmart.cn and other sites. Walmart International had net sales of $101.0 billion for fiscal 2023, representing 17% of our fiscal 2023 consolidated net sales, and had net sales of $101.0 billion and $121.4 billion for fiscal 2022 and 2021, respectively. The gross profit rate is lower than that of Walmart U.S. primarily because of its format mix.\n\nWalmart International\'s strategy is to create strong local businesses powered by Walmart which means being locally relevant and customer-focused in each of the markets it operates. We are being deliberate about where and how we choose to operate and continue to re-shape the portfolio to best enable long-term, sustainable and profitable growth. As such, we have taken certain strategic actions to strengthen our Walmart International portfolio for the long-term, which include the following highlights over the last three years:\n\nDivested of Walmart Argentina in November 2020.\n\nDivested of Asda Group Limited ("Asda"), our retail operations in the U.K., in February 2021.\n\nDivested of a majority stake in Seiyu, our retail operations in Japan, in March 2021.\n\nOmni-channel. Walmart U.S. provides an omni-channel experience to customers, integrating retail stores and eCommerce, through services such as pickup and delivery, in-home delivery, ship-from-store, and digital pharmacy fulfillment options. As of January 31, 2023, we had more than 4,600 pickup locations and more than 3,900 same-day delivery locations. Our Walmart+ membership offering provides enhanced omni-channel shopping benefits including unlimited free shipping on eligible items with no order minimum, unlimited delivery from store, fuel discounts, access to Paramount+ streaming service, and mobile scan & go for a streamlined in-store shopping experience. We have several eCommerce websites, the largest of which is walmart.com. We define eCommerce sales as sales initiated by customers digitally and fulfilled by a number of methods including our dedicated eCommerce fulfillment centers and leveraging our stores, as well as certain other business offerings that are part of our flywheel strategy, such as our Walmart Connect advertising business. The following table provides the approximate size of our retail stores as of January 31, 2023:\n\nOur strategy is to make every day easier for busy families, operate with discipline, sharpen our culture and become more digital, and make trust a competitive advantage. Making life easier for busy families includes our commitment to price leadership, which has been and will remain a cornerstone of our business, as well as increasing convenience to save our customers time. By leading on price, we earn the trust of our customers every day by providing a broad assortment of quality merchandise and services at everyday low prices ("EDLP"). EDLP is our pricing philosophy under which we price items at a low price every day so our customers trust that our prices will not change under frequent promotional activity. Everyday low cost ("EDLC") is our commitment to control expenses so our cost savings can be passed along to our customers.\n\nOur operations comprise three reportable segments: Walmart U.S., Walmart International and Sam\'s Club. Our fiscal year ends on January 31 for our United States ("U.S.") and Canadian operations. We consolidate all other operations generally using a one-month lag and on a calendar year basis. Our discussion is as of and for the fiscal years ended January 31, 2023 ("fiscal 2023"), January 31, 2022 ("fiscal 2022") and January 31, 2021 ("fiscal 2021"). During fiscal 2023, we generated total revenues of $611.3 billion, which was comprised primarily of net sales of $605.9 billion.'), AIMessage(content='During fiscal 2023, Walmart generated total revenues of $611.3 billion, primarily comprised of net sales of $605.9 billion. Walmart International had net sales of $101.0 billion, representing 17% of the fiscal 2023 consolidated net sales.')]}
---FINAL ANSWER---
During fiscal 2023, Walmart generated total revenues of $611.3 billion, primarily comprised of net sales of $605.9 billion. Walmart International had net sales of $101.0 billion, representing 17% of the fiscal 2023 consolidated net sales.
[42]
---EXTRACTING METADATA---
---CHECK FOR METADATA---
---DECISION: SKIP TO VECTOR SEARCH---
Node extract_metadata:
{'metadata': {}}
---PERFORMING VECTOR SEARCH---
Node vector_search:
{'context': ''}
---GENERATING THE ANSWER---
Node generate_answer:
{'memory': [HumanMessage(content=''), AIMessage(content='You asked for the sales summary for Walmart for 2023.')]}
---FINAL ANSWER---
You asked for the sales summary for Walmart for 2023.
[43]
---EXTRACTING METADATA---
---CHECK FOR METADATA---
---DECISION: SKIP TO VECTOR SEARCH---
Node extract_metadata:
{'metadata': {}}
---PERFORMING VECTOR SEARCH---
Node vector_search:
{'context': ''}
---GENERATING THE ANSWER---
Node generate_answer:
{'memory': [HumanMessage(content=''), AIMessage(content="I DON'T KNOW")]}
---FINAL ANSWER---
I DON'T KNOW