NVIDIA 01 Graph Triplet Extraction

01 Graph Triplet Extraction

gpu-accelerationretrieval-augmented-generationllm-inferencetensorrtGTC25_DLInvidia-generative-ai-exampleslarge-language-modelsmicroservicetriton-inference-servercommunityknowledge_graph_ragLLMnotebooksragnemo

alph-notebooks/nvidia-generative-ai-examples / 01_Graph_Triplet_Extraction.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Graph Triplet Extraction for Financial Documents

SEC (Securities and Exchange Commission) filings, such as 10-K reports, contain vast amounts of structured and unstructured data about a company's financials, risks, strategies, and operations. Extracting graph triplets from these documents provides several benefits:

Structured Representation: Converts unstructured text into structured knowledge in the form of (subject, relation, object) triplets, making it easier to analyze relationships between entities.
Enhanced Financial Analysis: Enables analysts to identify connections between companies, risks, financial metrics, and market conditions.
Scalability: Automates the extraction process for large volumes of filings across multiple companies and years.
Risk Assessment: Helps uncover hidden risks by linking entities (e.g., "TAUTACHROME INC.") to specific risk factors (e.g., "Market Risk").
Compliance and Strategy Insights: Identifies regulatory or operational dependencies that can impact business strategies.

By extracting graph triplets from SEC documents, we can transform raw text into actionable insights that are easier to query and visualize.

This notebook demonstrates how to extract graph triples from SEC 10-K filings using NVIDIA's AI endpoints. The extracted triples are useful for building a GraphRAG (Graph-based Retrieval-Augmented Generation) system, enhancing the knowledge graph with detailed financial information.

How Does a Knowledge Graph Help with Multiple SEC Documents?

A knowledge graph organizes extracted triplets into a connected network of entities and relationships. This structure is particularly valuable for analyzing multiple SEC filings:

Unified View Across Companies: Combines data from multiple filings into a single graph, enabling cross-company comparisons. Example: Identify common risk factors faced by companies in the same industry.
Queryable Relationships: Allows users to query specific relationships (e.g., "What market risks does a company face?") without manually sifting through documents.
Interconnected Insights: Links related entities across documents (e.g., subsidiaries, competitors, or shared risks). Example: Connect "TAUTACHROME INC" to its financial performance metrics and regulatory obligations.
Scalability for Large Datasets: Handles thousands of filings efficiently by representing them as nodes and edges in a graph. Example: Visualize how different companies are affected by the same regulation or market condition.
Improved Decision-Making: Provides a holistic view of a company's ecosystem, enabling better risk assessment, compliance tracking, and strategic planning. A knowledge graph built from SEC filings transforms disparate data into an interconnected web of insights that can be queried and analyzed at scale.

What Will You Learn in This Notebook?

This notebook demonstrates how to extract graph triplets from SEC filings and build a knowledge graph. By the end of this notebook, you will learn:

Triplet Extraction: How to extract (subject, relation, object) triplets from SEC filings using natural language processing techniques. Example: ("TAUTACHROME INC", "Faces", "Market Risk").
Building a Knowledge Graph: How to construct a knowledge graph from extracted triplets using tools like NetworkX. Relabel nodes with meaningful entity names (e.g., "TAUTACHROME INC") and edges with relation names (e.g., "Faces").
Querying the Knowledge Graph: How to query the graph for insights using LangChain's GraphQAChain. Example Query: "What risk factors does TAUTACHROME INC face?"
Applications of Knowledge Graphs: Learn how knowledge graphs can be used for financial analysis, risk assessment, compliance tracking, and strategic decision-making.

By following this notebook, you will gain hands-on experience in transforming raw text from SEC filings into actionable insights through graph-based representations. This markdown provides clear sections explaining the utility of graph triplet extraction for SEC documents, the benefits of knowledge graphs for analyzing multiple documents, and what users will learn in the notebook.

Import Necessary Libraries

[ ]

Ensure that your NVIDIA API key is set. This key is required to access NVIDIA's AI endpoints, which are used for processing the SEC filings. You can obtain your API key from NVIDIA's AI portal.

[ ]

Sample Dataset

For this example, we'll use 2021 SEC documents hosted on Kaggle. We'll store this data in our data directory

Define Functions for Preprocessing and Triple Extraction

Preprocess JSON Content

This function preprocesses JSON content by decoding Unicode escape sequences and normalizing characters. Preprocessing ensures that the text is in a suitable format for extraction.

[ ]

Preprocess Text,

This function preprocesses text by replacing company-specific pronouns with the company name. This step is important for accurate entity recognition and disambiguation.

[ ]

Process Response

This function processes the response from the language model, ensuring that the output is properly formatted as a list of graph triples.

[ ]

Extract Triples for Section

This function extracts graph triples for a given section of text. It uses Langchain's text splitting and prompt templates to generate triples that depict relationships between entities in the text.

[ ]

Extract Triples from File

This function extracts triples from a given JSON file containing SEC filing data. It processes specific sections and generates triples for each section.

[ ]

Extract graph triples from SEC 10-K documents

We're also going to save the results to file in JSON format, organizing the data for further analysis or integration into a knowledge graph.

[ ]

Skip next cell due to time limit of this lab. Feel free to run after the lab to extract new graph triplets.

[ ]

Accelerated Graph Construction with cuGraph and NetworkX

Now that we have our graph triples, we can construct our full knowledge graph for the corpus of 10-K documents.

In the next section, we demonstrate how to construct a graph using NetworkX and optionally accelerate it with cuGraph (GPU-accelerated graph analytics library). The graph is built from triples extracted from SEC 10-K filings, where each triple represents a relationship between two entities. This process is useful for creating knowledge graphs that can be queried for insights or used in downstream machine learning tasks.

[ ]

Save the knowledge graph object

[ ]

Sample query the graph

[ ]

Adjusting the PromptTemplate via LangChain

[ ]

Add custom context retrieval and chat template

[ ]