Pdf Azure Ai Document Intelligence
openai-chatgptlangchain-pythonchatgptgenaielasticsearchelasticopenaiAIpdf-azure-ai-document-intelligencechatlogvectordatabasePythonsearchgenaistacksupporting-blog-contentvectorelasticsearch-labslangchainapplications
Export
Overview
This notebook provides the following:
- Parses PDFs with Azure Document Intelligence that have text and tables. Each PDF is saved as a JSON file so that it can be loaded into elastic.
- Loads JSON files into Elasticsearch. This notebook uses the elasticsearch python client to create an index with E5 and ELSER semantic_text mappings.
- Once the data is loaded into Elasticsearch, you can ask questions in Playground and get answers grounded in truth. The index "id" field uses the following naming convention: PDF_FILENAME.pdf_PAGENUMBER. That allows you to see PDF and page number in the "document sources" link.
This notebook cannot be used to parse PDF images.
Install python dependencies
[ ]
Create a .env file that has the following entries.
Elasticsearch
- You must have a functional elasticsearch environment that has an
enterpriselevel license - The fastest way to get up and running is to use the Elastic Serverless - Get started guide
ES_URL=?
ES_API_KEY=?
Azure AI Document Intelligence
AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT=?
AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY=?
Create input and output folders
- /pdf - place your PDF files in this input folder
- /json - parser will output one json file for each pdf in this output folder
[ ]
Download PDF files
- This notebook downloads 4 recent Elastic SEC 10-Q quarterly reports
- If you already have PDF files, feel free to place them in
./pdffolder
[ ]
Set Azure AI Document Intelligence Imports and Environment Variables
[ ]
Parse paragraphs using AnalyzeResult
This function extracts the paragraph text via an AnalyzeResult on a PDF file.
[ ]
Parse tables using AnalyzeResult
This function extracts the paragraph text via an AnalyzeResult on a PDF file.
[ ]
Combine paragraph and table text
[ ]
Bring it all together
[ ]
Set imports for the elasticsearch client and environment variables
[ ]
Create index in Elastic Cloud Serverless
[ ]
[ ]
Prompt List
- Compare/contrast subscription revenue for Q2-2025, Q1-2025, Q4-2024 and Q3-2024?
- Provide an Income Taxes summary for Q2-2025, Q1-2025, Q4-2024 and Q3-2024?
- How has the balance sheet changed for Q2-2025, Q1-2025, Q4-2024 and Q3-2024?