Notebooks
E
Elastic
Pdf Azure Ai Document Intelligence

Pdf Azure Ai Document Intelligence

openai-chatgptlangchain-pythonchatgptgenaielasticsearchelasticopenaiAIpdf-azure-ai-document-intelligencechatlogvectordatabasePythonsearchgenaistacksupporting-blog-contentvectorelasticsearch-labslangchainapplications

Overview

This notebook provides the following:

  1. Parses PDFs with Azure Document Intelligence that have text and tables. Each PDF is saved as a JSON file so that it can be loaded into elastic.
  2. Loads JSON files into Elasticsearch. This notebook uses the elasticsearch python client to create an index with E5 and ELSER semantic_text mappings.
  3. Once the data is loaded into Elasticsearch, you can ask questions in Playground and get answers grounded in truth. The index "id" field uses the following naming convention: PDF_FILENAME.pdf_PAGENUMBER. That allows you to see PDF and page number in the "document sources" link.

This notebook cannot be used to parse PDF images.

Install python dependencies

[ ]

Create a .env file that has the following entries.

Elasticsearch

  • You must have a functional elasticsearch environment that has an enterprise level license
  • The fastest way to get up and running is to use the Elastic Serverless - Get started guide
	ES_URL=?
ES_API_KEY=?

Azure AI Document Intelligence

	AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT=?
AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY=?

Create input and output folders

  • /pdf - place your PDF files in this input folder
  • /json - parser will output one json file for each pdf in this output folder
[ ]

Download PDF files

  • This notebook downloads 4 recent Elastic SEC 10-Q quarterly reports
  • If you already have PDF files, feel free to place them in ./pdf folder
[ ]

Set Azure AI Document Intelligence Imports and Environment Variables

[ ]

Parse paragraphs using AnalyzeResult

This function extracts the paragraph text via an AnalyzeResult on a PDF file.

[ ]

Parse tables using AnalyzeResult

This function extracts the paragraph text via an AnalyzeResult on a PDF file.

[ ]

Combine paragraph and table text

[ ]

Bring it all together

[ ]

Set imports for the elasticsearch client and environment variables

[ ]

Create index in Elastic Cloud Serverless

[ ]
[ ]

Prompt List

  1. Compare/contrast subscription revenue for Q2-2025, Q1-2025, Q4-2024 and Q3-2024?
  2. Provide an Income Taxes summary for Q2-2025, Q1-2025, Q4-2024 and Q3-2024?
  3. How has the balance sheet changed for Q2-2025, Q1-2025, Q4-2024 and Q3-2024?