Batch Ocr

mistral-cookbookocrmistral

OCR at scale via Mistral's Batch API


Apply OCR to Convert Images into Text

Optical Character Recognition (OCR) allows you to retrieve text data from images. With Mistral OCR, you can do this extremely fast and effectively, extracting text from hundreds and thousands of images (or PDFs).

In this simple cookbook, we will extract text from a set of images using two methods:


Used

  • OCR
  • Batch Inference

Setup

First, let's install mistralai and datasets

[ ]
Requirement already satisfied: mistralai in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (1.5.0)
Requirement already satisfied: datasets in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (3.2.0)
Requirement already satisfied: eval-type-backport>=0.2.0 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from mistralai) (0.2.0)
Requirement already satisfied: httpx>=0.27.0 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from mistralai) (0.27.0)
Requirement already satisfied: jsonpath-python>=1.0.6 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from mistralai) (1.0.6)
Requirement already satisfied: pydantic>=2.9.0 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from mistralai) (2.9.2)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from mistralai) (2.8.2)
Requirement already satisfied: typing-inspect>=0.9.0 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from mistralai) (0.9.0)
Requirement already satisfied: filelock in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from datasets) (3.13.1)
Requirement already satisfied: numpy>=1.17 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from datasets) (1.26.2)
Requirement already satisfied: pyarrow>=15.0.0 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from datasets) (15.0.0)
Requirement already satisfied: dill<0.3.9,>=0.3.0 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from datasets) (0.3.7)
Requirement already satisfied: pandas in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from datasets) (2.1.4)
Requirement already satisfied: requests>=2.32.2 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from datasets) (2.32.3)
Requirement already satisfied: tqdm>=4.66.3 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from datasets) (4.67.1)
Requirement already satisfied: xxhash in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from datasets) (3.4.1)
Requirement already satisfied: multiprocess<0.70.17 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from datasets) (0.70.15)
Requirement already satisfied: fsspec<=2024.9.0,>=2023.1.0 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets) (2024.9.0)
Requirement already satisfied: aiohttp in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from datasets) (3.9.3)
Requirement already satisfied: huggingface-hub>=0.23.0 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from datasets) (0.28.1)
Requirement already satisfied: packaging in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from datasets) (23.2)
Requirement already satisfied: pyyaml>=5.1 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from datasets) (6.0.1)
Requirement already satisfied: aiosignal>=1.1.2 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: attrs>=17.3.0 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from aiohttp->datasets) (23.2.0)
Requirement already satisfied: frozenlist>=1.1.1 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from aiohttp->datasets) (1.4.1)
Requirement already satisfied: multidict<7.0,>=4.5 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from aiohttp->datasets) (6.0.5)
Requirement already satisfied: yarl<2.0,>=1.0 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from aiohttp->datasets) (1.9.4)
Requirement already satisfied: anyio in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from httpx>=0.27.0->mistralai) (3.7.1)
Requirement already satisfied: certifi in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from httpx>=0.27.0->mistralai) (2024.2.2)
Requirement already satisfied: httpcore==1.* in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from httpx>=0.27.0->mistralai) (1.0.4)
Requirement already satisfied: idna in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from httpx>=0.27.0->mistralai) (2.10)
Requirement already satisfied: sniffio in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from httpx>=0.27.0->mistralai) (1.3.0)
Requirement already satisfied: h11<0.15,>=0.13 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from httpcore==1.*->httpx>=0.27.0->mistralai) (0.14.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from huggingface-hub>=0.23.0->datasets) (4.12.2)
Requirement already satisfied: annotated-types>=0.6.0 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from pydantic>=2.9.0->mistralai) (0.6.0)
Requirement already satisfied: pydantic-core==2.23.4 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from pydantic>=2.9.0->mistralai) (2.23.4)
Requirement already satisfied: six>=1.5 in c:\users\di-co\appdata\roaming\python\python312\site-packages (from python-dateutil>=2.8.2->mistralai) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.32.2->datasets) (3.3.2)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from requests>=2.32.2->datasets) (2.2.1)
Requirement already satisfied: colorama in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from tqdm>=4.66.3->datasets) (0.4.6)
Requirement already satisfied: mypy-extensions>=0.3.0 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from typing-inspect>=0.9.0->mistralai) (1.0.0)
Requirement already satisfied: pytz>=2020.1 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from pandas->datasets) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in c:\users\di-co\appdata\local\programs\python\python312\lib\site-packages (from pandas->datasets) (2023.3)

[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip

We can now set up our client. You can create an API key on our Plateforme.

[ ]

Without Batch

As an example, let's use Mistral OCR to extract text from multiple images.

We will use a dataset containing raw image data. To send this data via an image URL, we need to encode it in base64. For more information, please visit our Vision Documentation.

[ ]

For this demo, we will use a simple dataset containing numerous documents and scans in image format. Specifically, we will use the HuggingFaceM4/DocumentVQA dataset, loaded via the datasets library.

We will download only 100 samples for this demonstration.

[ ]

With our subset of 100 samples ready, we can loop through each image to extract the text.

We will save the results in a new dataset and export it as a JSONL file.

[ ]
  0%|          | 0/100 [00:00<?, ?it/s]100%|██████████| 100/100 [02:13<00:00,  1.33s/it]
[ ]

Perfect, we have extracted all text from the 100 samples. However, this process can be made more cost-efficient using Batch Inference.

With Batch

To use Batch Inference, we need to create a JSONL file containing all the image data and request information for our batch.

Let's create a function called create_batch_file to handle this task by generating a file in the proper format.

[ ]

The next step involves encoding the data of each image into base64 and saving the URL of each image that will be used.

[ ]
 48%|████▊     | 48/100 [00:00<00:01, 41.07it/s]100%|██████████| 100/100 [00:04<00:00, 24.48it/s]

We can now create our batch file.

[ ]

With everything ready, we can upload it to the API.

[ ]

The file is uploaded, but the batch inference has not started yet. To initiate it, we need to create a job.

[ ]

Our batch is ready and running!

We can retrieve information using the following method:

[ ]
Status: QUEUED
Total requests: 100
Failed requests: 0
Successful requests: 0
Percent done: 0.0%

Let's automate this feedback loop and download the results once they are ready!

[ ]
Status: SUCCESS
Total requests: 100
Failed requests: 0
Successful requests: 100
Percent done: 100.0%
[ ]
<Response [200 OK]>

Done! With this method, you can perform OCR tasks in bulk in a very cost-effective way.