Elastic Esql Getting Started

Esql Getting Started

openai-chatgptlangchain-pythonchatgptesqlgenaielasticsearchelasticopenaiAIchatlogvectordatabasenotebooksPythonsearchgenaistackvectorelasticsearch-labslangchainapplications

alph-notebooks/elasticsearch-labs / esql-getting-started.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Getting started with ES|QL

In this notebook you'll learn the basics of the Elasticsearch Query Language (ES|QL). You'll be using the official Elasticsearch Python client.

You'll learn how to:

Run an ES|QL query
Use processing commands
Sort a table
Query data
Chain processing commands
Compute values
Calculate statistics
Access columns
Create a histogram
Enrich data
Process data

ℹ️ ES|QL is generally available as of Elastic stack version 8.14.0.

Create Elastic Cloud deployment

If you don't have an Elastic Cloud deployment, sign up here for a free trial.

Once logged in to your Elastic Cloud account, go to the Create deployment page and select Create deployment. Leave all settings with their default values.

Install packages and import modules

To get started, we'll need to connect to our Elastic deployment using the Python client. Because we're using an Elastic Cloud deployment, we'll use the Cloud ID to identify our deployment.

First we need to install the elasticsearch Python client.

[1]

Collecting elasticsearch
  Downloading elasticsearch-8.14.0-py3-none-any.whl (480 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 480.2/480.2 kB 3.0 MB/s eta 0:00:00
Collecting elastic-transport<9,>=8.13 (from elasticsearch)
  Downloading elastic_transport-8.13.1-py3-none-any.whl (64 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.5/64.5 kB 5.3 MB/s eta 0:00:00
Requirement already satisfied: urllib3<3,>=1.26.2 in /usr/local/lib/python3.10/dist-packages (from elastic-transport<9,>=8.13->elasticsearch) (2.0.7)
Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from elastic-transport<9,>=8.13->elasticsearch) (2024.6.2)
Installing collected packages: elastic-transport, elasticsearch
Successfully installed elastic-transport-8.13.1 elasticsearch-8.14.0

[2]

Initialize the Elasticsearch client

Now we can instantiate the Elasticsearch Python client, providing the CLOUD ID and API key for your deployment.

ℹ️ If you're running Elasticsearch locally or on self-managed infrastructure, you'll need to pass in the Elasticsearch host instead. Read the docs about how to connect to Elasticsearch locally.

[3]

Elastic Cloud ID: ··········
Elastic API Key: ··········

Add sample data to Elasticsearch

Before we index our sample dataset, let's create an index named sample_data with the correct mappings.

[4]

Now we can index the data using the Elasticsearch Python client's bulk helpers.

[5]

Successfully indexed 7 documents.

[6]

[7]

Your first ES|QL query

Each ES|QL query starts with a source command. A source command produces a table, typically with data from Elasticsearch.

The FROM source command returns a table with documents from a data stream, index, or alias. Each row in the resulting table represents a document. This query returns up to 500 documents from the sample_data index:

[8]

@timestamp               | client_ip    | event_duration | message              
--------------------------------------------------------------------------------
2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233        | Connected to 10.1.0.3
2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889        | Connected to 10.1.0.2
2023-10-23T13:33:34.937Z | 172.21.0.5   | 1232382        | Disconnected         
2023-10-23T13:51:54.732Z | 172.21.3.15  | 725448         | Connection error     
2023-10-23T13:52:55.015Z | 172.21.3.15  | 8268153        | Connection error     
2023-10-23T13:53:55.832Z | 172.21.3.15  | 5033755        | Connection error     
2023-10-23T13:55:01.543Z | 172.21.3.15  | 1756467        | Connected to 10.1.0.1

Each column corresponds to a field, and can be accessed by the name of that field.

ℹ️ ES|QL keywords are case-insensitive. FROM sample_data is identical to from sample_data.

Processing commands

A source command can be followed by one or more processing commands, separated by a pipe character: |. Processing commands change an input table by adding, removing, or changing rows and columns. Processing commands can perform filtering, projection, aggregation, and more.

For example, you can use the LIMIT command to limit the number of rows that are returned, up to a maximum of 10,000 rows:

[9]

@timestamp               | client_ip    | event_duration | message              
--------------------------------------------------------------------------------
2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233        | Connected to 10.1.0.3
2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889        | Connected to 10.1.0.2
2023-10-23T13:33:34.937Z | 172.21.0.5   | 1232382        | Disconnected

Sort a table

Another processing command is the SORT command. By default, the rows returned by FROM don’t have a defined sort order. Use the SORT command to sort rows on one or more columns:

[10]

@timestamp               | client_ip    | event_duration | message              
--------------------------------------------------------------------------------
2023-10-23T13:55:01.543Z | 172.21.3.15  | 1756467        | Connected to 10.1.0.1
2023-10-23T13:53:55.832Z | 172.21.3.15  | 5033755        | Connection error     
2023-10-23T13:52:55.015Z | 172.21.3.15  | 8268153        | Connection error     
2023-10-23T13:51:54.732Z | 172.21.3.15  | 725448         | Connection error     
2023-10-23T13:33:34.937Z | 172.21.0.5   | 1232382        | Disconnected         
2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889        | Connected to 10.1.0.2
2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233        | Connected to 10.1.0.3

Query the data

Use the WHERE command to query the data. For example, to find all events with a duration longer than 5ms:

[11]

@timestamp               | client_ip   | event_duration | message         
--------------------------------------------------------------------------
2023-10-23T13:52:55.015Z | 172.21.3.15 | 8268153        | Connection error
2023-10-23T13:53:55.832Z | 172.21.3.15 | 5033755        | Connection error

WHERE supports several operators.

For example, you can use LIKE to run a wildcard query against the message column:

[12]

@timestamp               | client_ip    | event_duration | message              
--------------------------------------------------------------------------------
2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233        | Connected to 10.1.0.3
2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889        | Connected to 10.1.0.2
2023-10-23T13:55:01.543Z | 172.21.3.15  | 1756467        | Connected to 10.1.0.1

More processing commands

There are many other processing commands, like KEEP and DROP to keep or drop columns, ENRICH to enrich a table with data from indices in Elasticsearch, and DISSECT and GROK to process data. Refer to Processing commands for an overview.

Chain processing commands

You can chain processing commands, separated by a pipe character: |. Each processing command works on the output table of the previous command. The result of a query is the table produced by the final processing command.

The following example first sorts the table on @timestamp, and next limits the result set to 3 rows:

[13]

@timestamp               | client_ip   | event_duration | message              
-------------------------------------------------------------------------------
2023-10-23T13:55:01.543Z | 172.21.3.15 | 1756467        | Connected to 10.1.0.1
2023-10-23T13:53:55.832Z | 172.21.3.15 | 5033755        | Connection error     
2023-10-23T13:52:55.015Z | 172.21.3.15 | 8268153        | Connection error

ℹ️ The order of processing commands is important. First limiting the result set to 3 rows before sorting those 3 rows would most likely return a result that is different than this example, where the sorting comes before the limit.

Compute values

Use the EVAL command to append columns to a table, with calculated values. For example, the following query appends a duration_ms column. The values in the column are computed by dividing event_duration by 1,000,000. In other words: event_duration converted from nanoseconds to milliseconds.

[14]

@timestamp               | client_ip    | event_duration | message               | duration_ms
----------------------------------------------------------------------------------------------
2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233        | Connected to 10.1.0.3 | 3.450233   
2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889        | Connected to 10.1.0.2 | 2.764889   
2023-10-23T13:33:34.937Z | 172.21.0.5   | 1232382        | Disconnected          | 1.232382   
2023-10-23T13:51:54.732Z | 172.21.3.15  | 725448         | Connection error      | 0.725448   
2023-10-23T13:52:55.015Z | 172.21.3.15  | 8268153        | Connection error      | 8.268153   
2023-10-23T13:53:55.832Z | 172.21.3.15  | 5033755        | Connection error      | 5.033755   
2023-10-23T13:55:01.543Z | 172.21.3.15  | 1756467        | Connected to 10.1.0.1 | 1.756467

EVAL supports several functions. For example, to round a number to the closest number with the specified number of digits, use the ROUND function:

[15]

@timestamp               | client_ip    | event_duration | message               | duration_ms
----------------------------------------------------------------------------------------------
2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233        | Connected to 10.1.0.3 | 3.5        
2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889        | Connected to 10.1.0.2 | 2.8        
2023-10-23T13:33:34.937Z | 172.21.0.5   | 1232382        | Disconnected          | 1.2        
2023-10-23T13:51:54.732Z | 172.21.3.15  | 725448         | Connection error      | 0.7        
2023-10-23T13:52:55.015Z | 172.21.3.15  | 8268153        | Connection error      | 8.3        
2023-10-23T13:53:55.832Z | 172.21.3.15  | 5033755        | Connection error      | 5.0        
2023-10-23T13:55:01.543Z | 172.21.3.15  | 1756467        | Connected to 10.1.0.1 | 1.8

Calculate statistics

You can also use ES|QL to aggregate your data. Use the STATS ... BY command to calculate statistics.

For example, to calculate the median duration:

[16]

median_duration
---------------
2764889.0

You can calculate multiple stats with one command:

[17]

median_duration | max_duration
------------------------------
2764889.0       | 8268153

Use BY to group calculated stats by one or more columns. For example, to calculate the median duration per client IP:

[18]

median_duration | client_ip   
------------------------------
1232382.0       | 172.21.0.5  
2764889.0       | 172.21.2.113
3450233.0       | 172.21.2.162
3395111.0       | 172.21.3.15

Access columns

You can access columns by their name. If a name contains special characters, it needs to be quoted with backticks (`).

Assigning an explicit name to a column created by EVAL or STATS is optional. If you don’t provide a name, the new column name is equal to the function expression. For example:

[19]

@timestamp               | client_ip    | event_duration | message               | event_duration/1000000.0
-----------------------------------------------------------------------------------------------------------
2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233        | Connected to 10.1.0.3 | 3.450233                
2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889        | Connected to 10.1.0.2 | 2.764889                
2023-10-23T13:33:34.937Z | 172.21.0.5   | 1232382        | Disconnected          | 1.232382                
2023-10-23T13:51:54.732Z | 172.21.3.15  | 725448         | Connection error      | 0.725448                
2023-10-23T13:52:55.015Z | 172.21.3.15  | 8268153        | Connection error      | 8.268153                
2023-10-23T13:53:55.832Z | 172.21.3.15  | 5033755        | Connection error      | 5.033755                
2023-10-23T13:55:01.543Z | 172.21.3.15  | 1756467        | Connected to 10.1.0.1 | 1.756467

In this query, EVAL adds a new column named event_duration/1000000.0. Because its name contains special characters, to access this column, quote it with backticks:

[20]

MEDIAN(`event_duration/1000000.0`)
----------------------------------
2.764889

Create a histogram

To track statistics over time, ES|QL enables you to create histograms using the BUCKET function. BUCKET creates human-friendly bucket sizes and returns a value for each row that corresponds to the resulting bucket the row falls into.

ℹ️ The BUCKET function must be used together with the STATS ... BY command. It replaces the AUTO_BUCKET function which was removed in 8.14.0.

For example, the following query creates hourly buckets for the data on October 23rd and creates a histogram to count the number of events per hour.

[49]

COUNT(*) | bucket                  
-----------------------------------
2        | 2023-10-23T12:00:00.000Z
5        | 2023-10-23T13:00:00.000Z

Or the median duration per hour:

[27]

median_duration | bucket                  
------------------------------------------
3107561.0       | 2023-10-23T12:00:00.000Z
1756467.0       | 2023-10-23T13:00:00.000Z

Enrich data

ES|QL enables you to enrich a table with data from indices in Elasticsearch, using the ENRICH command.

ℹ️ Before you can use ENRICH, you first need to create and execute an enrich policy.

The following requests create and execute a policy called clientip_policy. The policy links an IP address to an environment ("Development", "QA", or "Production").

[45]

ObjectApiResponse({'status': {'phase': 'COMPLETE'}})

After creating and executing a policy, you can use it with the ENRICH command:

[46]

@timestamp               | event_duration | client_ip    | env        
----------------------------------------------------------------------
2023-10-23T12:15:03.360Z | 3450233        | 172.21.2.162 | QA         
2023-10-23T12:27:28.948Z | 2764889        | 172.21.2.113 | QA         
2023-10-23T13:33:34.937Z | 1232382        | 172.21.0.5   | Development
2023-10-23T13:51:54.732Z | 725448         | 172.21.3.15  | Production 
2023-10-23T13:52:55.015Z | 8268153        | 172.21.3.15  | Production 
2023-10-23T13:53:55.832Z | 5033755        | 172.21.3.15  | Production 
2023-10-23T13:55:01.543Z | 1756467        | 172.21.3.15  | Production

You can use the new env column that’s added by the ENRICH command in subsequent commands. For example, to calculate the median duration per environment:

[47]

median_duration | env        
-----------------------------
3107561.0       | QA         
1232382.0       | Development
3395111.0       | Production

For more about data enrichment with ES|QL, refer to Data enrichment.

Process data

Your data may contain unstructured strings that you want to structure to make it easier to analyze the data. For example, the sample data contains log messages like:

	"Connected to 10.1.0.3"

By extracting the IP address from these messages, you can determine which IP has accepted the most client connections.

To structure unstructured strings at query time, you can use the ES|QL DISSECT and GROK commands. DISSECT works by breaking up a string using a delimiter-based pattern. GROK works similarly, but uses regular expressions. This makes GROK more powerful, but generally also slower.

In this case, no regular expressions are needed, as the message is straightforward: "Connected to ", followed by the server IP. To match this string, you can use the following DISSECT command:

[37]

@timestamp               | client_ip    | event_duration | message               | server_ip
--------------------------------------------------------------------------------------------
2023-10-23T12:15:03.360Z | 172.21.2.162 | 3450233        | Connected to 10.1.0.3 | 10.1.0.3 
2023-10-23T12:27:28.948Z | 172.21.2.113 | 2764889        | Connected to 10.1.0.2 | 10.1.0.2 
2023-10-23T13:33:34.937Z | 172.21.0.5   | 1232382        | Disconnected          | None     
2023-10-23T13:51:54.732Z | 172.21.3.15  | 725448         | Connection error      | None     
2023-10-23T13:52:55.015Z | 172.21.3.15  | 8268153        | Connection error      | None     
2023-10-23T13:53:55.832Z | 172.21.3.15  | 5033755        | Connection error      | None     
2023-10-23T13:55:01.543Z | 172.21.3.15  | 1756467        | Connected to 10.1.0.1 | 10.1.0.1

This adds a server_ip column to those rows that have a message that matches this pattern. For other rows, the value of server_ip is null.

You can use the new server_ip column that’s added by the DISSECT command in subsequent commands. For example, to determine how many connections each server has accepted:

[48]

COUNT(*) | server_ip
--------------------
1        | 10.1.0.3 
1        | 10.1.0.2 
1        | 10.1.0.1

ℹ️ To learn more about data processing with ES|QL, refer to Data processing with DISSECT and GROK.

Learn more

To learn more about ES|QL, refer to: