Notebooks
E
Elastic
Elastic Crawler To Open Crawler Migration

Elastic Crawler To Open Crawler Migration

openai-chatgptenterprise-searchlangchain-pythonchatgptgenaielasticsearchelasticopenaiAIchatlogvectordatabasenotebooksPythonsearchgenaistackvectorelasticsearch-labslangchainapplications

Hello, future Open Crawler user!

Open In Colab

This notebook is designed to help you migrate your Elastic Crawler configurations to Open Crawler-friendly YAML!

We recommend running each cell individually in a sequential fashion, as each cell is dependent on previous cells having been run. Furthermore, we recommend that you only run each cell once as re-running cells may result in errors or incorrect YAML files.

Setup

First, let's start by making sure elasticsearch and other required dependencies are installed and imported by running the following cell:

[ ]

We are going to need a few things from your Elasticsearch deployment before we can migrate your configurations:

  • Your Elasticsearch Endpoint URL
  • Your Elasticsearch Endpoint Port number
  • An API key

You can find your Endpoint URL and port number by visiting your Elasticsearch Overview page in Kibana.

You can create a new API key from the Stack Management -> API keys menu in Kibana. Be sure to copy or write down your key in a safe place, as it will be displayed only once upon creation.

[ ]

Hopefully you received our tagline 'You Know, for Search'. If so, we are connected and ready to go!

If not, please double-check your Cloud ID and API key that you provided above.

Step 1: Acquire Basic Configurations

First, we need to establish what Crawlers you have and their basic configuration details. This migration notebook will attempt to pull configurations for every distinct Crawler you have in your Elasticsearch instance.

[ ]

Before continuing, please verify in the output above that the correct number of Crawlers was found.

Now that we have some basic data about your Crawlers, let's use this information to get more configuration values!

Step 2: URLs, Sitemaps, and Crawl Rules

In the next cell, we will need to query Elasticsearch for information about each Crawler's domain URLs, seed URLs, sitemaps, and crawling rules.

[ ]

Step 3: Extracting the Extraction Rules

In the next cell, we will find any extraction rules you set for your Elastic Crawlers.

[ ]

Step 4: Schedules

In the next cell, we will gather any specific time schedules your Crawlers have set. Please note that interval time schedules are not supported by Open Crawler and will be ignored.

[ ]

Step 5: Creating the Open Crawler YAML configuration files

In this final step, we will create the actual YAML files you need to get up and running with Open Crawler!

The next cell performs some final transformations to the in-memory data structure that is keeping track of your configurations.

[ ]

Wait! Before we continue onto creating our YAML files, we're going to need your input on a few things.

In the next cell, please enter the following details about the Elasticsearch instance you will be using with Open Crawler. This instance can be Elastic Cloud Hosted, Serverless, or a local instance.

  • The Elasticsearch endpoint URL
  • The port number of your Elasticsearch endpoint (Optional, will default to 443 if left blank)
  • An API key
[ ]

This is the final step! You have two options here:

  • The "Write to YAML" cell will create n number of YAML files, one for each Crawler you have.
  • The "Print to output" cell will print each Crawler's configuration YAML in the Notebook, so you can copy-paste them into your Open Crawler YAML files manually.

Feel free to run both! You can run Option 2 first to see the output before running Option 1 to save the configs into YAML files.

Option 1: Write to YAML file

[ ]

Option 2: Print to output

[ ]

Next Steps

Now that the YAML files have been generated, you can visit the Open Crawler GitHub repository to learn more about how to deploy Open Crawler: https://github.com/elastic/crawler#quickstart

If you find any problems with this Notebook, please feel free to create an issue in the elasticsearch-labs repository: https://github.com/elastic/elasticsearch-labs/issues