Getting Started
Getting Started with Tavily Crawl and Map
Web crawling is the process of automatically navigating through websites by following hyperlinks to discover numerous web pages and URLs. For autonomous web agents, this capability is essential for accessing deep web data which might be difficult to retrieve via search.
Tavily’s /crawl and /map endpoints seamlessly integrate crawling and scraping capabilities with the following features:
- URL Discovery: Identifies links via parallelized sitemap parsing and page traversal.
- Recursive Navigation: Balances breadth-first and depth-first search to explore nested pages efficiently.
- Content Extraction: Captures content from both dynamic and static web pages.
- Customizable: Users can define crawl scope and intent for tailored results.
Setup
Follow these steps to set up:
-
Sign up for Tavily at app.tavily.com to get your API key. Refer to the screenshots linked below for step-by-step guidance:
-
Copy your API keys from your Tavily and OpenAI account dashboard.
-
Paste your API keys into the cell below and execute the cell.
Install dependencies in the cell below.
Define the Target Website
We'll specify the base URL to crawl. For this example, we're using the Tavily webpage.
API Response Format
The crawler returns a standardized API response format with the following structure:
Let's view the data for one of the crawled pages
As you can see, we've combined sitemapping and extract functionlity into a single crawl endpoint, allowing for seemless discovery of URLs and their associated raw content. If you're interested in just the links (without the full page content), you can use the Map endpoint. This endpoint performs the exact same sitemapping as crawl without returning the raw page content. It's a faster and more cost-effective way to retrieve all the links from a site.
The api response will only contain the sitemapped URLs.
Crawl Depth and Breadth Visualization
In this example, If max_breadth=2, the crawler will only reach 2 of the blue nodes. If max_depth=3, then the crawler will reach at maximum the yellow level.
Deeper Crawl
Now, lets see it action. Feel free to play around with the limit, max_depth, and max_breadth parameters. The limit parameter plays a crucial role in managing web crawling scope. It defines the maximum number of pages to be scraped, which becomes especially valuable when dealing with extensive websites or when following external links. If not specified, the crawler might endlessly navigate through interconnected pages, wasting resources and processing time.
Let's view all the links we crawled.
Intelligent Crawling
The instructions parameter allows the crawler to intelligently navigate through a website using natural language. For best performance, keep the instruction specific and concise - think of it as a query rather than a prompt.
As you can see, the web crawler can dynamically retrieve relevant content.
Next Steps
Looking for more use cases? Check out our related notebooks: