Notebooks
S
ScrapeGraphAI
Scrapegraph Burr Lancedb

Scrapegraph Burr Lancedb

scrapegraph-pysdk-nodejscookbookscrapingjson-schemasdk-pythonweb-scraping-pythonchat-webpage-simple-ragweb-scrapingsdk-jsapiPythonscrapegraphweb-crawler

🕷️ Chat with your webpage with scrapegraph, burr and lancedb

image.png

🔧 Install dependencies

[1]

🔑 Import ScrapeGraph and OpenAI API keys

You can find the Scrapegraph API key here

[2]
Scrapegraph API key:
··········
OpenAI API key:
··········

🚀 Define the extraction and query flow using burr and run it

burr is an open-source orchestrator framework that makes it easy to develop applications that make decisions (chatbots, agents, simulations, etc...). It also features a cool self-hosted UI to trace what's happening in the application.

Check the Github Repo

Our goal is to define a flow (DAG) that:

  1. Fetches markdown from webpages (scrapegraph)
  2. Chunks the content and stores it in a vector store (lancedb)
  3. Allows to query the db and generate an answer using a LLM

We can see all of this as Nodes connected together in a Graph, where the Nodes are the actions we want to perform.

In burr we define actions by simply adding the action decorator to each function and specifying what that function needs to read and write from the graph's state.

All imports

[66]

🔍 Define fetch_webpage action to fetch and convert a webpage into markdown

Here we use markdownify to fetch a webpage and convert it into markdown format, which is suitable to LLM ingestion.

You can find more info in the official scrapegraph documentation

[67]
💬 2024-12-29 13:56:08,576 🔑 Initializing Client
INFO:scrapegraph:🔑 Initializing Client
💬 2024-12-29 13:56:08,582 ✅ Client initialized successfully
INFO:scrapegraph:✅ Client initialized successfully
[68]

📁 Define embed_and_store action to chunk the markdown and store it in a local vector store

Define the data structure to hold the chunks in the lancedb vector store

[69]

Utils to create chunks based on the number of tokens

[52]

Let's define the action. It creates a webpages local vector store if not present and add the chunks to the chunks table

[53]

💬 Define ask_question action to retrieve the most relevant chunks from the vector store and query them with a llm

Fetches the first 3 relevant chunks based on the user query and generate and answer

[54]

🤖 Define the burr application graph and run it

[ ]
[63]
[64]
💬 2024-12-29 13:51:54,568 🔍 Starting markdownify request for https://scrapegraphai.com/
INFO:scrapegraph:🔍 Starting markdownify request for https://scrapegraphai.com/
💬 2024-12-29 13:51:54,577 🚀 Making POST request to https://api.scrapegraphai.com/v1/markdownify
INFO:scrapegraph:🚀 Making POST request to https://api.scrapegraphai.com/v1/markdownify
💬 2024-12-29 13:51:57,646 ✅ Request completed successfully: POST https://api.scrapegraphai.com/v1/markdownify
INFO:scrapegraph:✅ Request completed successfully: POST https://api.scrapegraphai.com/v1/markdownify
💬 2024-12-29 13:51:57,649 ✨ Markdownify request completed successfully
INFO:scrapegraph:✨ Markdownify request completed successfully
Request ID: d646737f-2dbd-4c6d-aecb-fc5f2c3e132d
Markdown Content: [Star us on GitHub0](https://github.com/ScrapeGraphAI/Scrapegraph-ai)

## Transform Websites into  
Structured Data

### Just One Prompt Away

Transform any website into clean, organized data for AI a... (truncated)
[65]
The founders of ScrapeGraphAI are:

1. **** - Founder & Technical Lead
   - [LinkedIn profile of ](https://www.linkedin.com/in/perinim/)

2. **Marco Vinciguerra** - Founder & Software Engineer
   - [LinkedIn profile of Marco Vinciguerra](https://www.linkedin.com/in/marco-vinciguerra-7ba365242/)

3. **Lorenzo Padoan** - Founder & Product Engineer
   - [LinkedIn profile of Lorenzo Padoan](https://www.linkedin.com/in/lorenzo-padoan-4521a2154)

🖼️ Visualize the traces with Burr UI

[70]
[74]

image.png

🔗 Resources

ScrapeGraph API Banner

Made with ❤️ by the ScrapeGraphAI Team