Scrapegraph Llama Index
🕷️ Extract Company Info with llama-index-tools-scrapegraphai
🔧 Install dependencies
🔑 Import ScrapeGraph API key
You can find the Scrapegraph API key here
📝 Defining an Output Schema for Webpage Content Extraction
If you already know what you want to extract from a webpage, you can define an output schema using Pydantic. This schema acts as a "blueprint" that tells the AI how to structure the response.
Pydantic Schema Quick Guide
Types of Schemas
- Simple Schema
Use this when you want to extract straightforward information, such as a single piece of content.
from pydantic import BaseModel, Field
# Simple schema for a single webpage
class PageInfoSchema(BaseModel):
title: str = Field(description="The title of the webpage")
description: str = Field(description="The description of the webpage")
# Example Output JSON after AI extraction
{
"title": "ScrapeGraphAI: The Best Content Extraction Tool",
"description": "ScrapeGraphAI provides powerful tools for structured content extraction from websites."
}
- Complex Schema (Nested)
If you need to extract structured information with multiple related items (like a list of repositories), you can nest schemas.
from pydantic import BaseModel, Field
from typing import List
# Define a schema for a single repository
class RepositorySchema(BaseModel):
name: str = Field(description="Name of the repository (e.g., 'owner/repo')")
description: str = Field(description="Description of the repository")
stars: int = Field(description="Star count of the repository")
forks: int = Field(description="Fork count of the repository")
today_stars: int = Field(description="Stars gained today")
language: str = Field(description="Programming language used")
# Define a schema for a list of repositories
class ListRepositoriesSchema(BaseModel):
repositories: List[RepositorySchema] = Field(description="List of GitHub trending repositories")
# Example Output JSON after AI extraction
{
"repositories": [
{
"name": "google-gemini/cookbook",
"description": "Examples and guides for using the Gemini API",
"stars": 8036,
"forks": 1001,
"today_stars": 649,
"language": "Jupyter Notebook"
},
{
"name": "TEN-framework/TEN-Agent",
"description": "TEN Agent is a conversational AI powered by TEN, integrating Gemini 2.0 Multimodal Live API, OpenAI Realtime API, RTC, and more.",
"stars": 3224,
"forks": 311,
"today_stars": 361,
"language": "Python"
}
]
}
Key Takeaways
- Simple Schema: Perfect for small, straightforward extractions.
- Complex Schema: Use nesting to extract lists or structured data, like "a list of repositories."
Both approaches give the AI a clear structure to follow, ensuring that the extracted content matches exactly what you need.
🚀 Initialize ScrapegraphToolSpec tools and start extraction
Here we use scrapegraph_smartscraper to extract structured data using AI from a webpage.
If you already have an HTML file, you can upload it and use
scrapegraph_local_scrapeinstead.
You can find more info in the official llama-index documentation
Print the response
Company Info:
{
"request_id": "2f41ad97-b0d3-4d44-8c0d-9734c347f595",
"status": "completed",
"website_url": "https://www.wired.com/tag/science/",
"user_prompt": "Extract information about science news articles",
"result": {
"company_name": "WIRED",
"description": "WIRED is a magazine that covers the intersection of technology, science, culture, and politics. It is the essential source of information and ideas that make sense of a world in constant transformation, illuminating how technology is changing every aspect of our lives.",
"founders": [
{
"name": "Louis Rossetto",
"role": "Co-founder",
"linkedin": "NA"
},
{
"name": "Jane Metcalfe",
"role": "Co-founder",
"linkedin": "NA"
}
],
"logo": "https://www.wired.com/verso/static/wired-us/assets/logo-header.svg",
"partners": [],
"pricing_plans": [],
"contact_emails": [],
"social_links": {
"linkedin": "NA",
"twitter": "https://twitter.com/wired/",
"github": "NA"
},
"privacy_policy": "http://www.condenast.com/privacy-policy#privacypolicy",
"terms_of_service": "https://www.condenast.com/user-agreement/",
"api_status": "NA"
},
"error": ""
}
💾 Save the output to a CSV file
Let's create a pandas dataframe and show the tables with the extracted content
Show flattened tables
Save the responses to CSV
Data saved to CSV files
🔗 Resources
- 🚀 Get your API Key: ScrapeGraphAI Dashboard
- 🐙 GitHub: ScrapeGraphAI GitHub
- 💼 LinkedIn: ScrapeGraphAI LinkedIn
- 🐦 Twitter: ScrapeGraphAI Twitter
- 💬 Discord: Join our Discord Community
- 🦙 LlamaIndex: ScrapeGraph docs
Made with ❤️ by the ScrapeGraphAI Team