Scrapegraph Sdk
🕷️ Extract Company Info with Official Scrapegraph SDK
🔧 Install dependencies
🔑 Import ScrapeGraph API key
You can find the Scrapegraph API key here
SGAI_API_KEY not found in environment. Please enter your SGAI_API_KEY: ·········· SGAI_API_KEY has been set in the environment.
📝 Defining an Output Schema for Webpage Content Extraction
If you already know what you want to extract from a webpage, you can define an output schema using Pydantic. This schema acts as a "blueprint" that tells the AI how to structure the response.
Pydantic Schema Quick Guide
Types of Schemas
- Simple Schema
Use this when you want to extract straightforward information, such as a single piece of content.
from pydantic import BaseModel, Field
# Simple schema for a single webpage
class PageInfoSchema(BaseModel):
title: str = Field(description="The title of the webpage")
description: str = Field(description="The description of the webpage")
# Example Output JSON after AI extraction
{
"title": "ScrapeGraphAI: The Best Content Extraction Tool",
"description": "ScrapeGraphAI provides powerful tools for structured content extraction from websites."
}
- Complex Schema (Nested)
If you need to extract structured information with multiple related items (like a list of repositories), you can nest schemas.
from pydantic import BaseModel, Field
from typing import List
# Define a schema for a single repository
class RepositorySchema(BaseModel):
name: str = Field(description="Name of the repository (e.g., 'owner/repo')")
description: str = Field(description="Description of the repository")
stars: int = Field(description="Star count of the repository")
forks: int = Field(description="Fork count of the repository")
today_stars: int = Field(description="Stars gained today")
language: str = Field(description="Programming language used")
# Define a schema for a list of repositories
class ListRepositoriesSchema(BaseModel):
repositories: List[RepositorySchema] = Field(description="List of GitHub trending repositories")
# Example Output JSON after AI extraction
{
"repositories": [
{
"name": "google-gemini/cookbook",
"description": "Examples and guides for using the Gemini API",
"stars": 8036,
"forks": 1001,
"today_stars": 649,
"language": "Jupyter Notebook"
},
{
"name": "TEN-framework/TEN-Agent",
"description": "TEN Agent is a conversational AI powered by TEN, integrating Gemini 2.0 Multimodal Live API, OpenAI Realtime API, RTC, and more.",
"stars": 3224,
"forks": 311,
"today_stars": 361,
"language": "Python"
}
]
}
Key Takeaways
- Simple Schema: Perfect for small, straightforward extractions.
- Complex Schema: Use nesting to extract lists or structured data, like "a list of repositories."
Both approaches give the AI a clear structure to follow, ensuring that the extracted content matches exactly what you need.
🚀 Initialize SGAI Client and start extraction
Initialize the client for scraping (there's also an async version here)
Here we use Smartscraper service to extract structured data using AI from a webpage.
If you already have an HTML file, you can upload it and use
Localscraperinstead.
Print the response
Request ID: 87a7ea1a-9dd4-4d1d-ae76-b419ead57c11
Company Info:
{
"company_name": "ScrapeGraphAI",
"description": "ScrapeGraphAI is a powerful AI scraping API designed for efficient web data extraction to power LLM applications and AI agents. It enables developers to perform intelligent AI scraping and extract structured information from websites using advanced AI techniques.",
"founders": [
{
"name": "",
"role": "Founder & Technical Lead",
"linkedin": "https://www.linkedin.com/in/perinim/"
},
{
"name": "Marco Vinciguerra",
"role": "Founder & Software Engineer",
"linkedin": "https://www.linkedin.com/in/marco-vinciguerra-7ba365242/"
},
{
"name": "Lorenzo Padoan",
"role": "Founder & Product Engineer",
"linkedin": "https://www.linkedin.com/in/lorenzo-padoan-4521a2154/"
}
],
"logo": "https://scrapegraphai.com/images/scrapegraphai_logo.svg",
"partners": [
"PostHog",
"AWS",
"NVIDIA",
"JinaAI",
"DagWorks",
"Browserbase",
"ScrapeDo",
"HackerNews",
"Medium",
"HackADay"
],
"pricing_plans": [
{
"tier": "Free",
"price": "$0",
"credits": 100
},
{
"tier": "Starter",
"price": "$20/month",
"credits": 5000
},
{
"tier": "Growth",
"price": "$100/month",
"credits": 40000
},
{
"tier": "Pro",
"price": "$500/month",
"credits": 250000
}
],
"contact_emails": [
"contact@scrapegraphai.com"
],
"social_links": {
"linkedin": "https://www.linkedin.com/company/101881123",
"twitter": "https://x.com/scrapegraphai",
"github": "https://github.com/ScrapeGraphAI/Scrapegraph-ai"
},
"privacy_policy": "https://scrapegraphai.com/privacy",
"terms_of_service": "https://scrapegraphai.com/terms",
"api_status": "https://scrapegraphapi.openstatus.dev"
}
💾 Save the output to a CSV file
Let's create a pandas dataframe and show the tables with the extracted content
Show flattened tables
Save the results to CSV
Data saved to CSV files
🔗 Resources
- 🚀 Get your API Key: ScrapeGraphAI Dashboard
- 🐙 GitHub: ScrapeGraphAI GitHub
- 💼 LinkedIn: ScrapeGraphAI LinkedIn
- 🐦 Twitter: ScrapeGraphAI Twitter
- 💬 Discord: Join our Discord Community
Made with ❤️ by the ScrapeGraphAI Team