Scrapegraph Sdk
🕷️ Extract Wired Science News with Official Scrapegraph SDK
🔧 Install dependencies
🔑 Import ScrapeGraph API key
You can find the Scrapegraph API key here
SGAI_API_KEY not found in environment. Please enter your SGAI_API_KEY: ·········· SGAI_API_KEY has been set in the environment.
📝 Defining an Output Schema for Webpage Content Extraction
If you already know what you want to extract from a webpage, you can define an output schema using Pydantic. This schema acts as a "blueprint" that tells the AI how to structure the response.
Pydantic Schema Quick Guide
Types of Schemas
- Simple Schema
Use this when you want to extract straightforward information, such as a single piece of content.
from pydantic import BaseModel, Field
# Simple schema for a single webpage
class PageInfoSchema(BaseModel):
title: str = Field(description="The title of the webpage")
description: str = Field(description="The description of the webpage")
# Example Output JSON after AI extraction
{
"title": "ScrapeGraphAI: The Best Content Extraction Tool",
"description": "ScrapeGraphAI provides powerful tools for structured content extraction from websites."
}
- Complex Schema (Nested)
If you need to extract structured information with multiple related items (like a list of repositories), you can nest schemas.
from pydantic import BaseModel, Field
from typing import List
# Define a schema for a single repository
class RepositorySchema(BaseModel):
name: str = Field(description="Name of the repository (e.g., 'owner/repo')")
description: str = Field(description="Description of the repository")
stars: int = Field(description="Star count of the repository")
forks: int = Field(description="Fork count of the repository")
today_stars: int = Field(description="Stars gained today")
language: str = Field(description="Programming language used")
# Define a schema for a list of repositories
class ListRepositoriesSchema(BaseModel):
repositories: List[RepositorySchema] = Field(description="List of GitHub trending repositories")
# Example Output JSON after AI extraction
{
"repositories": [
{
"name": "google-gemini/cookbook",
"description": "Examples and guides for using the Gemini API",
"stars": 8036,
"forks": 1001,
"today_stars": 649,
"language": "Jupyter Notebook"
},
{
"name": "TEN-framework/TEN-Agent",
"description": "TEN Agent is a conversational AI powered by TEN, integrating Gemini 2.0 Multimodal Live API, OpenAI Realtime API, RTC, and more.",
"stars": 3224,
"forks": 311,
"today_stars": 361,
"language": "Python"
}
]
}
Key Takeaways
- Simple Schema: Perfect for small, straightforward extractions.
- Complex Schema: Use nesting to extract lists or structured data, like "a list of repositories."
Both approaches give the AI a clear structure to follow, ensuring that the extracted content matches exactly what you need.
🚀 Initialize SGAI Client and start extraction
Initialize the client for scraping (there's also an async version here)
Here we use Smartscraper service to extract structured data using AI from a webpage.
If you already have an HTML file, you can upload it and use
Localscraperinstead.
Print the response
Request ID: 6bf82e33-44af-4064-83c6-b447192d68da
Science News:
{
"news": [
{
"category": "Science",
"title": "The Study That Called Out Black Plastic Utensils Had a Major Math Error",
"link": "https://www.wired.com/story/black-plastic-utensils-study-math-error-correction/",
"author": "Beth Mole, Ars Technica"
},
{
"category": "Environment",
"title": "Generative AI and Climate Change Are on a Collision Course",
"link": "https://www.wired.com/story/true-cost-generative-ai-data-centers-energy/",
"author": "Sasha Luccioni"
},
{
"category": "Xenotransplantation",
"title": "A Third Person Has Received a Transplant of a Genetically Engineered Pig Kidney",
"link": "https://www.wired.com/story/a-third-person-has-received-a-transplant-of-a-genetically-engineered-pig-kidney/",
"author": "Emily Mullin"
},
{
"category": "Health",
"title": "Antibodies Could Soon Help Slow the Aging Process",
"link": "https://www.wired.com/story/antibodies-could-soon-help-slow-the-aging-process/",
"author": "Andrew Steele"
},
{
"category": "Science",
"title": "Good at Reading? Your Brain May Be Structured Differently",
"link": "https://www.wired.com/story/good-at-reading-your-brain-may-be-structured-differently/",
"author": "Mikael Roll"
},
{
"category": "Health",
"title": "Mega-Farms Are Driving the Threat of Bird Flu",
"link": "https://www.wired.com/story/mega-farms-are-driving-the-threat-of-bird-flu/",
"author": "Georgina Gustin"
},
{
"category": "Health",
"title": "RFK Plans to Take on Big Pharma. It\u2019s Easier Said Than Done",
"link": "https://www.wired.com/story/rfks-plan-to-take-on-big-pharma/",
"author": "Emily Mullin"
},
{
"category": "Environment",
"title": "How Christmas Trees Could Become a Source of Low-Carbon Protein",
"link": "https://www.wired.com/story/how-christmas-trees-could-become-a-source-of-low-carbon-protein/",
"author": "Alexa Phillips"
},
{
"category": "Environment",
"title": "Creating a Global Package to Solve the Problem of Plastics",
"link": "https://www.wired.com/story/global-plastics-treaty-united-nations/",
"author": "Susan Solomon"
},
{
"category": "Environment",
"title": "These 3 Things Are Standing in the Way of a Global Plastics Treaty",
"link": "https://www.wired.com/story/these-3-things-are-standing-in-the-way-of-a-global-plastics-treaty/",
"author": "Steve Fletcher and Samuel Winton"
}
]
}
💾 Save the output to a CSV file
Let's create a pandas dataframe and show the table with the extracted content
Save it to CSV
Data saved to wired_news.csv
🔗 Resources
- 🚀 Get your API Key: ScrapeGraphAI Dashboard
- 🐙 GitHub: ScrapeGraphAI GitHub
- 💼 LinkedIn: ScrapeGraphAI LinkedIn
- 🐦 Twitter: ScrapeGraphAI Twitter
- 💬 Discord: Join our Discord Community
Made with ❤️ by the ScrapeGraphAI Team