Scrapegraph Sdk
🕷️ Extract Github Trending Repositories with Official Scrapegraph SDK
🔧 Install dependencies
🔑 Import ScrapeGraph API key
You can find the Scrapegraph API key here
SGAI_API_KEY not found in environment. Please enter your SGAI_API_KEY: ·········· SGAI_API_KEY has been set in the environment.
📝 Defining an Output Schema for Webpage Content Extraction
If you already know what you want to extract from a webpage, you can define an output schema using Pydantic. This schema acts as a "blueprint" that tells the AI how to structure the response.
Pydantic Schema Quick Guide
Types of Schemas
- Simple Schema
Use this when you want to extract straightforward information, such as a single piece of content.
from pydantic import BaseModel, Field
# Simple schema for a single webpage
class PageInfoSchema(BaseModel):
title: str = Field(description="The title of the webpage")
description: str = Field(description="The description of the webpage")
# Example Output JSON after AI extraction
{
"title": "ScrapeGraphAI: The Best Content Extraction Tool",
"description": "ScrapeGraphAI provides powerful tools for structured content extraction from websites."
}
- Complex Schema (Nested)
If you need to extract structured information with multiple related items (like a list of repositories), you can nest schemas.
from pydantic import BaseModel, Field
from typing import List
# Define a schema for a single repository
class RepositorySchema(BaseModel):
name: str = Field(description="Name of the repository (e.g., 'owner/repo')")
description: str = Field(description="Description of the repository")
stars: int = Field(description="Star count of the repository")
forks: int = Field(description="Fork count of the repository")
today_stars: int = Field(description="Stars gained today")
language: str = Field(description="Programming language used")
# Define a schema for a list of repositories
class ListRepositoriesSchema(BaseModel):
repositories: List[RepositorySchema] = Field(description="List of GitHub trending repositories")
# Example Output JSON after AI extraction
{
"repositories": [
{
"name": "google-gemini/cookbook",
"description": "Examples and guides for using the Gemini API",
"stars": 8036,
"forks": 1001,
"today_stars": 649,
"language": "Jupyter Notebook"
},
{
"name": "TEN-framework/TEN-Agent",
"description": "TEN Agent is a conversational AI powered by TEN, integrating Gemini 2.0 Multimodal Live API, OpenAI Realtime API, RTC, and more.",
"stars": 3224,
"forks": 311,
"today_stars": 361,
"language": "Python"
}
]
}
Key Takeaways
- Simple Schema: Perfect for small, straightforward extractions.
- Complex Schema: Use nesting to extract lists or structured data, like "a list of repositories."
Both approaches give the AI a clear structure to follow, ensuring that the extracted content matches exactly what you need.
🚀 Initialize SGAI Client and start extraction
Initialize the client for scraping (there's also an async version here)
Here we use Smartscraper service to extract structured data using AI from a webpage.
If you already have an HTML file, you can upload it and use
Localscraperinstead.
Print the response
Request ID: 1e3b00ff-4b55-497c-8046-8ec5503cdafd
Trending Repositories:
{
"repositories": [
{
"name": "Byaidu/PDFMathTranslate",
"description": "PDF scientific paper translation with preserved formats - \u57fa\u4e8e AI \u5b8c\u6574\u4fdd\u7559\u6392\u7248\u7684 PDF \u6587\u6863\u5168\u6587\u53cc\u8bed\u7ffb\u8bd1\uff0c\u652f\u6301 Google/DeepL/Ollama/OpenAI \u7b49\u670d\u52a1\uff0c\u63d0\u4f9b CLI/GUI/Docker",
"stars": 8902,
"forks": 633,
"today_stars": 816,
"language": "Python"
},
{
"name": "bigskysoftware/htmx",
"description": "htmx - high power tools for HTML",
"stars": 39143,
"forks": 1324,
"today_stars": 186,
"language": "JavaScript"
},
{
"name": "commaai/openpilot",
"description": "openpilot is an operating system for robotics. Currently, it upgrades the driver assistance system on 275+ supported cars.",
"stars": 50945,
"forks": 9206,
"today_stars": 132,
"language": "Python"
},
{
"name": "google-gemini/cookbook",
"description": "Examples and guides for using the Gemini API",
"stars": 8108,
"forks": 1011,
"today_stars": 1221,
"language": "Jupyter Notebook"
},
{
"name": "stripe/stripe-ios",
"description": "Stripe iOS SDK",
"stars": 2179,
"forks": 994,
"today_stars": 19,
"language": "Swift"
},
{
"name": "RIOT-OS/RIOT",
"description": "RIOT - The friendly OS for IoT",
"stars": 5234,
"forks": 2017,
"today_stars": 168,
"language": "C"
},
{
"name": "zju3dv/EasyVolcap",
"description": "EasyVolcap: Accelerating Neural Volumetric Video Research",
"stars": 802,
"forks": 52,
"today_stars": 30,
"language": "Python"
},
{
"name": "TEN-framework/TEN-Agent",
"description": "TEN Agent is a conversational AI powered by TEN, integrating Gemini 2.0 Multimodal Live API, OpenAI Realtime API, RTC, and more. It offers real-time capabilities to see, hear, and speak, along with advanced tools like weather checks, web search, and RAG.",
"stars": 3245,
"forks": 313,
"today_stars": 296,
"language": "Python"
},
{
"name": "DS4SD/docling",
"description": "Get your documents ready for gen AI",
"stars": 15201,
"forks": 774,
"today_stars": 281,
"language": "Python"
},
{
"name": "Guovin/iptv-api",
"description": "\ud83d\udcfaIPTV\u7535\u89c6\u76f4\u64ad\u6e90\u66f4\u65b0\u5de5\u5177\uff1a\u2728\u592e\u89c6\u9891\u3001\ud83d\udcf1\u536b\u89c6\u3001\u2615\u5404\u7701\u4efd\u5730\u65b9\u53f0\u3001\ud83c\udf0f\u6e2f\u00b7\u6fb3\u00b7\u53f0\u3001\ud83c\udfa5\u7535\u5f71\u3001\ud83c\udfae\u6e38\u620f\u3001\ud83c\udfb5\u97f3\u4e50\u3001\ud83c\udfad\u7ecf\u5178\u5267\u573a\uff1b\u652f\u6301IPv4/IPv6\uff1b\u652f\u6301\u81ea\u5b9a\u4e49\u589e\u52a0\u9891\u9053\uff1b\u652f\u6301\u805a\u5408\u6e90\u3001\u4ee3\u7406\u6e90\u3001\u8ba2\u9605\u6e90\u3001\u5173\u952e\u5b57\u641c\u7d22\uff1b\u6bcf\u5929\u81ea\u52a8\u66f4\u65b0\u4e24\u6b21\uff0c\u7ed3\u679c\u53ef\u7528\u4e8eTVBox\u7b49\u64ad\u653e\u8f6f\u4ef6\uff1b\u652f\u6301\u5de5\u4f5c\u6d41\u3001Docker(amd64/arm64/arm v7)\u3001\u547d\u4ee4\u884c\u3001GUI\u8fd0\u884c\u65b9\u5f0f | IPTV live TV source update tool",
"stars": 9046,
"forks": 1938,
"today_stars": 101,
"language": "Python"
},
{
"name": "fatedier/frp",
"description": "A fast reverse proxy to help you expose a local server behind a NAT or firewall to the internet.",
"stars": 87828,
"forks": 13502,
"today_stars": 64,
"language": "Go"
},
{
"name": "facebookresearch/AnimatedDrawings",
"description": "Code to accompany \"A Method for Animating Children's Drawings of the Human Figure\"",
"stars": 10766,
"forks": 955,
"today_stars": 38,
"language": "Python"
},
{
"name": "gorilla/websocket",
"description": "Package gorilla/websocket is a fast, well-tested and widely used WebSocket implementation for Go.",
"stars": 22633,
"forks": 3495,
"today_stars": 13,
"language": "Go"
},
{
"name": "DefiLlama/chainlist",
"description": "NA",
"stars": 2368,
"forks": 2476,
"today_stars": 5,
"language": "JavaScript"
},
{
"name": "open-telemetry/opentelemetry-collector",
"description": "OpenTelemetry Collector",
"stars": 4570,
"forks": 1497,
"today_stars": 4,
"language": "Go"
},
{
"name": "RocketChat/Rocket.Chat",
"description": "The communications platform that puts data protection first.",
"stars": 41169,
"forks": 10877,
"today_stars": 73,
"language": "TypeScript"
},
{
"name": "langgenius/dify",
"description": "Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.",
"stars": 54976,
"forks": 8083,
"today_stars": 127,
"language": "TypeScript"
}
]
}
💾 Save the output to a CSV file
Let's create a pandas dataframe and show the table with the extracted content
Save it to CSV
Data saved to trending_repositories.csv
🔗 Resources
- 🚀 Get your API Key: ScrapeGraphAI Dashboard
- 🐙 GitHub: ScrapeGraphAI GitHub
- 💼 LinkedIn: ScrapeGraphAI LinkedIn
- 🐦 Twitter: ScrapeGraphAI Twitter
- 💬 Discord: Join our Discord Community
Made with ❤️ by the ScrapeGraphAI Team