ScrapeGraphAI Scrapegraph Sdk

Scrapegraph Sdk

scrapegraph-pysdk-nodejscookbookscrapingjson-schemagithub-trendingsdk-pythonweb-scraping-pythonweb-scrapingsdk-jsapiPythonscrapegraphweb-crawler

alph-notebooks/scrapegraph-py / scrapegraph_sdk.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

🕷️ Extract Github Trending Repositories with Official Scrapegraph SDK

🔧 Install `dependencies`

[ ]

🔑 Import `ScrapeGraph` API key

You can find the Scrapegraph API key here

[ ]

SGAI_API_KEY not found in environment.
Please enter your SGAI_API_KEY: ··········
SGAI_API_KEY has been set in the environment.

📝 Defining an `Output Schema` for Webpage Content Extraction

If you already know what you want to extract from a webpage, you can define an output schema using Pydantic. This schema acts as a "blueprint" that tells the AI how to structure the response.

Pydantic Schema Quick Guide

Types of Schemas

Simple Schema
Use this when you want to extract straightforward information, such as a single piece of content.

	from pydantic import BaseModel, Field

# Simple schema for a single webpage
class PageInfoSchema(BaseModel):
    title: str = Field(description="The title of the webpage")
    description: str = Field(description="The description of the webpage")

# Example Output JSON after AI extraction
{
    "title": "ScrapeGraphAI: The Best Content Extraction Tool",
    "description": "ScrapeGraphAI provides powerful tools for structured content extraction from websites."
}

Complex Schema (Nested)
If you need to extract structured information with multiple related items (like a list of repositories), you can nest schemas.

	from pydantic import BaseModel, Field
from typing import List

# Define a schema for a single repository
class RepositorySchema(BaseModel):
    name: str = Field(description="Name of the repository (e.g., 'owner/repo')")
    description: str = Field(description="Description of the repository")
    stars: int = Field(description="Star count of the repository")
    forks: int = Field(description="Fork count of the repository")
    today_stars: int = Field(description="Stars gained today")
    language: str = Field(description="Programming language used")

# Define a schema for a list of repositories
class ListRepositoriesSchema(BaseModel):
    repositories: List[RepositorySchema] = Field(description="List of GitHub trending repositories")

# Example Output JSON after AI extraction
{
    "repositories": [
        {
            "name": "google-gemini/cookbook",
            "description": "Examples and guides for using the Gemini API",
            "stars": 8036,
            "forks": 1001,
            "today_stars": 649,
            "language": "Jupyter Notebook"
        },
        {
            "name": "TEN-framework/TEN-Agent",
            "description": "TEN Agent is a conversational AI powered by TEN, integrating Gemini 2.0 Multimodal Live API, OpenAI Realtime API, RTC, and more.",
            "stars": 3224,
            "forks": 311,
            "today_stars": 361,
            "language": "Python"
        }
    ]
}

Key Takeaways

Simple Schema: Perfect for small, straightforward extractions.
Complex Schema: Use nesting to extract lists or structured data, like "a list of repositories."

Both approaches give the AI a clear structure to follow, ensuring that the extracted content matches exactly what you need.

[ ]

🚀 Initialize `SGAI Client` and start extraction

Initialize the client for scraping (there's also an async version here)

[ ]

Here we use Smartscraper service to extract structured data using AI from a webpage.

If you already have an HTML file, you can upload it and use Localscraper instead.

[ ]

Print the response

[ ]

Request ID: 1e3b00ff-4b55-497c-8046-8ec5503cdafd
Trending Repositories:
{
  "repositories": [
    {
      "name": "Byaidu/PDFMathTranslate",
      "description": "PDF scientific paper translation with preserved formats - \u57fa\u4e8e AI \u5b8c\u6574\u4fdd\u7559\u6392\u7248\u7684 PDF \u6587\u6863\u5168\u6587\u53cc\u8bed\u7ffb\u8bd1\uff0c\u652f\u6301 Google/DeepL/Ollama/OpenAI \u7b49\u670d\u52a1\uff0c\u63d0\u4f9b CLI/GUI/Docker",
      "stars": 8902,
      "forks": 633,
      "today_stars": 816,
      "language": "Python"
    },
    {
      "name": "bigskysoftware/htmx",
      "description": "htmx - high power tools for HTML",
      "stars": 39143,
      "forks": 1324,
      "today_stars": 186,
      "language": "JavaScript"
    },
    {
      "name": "commaai/openpilot",
      "description": "openpilot is an operating system for robotics. Currently, it upgrades the driver assistance system on 275+ supported cars.",
      "stars": 50945,
      "forks": 9206,
      "today_stars": 132,
      "language": "Python"
    },
    {
      "name": "google-gemini/cookbook",
      "description": "Examples and guides for using the Gemini API",
      "stars": 8108,
      "forks": 1011,
      "today_stars": 1221,
      "language": "Jupyter Notebook"
    },
    {
      "name": "stripe/stripe-ios",
      "description": "Stripe iOS SDK",
      "stars": 2179,
      "forks": 994,
      "today_stars": 19,
      "language": "Swift"
    },
    {
      "name": "RIOT-OS/RIOT",
      "description": "RIOT - The friendly OS for IoT",
      "stars": 5234,
      "forks": 2017,
      "today_stars": 168,
      "language": "C"
    },
    {
      "name": "zju3dv/EasyVolcap",
      "description": "EasyVolcap: Accelerating Neural Volumetric Video Research",
      "stars": 802,
      "forks": 52,
      "today_stars": 30,
      "language": "Python"
    },
    {
      "name": "TEN-framework/TEN-Agent",
      "description": "TEN Agent is a conversational AI powered by TEN, integrating Gemini 2.0 Multimodal Live API, OpenAI Realtime API, RTC, and more. It offers real-time capabilities to see, hear, and speak, along with advanced tools like weather checks, web search, and RAG.",
      "stars": 3245,
      "forks": 313,
      "today_stars": 296,
      "language": "Python"
    },
    {
      "name": "DS4SD/docling",
      "description": "Get your documents ready for gen AI",
      "stars": 15201,
      "forks": 774,
      "today_stars": 281,
      "language": "Python"
    },
    {
      "name": "Guovin/iptv-api",
      "description": "\ud83d\udcfaIPTV\u7535\u89c6\u76f4\u64ad\u6e90\u66f4\u65b0\u5de5\u5177\uff1a\u2728\u592e\u89c6\u9891\u3001\ud83d\udcf1\u536b\u89c6\u3001\u2615\u5404\u7701\u4efd\u5730\u65b9\u53f0\u3001\ud83c\udf0f\u6e2f\u00b7\u6fb3\u00b7\u53f0\u3001\ud83c\udfa5\u7535\u5f71\u3001\ud83c\udfae\u6e38\u620f\u3001\ud83c\udfb5\u97f3\u4e50\u3001\ud83c\udfad\u7ecf\u5178\u5267\u573a\uff1b\u652f\u6301IPv4/IPv6\uff1b\u652f\u6301\u81ea\u5b9a\u4e49\u589e\u52a0\u9891\u9053\uff1b\u652f\u6301\u805a\u5408\u6e90\u3001\u4ee3\u7406\u6e90\u3001\u8ba2\u9605\u6e90\u3001\u5173\u952e\u5b57\u641c\u7d22\uff1b\u6bcf\u5929\u81ea\u52a8\u66f4\u65b0\u4e24\u6b21\uff0c\u7ed3\u679c\u53ef\u7528\u4e8eTVBox\u7b49\u64ad\u653e\u8f6f\u4ef6\uff1b\u652f\u6301\u5de5\u4f5c\u6d41\u3001Docker(amd64/arm64/arm v7)\u3001\u547d\u4ee4\u884c\u3001GUI\u8fd0\u884c\u65b9\u5f0f | IPTV live TV source update tool",
      "stars": 9046,
      "forks": 1938,
      "today_stars": 101,
      "language": "Python"
    },
    {
      "name": "fatedier/frp",
      "description": "A fast reverse proxy to help you expose a local server behind a NAT or firewall to the internet.",
      "stars": 87828,
      "forks": 13502,
      "today_stars": 64,
      "language": "Go"
    },
    {
      "name": "facebookresearch/AnimatedDrawings",
      "description": "Code to accompany \"A Method for Animating Children's Drawings of the Human Figure\"",
      "stars": 10766,
      "forks": 955,
      "today_stars": 38,
      "language": "Python"
    },
    {
      "name": "gorilla/websocket",
      "description": "Package gorilla/websocket is a fast, well-tested and widely used WebSocket implementation for Go.",
      "stars": 22633,
      "forks": 3495,
      "today_stars": 13,
      "language": "Go"
    },
    {
      "name": "DefiLlama/chainlist",
      "description": "NA",
      "stars": 2368,
      "forks": 2476,
      "today_stars": 5,
      "language": "JavaScript"
    },
    {
      "name": "open-telemetry/opentelemetry-collector",
      "description": "OpenTelemetry Collector",
      "stars": 4570,
      "forks": 1497,
      "today_stars": 4,
      "language": "Go"
    },
    {
      "name": "RocketChat/Rocket.Chat",
      "description": "The communications platform that puts data protection first.",
      "stars": 41169,
      "forks": 10877,
      "today_stars": 73,
      "language": "TypeScript"
    },
    {
      "name": "langgenius/dify",
      "description": "Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.",
      "stars": 54976,
      "forks": 8083,
      "today_stars": 127,
      "language": "TypeScript"
    }
  ]
}

💾 Save the output to a `CSV` file

Let's create a pandas dataframe and show the table with the extracted content

[ ]

Save it to CSV

[ ]

Data saved to trending_repositories.csv

🔗 Resources

🚀 Get your API Key: ScrapeGraphAI Dashboard
🐙 GitHub: ScrapeGraphAI GitHub
💼 LinkedIn: ScrapeGraphAI LinkedIn
🐦 Twitter: ScrapeGraphAI Twitter
💬 Discord: Join our Discord Community

Made with ❤️ by the ScrapeGraphAI Team

Scrapegraph Sdk

🕷️ Extract Github Trending Repositories with Official Scrapegraph SDK

🔧 Install dependencies

🔑 Import ScrapeGraph API key

📝 Defining an Output Schema for Webpage Content Extraction

🚀 Initialize SGAI Client and start extraction

💾 Save the output to a CSV file