Notebooks
S
ScrapeGraphAI
Scrapegraph Llama Index

Scrapegraph Llama Index

scrapegraph-pysdk-nodejscookbookscrapingjson-schemasdk-pythoncompany-infoweb-scraping-pythonweb-scrapingsdk-jsapiPythonscrapegraphweb-crawler

🕷️ Extract Company Info with llama-index-tools-scrapegraphai

image.png

🔧 Install dependencies

[ ]

🔑 Import ScrapeGraph API key

You can find the Scrapegraph API key here

[ ]

📝 Defining an Output Schema for Webpage Content Extraction

If you already know what you want to extract from a webpage, you can define an output schema using Pydantic. This schema acts as a "blueprint" that tells the AI how to structure the response.

Pydantic Schema Quick Guide

Types of Schemas

  1. Simple Schema
    Use this when you want to extract straightforward information, such as a single piece of content.
	from pydantic import BaseModel, Field

# Simple schema for a single webpage
class PageInfoSchema(BaseModel):
    title: str = Field(description="The title of the webpage")
    description: str = Field(description="The description of the webpage")

# Example Output JSON after AI extraction
{
    "title": "ScrapeGraphAI: The Best Content Extraction Tool",
    "description": "ScrapeGraphAI provides powerful tools for structured content extraction from websites."
}

  1. Complex Schema (Nested)
    If you need to extract structured information with multiple related items (like a list of repositories), you can nest schemas.
	from pydantic import BaseModel, Field
from typing import List

# Define a schema for a single repository
class RepositorySchema(BaseModel):
    name: str = Field(description="Name of the repository (e.g., 'owner/repo')")
    description: str = Field(description="Description of the repository")
    stars: int = Field(description="Star count of the repository")
    forks: int = Field(description="Fork count of the repository")
    today_stars: int = Field(description="Stars gained today")
    language: str = Field(description="Programming language used")

# Define a schema for a list of repositories
class ListRepositoriesSchema(BaseModel):
    repositories: List[RepositorySchema] = Field(description="List of GitHub trending repositories")

# Example Output JSON after AI extraction
{
    "repositories": [
        {
            "name": "google-gemini/cookbook",
            "description": "Examples and guides for using the Gemini API",
            "stars": 8036,
            "forks": 1001,
            "today_stars": 649,
            "language": "Jupyter Notebook"
        },
        {
            "name": "TEN-framework/TEN-Agent",
            "description": "TEN Agent is a conversational AI powered by TEN, integrating Gemini 2.0 Multimodal Live API, OpenAI Realtime API, RTC, and more.",
            "stars": 3224,
            "forks": 311,
            "today_stars": 361,
            "language": "Python"
        }
    ]
}

Key Takeaways

  • Simple Schema: Perfect for small, straightforward extractions.
  • Complex Schema: Use nesting to extract lists or structured data, like "a list of repositories."

Both approaches give the AI a clear structure to follow, ensuring that the extracted content matches exactly what you need.

[ ]

🚀 Initialize ScrapegraphToolSpec tools and start extraction

Here we use scrapegraph_smartscraper to extract structured data using AI from a webpage.

If you already have an HTML file, you can upload it and use scrapegraph_local_scrape instead.

You can find more info in the official llama-index documentation

[ ]
[ ]

Print the response

[ ]
Company Info:
{
  "request_id": "2f41ad97-b0d3-4d44-8c0d-9734c347f595",
  "status": "completed",
  "website_url": "https://www.wired.com/tag/science/",
  "user_prompt": "Extract information about science news articles",
  "result": {
    "company_name": "WIRED",
    "description": "WIRED is a magazine that covers the intersection of technology, science, culture, and politics. It is the essential source of information and ideas that make sense of a world in constant transformation, illuminating how technology is changing every aspect of our lives.",
    "founders": [
      {
        "name": "Louis Rossetto",
        "role": "Co-founder",
        "linkedin": "NA"
      },
      {
        "name": "Jane Metcalfe",
        "role": "Co-founder",
        "linkedin": "NA"
      }
    ],
    "logo": "https://www.wired.com/verso/static/wired-us/assets/logo-header.svg",
    "partners": [],
    "pricing_plans": [],
    "contact_emails": [],
    "social_links": {
      "linkedin": "NA",
      "twitter": "https://twitter.com/wired/",
      "github": "NA"
    },
    "privacy_policy": "http://www.condenast.com/privacy-policy#privacypolicy",
    "terms_of_service": "https://www.condenast.com/user-agreement/",
    "api_status": "NA"
  },
  "error": ""
}

💾 Save the output to a CSV file

Let's create a pandas dataframe and show the tables with the extracted content

[ ]

Show flattened tables

[ ]
[ ]
[ ]
[ ]

Save the responses to CSV

[ ]
Data saved to CSV files

🔗 Resources

ScrapeGraph API Banner

Made with ❤️ by the ScrapeGraphAI Team