Notebooks
S
ScrapeGraphAI
Scrapegraph Llama Index

Scrapegraph Llama Index

scrapegraph-pysdk-nodejscookbookscrapingjson-schemahomes-forsalesdk-pythonweb-scraping-pythonweb-scrapingsdk-jsapiPythonscrapegraphweb-crawler

🕷️ Extract Houses Listing on Zillow with llama-index and ScrapegraphAI APIs

2025-01-02 15.59.59.jpg

🔧 Install dependencies

[ ]

🔑 Import ScrapeGraph API key

You can find the Scrapegraph API key here

[ ]
SGAI_API_KEY not found in environment.
SGAI_API_KEY has been set in the environment.

📝 Defining an Output Schema for Webpage Content Extraction

If you already know what you want to extract from a webpage, you can define an output schema using Pydantic. This schema acts as a "blueprint" that tells the AI how to structure the response.

Pydantic Schema Quick Guide

Types of Schemas

  1. Simple Schema
    Use this when you want to extract straightforward information, such as a single piece of content.
	from pydantic import BaseModel, Field

# Simple schema for a single webpage
class PageInfoSchema(BaseModel):
    title: str = Field(description="The title of the webpage")
    description: str = Field(description="The description of the webpage")

# Example Output JSON after AI extraction
{
    "title": "ScrapeGraphAI: The Best Content Extraction Tool",
    "description": "ScrapeGraphAI provides powerful tools for structured content extraction from websites."
}

  1. Complex Schema (Nested)
    If you need to extract structured information with multiple related items (like a list of repositories), you can nest schemas.
	from pydantic import BaseModel, Field
from typing import List

# Define a schema for a single repository
class RepositorySchema(BaseModel):
    name: str = Field(description="Name of the repository (e.g., 'owner/repo')")
    description: str = Field(description="Description of the repository")
    stars: int = Field(description="Star count of the repository")
    forks: int = Field(description="Fork count of the repository")
    today_stars: int = Field(description="Stars gained today")
    language: str = Field(description="Programming language used")

# Define a schema for a list of repositories
class ListRepositoriesSchema(BaseModel):
    repositories: List[RepositorySchema] = Field(description="List of GitHub trending repositories")

# Example Output JSON after AI extraction
{
    "repositories": [
        {
            "name": "google-gemini/cookbook",
            "description": "Examples and guides for using the Gemini API",
            "stars": 8036,
            "forks": 1001,
            "today_stars": 649,
            "language": "Jupyter Notebook"
        },
        {
            "name": "TEN-framework/TEN-Agent",
            "description": "TEN Agent is a conversational AI powered by TEN, integrating Gemini 2.0 Multimodal Live API, OpenAI Realtime API, RTC, and more.",
            "stars": 3224,
            "forks": 311,
            "today_stars": 361,
            "language": "Python"
        }
    ]
}

Key Takeaways

  • Simple Schema: Perfect for small, straightforward extractions.
  • Complex Schema: Use nesting to extract lists or structured data, like "a list of repositories."

Both approaches give the AI a clear structure to follow, ensuring that the extracted content matches exactly what you need.

[ ]

🚀 Initialize ScrapegraphToolSpec tools and start extraction

Here we use SmartScraperTool to extract structured data using AI from a webpage.

If you already have an HTML file, you can upload it and use LocalScraperTool instead.

You can find more info in the official langchain documentation

[ ]

Invoke the tool

[ ]

As you may have noticed, we are not passing the llm_output_schema while invoking the tool, this will make life easier to AI agents since they will not need to generate one themselves with high risk of failure. Instead, we force the tool to return always a structured output that follows your previously defined schema. To find out more, check the following README

Print the response

[ ]
Trending Repositories:
{
  "request_id": "628bdf64-26f9-486a-9f2f-f3b5ac9c0421",
  "status": "completed",
  "website_url": "https://www.zillow.com/san-francisco-ca/",
  "user_prompt": "Extract information about houses for sale",
  "result": {
    "houses": [
      {
        "price": 449000,
        "bedrooms": 3,
        "bathrooms": 3,
        "square_feet": 2166,
        "address": "3229 Jennings St",
        "city": "San Francisco",
        "state": "CA",
        "zip_code": "94124",
        "tags": [],
        "agent_name": "Michelle K. Pender",
        "agency": "ENGEL & VOELKERS SAN FRANCISCO"
      },
      {
        "price": 950000,
        "bedrooms": 2,
        "bathrooms": 2,
        "square_feet": 1686,
        "address": "401 Huron Ave",
        "city": "San Francisco",
        "state": "CA",
        "zip_code": "94112",
        "tags": [
          "Cozy fireplace"
        ],
        "agent_name": "Allison Chapleau",
        "agency": "COMPASS"
      },
      {
        "price": 207555,
        "bedrooms": 1,
        "bathrooms": 1,
        "square_feet": 1593,
        "address": "2040 Fell St APT 10",
        "city": "San Francisco",
        "state": "CA",
        "zip_code": "94117",
        "tags": [],
        "agent_name": "Trista Elizabeth Bernasconi",
        "agency": "COMPASS"
      },
      {
        "price": 795000,
        "bedrooms": 4,
        "bathrooms": 2,
        "square_feet": 2000,
        "address": "515 Athens St",
        "city": "San Francisco",
        "state": "CA",
        "zip_code": "94112",
        "tags": [
          "Level fenced rear yard"
        ],
        "agent_name": "Darin J. Holwitz",
        "agency": "COMPASS"
      },
      {
        "price": 599000,
        "bedrooms": 1,
        "bathrooms": 1,
        "square_feet": 0,
        "address": "380 Dolores St #6",
        "city": "San Francisco",
        "state": "CA",
        "zip_code": "94114",
        "tags": [],
        "agent_name": "Melody A. Hultgren",
        "agency": "NOVA REAL ESTATE"
      },
      {
        "price": 875000,
        "bedrooms": 2,
        "bathrooms": 2,
        "square_feet": 907,
        "address": "426 Fillmore St #A",
        "city": "San Francisco",
        "state": "CA",
        "zip_code": "94117",
        "tags": [
          "Sleek finishes"
        ],
        "agent_name": "NA",
        "agency": "NA"
      },
      {
        "price": 335512,
        "bedrooms": 2,
        "bathrooms": 2,
        "square_feet": 886,
        "address": "1688 Pine St #101",
        "city": "San Francisco",
        "state": "CA",
        "zip_code": "94109",
        "tags": [],
        "agent_name": "Trista Elizabeth Bernasconi",
        "agency": "COMPASS"
      },
      {
        "price": 899000,
        "bedrooms": 4,
        "bathrooms": 2,
        "square_feet": 1680,
        "address": "351 Chenery St",
        "city": "San Francisco",
        "state": "CA",
        "zip_code": "94131",
        "tags": [
          "South-facing panoramic views"
        ],
        "agent_name": "Easton S. Thodos",
        "agency": "THESEUS REAL ESTATE"
      },
      {
        "price": 155659,
        "bedrooms": 0,
        "bathrooms": 1,
        "square_feet": 514,
        "address": "52 Kirkwood Ave #203",
        "city": "San Francisco",
        "state": "CA",
        "zip_code": "94124",
        "tags": [
          "Modern cabinetry"
        ],
        "agent_name": "Lynn Anne Bell",
        "agency": "CHRISTIE'S INT'L R.E. SF"
      }
    ]
  },
  "error": ""
}

💾 Save the output to a CSV file

Let's create a pandas dataframe and show the table with the extracted content

[ ]

Save it to CSV

[ ]

🔗 Resources

ScrapeGraph API Banner

Made with ❤️ by the ScrapeGraphAI Team