Scrapegraph Langchain
🕷️ Extract Houses Listing with langchain-scrapegraph
🔧 Install dependencies
🔑 Import ScrapeGraph API key
You can find the Scrapegraph API key here
SGAI_API_KEY found in environment.
📝 Defining an Output Schema for Webpage Content Extraction
If you already know what you want to extract from a webpage, you can define an output schema using Pydantic. This schema acts as a "blueprint" that tells the AI how to structure the response.
Pydantic Schema Quick Guide
Types of Schemas
- Simple Schema
Use this when you want to extract straightforward information, such as a single piece of content.
from pydantic import BaseModel, Field
# Simple schema for a single webpage
class PageInfoSchema(BaseModel):
title: str = Field(description="The title of the webpage")
description: str = Field(description="The description of the webpage")
# Example Output JSON after AI extraction
{
"title": "ScrapeGraphAI: The Best Content Extraction Tool",
"description": "ScrapeGraphAI provides powerful tools for structured content extraction from websites."
}
- Complex Schema (Nested)
If you need to extract structured information with multiple related items (like a list of repositories), you can nest schemas.
from pydantic import BaseModel, Field
from typing import List
# Define a schema for a single repository
class RepositorySchema(BaseModel):
name: str = Field(description="Name of the repository (e.g., 'owner/repo')")
description: str = Field(description="Description of the repository")
stars: int = Field(description="Star count of the repository")
forks: int = Field(description="Fork count of the repository")
today_stars: int = Field(description="Stars gained today")
language: str = Field(description="Programming language used")
# Define a schema for a list of repositories
class ListRepositoriesSchema(BaseModel):
repositories: List[RepositorySchema] = Field(description="List of GitHub trending repositories")
# Example Output JSON after AI extraction
{
"repositories": [
{
"name": "google-gemini/cookbook",
"description": "Examples and guides for using the Gemini API",
"stars": 8036,
"forks": 1001,
"today_stars": 649,
"language": "Jupyter Notebook"
},
{
"name": "TEN-framework/TEN-Agent",
"description": "TEN Agent is a conversational AI powered by TEN, integrating Gemini 2.0 Multimodal Live API, OpenAI Realtime API, RTC, and more.",
"stars": 3224,
"forks": 311,
"today_stars": 361,
"language": "Python"
}
]
}
Key Takeaways
- Simple Schema: Perfect for small, straightforward extractions.
- Complex Schema: Use nesting to extract lists or structured data, like "a list of repositories."
Both approaches give the AI a clear structure to follow, ensuring that the extracted content matches exactly what you need.
🚀 Initialize langchain-scrapegraph tools and start extraction
Here we use SmartScraperTool to extract structured data using AI from a webpage.
If you already have an HTML file, you can upload it and use
LocalScraperToolinstead.
You can find more info in the official langchain documentation
Invoke the tool
As you may have noticed, we are not passing the
llm_output_schemawhile invoking the tool, this will make life easier toAI agentssince they will not need to generate one themselves with high risk of failure. Instead, we force the tool to return always a structured output that follows your previously defined schema. To find out more, check the following README
Print the response
Houses:
{
"houses": [
{
"price": 549000,
"bedrooms": 1,
"bathrooms": 1,
"square_feet": 0,
"address": "380 14th St Unit 405",
"city": "San Francisco",
"state": "CA",
"zip_code": "94103",
"tags": [
"New construction"
],
"agent_name": "Eddie O'Sullivan",
"agency": "Elevation Real Estate"
},
{
"price": 1799000,
"bedrooms": 4,
"bathrooms": 2,
"square_feet": 2735,
"address": "123 Grattan St",
"city": "San Francisco",
"state": "CA",
"zip_code": "94117",
"tags": [
"Edwardian-style",
"investment",
"owner-occupants"
],
"agent_name": "Sean Engmann",
"agency": "eXp Realty of Northern CA Inc."
},
{
"price": 1995000,
"bedrooms": 7,
"bathrooms": 3,
"square_feet": 3330,
"address": "1590 Washington St",
"city": "San Francisco",
"state": "CA",
"zip_code": "94109",
"tags": [
"Nob Hill",
"3-unit building",
"investment"
],
"agent_name": "Eddie O'Sullivan",
"agency": "Elevation Real Estate"
},
{
"price": 549000,
"bedrooms": 0,
"bathrooms": 1,
"square_feet": 477,
"address": "240 Lombard St Unit 835",
"city": "San Francisco",
"state": "CA",
"zip_code": "94111",
"tags": [
"Bay view",
"studio",
"modern appliances"
],
"agent_name": "Tim Gullicksen",
"agency": "Corcoran Icon Properties"
},
{
"price": 5495000,
"bedrooms": 10,
"bathrooms": 7,
"square_feet": 6505,
"address": "1057 Steiner St",
"city": "San Francisco",
"state": "CA",
"zip_code": "94115",
"tags": [
"Victorian",
"Bed & Breakfast",
"Gilded Age"
],
"agent_name": "Bonnie Spindler",
"agency": "Corcoran Icon Properties"
},
{
"price": 925000,
"bedrooms": 2,
"bathrooms": 1,
"square_feet": 779,
"address": "2 Fallon Place Unit 57",
"city": "San Francisco",
"state": "CA",
"zip_code": "94133",
"tags": [
"Russian Hill",
"views",
"exclusive-use deck"
],
"agent_name": "Eddie O'Sullivan",
"agency": "Elevation Real Estate"
},
{
"price": 898000,
"bedrooms": 2,
"bathrooms": 2,
"square_feet": 1175,
"address": "5160 Diamond Heights Blvd Unit 208C",
"city": "San Francisco",
"state": "CA",
"zip_code": "94131",
"tags": [],
"agent_name": "Joe Polyak",
"agency": "Rise Homes"
},
{
"price": 1700000,
"bedrooms": 4,
"bathrooms": 2,
"square_feet": 1950,
"address": "1351 26th Ave",
"city": "San Francisco",
"state": "CA",
"zip_code": "94122",
"tags": [],
"agent_name": "Glenda Queensbury",
"agency": "Referral Realty-BV"
},
{
"price": 1899000,
"bedrooms": 3,
"bathrooms": 2,
"square_feet": 1560,
"address": "340 Yerba Buena Ave",
"city": "San Francisco",
"state": "CA",
"zip_code": "94127",
"tags": [],
"agent_name": "Jeannie Anderson",
"agency": "Coldwell Banker Realty"
},
{
"price": 850000,
"bedrooms": 2,
"bathrooms": 2,
"square_feet": 1055,
"address": "588 Minna Unit 604",
"city": "San Francisco",
"state": "CA",
"zip_code": "94103",
"tags": [],
"agent_name": "Mohamed Lakdawala",
"agency": "Remax Prestigious Properties"
},
{
"price": 1990000,
"bedrooms": 3,
"bathrooms": 1,
"square_feet": 1280,
"address": "1450 Diamond St",
"city": "San Francisco",
"state": "CA",
"zip_code": "94131",
"tags": [],
"agent_name": "Mary Anne Villamil",
"agency": "Kinetic Real Estate"
},
{
"price": 849000,
"bedrooms": 1,
"bathrooms": 1,
"square_feet": 855,
"address": "81 Lansing St Unit 401",
"city": "San Francisco",
"state": "CA",
"zip_code": "94105",
"tags": [],
"agent_name": "Kristen Haenggi",
"agency": "Compass"
},
{
"price": 1080000,
"bedrooms": 2,
"bathrooms": 2,
"square_feet": 936,
"address": "451 Kansas St Unit 466",
"city": "San Francisco",
"state": "CA",
"zip_code": "94107",
"tags": [],
"agent_name": "Maureen DeBoer",
"agency": "LKJ Realty"
},
{
"price": 1499000,
"bedrooms": 4,
"bathrooms": 2,
"square_feet": 2145,
"address": "486 Yale St",
"city": "San Francisco",
"state": "CA",
"zip_code": "94134",
"tags": [],
"agent_name": "Alicia Atienza",
"agency": "Statewide Realty"
},
{
"price": 1140000,
"bedrooms": 2,
"bathrooms": 2,
"square_feet": 998,
"address": "588 Minna Unit 801",
"city": "San Francisco",
"state": "CA",
"zip_code": "94103",
"tags": [],
"agent_name": "Milan Jezdimirovic",
"agency": "Compass"
},
{
"price": 1988000,
"bedrooms": 2,
"bathrooms": 1,
"square_feet": 3800,
"address": "183 19th Ave",
"city": "San Francisco",
"state": "CA",
"zip_code": "94121",
"tags": [
"Amazing Property",
"Marina Style",
"Needs TLC"
],
"agent_name": "Leo Cheung",
"agency": "eXp Realty of California, Inc"
},
{
"price": 1218000,
"bedrooms": 2,
"bathrooms": 2,
"square_feet": 1275,
"address": "1998 Pacific Ave Unit 202",
"city": "San Francisco",
"state": "CA",
"zip_code": "94109",
"tags": [
"Light-filled",
"Freshly painted",
"Walker's paradise"
],
"agent_name": "Grace Sun",
"agency": "Compass"
},
{
"price": 895000,
"bedrooms": 1,
"bathrooms": 1,
"square_feet": 837,
"address": "425 1st St Unit 2501",
"city": "San Francisco",
"state": "CA",
"zip_code": "94105",
"tags": [
"Unobstructed bay bridge views",
"Open layout"
],
"agent_name": "Matt Fuller",
"agency": "Jackson Fuller Real Estate"
},
{
"price": 1499000,
"bedrooms": 3,
"bathrooms": 1,
"square_feet": 1500,
"address": "Unlisted Address",
"city": "San Francisco",
"state": "CA",
"zip_code": "NA",
"tags": [
"Contractor's Special",
"Fixer-upper"
],
"agent_name": "Jaymee Faith Sagisi",
"agency": "IMPACT"
},
{
"price": 900000,
"bedrooms": 1,
"bathrooms": 1,
"square_feet": 930,
"address": "1101 Green St Unit 302",
"city": "San Francisco",
"state": "CA",
"zip_code": "94109",
"tags": [
"Historic Art Deco",
"Abundant natural light"
],
"agent_name": "NA",
"agency": "NA"
},
{
"price": 858000,
"bedrooms": 1,
"bathrooms": 1,
"square_feet": 1104,
"address": "260 King St Unit 557",
"city": "San Francisco",
"state": "CA",
"zip_code": "94107",
"tags": [],
"agent_name": "Miyuki Takami",
"agency": "eXp Realty of California, Inc"
},
{
"price": 945000,
"bedrooms": 2,
"bathrooms": 1,
"square_feet": 767,
"address": "307 Page St Unit 1",
"city": "San Francisco",
"state": "CA",
"zip_code": "94102",
"tags": [],
"agent_name": "NA",
"agency": "NA"
},
{
"price": 1099000,
"bedrooms": 2,
"bathrooms": 2,
"square_feet": 1330,
"address": "1080 Sutter St Unit 202",
"city": "San Francisco",
"state": "CA",
"zip_code": "94109",
"tags": [],
"agent_name": "Annette Liberty",
"agency": "Coldwell Banker Realty"
},
{
"price": 950000,
"bedrooms": 4,
"bathrooms": 3,
"square_feet": 2090,
"address": "3328 26th St Unit 3330",
"city": "San Francisco",
"state": "CA",
"zip_code": "94110",
"tags": [],
"agent_name": "Isaac Munene",
"agency": "Coldwell Banker Realty"
},
{
"price": 1088000,
"bedrooms": 2,
"bathrooms": 2,
"square_feet": 1065,
"address": "1776 Sacramento St Unit 503",
"city": "San Francisco",
"state": "CA",
"zip_code": "94109",
"tags": [],
"agent_name": "Marilyn Becklehimer",
"agency": "Dio Real Estate"
},
{
"price": 1788888,
"bedrooms": 4,
"bathrooms": 3,
"square_feet": 1856,
"address": "2317 15th St",
"city": "San Francisco",
"state": "CA",
"zip_code": "94114",
"tags": [],
"agent_name": "Joel Gile",
"agency": "Sequoia Real Estate"
},
{
"price": 1650000,
"bedrooms": 3,
"bathrooms": 2,
"square_feet": 1547,
"address": "2475 47th Ave",
"city": "San Francisco",
"state": "CA",
"zip_code": "94116",
"tags": [],
"agent_name": "Lucy Goldenshteyn",
"agency": "Redfin"
},
{
"price": 998000,
"bedrooms": 2,
"bathrooms": 2,
"square_feet": 1202,
"address": "50 Lansing St Unit 201",
"city": "San Francisco",
"state": "CA",
"zip_code": "94105",
"tags": [],
"agent_name": "Tracey Broadman",
"agency": "Vanguard Properties"
},
{
"price": 1595000,
"bedrooms": 3,
"bathrooms": 5,
"square_feet": 1995,
"address": "15 Joy St",
"city": "San Francisco",
"state": "CA",
"zip_code": "94110",
"tags": [],
"agent_name": "Mike Stack",
"agency": "Vanguard Properties"
},
{
"price": 1028000,
"bedrooms": 2,
"bathrooms": 2,
"square_feet": 1065,
"address": "50 Lansing St Unit 403",
"city": "San Francisco",
"state": "CA",
"zip_code": "94105",
"tags": [],
"agent_name": "Robyn Kaufman",
"agency": "Vivre Real Estate"
},
{
"price": 999000,
"bedrooms": 1,
"bathrooms": 1,
"square_feet": 1021,
"address": "338 Spear St Unit 6J",
"city": "San Francisco",
"state": "CA",
"zip_code": "94105",
"tags": [
"Spacious",
"Balcony",
"Bright courtyard views"
],
"agent_name": "Paul Hwang",
"agency": "Skybox Realty"
},
{
"price": 799800,
"bedrooms": 2,
"bathrooms": 2,
"square_feet": 1109,
"address": "10 Innes Ct",
"city": "San Francisco",
"state": "CA",
"zip_code": "94124",
"tags": [
"New Construction",
"1-car garage"
],
"agent_name": "Lennar",
"agency": "Lennar"
},
{
"price": 529880,
"bedrooms": 1,
"bathrooms": 1,
"square_feet": 740,
"address": "10 Innes Ct",
"city": "San Francisco",
"state": "CA",
"zip_code": "94124",
"tags": [
"New Construction",
"1-car garage"
],
"agent_name": "Lennar",
"agency": "Lennar"
},
{
"price": 489000,
"bedrooms": 1,
"bathrooms": 1,
"square_feet": 741,
"address": "10 Innes Ct",
"city": "San Francisco",
"state": "CA",
"zip_code": "94124",
"tags": [
"New Construction",
"1-car garage"
],
"agent_name": "Lennar",
"agency": "Lennar"
},
{
"price": 1359000,
"bedrooms": 4,
"bathrooms": 2,
"square_feet": 1845,
"address": "170 Thrift St",
"city": "San Francisco",
"state": "CA",
"zip_code": "94112",
"tags": [
"Updated",
"Natural light"
],
"agent_name": "Cristal Wright",
"agency": "Vanguard Properties"
},
{
"price": 1295000,
"bedrooms": 3,
"bathrooms": 1,
"square_feet": 1214,
"address": "1922 43rd Ave",
"city": "San Francisco",
"state": "CA",
"zip_code": "94116",
"tags": [],
"agent_name": "Mila Romprey",
"agency": "Premier Realty Associates"
},
{
"price": 1098000,
"bedrooms": 3,
"bathrooms": 1,
"square_feet": 1006,
"address": "150 Putnam St",
"city": "San Francisco",
"state": "CA",
"zip_code": "94110",
"tags": [],
"agent_name": "Genie Mantzoros",
"agency": "Epic Real Estate & Asso. Inc."
},
{
"price": 1189870,
"bedrooms": 3,
"bathrooms": 2,
"square_feet": 1436,
"address": "327 Ordway St",
"city": "San Francisco",
"state": "CA",
"zip_code": "94134",
"tags": [],
"agent_name": "Shawn Zahraie",
"agency": "Affinity Enterprises, Inc"
},
{
"price": 899000,
"bedrooms": 2,
"bathrooms": 1,
"square_feet": 1118,
"address": "272 Farallones St",
"city": "San Francisco",
"state": "CA",
"zip_code": "94112",
"tags": [],
"agent_name": "Janice Lee",
"agency": "Coldwell Banker Realty"
},
{
"price": 30000,
"bedrooms": 0,
"bathrooms": 0,
"square_feet": 0,
"address": "0 Evans Ave",
"city": "San Francisco",
"state": "CA",
"zip_code": "94124",
"tags": [
"Land",
"0.12 Acre",
"$251,467 per Acre"
],
"agent_name": "Heidy Carrera",
"agency": "Berkshire Hathaway HomeService"
}
]
}
💾 Save the output to a CSV file
Let's create a pandas dataframe and show the table with the extracted content
Save it to CSV
Data saved to houses_forsale.csv
🔗 Resources
- 🚀 Get your API Key: ScrapeGraphAI Dashboard
- 🐙 GitHub: ScrapeGraphAI GitHub
- 💼 LinkedIn: ScrapeGraphAI LinkedIn
- 🐦 Twitter: ScrapeGraphAI Twitter
- 💬 Discord: Join our Discord Community
- 🦜 Langchain: ScrapeGraph docs
Made with ❤️ by the ScrapeGraphAI Team