Notebooks
S
ScrapeGraphAI
Scrapegraph Langchain

Scrapegraph Langchain

scrapegraph-pysdk-nodejscookbookscrapingwired-newsjson-schemasdk-pythonweb-scraping-pythonweb-scrapingsdk-jsapiPythonscrapegraphweb-crawler

🕷️ Extract Wired Science News with langchain-scrapegraph

Presentazione senza titolo.pptx (4).png

🔧 Install dependencies

[ ]

🔑 Import ScrapeGraph API key

You can find the Scrapegraph API key here

[ ]
SGAI_API_KEY not found in environment.
Please enter your SGAI_API_KEY: ··········
SGAI_API_KEY has been set in the environment.

📝 Defining an Output Schema for Webpage Content Extraction

If you already know what you want to extract from a webpage, you can define an output schema using Pydantic. This schema acts as a "blueprint" that tells the AI how to structure the response.

Pydantic Schema Quick Guide

Types of Schemas

  1. Simple Schema
    Use this when you want to extract straightforward information, such as a single piece of content.
	from pydantic import BaseModel, Field

# Simple schema for a single webpage
class PageInfoSchema(BaseModel):
    title: str = Field(description="The title of the webpage")
    description: str = Field(description="The description of the webpage")

# Example Output JSON after AI extraction
{
    "title": "ScrapeGraphAI: The Best Content Extraction Tool",
    "description": "ScrapeGraphAI provides powerful tools for structured content extraction from websites."
}

  1. Complex Schema (Nested)
    If you need to extract structured information with multiple related items (like a list of repositories), you can nest schemas.
	from pydantic import BaseModel, Field
from typing import List

# Define a schema for a single repository
class RepositorySchema(BaseModel):
    name: str = Field(description="Name of the repository (e.g., 'owner/repo')")
    description: str = Field(description="Description of the repository")
    stars: int = Field(description="Star count of the repository")
    forks: int = Field(description="Fork count of the repository")
    today_stars: int = Field(description="Stars gained today")
    language: str = Field(description="Programming language used")

# Define a schema for a list of repositories
class ListRepositoriesSchema(BaseModel):
    repositories: List[RepositorySchema] = Field(description="List of GitHub trending repositories")

# Example Output JSON after AI extraction
{
    "repositories": [
        {
            "name": "google-gemini/cookbook",
            "description": "Examples and guides for using the Gemini API",
            "stars": 8036,
            "forks": 1001,
            "today_stars": 649,
            "language": "Jupyter Notebook"
        },
        {
            "name": "TEN-framework/TEN-Agent",
            "description": "TEN Agent is a conversational AI powered by TEN, integrating Gemini 2.0 Multimodal Live API, OpenAI Realtime API, RTC, and more.",
            "stars": 3224,
            "forks": 311,
            "today_stars": 361,
            "language": "Python"
        }
    ]
}

Key Takeaways

  • Simple Schema: Perfect for small, straightforward extractions.
  • Complex Schema: Use nesting to extract lists or structured data, like "a list of repositories."

Both approaches give the AI a clear structure to follow, ensuring that the extracted content matches exactly what you need.

[ ]

🚀 Initialize langchain-scrapegraph tools and start extraction

Here we use SmartScraperTool to extract structured data using AI from a webpage.

If you already have an HTML file, you can upload it and use LocalScraperTool instead.

You can find more info in the official langchain documentation

[ ]
[ ]

Print the response

[ ]
Science News:
{
  "news": [
    {
      "category": "Science",
      "title": "NASA Postpones Return of Stranded Starliner Astronauts to March",
      "link": "https://www.wired.com/story/boeing-starliner-astronauts-stranded-until-march-nasa/",
      "author": "Fernanda Gonz\u00e1lez"
    },
    {
      "category": "Public Health",
      "title": "CDC Confirms First US Case of Severe Bird Flu",
      "link": "https://www.wired.com/story/cdc-confirms-first-us-case-of-severe-bird-flu/",
      "author": "Emily Mullin"
    },
    {
      "category": "Science",
      "title": "The Study That Called Out Black Plastic Utensils Had a Major Math Error",
      "link": "https://www.wired.com/story/black-plastic-utensils-study-math-error-correction/",
      "author": "Beth Mole, Ars Technica"
    },
    {
      "category": "Health",
      "title": "A Third Person Has Received a Transplant of a Genetically Engineered Pig Kidney",
      "link": "https://www.wired.com/story/a-third-person-has-received-a-transplant-of-a-genetically-engineered-pig-kidney/",
      "author": "Emily Mullin"
    },
    {
      "category": "Health",
      "title": "Antibodies Could Soon Help Slow the Aging Process",
      "link": "https://www.wired.com/story/antibodies-could-soon-help-slow-the-aging-process/",
      "author": "Andrew Steele"
    },
    {
      "category": "Science",
      "title": "Good at Reading? Your Brain May Be Structured Differently",
      "link": "https://www.wired.com/story/good-at-reading-your-brain-may-be-structured-differently/",
      "author": "Mikael Roll"
    },
    {
      "category": "Environment",
      "title": "Mega-Farms Are Driving the Threat of Bird Flu",
      "link": "https://www.wired.com/story/mega-farms-are-driving-the-threat-of-bird-flu/",
      "author": "Georgina Gustin"
    },
    {
      "category": "Environment",
      "title": "How Christmas Trees Could Become a Source of Low-Carbon Protein",
      "link": "https://www.wired.com/story/how-christmas-trees-could-become-a-source-of-low-carbon-protein/",
      "author": "Alexa Phillips"
    },
    {
      "category": "Environment",
      "title": "Creating a Global Package to Solve the Problem of Plastics",
      "link": "https://www.wired.com/story/global-plastics-treaty-united-nations/",
      "author": "Susan Solomon"
    },
    {
      "category": "Environment",
      "title": "These 3 Things Are Standing in the Way of a Global Plastics Treaty",
      "link": "https://www.wired.com/story/these-3-things-are-standing-in-the-way-of-a-global-plastics-treaty/",
      "author": "Steve Fletcher and Samuel Winton"
    },
    {
      "category": "Environment",
      "title": "Environmental Sensing Is Here, Tracking Everything from Forest Fires to Threatened Species",
      "link": "https://www.wired.com/story/environmental-sensing-is-here-tracking-everything-from-forest-fires-to-threatened-species/",
      "author": "Sabrina Weiss"
    },
    {
      "category": "Climate",
      "title": "Generative AI and Climate Change Are on a Collision Course",
      "link": "https://www.wired.com/story/true-cost-generative-ai-data-centers-energy/",
      "author": "Sasha Luccioni"
    },
    {
      "category": "Climate",
      "title": "Climate Change Is Destroying Monarch Butterflies\u2019 Winter Habitat",
      "link": "https://www.wired.com/story/global-warming-threatens-the-monarch-butterfly-sanctuary-but-this-scientist-prepares-a-new-home-for-them/",
      "author": "Andrea J. Arratibel"
    },
    {
      "category": "Politics",
      "title": "More Humanitarian Organizations Will Harness AI\u2019s Potential",
      "link": "https://www.wired.com/story/humanitarian-organizations-artificial-intelligence/",
      "author": "David Miliband"
    },
    {
      "category": "Environment",
      "title": "Chocolate Has a Sustainability Problem. Science Thinks It's Found the Answer",
      "link": "https://www.wired.com/story/chocolate-has-a-sustainability-problem-science-thinks-its-found-the-answer/",
      "author": "Eve Thomas"
    },
    {
      "category": "Energy",
      "title": "Big Tech Will Scour the Globe in Its Search for Cheap Energy",
      "link": "https://www.wired.com/story/big-tech-data-centers-cheap-energy/",
      "author": "Azeem Azhar"
    },
    {
      "category": "Environment",
      "title": "Humans Will Continue to Live in an Age of Incredible Food Waste",
      "link": "https://www.wired.com/story/food-production-energy-waste/",
      "author": "Vaclav Smil"
    },
    {
      "category": "Energy",
      "title": "A Uranium-Mining Boom Is Sweeping Through Texas",
      "link": "https://www.wired.com/story/a-uranium-mining-boom-is-sweeping-through-texas-nuclear-energy/",
      "author": "Dylan Baddour"
    },
    {
      "category": "Space",
      "title": "The End Is Near for NASA\u2019s Voyager Probes",
      "link": "https://www.wired.com/story/the-end-is-near-for-nasas-voyager-probes/",
      "author": "Luca Nardi"
    },
    {
      "category": "Space",
      "title": "The Mystery of How Supermassive Black Holes Merge",
      "link": "https://www.wired.com/story/how-do-merging-supermassive-black-holes-pass-the-final-parsec/",
      "author": "Jonathan O\u2019Callaghan"
    },
    {
      "category": "Space",
      "title": "Starship\u2019s Next Launch Could Be Just Two Weeks Away",
      "link": "https://www.wired.com/story/starships-next-launch-could-be-just-two-weeks-away/",
      "author": "Eric Berger, Ars Technica"
    },
    {
      "category": "Math",
      "title": "The Simple Math Behind Public Key Cryptography",
      "link": "https://www.wired.com/story/how-public-key-cryptography-really-works-using-only-simple-math/",
      "author": "John Pavlus"
    },
    {
      "category": "Math",
      "title": "Everyone Is Capable of Mathematical Thinking--Yes, Even You",
      "link": "https://www.wired.com/story/everyone-is-capable-of-mathematical-thinking-yes-even-you/",
      "author": "Kelsey Houston-Edwards"
    },
    {
      "category": "Math",
      "title": "The Physics of the Macy\u2019s Thanksgiving Day Parade Balloons",
      "link": "https://www.wired.com/story/the-physics-of-the-macys-thanksgiving-day-parade-balloons/",
      "author": "Rhett Allain"
    },
    {
      "category": "Math",
      "title": "Mathematicians Just Debunked the \u2018Bunkbed Conjecture\u2019",
      "link": "https://www.wired.com/story/maths-bunkbed-conjecture-has-been-debunked/",
      "author": "Joseph Howlett"
    },
    {
      "category": "Biotech",
      "title": "Muscle Implants Could Allow Mind-Controlled Prosthetics--No Brain Surgery Required",
      "link": "https://www.wired.com/story/amputees-could-control-prosthetics-with-just-their-thoughts-no-brain-surgery-required-phantom-neuro/",
      "author": "Emily Mullin"
    },
    {
      "category": "Biotech",
      "title": "Combining AI and Crispr Will Be Transformational",
      "link": "https://www.wired.com/story/combining-ai-and-crispr-will-be-transformational/",
      "author": "Jennifer Doudna"
    },
    {
      "category": "Biotech",
      "title": "Neuralink Plans to Test Whether Its Brain Implant Can Control a Robotic Arm",
      "link": "https://www.wired.com/story/neuralink-robotic-arm-controlled-by-mind/",
      "author": "Emily Mullin"
    },
    {
      "category": "Public Health",
      "title": "A Mysterious Respiratory Disease Has the Democratic Republic of the Congo on High Alert",
      "link": "https://www.wired.com/story/drc-mysterious-respiratory-disease-children-who-africa/",
      "author": "Marta Musso"
    },
    {
      "category": "Health",
      "title": "Skip the Sea Kelp Supplements",
      "link": "https://www.wired.com/story/pass-on-sea-kelp-supplements/",
      "author": "Boutayna Chokrane"
    },
    {
      "category": "Sports",
      "title": "Why Soccer Players Are Training in the Dark",
      "link": "https://www.wired.com/story/why-soccer-players-are-training-in-the-dark-okkulo-football-sunderland-leeds-united-neuroscience/",
      "author": "RM Clark"
    },
    {
      "category": "Environment",
      "title": "A Parasite That Eats Cattle Alive Is Creeping North Toward the US",
      "link": "https://www.wired.com/story/a-parasite-that-eats-cattle-alive-is-creeping-north-toward-the-us/",
      "author": "Geraldine Castro"
    },
    {
      "category": "Technology",
      "title": "Lasers Are Making It Easier to Find Buried Land Mines",
      "link": "https://www.wired.com/story/this-laser-system-can-locate-landmines-with-high-accuracy/",
      "author": "Ritsuko Kawai"
    },
    {
      "category": "Health",
      "title": "Mark Cuban\u2019s War on Drug Prices: \u2018How Much Fucking Money Do I Need?\u2019",
      "link": "https://www.wired.com/story/big-interview-mark-cuban-2024/",
      "author": "Marah Eakin"
    },
    {
      "category": "Environment",
      "title": "Can Artificial Rain, Drones, or Satellites Clean Toxic Air?",
      "link": "https://www.wired.com/story/artificial-rain-drones-and-satellites-can-tech-clean-indias-toxic-air/",
      "author": "Arunima Kar"
    },
    {
      "category": "Health",
      "title": "These Stem Cell Treatments Are Worth Millions. Donors Get Paid $200",
      "link": "https://www.wired.com/story/stem-cells-cost-rich-16500-donors-get-paid-200-cellcolabs-sweden/",
      "author": "Matt Reynolds"
    },
    {
      "category": "Environment",
      "title": "The $60 Billion Potential Hiding in Your Discarded Gadgets",
      "link": "https://www.wired.com/story/a-dollar60-billion-a-year-climate-solution-is-sitting-in-our-junk-drawers/",
      "author": "Vince Beiser"
    },
    {
      "category": "Health",
      "title": "Tune In to the Healing Powers of a Decent Playlist",
      "link": "https://www.wired.com/story/music-therapy-health-care/",
      "author": "Daniel Levitin"
    },
    {
      "category": "Human History",
      "title": "The Whole Story of How Humans Evolved From Great Apes",
      "link": "https://www.wired.com/story/the-whole-story-of-how-humans-evolved-from-great-apes-homo-erectus-hominin-lucy/",
      "author": "John Gowlett"
    },
    {
      "category": "Health",
      "title": "Why an Offline Nuclear Reactor Led to Thousands of Hospital Appointments Being Canceled",
      "link": "https://www.wired.com/story/why-an-offline-nuclear-reactor-led-to-thousands-of-hospital-appointments-being-cancelled/",
      "author": "Chris Baraniuk"
    }
  ]
}

💾 Save the output to a CSV file

Let's create a pandas dataframe and show the table with the extracted content

[ ]

Save it to CSV

[ ]
Data saved to wired_news.csv

🔗 Resources

ScrapeGraph API Banner

Made with ❤️ by the ScrapeGraphAI Team