Notebooks
S
ScrapeGraphAI
Scrapegraph Llama Index

Scrapegraph Llama Index

scrapegraph-pysdk-nodejscookbookscrapingwired-newsjson-schemasdk-pythonweb-scraping-pythonweb-scrapingsdk-jsapiPythonscrapegraphweb-crawler

🕷️ Extract Wired Science News Info with llama-index and ScrapegraphAI's APIs

2025-01-02 15.59.56.jpg

🔧 Install dependencies

[ ]

🔑 Import ScrapeGraph API key

You can find the Scrapegraph API key here

[ ]

📝 Defining an Output Schema for Webpage Content Extraction

If you already know what you want to extract from a webpage, you can define an output schema using Pydantic. This schema acts as a "blueprint" that tells the AI how to structure the response.

Pydantic Schema Quick Guide

Types of Schemas

  1. Simple Schema
    Use this when you want to extract straightforward information, such as a single piece of content.
	from pydantic import BaseModel, Field

# Simple schema for a single webpage
class PageInfoSchema(BaseModel):
    title: str = Field(description="The title of the webpage")
    description: str = Field(description="The description of the webpage")

# Example Output JSON after AI extraction
{
    "title": "ScrapeGraphAI: The Best Content Extraction Tool",
    "description": "ScrapeGraphAI provides powerful tools for structured content extraction from websites."
}

  1. Complex Schema (Nested)
    If you need to extract structured information with multiple related items (like a list of repositories), you can nest schemas.
	from pydantic import BaseModel, Field
from typing import List

# Define a schema for a single repository
class RepositorySchema(BaseModel):
    name: str = Field(description="Name of the repository (e.g., 'owner/repo')")
    description: str = Field(description="Description of the repository")
    stars: int = Field(description="Star count of the repository")
    forks: int = Field(description="Fork count of the repository")
    today_stars: int = Field(description="Stars gained today")
    language: str = Field(description="Programming language used")

# Define a schema for a list of repositories
class ListRepositoriesSchema(BaseModel):
    repositories: List[RepositorySchema] = Field(description="List of GitHub trending repositories")

# Example Output JSON after AI extraction
{
    "repositories": [
        {
            "name": "google-gemini/cookbook",
            "description": "Examples and guides for using the Gemini API",
            "stars": 8036,
            "forks": 1001,
            "today_stars": 649,
            "language": "Jupyter Notebook"
        },
        {
            "name": "TEN-framework/TEN-Agent",
            "description": "TEN Agent is a conversational AI powered by TEN, integrating Gemini 2.0 Multimodal Live API, OpenAI Realtime API, RTC, and more.",
            "stars": 3224,
            "forks": 311,
            "today_stars": 361,
            "language": "Python"
        }
    ]
}

Key Takeaways

  • Simple Schema: Perfect for small, straightforward extractions.
  • Complex Schema: Use nesting to extract lists or structured data, like "a list of repositories."

Both approaches give the AI a clear structure to follow, ensuring that the extracted content matches exactly what you need.

[ ]

🚀 Initialize ScrapegraphToolSpec tools and start extraction

Here we use scrapegraph_smartscraper to extract structured data using AI from a webpage.

If you already have an HTML file, you can upload it and use scrapegraph_local_scrape instead.

You can find more info in the official llama-index documentation

[ ]
[ ]

Print the response

[ ]
Science News:
{
  "request_id": "5369dc4a-ddd3-4e7d-9938-741f6388c804",
  "status": "completed",
  "website_url": "https://www.wired.com/category/science/",
  "user_prompt": "Extract the first 10 news in the page",
  "result": {
    "news": [
      {
        "category": "WIRED World",
        "title": "Transforming the Moon Into Humanity\u2019s First Space Hub",
        "link": "https://www.wired.com/story/moon-humanity-industrial-space-hub/",
        "author": "Saurav Shroff"
      },
      {
        "category": "WIRED World",
        "title": "How Do You Live a Happier Life? Notice What Was There All Along",
        "link": "https://www.wired.com/story/happiness-habituation-experiment-in-living/",
        "author": "Tali Sharot"
      },
      {
        "category": "Year In Review",
        "title": "24 Things That Made the World a Better Place in 2024",
        "link": "https://www.wired.com/story/24-things-that-made-the-world-a-better-place-in-2024-good-news/",
        "author": "Rob Reddick"
      },
      {
        "category": "Health",
        "title": "Healthier Cities Will Require a Strong Dose of Nature",
        "link": "https://www.wired.com/story/healthier-cities-will-require-a-strong-dose-of-nature/",
        "author": "Kathy Willis"
      },
      {
        "category": "Health",
        "title": "There\u2019s Still Time to Get Ahead of the Next Global Pandemic",
        "link": "https://www.wired.com/story/global-pandemic-public-health-lessons-preparedness/",
        "author": "Caitlin Rivers"
      },
      {
        "category": "Health",
        "title": "Give Your Social Health a Decent Workout",
        "link": "https://www.wired.com/story/social-health-relationships-community/",
        "author": "Kasley Killam"
      },
      {
        "category": "Science",
        "title": "The World\u2019s First Crispr Drug Gets a Slow Start",
        "link": "https://www.wired.com/story/the-worlds-first-crispr-drug-gets-a-slow-start-sickle-cell-beta-thalassemia-vertex/",
        "author": "Emily Mullin"
      },
      {
        "category": "Environment",
        "title": "To Improve Your Gut Microbiome, Spend More Time in Nature",
        "link": "https://www.wired.com/story/to-improve-your-gut-microbiome-spend-more-time-in-nature-kathy-willis/",
        "author": "Kathy Willis"
      },
      {
        "category": "Environment",
        "title": "This Tropical Virus Is Spreading Out of the Amazon to the US and Europe",
        "link": "https://www.wired.com/story/this-tropical-virus-is-spreading-out-of-the-amazon-to-the-us-and-europe/",
        "author": "Geraldine Castro"
      },
      {
        "category": "Environment",
        "title": "How Christmas Trees Could Become a Source of Low-Carbon Protein",
        "link": "https://www.wired.com/story/how-christmas-trees-could-become-a-source-of-low-carbon-protein/",
        "author": "Alexa Phillips"
      },
      {
        "category": "Environment",
        "title": "Creating a Global Package to Solve the Problem of Plastics",
        "link": "https://www.wired.com/story/global-plastics-treaty-united-nations/",
        "author": "Susan Solomon"
      },
      {
        "category": "Climate",
        "title": "December Wildfires Are Now a Thing",
        "link": "https://www.wired.com/story/december-wildfires-are-now-a-thing/",
        "author": "Kylie Mohr"
      },
      {
        "category": "Climate",
        "title": "Generative AI and Climate Change Are on a Collision Course",
        "link": "https://www.wired.com/story/true-cost-generative-ai-data-centers-energy/",
        "author": "Sasha Luccioni"
      },
      {
        "category": "Climate",
        "title": "Climate Change Is Destroying Monarch Butterflies\u2019 Winter Habitat",
        "link": "https://www.wired.com/story/global-warming-threatens-the-monarch-butterfly-sanctuary-but-this-scientist-prepares-a-new-home-for-them/",
        "author": "Andrea J. Arratibel"
      },
      {
        "category": "Politics",
        "title": "More Humanitarian Organizations Will Harness AI\u2019s Potential",
        "link": "https://www.wired.com/story/humanitarian-organizations-artificial-intelligence/",
        "author": "David Miliband"
      },
      {
        "category": "Energy",
        "title": "Electric Vehicle Charging Is Going to Get Political",
        "link": "https://www.wired.com/story/electric-vehicle-charging-is-going-to-get-political/",
        "author": "Aarian Marshall"
      },
      {
        "category": "Politics",
        "title": "Big Tech Will Scour the Globe in Its Search for Cheap Energy",
        "link": "https://www.wired.com/story/big-tech-data-centers-cheap-energy/",
        "author": "Azeem Azhar"
      },
      {
        "category": "Environment",
        "title": "Humans Will Continue to Live in an Age of Incredible Food Waste",
        "link": "https://www.wired.com/story/food-production-energy-waste/",
        "author": "Vaclav Smil"
      },
      {
        "category": "Energy",
        "title": "A Uranium-Mining Boom Is Sweeping Through Texas",
        "link": "https://www.wired.com/story/a-uranium-mining-boom-is-sweeping-through-texas-nuclear-energy/",
        "author": "Dylan Baddour"
      },
      {
        "category": "Science",
        "title": "A Spacecraft Is About to Fly Into the Sun\u2019s Atmosphere for the First Time",
        "link": "https://www.wired.com/story/parker-solar-probe-atmosphere/",
        "author": "Eric Berger, Ars Technica"
      },
      {
        "category": "Science",
        "title": "What\u2019s the Winter Solstice, Anyway?",
        "link": "https://www.wired.com/story/winter-solstice/",
        "author": "Reece Rogers"
      },
      {
        "category": "Physics and Math",
        "title": "Viewers of Quantum Events Are Also Subject to Uncertainty",
        "link": "https://www.wired.com/story/in-the-quantum-world-even-points-of-view-are-uncertain/",
        "author": "Anil Ananthaswamy"
      },
      {
        "category": "Science",
        "title": "How Does a Movie Projector Show the Color Black?",
        "link": "https://www.wired.com/story/how-does-a-movie-projector-show-the-color-black/",
        "author": "Rhett Allain"
      },
      {
        "category": "Science",
        "title": "Why Can\u2019t You Switch Seats in an Empty Airplane?",
        "link": "https://www.wired.com/story/why-cant-you-switch-seats-in-an-empty-airplane/",
        "author": "Rhett Allain"
      },
      {
        "category": "Science",
        "title": "The Simple Math Behind Public Key Cryptography",
        "link": "https://www.wired.com/story/how-public-key-cryptography-really-works-using-only-simple-math/",
        "author": "John Pavlus"
      },
      {
        "category": "Biotech",
        "title": "A Third Person Has Received a Transplant of a Genetically Engineered Pig Kidney",
        "link": "https://www.wired.com/story/a-third-person-has-received-a-transplant-of-a-genetically-engineered-pig-kidney/",
        "author": "Emily Mullin"
      },
      {
        "category": "Biotech",
        "title": "Muscle Implants Could Allow Mind-Controlled Prosthetics--No Brain Surgery Required",
        "link": "https://www.wired.com/story/amputees-could-control-prosthetics-with-just-their-thoughts-no-brain-surgery-required-phantom-neuro/",
        "author": "Emily Mullin"
      },
      {
        "category": "Biotech",
        "title": "Combining AI and Crispr Will Be Transformational",
        "link": "https://www.wired.com/story/combining-ai-and-crispr-will-be-transformational/",
        "author": "Jennifer Doudna"
      },
      {
        "category": "Biotech",
        "title": "Neuralink Plans to Test Whether Its Brain Implant Can Control a Robotic Arm",
        "link": "https://www.wired.com/story/neuralink-robotic-arm-controlled-by-mind/",
        "author": "Emily Mullin"
      },
      {
        "category": "Health",
        "title": "Meet the Next Generation of Doctors--and Their Surgical Robots",
        "link": "https://www.wired.com/story/next-generation-doctors-surgical-robots/",
        "author": "Neha Mukherjee"
      },
      {
        "category": "Health",
        "title": "AI Is Building Highly Effective Antibodies That Humans Can\u2019t Even Imagine",
        "link": "https://www.wired.com/story/labgenius-antibody-factory-machine-learning/",
        "author": "Amit Katwala"
      },
      {
        "category": "Psychology and Neuroscience",
        "title": "The Race to Translate Animal Sounds Into Human Language",
        "link": "https://www.wired.com/story/artificial-intelligence-translation-animal-sounds-human-language/",
        "author": "Arik Kershenbaum"
      },
      {
        "category": "Psychology and Neuroscience",
        "title": "An Uncertain Future Requires Uncertain Prediction Skills",
        "link": "https://www.wired.com/story/embrace-uncertainty-forecasting-prediction-skills/",
        "author": "David Spiegelhalter"
      },
      {
        "category": "Psychology and Neuroscience",
        "title": "These Rats Learned to Drive--and They Love It",
        "link": "https://www.wired.com/story/these-rats-learned-to-drive-and-they-love-it/",
        "author": "Kelly Lambert"
      },
      {
        "category": "Psychology and Neuroscience",
        "title": "Scientists Are Unlocking the Secrets of Your \u2018Little Brain\u2019",
        "link": "https://www.wired.com/story/cerebellum-brain-movement-feelings/",
        "author": "R Douglas Fields"
      },
      {
        "category": "Year in Review",
        "title": "Beyond Meat Says Being Attacked Has Just Made It Stronger",
        "link": "https://www.wired.com/story/beyond-meat-hits-back-against-the-haters-ethan-brown/",
        "author": "Matt Reynolds"
      },
      {
        "category": "Mental Health",
        "title": "How to Manage Food Anxiety Over the Holidays",
        "link": "https://www.wired.com/story/how-to-cope-with-food-anxiety-during-the-festive-season/",
        "author": "Alison Fixsen"
      },
      {
        "category": "extended trip",
        "title": "NASA Postpones Return of Stranded Starliner Astronauts to March",
        "link": "https://www.wired.com/story/boeing-starliner-astronauts-stranded-until-march-nasa/",
        "author": "Fernanda Gonz\u00e1lez"
      },
      {
        "category": "Public Health",
        "title": "CDC Confirms First US Case of Severe Bird Flu",
        "link": "https://www.wired.com/story/cdc-confirms-first-us-case-of-severe-bird-flu/",
        "author": "Emily Mullin"
      },
      {
        "category": "Health",
        "title": "Mega-Farms Are Driving the Threat of Bird Flu",
        "link": "https://www.wired.com/story/mega-farms-are-driving-the-threat-of-bird-flu/",
        "author": "Georgina Gustin"
      },
      {
        "category": "Health",
        "title": "RFK Plans to Take on Big Pharma. It\u2019s Easier Said Than Done",
        "link": "https://www.wired.com/story/rfks-plan-to-take-on-big-pharma/",
        "author": "Emily Mullin"
      },
      {
        "category": "Health",
        "title": "Designer Babies Are Teenagers Now--and Some of Them Need Therapy Because of It",
        "link": "https://www.wired.com/story/your-next-job-designer-baby-therapist/",
        "author": "Emi Nietfeld"
      },
      {
        "category": "Economics",
        "title": "US Meat, Milk Prices Should Spike if Donald Trump Carries Out Mass Deportation Schemes",
        "link": "https://www.wired.com/story/us-meat-milk-prices-should-spike-if-donald-trump-carries-out-mass-deportation-schemes/",
        "author": "Matt Reynolds"
      },
      {
        "category": "Health",
        "title": "An Augmented Reality Program Can Help Patients Overcome Parkinson\u2019s Symptoms",
        "link": "https://www.wired.com/story/lining-up-tech-to-help-banish-tremors-strolll-parkinsons/",
        "author": "Grace Browne"
      },
      {
        "category": "Environment",
        "title": "Meet the Plant Hacker Creating Flowers Never Seen (or Smelled) Before",
        "link": "https://www.wired.com/story/meet-the-plant-hacker-creating-flowers-never-seen-or-smelled-before/",
        "author": "Matt Reynolds"
      },
      {
        "category": "Environment",
        "title": "These 3 Things Are Standing in the Way of a Global Plastics Treaty",
        "link": "https://www.wired.com/story/these-3-things-are-standing-in-the-way-of-a-global-plastics-treaty/",
        "author": "Steve Fletcher and Samuel Winton"
      },
      {
        "category": "Education",
        "title": "Everyone Is Capable of Mathematical Thinking--Yes, Even You",
        "link": "https://www.wired.com/story/everyone-is-capable-of-mathematical-thinking-yes-even-you/",
        "author": "Kelsey Houston-Edwards"
      },
      {
        "category": "Public Health",
        "title": "A Mysterious Respiratory Disease Has the Democratic Republic of the Congo on High Alert",
        "link": "https://www.wired.com/story/drc-mysterious-respiratory-disease-children-who-africa/",
        "author": "Marta Musso"
      },
      {
        "category": "Health",
        "title": "Skip the Sea Kelp Supplements",
        "link": "https://www.wired.com/story/pass-on-sea-kelp-supplements/",
        "author": "Boutayna Chokrane"
      }
    ]
  },
  "error": ""
}

💾 Save the output to a CSV file

Let's create a pandas dataframe and show the table with the extracted content

[ ]

Save it to CSV

[ ]
Data saved to wired_news.csv

🔗 Resources

ScrapeGraph API Banner

Made with ❤️ by the ScrapeGraphAI Team