ScrapegraphAI: Powering Web Scraping with LLMs

指南, 蟒蛇, 抓取, Sep-27-20245 分钟阅读

Web scraping has evolved from simple rule-based extraction to more advanced techniques that rely on large language models (LLMs) for context-aware data extraction. ScrapegraphAI is at the forefront of this evolution, enabling web scraping through powerful LLMs like OpenAI, Gemini, and even local models like Ollama. In this blog, we'll dive into what ScrapegraphAI is, how it works, and walk through a real-world example of scraping data from a website with proxy integration.

What Will You Learn?

In this blog, we will cover:

  • What ScrapegraphAI is and how it works
  • The basic usage of ScrapegraphAI for scraping websites
  • How to integrate proxies for better performance
  • A hands-on example using OpenAI’s GPT-4o-mini model to extract book data from the website Books to Scrape

What is ScrapegraphAI and How It Works

ScrapegraphAI is a robust web scraping framework (open source) that leverages large language models to dynamically extract data from websites. Unlike traditional scrapers that rely on rigid CSS selectors or XPath, ScrapegraphAI uses LLMs to interpret and extract structured data from a wide range of sources, including dynamic web pages and files such as PDFs. Simply specify the information you're after, and let ScrapeGraphAI do the heavy lifting, providing a more flexible and low-maintenance option compared to traditional scraping tools. A key feature of ScrapegraphAI is its ability to let users define a schema for the data they want to extract. You can specify a structured format for your output, and ScrapegraphAI will adjust the extracted data to match this schema.

One of the standout features of ScrapegraphAI is its flexibility in choosing LLMs, with support for:

  • OpenAI’s GPT models like GPT-3.5 and GPT-4o-mini
  • Gemini for more specific use cases
  • Local models using Ollama for cost-effective, private scraping solutions

Key Scraping Pipelines

ScrapegraphAI offers several standard scraping pipelines to fit various needs. Some of the most common ones include:

  • SmartScraperGraph: A single-page scraper that only needs a user prompt and an input source (website or local file).
  • SearchGraph: Extracts information from the top n search results of a search engine.
  • SpeechGraph: Scrapes data from a page and generates an audio file from the results.
  • ScriptCreatorGraph: Scrapes a single page and generates a Python script for future extractions.
  • SmartScraperMultiGraph: Scrapes data from multiple pages using a single prompt and a list of URLs.
  • ScriptCreatorMultiGraph: Similar to the previous one but generates Python scripts for multi-page scraping.

In the next section, we’ll focus on the SmartScraperGraph, which allows for single-page scraping by simply providing a prompt and a source URL.

Basic Usage of ScrapegraphAI

Prerequisites

To follow along, you need to install a few dependencies. You can do this by running the following command:

pip install scrapegraphai openai python-dotenv
playwright install
  • scrapegraphai: This is the core package for ScrapegraphAI.
  • openai: We'll use OpenAI’s GPT-4o-mini model for scraping.
  • python-dotenv: This will allow us to securely load environment variables like API keys from a .env file.

Once you’ve installed these, make sure you have your OpenAI API Key ready. Store it in a .env file to keep your credentials secure:

OPENAI_APIKEY=your_openai_api_key

Code Example: Scraping Data from Books to Scrape

Let’s say we want to extract information about all the books on Books to Scrape, including:

  • 书名
  • 价格
  • Availability
  • 评论

Here’s a code example using ScrapegraphAI’s SmartScraperGraph pipeline:

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph

# Load the OpenAI API key from .env file
load_dotenv()
openai_key = os.getenv("OPENAI_APIKEY")

# Define configuration for the LLM
graph_config = {
   "llm": {
      "api_key": openai_key,
      "model": "openai/gpt-4o-mini",
   },
}

prompt = """
Extract all the books from this website including
- Book Name
- Price
- Availability 
- Reviews
"""

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
   prompt=prompt,
   source="https://books.toscrape.com/",
   config=graph_config
)


if __name__ == '__main__':
   result = smart_scraper_graph.run()
   print(result)

Explanation of the Code:

  • LLM Configuration: We configure ScrapegraphAI to use OpenAI's GPT-4o-mini model by providing the API key and specifying the model name.
  • Prompt: The user-defined prompt instructs the AI to extract information from the website about each book, including the name, price, availability, and reviews.
  • Source URL: We provide the URL of the website we want to scrape.
  • Running the Scraper: The run() method starts the scraping process, and the result is printed as a list of dictionaries, each containing details about a book.

Example Output

Here’s an example of what the output might look like:

{'Book Name': 'A Light in the Attic', 'Price': '£51.77', 'Availability': 'In stock', 'Reviews': 'NA'},
{'Book Name': 'Tipping the Velvet', 'Price': '£53.74', 'Availability': 'In stock', 'Reviews': 'NA'},
{'Book Name': 'Soumission', 'Price': '£50.10', 'Availability': 'In stock', 'Reviews': 'NA'},
{'Book Name': 'Sharp Objects', 'Price': '£47.82', 'Availability': 'In stock', 'Reviews': 'NA'},
# ... more books ...

As you can see, the scraper successfully pulls details for each book in a structured format, ready for use in your data pipeline.

代理集成

When scraping at scale or targeting websites with anti-scraping measures, integrating proxies becomes essential to avoid IP bans, captchas, and rate-limiting. Using proxies not only provides anonymity but also ensures that you can scrape large amounts of data without interruptions.

One of the best options for this is residential proxies, as they come from real residential IP addresses, making them harder to detect and block.

Residential proxies from ProxyScrape are perfect for web scraping scenarios, especially when targeting websites with strict anti-scraping measures. We offer rotating IP addresses from various locations, ensuring that your requests appear as if they are coming from real users. This helps to bypass restrictions, evade bans, and ensure continuous access to the data you need.

Now let’s see how proxies are integrated with ScrapegraphAI:

from dotenv import load_dotenv
import os
from scrapegraphai.graphs import SmartScraperGraph

# Load the OpenAI API key from .env file
load_dotenv()
openai_key = os.getenv("OPENAI_APIKEY")

# Define the configuration with proxy integration
graph_config = {
   "llm": {
      "api_key": openai_key,
      "model": "openai/gpt-4o-mini",
   },
   "loader_kwargs": {
      "proxy": {
         "server": "rp.proxyscrape.com:6060",
         "username": "your_username",
         "password": "your_password",
      },
   },
}

prompt = """
Extract all the books from this website including
- Book Name
- Price
- Availability 
- Reviews
"""

# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
   prompt=prompt,
   source="https://books.toscrape.com/",
   config=graph_config
)

# Run the scraper and print the result
if __name__ == '__main__':
   result = smart_scraper_graph.run()
   print(result)

Explanation of Proxy Integration:

  • Proxy Configuration: The proxy is set under the loader_kwargs key in the graph_config. Here, you define your proxy server address, username, and password.
  • This ensures that all requests made by ScrapegraphAI are routed through the specified proxy server, which helps in bypassing restrictions or avoiding IP bans on the target website.

结论

In this blog, we explored the power of ScrapegraphAI, a modern web scraping tool that uses large language models (LLMs) to extract structured data from websites intelligently. We walked through its key features, including various scraping pipelines like the SmartScraperGraph, and provided a practical example of scraping book data from a website using OpenAI’s GPT-4o-mini model.

Also, we showed how to integrate proxies, especially ProxyScrape's residential proxies. Proxies are crucial for staying anonymous, bypassing restrictions, and maintaining data access, especially with sites that use anti-scraping tactics like IP bans or rate limits.

By integrating ProxyScrape's residential proxies, you ensure your web scraping activities are more efficient, secure, and scalable, even on the most challenging websites.