使用 MechanicalSoup 进行网络抓取

蟒蛇, 如何使用, 搜索, 2024 年 9 月 12 日5 分钟阅读

网络搜索已成为数字时代必不可少的工具，尤其是对网络开发人员、数据分析师和数字营销人员而言。试想一下，如果能够快速高效地从网站中提取有价值的信息，那将是多么令人兴奋的事情。这就是 MechanicalSoup 发挥作用的地方。本指南将探讨使用 MechanicalSoup 进行网页抓取的复杂性，并提供实用的见解和技巧，助您轻松上手。

MechanicalSoup 在网络抓取中的作用

MechanicalSoup 是一个 Python 库，旨在通过提供一个直接的界面来自动实现与网页的交互，从而简化网络搜索。它能有效地处理表单和链接，并能导航需要表单提交和链接导航等基本用户操作的网站。因此，它非常适合在不需要复杂用户行为的静态内容网站上自动执行任务。

开始设置 MechanicalSoup 进行网络抓取

在深入探讨网络搜刮的具体细节之前，我们先来设置一下 MechanicalSoup。安装过程简单明了，只需几步即可完成。

安装 MechanicalSoup

要安装 MechanicalSoup，您需要在机器上安装 Python。然后，您可以使用 Python 的软件包安装程序 pip 来安装 MechanicalSoup。打开终端，键入以下命令

pipinstallmechanicalsoup

设置环境

安装好 MechanicalSoup 后，设置开发环境至关重要。您需要一个代码编辑器，如 Visual Studio Code 或 PyCharm，来编写和运行 Python 脚本。确保您还安装了 "BeautifulSoup "和 "requests "库。

使用 MechanicalSoup 的第一步

使用 MechanicalSoup 创建您的第一个网络刮擦脚本需要几个基本步骤。首先，导入必要的库并初始化浏览器对象。下面是一个简单的示例供您参考：

导入 mechanicalsoup
browser =mechanicalsoup.StatefulBrowser()
browser.open("https://www.scrapethissite.com/pages/")

了解使用 MechanicalSoup 进行网络抓取的基础知识

现在我们已经设置好了 MechanicalSoup，让我们来探索一下网络刮擦的基础知识。了解了这些基础知识，您就能构建更复杂的搜索脚本。

处理表格

在 MechanicalSoup 中，"select_form() "方法用于定位和处理表单。
反对 select_form() 是一个 CSS 选择器。在下面的代码示例中，我们使用该网站来填写一个简单的单字段搜索表单。因为在我们的案例中，页面中只有一个表单、 browser.select_form() 就可以了。否则，您必须输入 css 选择器 到 select_form() 方法. 此外，要查看表单上的字段，可以使用 print_summary() 方法。这将为您提供每个字段的详细信息。鉴于表单包含两种元素--文本字段和按钮--我们只需填写文本字段，然后提交表单即可：

输入机械汤


browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.scrapethissite.com/pages/forms/?page_num=1")

# 选择表单
search_form = browser.select_form()

print(search_form.print_summary())
search_form.set("q",'test')

browser.submit_selected()

下面是上述代码的结果。

<input class="form-control" id="q" name="q" placeholder="Search for Teams" type="text"/>
<input class="btn btn-primary" type="submit" value="Search"/>

分页处理

网络搜刮通常需要处理多页数据。MechanicalSoup 并不直接提供使用分页链接对页面进行分页的功能。
在我们使用的示例网站中，分页功能如下所示：

下面是 HTML 结构的样子：

So what we will do is first select the list that holds the pagination links with "browser.page.select_one('ul.pagination')".
Then with ".select('li')[1::]" we select all "<li>" elements inside 'pagination' list starting from the second element. This will return a list of "<li>" elements and then we paginate each one of them in a "for loop" starting from the second element and for each "<li>" element we extract the "<a>" tag and then use it in "follow_link()" method to navigate to that page.
Here is the full example:

输入机械汤


browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.scrapethissite.com/pages/forms/?page_num=1")

forlinkinbrowser.page.select_one('ul.pagination').select('li')[1::]：
    next_page_link = link.select_one('a')
    browser.follow_link(next_page_link)
   print(browser.url)

设置代理

在刮擦网站或自动化网络交互时，使用代理对于绕过地理限制、管理速率限制或防止 IP 禁止至关重要。使用 MechanicalSoup 和 "请求 "库，我们可以无缝集成代理配置，让您有效利用这些优势。以下是如何在 MechanicalSoup 中为您的网络搜索任务设置代理：

import mechanicalsoup
import requests

def create_proxy_browser():
    # Define your proxy configuration (example values)
    proxies = {
        "http": "rp.proxyscrape.com:6060:username:password",
        "https": "rp.proxyscrape.com:6060:username:password",
    }

    # Create a session object with proxy settings
    session = requests.Session()
    session.proxies.update(proxies)

    # Optionally, you can add headers or other session settings here
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    })

    # Create a MechanicalSoup StatefulBrowser using the configured session
    browser = mechanicalsoup.StatefulBrowser(session=session)
    return browser

# Usage
browser = create_proxy_browser()
response = browser.open("https://www.scrapethissite.com/pages/forms/?page_num=1")
print(response.text)  # Outputs the content of the page

网络搜索的道德和法律考虑因素

网络搜刮可能会引发道德和法律问题。了解这些注意事项对避免潜在问题至关重要。

尊重网站政策

刮削前一定要查看网站的服务条款。有些网站明确禁止搜刮，而其他网站则可能有具体的指导原则。无视这些政策可能会导致法律后果。

避免服务器超载

对网站的频繁请求会使其服务器超载，从而导致网站运行中断。在请求之间使用延迟并尊重网站的 "robots.txt "文件可避免这种情况。以下是添加延迟的方法：

导入time
time.sleep(2)# 延迟 2 秒钟

数据隐私

确保您采集的数据不违反隐私法规，如 GDPR。应谨慎处理个人信息，仅在必要时收集。

结论

使用 MechanicalSoup 进行网页抓取可为网页开发人员、数据分析师和数字营销人员提供强大而灵活的解决方案。按照本指南中概述的步骤，您可以有效地从网站中提取有价值的数据，自动执行重复性任务，并在您的领域中获得竞争优势。

无论您是经验丰富的专业人士还是刚刚起步，MechanicalSoup 都能为您提供成功所需的工具。请记住，要始终考虑道德和法律方面的问题，遵循最佳实践，并不断提高自己的技能。

准备好将您的网络抓取技能提升到新的水平了吗？今天就开始尝试使用 MechanicalSoup，充分挖掘网络数据提取的潜力。搜索愉快

由ProxyScrape