eBay是全球最大的在线市场之一,拥有数以百万计的各类产品。扫描 eBay 对以下工作非常有价值:
在本指南中,我们将向您展示如何创建一个简单的 Python 脚本来搜索关键字,提取标题、价格、货币、可用性、评论和评分等产品详细信息,并将数据保存到CSV 文件中。本教程非常适合希望以正确方式学习网络搜索的初学者,并提供了尊重服务条款和负责任地使用代理的提示。
如果你只想了解完整的实现过程,这里有完整的 Python 脚本,用于使用代理从 eBay 搜刮产品详细信息。复制并粘贴到你的环境中即可开始使用:
import re
import csv
import time
import requests
from bs4 import BeautifulSoup
proxies = {
"http": "http://username:[email protected]:6060",
"https": "http://username:[email protected]:6060",
}
def get_product_information(product_url) -> dict:
r = requests.get(product_url, proxies=proxies)
soup = BeautifulSoup(r.text, features="html.parser")
product_title = soup.find("h1", {"class": "x-item-title__mainTitle"}).text
product_price = soup.find("div", {"class": "x-price-primary"}).text.split(" ")[-1]
currency = soup.find("div", {"class": "x-price-primary"}).text.split(" ")[0]
# locate the element that holds quanity number of product
quantity_available = soup.find("div", {"class":"x-quantity__availability"})
if quantity_available is not None:
# Using regex check if we can locate the strings that holds this number
regex_pattern = r"\d+\savailable"
if re.search(regex_pattern, quantity_available.text) is not None:
quantity_available = re.search(regex_pattern, quantity_available.text).group()
# After string is located we extract the number by splitting it by space and selecting the first element.
quantity_available = quantity_available.split(" ")[0]
else:
quantity_available = "NA"
total_reviews = soup.find("span", {"class":"ux-summary__count"})
if total_reviews is not None:
total_reviews = total_reviews.text.split(" ")[0]
else:
total_reviews = "NA"
rating = soup.find("span", {"class":"ux-summary__start--rating"})
if rating is not None:
rating = rating.text
else:
rating = "NA"
product_info = {
"product_url": product_url,
"title": product_title,
"product_price": product_price,
"currency": currency,
"availability": quantity_available,
"nr_reviews": total_reviews,
"rating": rating
}
return product_info
def save_to_csv(products, csv_file_name="products.csv"):
# Write the list of dictionaries to a CSV file
with open(csv_file_name, mode='w', newline='') as csv_file:
# Create a csv.DictWriter object
writer = csv.DictWriter(csv_file, fieldnames=products[0].keys())
# Write the header (keys of the dictionary)
writer.writeheader()
# Write the rows (values of the dictionaries)
writer.writerows(products)
print(f"Data successfully written to {csv_file_name}")
def main(keyword_to_search: str):
products = []
r = requests.get(f"https://www.ebay.com/sch/i.html?_nkw={keyword_to_search}", proxies=proxies)
soup = BeautifulSoup(r.text, features="html.parser")
for item in soup.find_all("div", {"class": "s-item__info clearfix"})[2::]:
item_url = item.find("a").get("href")
product_info: dict = get_product_information(item_url)
print(product_info)
# Adding a 1-second delay between requests to avoid overloading the server and reduce the risk of being blocked
time.sleep(2)
products.append(product_info)
# save data to csv
save_to_csv(products)
if __name__ == '__main__':
keywords = "laptop bag"
main(keywords)
请记得在使用前用新的用户名和密码更新代理变量。
我们的方法简化了流程,重点关注四项关键功能:
从正确的工具开始至关重要。您需要
mkdirebay_scraping
cdebay_scraping
python -m venv venv
source env/bin/activate # 在 Windows 上使用:venv\scripts\activate
pip install requests bs4
在本例中,我们将轮流使用 Proxyscrape 住宅代理,以保持匿名性并保护私人 IP 不被列入黑名单。
首先,我们要为这个网络搜索项目导入必要的库,其中包括
导入csv
导入时间
导入请求
从bs4导入BeautifulSoup
如上所述,我们将在本教程中使用轮流使用的 Proxyscrape 住宅代理服务器,但您也可以使用其他代理服务器或根本不使用代理服务器。
proxies = {
"http": "http://username:[email protected]:6060",
"https": "http://username:[email protected]:6060",
}
首先,让我们解释一下本教程中使用的搜索过程。我们将使用 "笔记本电脑包 "关键字查询 eBay URL,如图所示:
我们将使用查询到的 URL 发送请求,请求内容包括 request.get()
.收到响应后,我们将使用 BeautifulSoup (bs4) 解析 HTML 内容,提取每个产品的 URL。下图显示了每个产品 URL 在 HTML 中的位置。
产品链接在 <div>
元素的 类 s-item__info clearfix
.为了提取这些链接,我们使用 美丽汤 (bs4) 搜索所有 <div>
元素。找到这些元素后,我们会遍历每个元素,找到 <a>
元素,并提取 href
属性,其中包含产品 URL。
def main(keyword_to_search: str):
products = []
r = requests.get(f"https://www.ebay.com/sch/i.html?_nkw={keyword_to_search}", proxies=proxies)
soup = BeautifulSoup(r.text, features="html.parser")
for item in soup.find_all("div", {"class": "s-item__info clearfix"})[2::]:
item_url = item.find("a").get("href")
product_info: dict = get_product_information(item_url)
# Adding a 1-second delay between requests to avoid overloading the server and reduce the risk of being blocked
time.sleep(1)
products.append(product_info)
# save data to csv
save_to_csv(products)
介绍 获取产品信息
功能。该函数将产品 URL 作为输入,向该 URL 发送请求,然后利用 BeautifulSoup (bs4) 通过以下方式解析产品信息 具体规则 和 regex 模式.
def get_product_information(product_url) -> dict:
r = requests.get(product_url, proxies=proxies)
soup = BeautifulSoup(r.text, features="html.parser")
product_title = soup.find("h1", {"class": "x-item-title__mainTitle"}).text
product_price = soup.find("div", {"class": "x-price-primary"}).text.split(" ")[-1]
currency = soup.find("div", {"class": "x-price-primary"}).text.split(" ")[0]
# locate the element that holds quanity number of product
quantity_available = soup.find("div", {"class":"x-quantity__availability"})
if quantity_available is not None:
# Using regex check if we can locate the strings that holds this number
regex_pattern = r"\d+\savailable"
if re.search(regex_pattern, quantity_available.text) is not None:
quantity_available = re.search(regex_pattern, quantity_available.text).group()
# After string is located we extract the number by splitting it by space and selecting the first element.
quantity_available = quantity_available.split(" ")[0]
else:
quantity_available = "NA"
total_reviews = soup.find("span", {"class":"ux-summary__count"})
if total_reviews is not None:
total_reviews = total_reviews.text.split(" ")[0]
else:
total_reviews = "NA"
rating = soup.find("span", {"class":"ux-summary__start--rating"})
if rating is not None:
rating = rating.text
else:
rating = "NA"
product_info = {
"product_url": product_url,
"title": product_title,
"product_price": product_price,
"currency": currency,
"availability": quantity_available,
"nr_reviews": total_reviews,
"rating": rating
}
return product_info
最后,我们将解析后的产品实体整理成字典,然后由函数返回。
是时候用 Python 的本地 csv 图书馆图书馆 save_too_csv(products)
函数接受 产品 作为输入,这是一个包含产品详细信息的字典列表,如前所述。然后将这些数据保存到一个 CSV 文件中,文件名为 csv 文件名称
参数,默认为 "products.csv"。
def save_to_csv(products, csv_file_name="products.csv"):
# Write the list of dictionaries to a CSV file
with open(csv_file_name, mode='w', newline='') as csv_file:
# Create a csv.DictWriter object
writer = csv.DictWriter(csv_file, fieldnames=products[0].keys())
# Write the header (keys of the dictionary)
writer.writeheader()
# Write the rows (values of the dictionaries)
writer.writerows(products)
print(f"Data successfully written to {csv_file_name}")
在本教程中,我们演示了如何通过构建一个 Python 脚本来搜索 eBay 关键字、提取产品详细信息并将数据保存到 CSV 文件中。在此过程中,我们重点介绍了一些基本的刮擦技术,如处理 HTML 元素、使用代理实现匿名性以及遵守刮擦道德规范等。该脚本可通过加入分页功能和处理多个关键字的能力得到进一步改进。
请始终牢记负责任地进行搜刮,遵守网站服务条款,并使用速率限制等工具避免中断。为了让您的搜刮任务更可靠、更高效,请考虑使用我们的优质代理服务ProxyScrape 。无论您需要的是 住宅 代理、 数据中心 代理还是 移动代理,我们都能满足您的需求。查看我们的产品,让您的网络搜刮项目更上一层楼!
刮得开心