为网络抓取选择正确的选择器：CSS 还是 XPath

扫描, 差异, 2022 年 2 月 7 日5 分钟阅读

你知道在网络搜刮中正确使用哪种选择器吗？近十年来，网络搜索在从互联网上提取数据方面相当流行。它可以帮助企业获取和分析数据，从而做出更好的商业决策。得益于自动化技术，网络搜索从未像现在这样简单。

但是，无论您选择哪种工具或框架，您都必须做出一个重要决定，以确保您的刮擦工具能礼貌地刮擦数据。这就是，是使用 XPath 还是 CSS 选择器来提取网页元素，您将在本文中学到这一点。

让我们从一些现有的例子入手。

什么是 XPath 选择器？

XPath 是 XML 路径语言的缩写。不过，它使用非 XML 语法从 XML 文档或 HTML 中选择标记或标记组，就像网络搜刮一样。XPath 使您可以编写表达式，直接访问 HTML 或 XML 元素，而无需遍历整个 HTML 树。

要了解如何使用 XPath 访问元素，让我们通过 HTML 代码来深入探讨。我假设您已经了解一些基本的 HTML。

<!doctype html> <html xmlns=”http://www.w3.org/1999/xhtml” lang="en" xml:lang="en"> <head> <meta charset="utf-8"> <title>Awesome Products at your Fingertips</title> </head> <body> <h1>Description of product features</h1> <p>These products are great. So let's just look at the features !</p> <ul id="product-list" class=”basic-list”> <li>Item 1</li> <li>Item 2</li> <li>Item 3</li> </ul> </body> </html>

您可以在自己选择的编辑器中键入上述代码，并将其保存为 products.html。然后，您可以在浏览器（最好是谷歌浏览器，因为我们将用它来浏览本示例）中查看。

浏览器运行这些代码时，会对 HTML 进行措辞，并创建元素的树形表示。它被称为DOM（文档对象模型），形式如下：

有关 DOM 的更多信息，请访问给定链接。现在，我们的重点是 XPath，即如何在不遍历整个树的情况下直接导航到每个元素。因此，让我们从 Xpath 的基本术语开始。

什么是节点？

XPath 最基本的元素是节点。节点就是你刚才在 DOM 树中看到的单个元素。随着本文的深入，你会进一步发现节点由标记元素、属性、分配给它的字符串值等组成。每个 XML 或 HTML 页面中都有七个节点，让我们仔细看看每种节点类型。

Element node – These are the major tags in an HTML document. For instance, in our example, the elements nodes are <title>,<meta>,<body>,<h1>,<p>,<ul>,and <li>.

Attribute node – You can provide elements nodes with properties to give them unique names or apply CSS styles. In this example, we have provided the <ul> element with attributes “id” and “class”. More on ids and classes later.

Atomic values – these are the values that are inside the node elements. Over here, the atomic values are the text inside the <title>, <h1>,<p> and each of the three <li> elements.

以上三点是最重要的，但为了便于了解，了解以下四点也很重要。

Namespace: Since XPath also deals with XML documents, it is a standard that specifies how you can use elements and attributes. It is identified by URI (Uniform Resource Identifier). For the latest HTML 5 version, you can set the namespace like this: <html xmlns=”http://www.w3.org/1999/xhtml” lang=”en” xml:lang=”en”>.

注释：这些是 HTML 或 XML 文档的注释部分，编译器或解析器不会对其进行处理。

Process instruction: As with the case of HTML documents, they are the nodes that link external files such as CSS and JavaScript files. For example, the<link> node to include CSS files and the <src> node to include JavaScript files.

Root node: As you guessed it right, this is the topmost element of an HTML document: the <html> node.

节点之间是什么关系？

Parent: These are the root elements of the DOM tree that are precisely one level up. Each element has exactly one parent. According to our example, the parent of <li> element is <ul>.
Children: Children nodes are exactly one level down in the DOM tree. As per our example <h1>,<p> and <ul> elements are children of the <body> element.
Siblings: These are the elements that share the same parent. In our example, both the <li> elements are siblings of the <ul> parent. Also <h1>,<p> and <ul> are siblngs as they belong to the same parent <body>.
Descendants: Any elements that are one level down are descendants. For instance, the <title> element is a descendant of the <head> element.
Ancestor: Any element level up in the DOM tree is called an ancestor. In our example <li> element’s ancestors are <ul>,<body> and <html>.

如何查找 HTML 元素的 XPath？

有两种方法可以做到这一点。首先，让我们来演示一下或编写一个示例代码。正如我上面提到的，希望你已将其保存在本地磁盘上，并准备好在浏览器中查看。

页面加载完毕后，将鼠标悬停在项目 1 上并单击右键。然后从出现的菜单项中选择 "检查"，如下图所示：

Then you would be able to find the full XPath by clicking on the <li> element in the console and selecting “copy” from the drop-down menu, and then specifying “Copy full XPath as shown below:

然后将其粘贴到文本文件或其他地方，就会得到：

/html/body/ul/li[1]

这就是所谓的绝对路径。我将在下文中解释你是如何得出它的。

步骤 1 => li[1] //Here one 表示第一个 li 元素步骤 2 => /li[1] 步骤 3 => ul/li[1] 步骤 4 => /ul/li[1] 步骤 5 => body/ul/li[1] 步骤 6 => /body/ul/li[1] 步骤 7 => html/body/ul/li[1] 步骤 8 => /html/body/ul/li[1]

绝对路径

With this method, you need to work your way backward, starting from the target element all the way back to the root element. You add a forward slash before the element you have just added as you write each element. So let’s look at how you could work out the XPath for the first <li> element manually:

虽然上述方法看似冗长，但它能帮助你理解如何建立完整的 XPath。现在来看看相对方法。

相对路径

//*[@id="product-list"]/li[1]

As you can see, it is pretty short, and the path is relative to the parent <ul> element. Since the <li> element does not have an id attribute, its relative path is relative to the <ul> element.

相对路径和绝对路径的区别

二者的显著区别在于，完整 XPath 不易读且难以维护。另一个明显的问题是，如果从根元素开始的任何元素发生变化，绝对 XPath 将失效。因此，使用相对 XPath 是合理的。

不过，在进一步评论之前，我们先来看看它们的优缺点。

XPATH 的优缺点

优势

有了 XPath，如果不知道某个元素的名称，也不必担心，因为您可以利用包含函数来查找可能的匹配项。因此，在查询要搜刮的项目时，您可以沿 DOM 向上查询。

CSS 的另一个重要优势是，它可以在旧版本的传统浏览器（如过时的 Internet Explorer）中运行。

缺点

正如上文所述，它最大的缺点是在改变路径中的元素时更容易被破坏。与下面要介绍的 CSS 选择器相比，它可能很难理解。

此外，从 XPath 中检索元素时，其性能比 CSS 慢得多。

什么是 CSS 选择器？

如你所知，CSS 是层叠样式表（Cascading Style Sheets）的缩写，主要用于网页中 HTML 元素的样式设计。这些样式包括对字体、背景图片和颜色进行着色，对元素进行对齐和定位，以及增加/减少段落之间的空格。

要为 HTML 元素设置样式，需要通过 CSS 选择器来指定。让我们从下节的标记开始，举一个简单的例子。

CSS 选择器是如何创建的？

<h1 id="main-heading" class="header-styles" name="h1name">What are CSS Selectors?</h1>

下面是上述元素的 CSS 选择器：

<h1>-selects by the element name <h1>
#main-heading - #指定元素的 id
.header-styles - 点表示类名
[name="h1name"] - 您可以在方括号中指定属性。

您也可以将它们结合起来：

h1.header-styles-该 CSS 选择器选择带有 header-styles 类的 h1 元素。

高级选择器

运算符 > 用于选择子代。相反，+ 运算符选择第一个兄弟姐妹，而 ~ 运算符用于选择所有兄弟姐妹。下面是几个例子：

ul#product-list > li – selects <li> elements which are children of <ul> with id of product-list.
ul#product-list li – selects <li> elements which are descendent of <ul> with id of product-list.
ul#product-list + li – selects first <li> element after <ul> with id of product-list.
ul#product-list ~ li – Selects all <li> elements after the <ul> with id of product-list.

使用 CSS 选择器的优缺点

优势

与Beautiful Soup不支持的 XPath 不同，CSS 选择器支持最有效的刮擦库。此外，与 XPath 不同，CSS 选择器更容易学习和维护。除了 Internet Explorer 版本 8 以下的传统浏览器，几乎所有浏览器都支持 CSS。不过，如今人们已经很少使用这些浏览器了。

缺点

即使将旧版本的 Internet Explorer 排除在外，它们在不同浏览器上的呈现方式仍可能不一致。

由于 CSS 版本各异，可能会给开发人员和初学者造成困惑。

当今网络技术的另一个重要因素是 CSS 的安全性。

CSS 选择器和 XPath 有什么区别？

XPath 与 CSS 的明显区别在于 XPath 是双向的。这意味着您可以在 DOM 树中双向遍历。而 CSS 只能从父节点遍历到子节点，即单向流。

正如前面几节所讨论的，XPath 更难维护，也不适合有效提高可读性。另一方面，虽然 XPath 可以在传统浏览器中运行，但其渲染方式却因浏览器而异。

因此，在这方面，CSS 具有优势。

在这两者之间，您应该如何选择？

XPATH 脱颖而出，是因为 CSS 只能在特定区域内从父代遍历到子代，例如在 DOM 树中向上遍历。就速度而言，CSS 更胜一筹。

不过，XPath 和 CSS 之间的速度差异在网络搜刮中并不重要。还需要考虑其他因素，例如网络搜刮中的网络延迟。

说到 Beautiful Soup，CSS 会是你的首选，因为它不支持 XPath。

结论

关于在网络搜索项目中使用哪种选择器，并没有一个准确的答案。正如您在本文中所发现的，XPath 在某些情况下具有优势，而 CSS 则在其他情况下表现突出。

因此，您需要考虑到具体的关键点，如遍历、浏览器支持以及我们讨论过的一些技术能力。希望您能学以致用，并继续关注我们的更多文章。

由ProxyScrape