XPath 终极秘籍。如何轻松编写功能强大的选择器

Mihai Maxim，2022 年 12 月 16 日

XPath 指南？

你是否曾经需要编写一个与类无关的 CSS 选择器？如果答案是否定的，那么你可以认为自己很幸运。如果答案是肯定的，那么我们的 XPath Cheat Sheet 就是你所需要的。网络上到处都是数据。整个企业都依赖于将其中的一些数据整合在一起，为世界带来新的服务。应用程序接口非常有用，但并不是每个网站都有开放的应用程序接口。有时，你不得不用老办法来获取你需要的东西。你必须为网站创建一个搜索器。现代网站通过重命名 CSS 类来规避刮擦。因此，最好编写依赖于更稳定的东西的选择器。在本文中，你将学习如何根据页面的 DOM 节点布局编写选择器。

什么是 XPath，如何试用？

XPath 是 XML 路径语言的缩写。它使用路径符号（如 URL）来提供一种灵活的方式来指向 XML 文档的任何部分。

XPath 主要用于 XSLT，但也可以作为一种更强大的方式，使用 XPathExpression 在任何类 XML 语言文档（如 HTML 和 SVG）的 DOM 中进行导航，而不是依赖 Document.getElementById() 或 Document.querySelectorAll() 方法、Node.childNodes 属性以及其他 DOM 核心功能。XPath | MDN (mozilla.org)

路径符号？

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Nothing to see here</title>
</head>
<body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
    <div>
        <h2>My Second Heading</h2>
        <p>My second paragraph.</p>
        <div>
            <h3>My Third Heading</h3>
            <p>My third paragraph.</p>
        </div>
    </div>
</body>
</html>

路径有两种：相对路径和绝对路径

我的第三段的唯一路径（或绝对路径）是 /html/body/div/div/p

我的第三个段落的相对路径是 //body/div/div/p
我的第二个标题 => //body/div/h2
我的第一个段落 => //body/p

请注意，我使用的是 //body。相对路径使用 // 跳到所需的元素。

The usage of //<path> also implies that it should look for all occurrences of <path> in the document, regardless of what came before <path>.

For example, //div/p returns both My second paragraph. and My third paragraph.

您可以在浏览器中测试此示例，以获得更全面的了解！

将代码粘贴到 .html 文件中，然后用浏览器打开。将 XPath 定位器粘贴到小输入栏中，然后按回车键。

您还可以在 "元素 "选项卡中右键单击任何标签，然后选择 "复制 XPath"，以获取该标签的 XPath。

注意我是如何在 "我的第二段 "和 "我的第三段 "之间切换的。

Also, another important thing to know is that it is not necessary for a path to contain // in order to return multiple elements. Let's see what happens when I add another <p> in the last <div>.

/html/body/div/div/p 不再是绝对路径。

如果你跟随我走了这么远，那么恭喜你，你已经走上了掌握 XPath 的正确道路。现在，你已经准备好进入有趣的环节了。

方括号

您可以使用方括号选择特定元素。

 In this case, //body/div/div[2]/p[3] only selects the last <p> tag.

属性

您还可以使用属性来选择元素。

//body//p[@class="not-important"] => select all the <p> tags that are inside a <body> tag and have the "not-important" class.

//div[@id] => select all the <div> tags that have an id attribute.

//div[@class="p-children"][@id="important"]/p[3] => select the third <p> that is within a <div> tag that has both class="p-children" and id="important"

//div[@class="p-children" and @id="important"]/p[3] => same as above

//div[@class="p-children" or @id="important"]/p[3] => select the third <p> that is within a <div> that has class="p-children" or id="important"

注意 @ 标志着属性的开始

功能

XPath 提供了一系列有用的函数，您可以在方括号内使用这些函数。

position() => returns the index of the element
Ex: //body/div[position()=1] selects the first <div> in the <body>

last() => returns the last element
Ex: //div/p[last()] selects all the last <p> children of all the <div> tags

count(element) => returns the number of elements
Ex: //body/count(div) returns the number of child <div> tags inside the <body>

node() or * => returns any element
Ex: //div/node() and //div/*=> selects all the children of all the <div> tags

text() => returns the text of the element
Ex: //p/text() returns the text of all the <p> elements

concat(string1, string2) => 合并 string1 和 string2

contains(@attribute, "value") => returns true if @attribute contains "value" 
Ex:
 //p[contains(text(),"I am the third child")] selects all the <p> tags that have the "I am the third child" text value.

starts-with(@attribute, "value") => 如果 @attribute 以 "value "开头，则返回 true
ends-with(@attribute, "value") => 如果 @attribute 以 "value "结尾，则返回 true

substring(@attribute,start_index,end_index)] => 根据两个索引值返回属性值的子串
例：
//p[substring(text(),3,12)="am the third"] => 如果 text() ="I am the third child" 则返回 true

normalize-space() => 作用类似于 text()，但会去掉尾部的空格
例： normalize-space(" example ") = "example"

string-length() => returns the length of the text
Ex: //p[string-length()=20] returns all the <p> tags that have the text length of 20

这些函数可能有点难记。幸运的是，《Xpath 终极秘籍》提供了有用的示例：

//p[text()=concat(substring(//p[@class="not-important"]/text(),1,15), substring(text(),16,20))]

//p[text()=<expression_return_value>] will select all the <p> elements that have the text value equal to the return value of the condition.

//p[@class="not-important"]/text() returns the text values of all the <p> tags that have class="not-important".

If there is only one <p> tag that satisfies this condition, then we can pass the return_value to the substring function.

substring(return_value,1,15) 将返回 return_value 字符串的前 15 个字符。

substring(text(),16,20) 将返回同一字符串的最后 5 个字符。

text() value that we used in //p[text()=<expression_return_value>].

Finally, concat() will merge the two substrings and create the return value of <expression_return_value>.

路径嵌套

XPath 支持路径嵌套。这很酷，但我所说的路径嵌套到底是什么意思？

让我们试试新花样：/html/body/div[./div[./p]]

You can read it as "Select all the <div> sons of the <body> that have a <div> child. Also, the children must also be parents to a <p> element."

If you don't care about the father of the <p> element, you can write: /html/body/div[.//p]

This now translates to "Select all the div children of the body that have a <p> descendant"

在本例中，/html/body/div[./div[./p]] 和 /html/body/div[.//p] 产生了相同的结果。

现在，我相信您一定很想知道 ./ 和 .// 中的这些点是怎么回事。

点代表自元素。在一对括号中使用时，它指的是打开括号的特定标记。让我们再深入一点。

In our example, /html/body/div returns two divs:
<div class="no-content"> and <div class="content">

/html/body/div[.//p]翻译为

   /html/body/div[1][/html/body/div[1]//p]
和 /html/body/div[2][/html/body/div[2]//p] 。

/html/body/div[2][/html/body/div[2]//p]为 true，因此返回 /html/body/div[2]

In our case, the dot ensures that /html/body/div and /html/body/div//p refer to the same <div>

现在，让我们来看看如果不这样做会发生什么。

/html/body/div[/html/body/div//p] would return both 
<div class="no-content">  and <div class="content">

为什么？因为 /html/body/div//p 对 /html/body/div[1] 和 /html/body/div[2] 都有效。

/html/body/div[/html/body/div//p] actually translates to "Select all the div children of the <body> if /html/body/div//p is true.

/html/body/div//p is true if the body has a <div> child, and that child has a <p> descendent". In our case, this statement is always true.

遗憾的是，其他 Xpath Cheat Sheets 没有提到任何关于嵌套的内容。我认为嵌套很神奇。它能让你扫描文档以寻找不同的模式，然后再返回其他内容。唯一的缺点是，这样编写查询可能会变得难以理解。好在还有其他方法。

轴

您可以使用坐标轴来确定节点相对于其他上下文节点的位置。

让我们来探讨其中的一些问题。

四大主轴

//p/ancestor::div => selects all the divs that are ancestors of <p>

How I read it: Get all the <p> tags, for each <p> look through its ancestors. If you find <div> tags, select them.

//p/parent::div => selects all the <div> tags that are parents of <p>

How I read it: Get all the <p> tags and of all their parents, if the parent is a <div>, select it.

//div/child::p=> selects all the <p> tags that are children of <div> tags.

How I read it: Get all the <div> tags and their children, if the child is a <p>, select it.

//div/descendant::p => selects all the <p> tags that are descendants of <div> tags.

How I read it: Get all the <div> tags and their descendants, if the descendant is a <p>, select it.

现在该重写之前的表达式了：

/html/body/div[./div[./p]] 等同于 /html/body/div/div/p/parent::div/parent::div

但 /html/body/div[.//p] 并不等同于 /html/body/div/p/ancestor::div

好消息是，我们可以稍作调整。

/html/body/div//p/ancestor::div[last()]等同于 /html/body/div[.//p]

其他重要轴线

//p/following-sibling::span => for each <p> tag, select its following <span> siblings.

//p/preceding-sibling::span => for each <p> tag, select its preceding <span> siblings.

//title/following::span => selects all the <span> tags that appear in the DOM after the <title>.

In our example, //title/following::span selects all the <span> tags in the document.

//p/preceding::div => selects all the <div> tags that appear in the DOM before any <p> tag. But it ignores ancestors, attribute nodes and namespace nodes.

In our case, //p/preceding::div only selects <div class="p-children"> and <div class="no_content">.

Most of the <p> tags are in <div class="content">, but this <div> is not selected because it is a common ancestor for them. As I mentioned, the 
preceding axe ignores ancestors.

<div class="p-children"> is selected because it is not an ancestor for the <p> tags inside <div class="p-children" id="important">

摘要

恭喜你，你成功了。你在选择器工具箱中添加了一个全新的工具！如果您正在构建网络搜刮器或自动进行网络测试，那么这份 Xpath Cheat Sheet 将会派上用场！如果您正在寻找一种更顺畅的方式来遍历 DOM，那就来对地方了。无论如何，XPath 都值得一试。谁知道呢，也许你会发现它的更多用例。
你对网络刮削的概念感兴趣吗？请点击WebScrapingAPI - 联系我们。如果您想进行网络刮擦，我们很乐意为您提供帮助。同时，请考虑免费试用WebScrapingAPI - 产品。