返回博客
指南
Mihai MaximLast updated on Mar 31, 20266 min read

终极 XPath 速查表。如何轻松编写强大的选择器。

终极 XPath 速查表。如何轻松编写强大的选择器。

XPath 速查表?

你是否曾需要编写一个不依赖类名的 CSS 选择器?如果你的答案是否定的,那你真是走运。如果答案是肯定的,那么我们的 XPath 速查表正是你所需要的。网络上充斥着海量数据。许多企业正是通过整合这些数据,向世界提供新的服务。 API 固然非常有用,但并非每个网站都提供开放的 API。有时,你不得不采用传统方式获取所需数据,即为该网站构建一个爬虫。现代网站通过重命名 CSS 类来规避爬取行为。因此,编写基于更稳定要素的选择器会更明智。在本文中,你将学习如何根据页面的 DOM 节点布局来编写选择器。

什么是 XPath,如何尝试使用它?

XPath 代表 XML 路径语言。它使用路径表示法(类似于 URL)来提供一种灵活的方式,用于定位 XML 文档的任意部分。 

XPath 主要用于 XSLT,但也可以作为一种更强大的方式,通过 XPathExpression 在任何 XML 样式语言文档(如 HTML 和 SVG)的 DOM 中进行导航,而无需依赖 Document.getElementById() 或 Document.querySelectorAll() 方法、Node.childNodes 属性以及其他 DOM 核心功能。XPath | MDN (mozilla.org)

路径表示法?

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Nothing to see here</title>
</head>
<body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
    <div>
        <h2>My Second Heading</h2>
        <p>My second paragraph.</p>
        <div>
            <h3>My Third Heading</h3>
            <p>My third paragraph.</p>
        </div>
    </div>
</body>
</html>

There are two types of paths: relative and absolute

The unique path ( or absolute path ) to My third paragraph. is /html/body/div/div/p

A relative path to My third paragraph. is //body/div/div/p
For My Second Heading. => //body/div/h2
For My first paragraph. => //body/p

Notice that I'm using //body. Relative paths use // to skip right to the desired element.

The usage of //<path> also implies that it should look for all occurrences of <path> in the document, regardless of what came before <path>.

For example, //div/p returns both My second paragraph. and My third paragraph.

您可以在浏览器中测试此示例以获得更直观的体验!

将代码粘贴到 .html 文件中,并用浏览器打开。打开开发者工具,按下 Ctrl + F。将 XPath 定位器粘贴到小输入框中,然后按回车键。

您也可以在“元素”选项卡中右键单击任意标签,然后选择“复制 XPath”来获取其 XPath

请注意我如何在“我的第二个段落”和“我的第三个段落”之间切换

Also, another important thing to know is that it is not necessary for a path to contain // in order to return multiple elements. Let's see what happens when I add another <p> in the last <div>.

/html/body/div/div/p is no longer an absolute path.

如果您能跟上到这里,恭喜您,您已踏上精通 XPath 的正确道路。现在,您可以开始探索更有趣的内容了。

方括号

你可以使用方括号来选择特定元素。

In this case, //body/div/div[2]/p[3] only selects the last <p> tag.

属性

你也可以使用属性来选择元素。

//body//p[@class="not-important"] => select all the <p> tags that are inside a <body> tag and have the "not-important" class.

//div[@id] => select all the <div> tags that have an id attribute.

//div[@class="p-children"][@id="important"]/p[3] => select the third <p> that is within a <div> tag that has both class="p-children" and id="important"

//div[@class="p-children" and @id="important"]/p[3] => same as above

//div[@class="p-children" or @id="important"]/p[3] => select the third <p> that is within a <div> that has class="p-children" or id="important"

Notice @ marks the start of an attribute

函数

XPath 提供了一组有用的函数,您可以在方括号内使用它们。

position() => returns the index of the element
Ex: //body/div[position()=1] selects the first <div> in the <body>

last() => returns the last element
Ex: //div/p[last()] selects all the last <p> children of all the <div> tags

count(element) => returns the number of elements
Ex: //body/count(div) returns the number of child <div> tags inside the <body>

node() or * => returns any element
Ex: //div/node() and //div/*=> selects all the children of all the <div> tags

text() => returns the text of the element
Ex: //p/text() returns the text of all the <p> elements

concat(string1, string2) => merges string1 with string2

contains(@attribute, "value") => returns true if @attribute contains "value"
Ex:
//p[contains(text(),"I am the third child")] selects all the <p> tags that have the "I am the third child" text value.

starts-with(@attribute, "value") => returns true if @attribute starts with "value"
ends-with(@attribute, "value") => returns true if @attribute ends with "value" 

substring(@attribute,start_index,end_index)] => returns the substring of the attribute value based on two index values
Ex:
//p[substring(text(),3,12)="am the third"] => returns true if text() = "I am the third child"

normalize-space() => acts like text(), but it removes the trailing spaces
Ex: normalize-space(" example ") = "example"

string-length() => returns the length of the text
Ex: //p[string-length()=20] returns all the <p> tags that have the text length of 20

这些函数可能有点难记。幸运的是,《终极 XPath 速查表》提供了有用的示例:

//p[text()=concat(substring(//p[@class="not-important"]/text(),1,15), substring(text(),16,20))]

//p[text()=<expression_return_value>] will select all the <p> elements that have the text value equal to the return value of the condition.

//p[@class="not-important"]/text() returns the text values of all the <p> tags that have class="not-important".

If there is only one <p> tag that satisfies this condition, then we can pass the return_value to the substring function.

substring(return_value,1,15) will return the first 15 characters of the return_value string.

substring(text(),16,20) will return the last 5 characters of the same

text() value that we used in //p[text()=<expression_return_value>].

Finally, concat() will merge the two substrings and create the return value of <expression_return_value>.

路径嵌套

XPath 支持路径嵌套。这很酷,但所谓路径嵌套究竟是什么意思呢?

Let's try something new: /html/body/div[./div[./p]]

You can read it as "Select all the <div> sons of the <body> that have a <div> child. Also, the children must also be parents to a <p> element."

If you don't care about the father of the <p> element, you can write: /html/body/div[.//p]

This now translates to "Select all the div children of the body that have a <p> descendant"

In this particular example, /html/body/div[./div[./p]] and /html/body/div[.//p] yield the same result.

By now, I'm sure that you are wondering what is up with those dots in ./ and .// 

The dot represents the self element. When used in a pair of brackets, it references the specific tag that opened them. Let's dive a little deeper.

In our example, /html/body/div returns two divs:
<div class="no-content"> and <div class="content">

/html/body/div[.//p] translates to:

    /html/body/div[1][/html/body/div[1]//p]
and /html/body/div[2][/html/body/div[2]//p]

/html/body/div[2][/html/body/div[2]//p] is true, so it returns /html/body/div[2] 

In our case, the dot ensures that /html/body/div and /html/body/div//p refer to the same <div>

Now let's look at what would have happened if it didn't.

/html/body/div[/html/body/div//p] would return both
<div class="no-content">  and <div class="content">

Why? Because /html/body/div//p is true for both /html/body/div[1] and /html/body/div[2].

/html/body/div[/html/body/div//p] actually translates to "Select all the div children of the <body> if /html/body/div//p is true. 

/html/body/div//p is true if the body has a <div> child, and that child has a <p> descendent". In our case, this statement is always true.

遗憾的是,其他 XPath 速查表并未提及嵌套。我认为这非常了不起。它使您能够扫描文档以查找不同的模式,并返回其他内容。唯一的缺点是,以这种方式编写查询可能会难以理解。好消息是,还有其他方法可以实现。

你可以使用轴来定位相对于其他上下文节点的节点。

让我们来探索其中的一些。

四大主轴

//p/ancestor::div => selects all the divs that are ancestors of <p>

How I read it: Get all the <p> tags, for each <p> look through its ancestors. If you find <div> tags, select them.

//p/parent::div => selects all the <div> tags that are parents of <p> 

How I read it: Get all the <p> tags and of all their parents, if the parent is a <div>, select it.

//div/child::p=> selects all the <p> tags that are children of <div> tags.

How I read it: Get all the <div> tags and their children, if the child is a <p>, select it.

//div/descendant::p => selects all the <p> tags that are descendants of <div> tags.

How I read it: Get all the <div> tags and their descendants, if the descendant is a <p>, select it.

现在是时候重写之前的表达式了:

/html/body/div[./div[./p]] is equivalent to /html/body/div/div/p/parent::div/parent::div

But /html/body/div[.//p] is NOT equivalent to /html/body/div//p/ancestor::div

The good news is that we can tweak it a little bit.

/html/body/div//p/ancestor::div[last()] is equivalent to /html/body/div[.//p]

其他重要坐标轴

//p/following-sibling::span => for each <p> tag, select its following <span> siblings.

//p/preceding-sibling::span => for each <p> tag, select its preceding <span> siblings.

//title/following::span => selects all the <span> tags that appear in the DOM after the <title>.

In our example, //title/following::span selects all the <span> tags in the document.

//p/preceding::div => selects all the <div> tags that appear in the DOM before any <p> tag. But it ignores ancestors, attribute nodes and namespace nodes.

In our case, //p/preceding::div only selects <div class="p-children"> and <div class="no_content">.

Most of the <p> tags are in <div class="content">, but this <div> is not selected because it is a common ancestor for them. As I mentioned, the
preceding axe ignores ancestors.

<div class="p-children"> is selected because it is not an ancestor for the <p> tags inside <div class="p-children" id="important">

总结

恭喜,你成功了。你的选择器工具箱中又增添了一件全新利器!无论你是正在开发网页爬虫,还是自动化网页测试,这份 XPath 速查表都将大派用场!如果你正在寻找一种更流畅的 DOM 遍历方式,这里正是你想要的地方。无论如何,尝试一下 XPath 都是值得的。说不定,你还能发现它更多的应用场景。 网页抓取的概念听起来是否令你感兴趣?你可以通过此处联系我们:WebScrapingAPI - 联系我们。如果你想进行网页抓取,我们很乐意全程协助。与此同时,不妨考虑免费试用 WebScrapingAPI - 产品

关于作者
Mihai Maxim, 全栈开发工程师 @ WebScrapingAPI
Mihai Maxim全栈开发工程师

米海·马克西姆(Mihai Maxim)是 WebScrapingAPI 的全栈开发工程师,他在产品各领域均有贡献,并协助为该平台构建可靠的工具和功能。

开始构建

准备好扩展您的数据收集规模了吗?

加入2,000多家企业,使用WebScrapingAPI在无需任何基础设施开销的情况下,以企业级规模提取网络数据。