文章详情页

Python xpath表达式如何实现数据处理

浏览：3日期：2022-07-21 13:42:55

xpath表达式

1. xpath语法

<bookstore><book> <title lang='eng'>Harry Potter</title> <price>999</price></book><book> <title lang='eng'>Learning XML</title> <price>888</price></book></bookstore>

1.1 选取节点

XPath 使用路径表达式来选取 XML 文档中的节点或者节点集。这些路径表达式和我们在常规的电脑文件系统中看到的表达式非常相似。

使用chrome插件选择标签时候，选中时，选中的标签会添加属性class='xh-highlight'

下面列出了最有用的表达式：

表达式描述 nodename 选中该元素。 / 从根节点选取、或者是元素和元素间的过渡。 // 从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。 . 选取当前节点。 .. 选取当前节点的父节点。 @ 选取属性。 text() 选取文本。

实例

路径表达式结果 bookstore 选择bookstore元素。 /bookstore 选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！ bookstore/book 选取属于 bookstore 的子元素的所有 book 元素。 //book 选取所有 book 子元素，而不管它们在文档中的位置。 bookstore//book 选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。 //book/title/@lang 选择所有的book下面的title中的lang属性的值。 //book/title/text() 选择所有的book下面的title的文本。

选择所有的h1下的文本 //h1/text() 获取所有的a标签的href //a/@href 获取html下的head下的title的文本 /html/head/title/text() 获取html下的head下的link标签的href /html/head/link/@href

1.2 查找特定的节点

路径表达式结果 //title[@lang='eng'] 选择lang属性值为eng的所有title元素 /bookstore/book[1] 选取属于 bookstore 子元素的第一个 book 元素。 /bookstore/book[last()] 选取属于 bookstore 子元素的最后一个 book 元素。 /bookstore/book[last()-1] 选取属于 bookstore 子元素的倒数第二个 book 元素。 /bookstore/book[position()>1] 选择bookstore下面的book元素，从第二个开始选择 //book/title[text()=’Harry Potter’] 选择所有book下的title元素，仅仅选择文本为Harry Potter的title元素 /bookstore/book[price>35.00]/title 选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

注意点: 在xpath中，第一个元素的位置是1，最后一个元素的位置是last(),倒数第二个是last()-1

1.3 选取未知节点

XPath 通配符可用来选取未知的 XML 元素。

通配符描述 * 匹配任何元素节点。 @* 匹配任何属性节点。 node() 匹配任何类型的节点。

实例

在下面的表格中，我们列出了一些路径表达式，以及这些表达式的结果：

路径表达式结果 /bookstore/* 选取 bookstore 元素的所有子元素。 //* 选取文档中的所有元素。 //title[@*] 选取所有带有属性的 title 元素。

1.4 选取若干路径

通过在路径表达式中使用“|”运算符，您可以选取若干个路径。

实例

在下面的表格中，我们列出了一些路径表达式，以及这些表达式的结果：

路径表达式结果 //book/title | //book/price 选取 book 元素的所有 title 和 price 元素。 //title | //price 选取文档中的所有 title 和 price 元素。 /bookstore/book/title | //price 选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素。

实例：

from lxml import etreetext = ’’’ <div> <ul> <li class='item-1'><a href='https://www.haobala.com/bcjs/link1.html' rel='external nofollow' >first item</a></li> <li class='item-1'><a href='https://www.haobala.com/bcjs/link2.html' rel='external nofollow' >second item</a></li> <li class='item-inactive'><a href='https://www.haobala.com/bcjs/link3.html' rel='external nofollow' >third item</a></li> <li class='item-1'><a href='https://www.haobala.com/bcjs/link4.html' rel='external nofollow' >fourth item</a></li> <li class='item-0'><a href='https://www.haobala.com/bcjs/link5.html' rel='external nofollow' >fifth item</a> </ul> </div> ’’’html = etree.HTML(text)#获取href的列表和title的列表href_list = html.xpath('//li[@class=’item-1’]/a/@href')title_list = html.xpath('//li[@class=’item-1’]/a/text()')#组装成字典for href in href_list: item = {} item['href'] = href item['title'] = title_list[href_list.index(href)] print(item)# 如果取到的是一个节点，返回的是element对象，可以继续使用xpath方法，对此我们可以在后面的数据提取过程中：先根据某个标签进行分组，分组之后再进行数据的提取li_list = html.xpath('//li[@class=’item-1’]')#在每一组中继续进行数据的提取for li in li_list: item = {} item['href'] = li.xpath('./a/@href')[0] if len(li.xpath('./a/@href'))>0 else None item['title'] = li.xpath('./a/text()')[0] if len(li.xpath('./a/text()'))>0 else None print(item)

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持好吧啦网。

Python 编程

上一条：Python Django搭建网站流程图解下一条：Python轻量级web框架bottle使用方法解析

相关文章：

1. HTML5 Canvas绘制图形从入门到精通2. 将properties文件的配置设置为整个Web应用的全局变量实现方法3. JSP数据交互实现过程解析4. WML语言的基本情况5. JSP的Cookie在登录中的使用6. chat.asp聊天程序的编写方法7. asp读取xml文件和记数8. 轻松学习XML教程9. css进阶学习选择符10. asp.net core项目授权流程详解