前言回顾:
基本的处理html数据的工具
- requests:可以发送数据和一些基本数据的处理方式
参考文献:https://editor.csdn.net/md/?articleId=118095431 - BeautifulSoup
- 正则表达式的使用
参考文献;https://editor.csdn.net/md/?articleId=117717623
Xpath处理数据
from lxml import etree
html = etree.parse('./test.html'(可以换成requests得到的数据包), etree.HTMLParser())
result = etree.tostring(html)
result = html.xpath('/html//li/a')
for item in result:
print(item)
result2 = html.xpath('//li[@class="item-3"]')
print(result2)
result1 = html.xpath('//a[@href="https://hao.360.cn/?a1004"]/../@class')
print(result1)
result3 = html.xpath('//li/a[@href="https://hao.360.cn/?a1004"]/text()')
print(result3)
result4 = html.xpath('//li[contains(@class,"sp")]/a/text()')
print(result4)
result5 = html.xpath('//li[contains(@class,"sp" and @name="123")]/a/text()')
result6 = html.xpath("//li[2]")
实例分析
import requests
from lxml import etree
url = 'https://bj.58.com/chuzu/?PGTID=0d100000-0000-12c6-fb47-a49a0f7be1ee&ClickID=2'
headers ={
'user-agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0"
}
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath("//ul[@class='house-list']/li")
fp = open("58.txt", "w", encoding="utf-8")
for lst in li_list:
title = lst.xpath("./div[2]/h2/a/text()")[0]
print(title)
fp.write(title+"\n")
|