一、网页分析
网页中房子信息都在//div[@class=“shop_list shop_list_4”]/dl[@class=“clearfix”]里面。 以标题为例用xpath-helper插件分析。
二、代码实战
import requests
import parsel
import csv
url = 'https://xian.esf.fang.com/house/i37/'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}
response = requests.get(url=url,headers=headers)
response.encoding = 'utf-8'
data = response.text
print(data)
selector = parsel.Selector(data)
dls = selector.xpath('//div[@class="shop_list shop_list_4"]/dl[@class="clearfix"]')
print(dls)
for dl in dls:
name = dl.xpath('.//h4[@class="clearfix"]/a/span/text()').get()
if name:
name = dl.xpath('.//h4[@class="clearfix"]/a/span/text()').get().strip()
addr = dl.xpath('.//p[@class="add_shop"]/span/text()').get()
if addr:
addr = dl.xpath('.//p[@class="add_shop"]/span/text()').get()
price = dl.xpath('.//dd[@class="price_right"]/span/text()').getall()
price_w = dl.xpath('.//dd[@class="price_right"]/span/b/text()').getall()
if price:
price[1] = price_w[0]
price = "|".join(price)
room = dl.xpath('.//p[@class="tel_shop"]/text()').getall()
area = dl.xpath('.//p[@class="tel_shop"]/i').re("[\d~㎡]+")
if room:
room = "".join(room).strip()
str = room.split()
room = "|".join(str)
orig_url = dl.xpath('.//h4[@class="clearfix"]/a/@href').get()
if orig_url:
orig_url = dl.xpath('.//h4[@class="clearfix"]/a/@href').get()
print([name,price,addr,room,orig_url])
with open('house.csv',mode='a',encoding='utf-8',newline='') as f:
csv_write = csv.writer(f)
csv_write.writerow([name,price,addr,room,orig_url])
爬取结果: 保存下来的csv文件。 备注:本案例在爬取第二页的时候出现了抓取内容为空,后面解决了会更新文章,如果你有好的解决办法,欢迎留言交流。
|