本文爬取目标网址中的电影名,介绍,及评分
前期准备:
工具:Spyder
引用的库:requests,csv,lxml里的 etree
1.前期基本处理:
url = "https://film.sohu.com/list_4_0_0_0_0_1_60.html?channeled=1200100000"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
}
resp = requests.get(url,headers=headers)
resp.encoding = "utf-8"
2.对其进行x-path解析
html = etree.HTML(resp.text)
3.进行定位 选中神秘巨星的字样,右键->检查 选择右边源码区域,选择到每个影片同级的区域,这里找到 右键->复制->复制Xpath /html/body/div[4]/div[2]/ul/li[1] li[1]就是神秘巨星影片所包含的信息,我们去掉[1],就能寻找到所有影片的信息了
lis = html.xpath("/html/body/div[4]/div[2]/ul/li")
4.深层寻找每个信息
for li in lis:
title = li.xpath("./div[2]/div[1]/text()")
abstract = li.xpath("./div[2]/div[2]/text()")
score = li.xpath("./div[1]/div[3]/span/text()")
附上完整代码
"""
Created on Sun Oct 3 17:37:09 2021
@author: yingzi
E-mail:guotaomath@163.com
"""
import requests
from lxml import etree
import csv
url = "https://film.sohu.com/list_4_0_0_0_0_1_60.html?channeled=1200100000"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
}
resp = requests.get(url,headers=headers)
resp.encoding = "utf-8"
f = open("movie.csv",mode="w",encoding="utf-8")
csvwriter = csv.writer(f)
html = etree.HTML(resp.text)
lis = html.xpath("/html/body/div[4]/div[2]/ul/li")
for li in lis:
title = li.xpath("./div[2]/div[1]/text()")
abstract = li.xpath("./div[2]/div[2]/text()")
score = li.xpath("./div[1]/div[3]/span/text()")
csvwriter.writerow([title,abstract,score])
f.close()
resp.close()
print("over!!!")
评分那块,只取出数字待解决
|