Python课程内容回顾
今天Python实训课主要讲了爬虫的相关操作,主要是爬取百度小说西游记的内容和爬取网易云热歌排行榜的音乐:
爬取百度小说西游记
'''
软工实训课内容
'''
import requests
import time
url = "http://dushu.baidu.com/api/pc/getCatalog?data={%22book_id%22:%224306063500%22}"
resp1 = requests.get(url)
print(resp1)
data1 = resp1.json()
print(data1)
print(type(data1))
titleList = data1["data"]["novel"]["items"]
print(titleList)
for x in titleList:
url1 = "http://dushu.baidu.com/api/pc/getChapterContent?data={%22book_id%22:%224306063500%22,%22cid%22:%224306063500|"+x["cid"]+"%22,%22need_bookinfo%22:1}"
time.sleep(1)
resp2 = requests.get(url1)
data2 = resp2.json()
text1 = data2["data"]["novel"]["content"]
print(text1)
path = "D:\\Desktop\\西游记\\"+x["title"]+".txt"
with open(path,'w',encoding='utf8') as f:
f.write(text1)
print("================",x["title"]+"下载完成!====================")
爬取网易云音乐热歌榜
import requests
from lxml import etree
import time
url = "https://music.163.com/discover/toplist?id=3778678"
resp1 = requests.get(url)
body = resp1.text
html = etree.HTML(body)
data1 = html.xpath("//ul[@class='f-hide']/li/a/@href")
data2 = html.xpath("//ul[@class='f-hide']/li/a/text()")
for i in range(len(data1)):
url1 = "https://link.hhtjim.com/163/"+data1[i].split("=")[1]+".mp3"
time.sleep(1)
resp2 = requests.get(url1)
data3 = resp2.content
path = "D:\\Desktop\\music\\"+data2[i]+".mp3"
with open(path,"wb") as f:
f.write(data3)
print("=============",data2[i]+"下载完成========")
Note:在爬取一个网站的数据的时候,首先需要判断该网站该网页的数据是存在客户端还是服务端
判断方法:在网页右键点击检查网页源代码,看在源代码中是否存在自己所需要的内容。如果在则可以直接利用该链接爬取,如果不在则需要检查->Network->XHR->刷新,根据请求去寻找自己需要的内容,从而利用该链接去爬取自己所需的内容
- .text返回的是Unicode型的数据(想要得到的是字符串,文本)
- .content返回的是bytes型的数据(想要得到的是图片,文件)
- .json返回的是json格式数据(想要得到的是json格式数据)
|