涉及的知识点
- 基础爬虫
- 数据解析(xpath与正则表达式)
- 多线程异步基础(线程池)
import requests
import os
import random
from lxml import etree
from multiprocessing.dummy import Pool
import re
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
}
url='https://www.pearvideo.com/category_5'
page_text=requests.get(url=url,headers=headers).content
tree=etree.HTML(page_text)
li_list=tree.xpath('//*[@id="listvideoListUl"]/li[@class="categoryem "]')
urls=[]
if not os.path.exists('./myvideos'):
os.mkdir('./myvideos')
for li in li_list:
v_name=li.xpath('./div/a/div[2]/text()')[0]+'.mp4'
v_url='https://www.pearvideo.com/'+li.xpath('./div/a/@href')[0]
v_page_text=requests.get(url=v_url,headers=headers).content
v_tree=etree.HTML(v_page_text)
ajax_url='https://www.pearvideo.com/videoStatus.jsp?'
ajax_id=str(v_tree.xpath('//*[@id="detailsbd"]/div[1]/div[2]/div/div[1]/div/div[1]/@data-id')[0])
print('ajax_id:',ajax_id)
params={
'contId':ajax_id,
'mrd':str(random.random())
}
ajax_headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
'Referer':'https://www.pearvideo.com/video_' +ajax_id
}
dic_obj=requests.get(url=ajax_url,headers=ajax_headers,params=params).json()
video_url=dic_obj["videoInfo"]["videos"]["srcUrl"]
print('video_url:',video_url)
ex='.*?/third/{1}\d{8}/(.*?)-'
word = re.findall(ex,video_url,re.S)[0]
print('word:',word)
new_video_url=re.sub(word,'cont-'+ajax_id,video_url)
dic={
'name':v_name,
'url':new_video_url
}
urls.append(dic)
def get_video_data(dic_):
url_=dic_['url']
print(dic_['name'],'正在下载.....')
v_data=requests.get(url=url_,headers=headers).content
v_path='./myvideos/'+dic_['name']
with open(v_path,'wb')as fp:
fp.write(v_data)
print(dic_['name'],'下载成功!')
pool=Pool(4)
pool.map(get_video_data,urls)
pool.close()
pool.join()
总结
- xpath在初始视频界面无法获取video的地址,可以查找network中是否是通过ajax请求动态获取。
- 注意 ajax请求的请求头中元素与params中元素来源。
关于headers中referer的详解:https://www.jianshu.com/p/1a6abab212ed - 注意到ajax请求到的video数据包中视频地址为假,与真地址对比并更改,学习re模块中通过正则表达式进行字符串替换。https://blog.csdn.net/zss041962/article/details/79089215
|