这里下载极客时间网站的免费音频
先分析网页
数据结构与算法之美 https://time.geekbang.org/column/intro/126 点进去第一个 可以看到都有个article的数据,数据里有一个data,data中有个audio_download_url,是个MP3 这里可以直接拷贝到浏览器,确定就是这个课程这一节的音频,我们的目标就找到了,就是这个东西 这个可以看到是请求的路径,并非上面的网址 参数在下面 因为需要登录,所以封装自己的请求头 抄浏览器的 一长串,里面的cookie会过期,过一段时间是需要修改的
单个媒体下载
import json
import requests
headers2= {
'Accept':'application/json, text/plain, */*',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,zh-TW;q=0.6',
'Connection':'keep-alive',
'Content-Length':'73',
'Content-Type':'application/json',
'Cookie':'_ga=GA1.2.1030138631.1629027991; LF_ID=1629027991552-8121479-3323698; GCID=6d8a6fa-13351d1-3345bda-421bf56; GRID=6d8a6fa-13351d1-3345bda-421bf56; gksskpitn=7c61cb9d-642e-443b-b3e5-a696c5eb3fc4; _gid=GA1.2.369696871.1631087158; GCESS=BgMEqZ84YQQEAC8NAAsCBgAKBAAAAAAIAQMHBKBBO7gNAQEGBB66HJMMAQEBCM.VKAAAAAAABQQAAAAAAgSpnzhhCQEB; Hm_lvt_59c4ff31a9ee6263811b23eb921a5083=1630833486,1630917251,1631087158,1631100840; Hm_lvt_022f847c4e3acd44d4a2481d9187f1e6=1630833486,1630917251,1631087158,1631100840; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%222659791%22%2C%22first_id%22%3A%2217b49a27e840-0fa330fb0d8d53-404b032d-1049088-17b49a27e8692%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E5%BC%95%E8%8D%90%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Faccount.infoq.cn%2Fsyncinfoq%2F%3Fto%3Dff6d71c6643d6044%26redirect%3Dhttps%253A%252F%252Ftime.geekbang.org%252Fcolumn%252Fintro%252F100017301%22%2C%22%24latest_landing_page%22%3A%22https%3A%2F%2Ftime.geekbang.org%2Fcolumn%2Fintro%2F100017301%22%2C%22%24latest_utm_source%22%3A%22shequn%22%2C%22%24latest_utm_medium%22%3A%220817%22%2C%22%24latest_utm_campaign%22%3A%22newregister%22%7D%2C%22%24device_id%22%3A%2217b49a27e840-0fa330fb0d8d53-404b032d-1049088-17b49a27e8692%22%7D; Hm_lpvt_59c4ff31a9ee6263811b23eb921a5083=1631101079; _gat=1; Hm_lpvt_022f847c4e3acd44d',
'Host':'time.geekbang.org',
'Origin':'https://time.geekbang.org',
'Referer':'https://time.geekbang.org/column/intro/100017301',
'sec-ch-ua':'"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'sec-ch-ua-mobile':'?0',
'Sec-Fetch-Dest':'empty',
'Sec-Fetch-Mode':'cors',
'Sec-Fetch-Site':'same-origin',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
}
data2 = {"id":"40961","include_neighbors":'true',"is_freelyread":'true'}
res2 = requests.post(url='https://time.geekbang.org/serv/v1/article', data=json.dumps(data2),headers=headers2).json()
url=res2.get('data').get('audio_download_url')
print(res2)
print(url)
res3=requests.get(url)
with open('xx.mp3','wb') as f:
f.write(res3.content)
能下下来指定的MP3
批量处理
课程的首页里点开目录,可以看到浏览器加载了一个articles的xhr 里面有个data,data里有list 每个list都包含小节的id
请求的路径和类型是这个 请求的参数是这个
data = {"cid":'126',"size":'500',"prev":'0',"order":"earliest","sample":'false'}
res = requests.post(url='https://time.geekbang.org/serv/v1/column/articles', data=json.dumps(data),headers=headers2).json()
params = res.get('data')['list']
print(params)
遍历这个list,取出其中我们需要的id和标题
for i in params:
id = i.get('id')
title = i.get('article_sharetitle')
组合单个下载的方法,尝试获取媒体文件的url
if all((id,title)):
print(id,title)
data2 = {"id": f"{id}", "include_neighbors": 'true', "is_freelyread": 'true'}
time.sleep(2)
res2 = requests.post(url='https://time.geekbang.org/serv/v1/article', data=json.dumps(data2),
headers=headers2).json()
try:
res_url=res2.get('data').get('audio_download_url')
print(res_url)
except:
print('此处无资源')
res_url=None
对有效的url进行下载
def download_mp3(title,url:str):
res = requests.get(url)
if url.endswith('mp3'):
name='vm6/'+title+'.mp3'
with open(name,'wb') as f:
f.write(res.content)
有时候需要对标题提取,去掉不可成为文件名的字符
def validateTitle(title):
punctuation = '!,;:?"\'、,;“ ” 《 》【】? + * & /'
new_title = re.sub(r'[{}]+'.format(punctuation), '', title)
return new_title.strip()
结果是这样的 None的是url没有获取到,在课程网站是锁定的,付费才能开通,所以我这里是没有的
运行次数多了会出错,在这里建议重新登录自己的账号,获取新的cookie,而且一次也不要太多,可以尝试着分段提取,调整自己的cookie。
|