在爬取大文件时,不能直接将文件读取到内存中,因为内存是有限的,所以优化请求方式为断点续传。
import sys
import os
import csv
import random
import requests
from concurrent.futures import ThreadPoolExecutor
from fake_useragent import UserAgent
ua = UserAgent()
class Downloader:
def __init__(self, url, file_path):
self.url = url
self.file_path = file_path
def start(self):
requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
res_length = s.get(self.url, stream=True)
total_size = int(res_length.headers['Content-Length'])
if os.path.exists(self.file_path):
temp_size = os.path.getsize(self.file_path)
print("当前:%d 字节, 总:%d 字节, 已下载:%2.2f%% " % (temp_size, total_size, 100 * temp_size / total_size))
else:
temp_size = 0
print("总:%d 字节,开始下载..." % (total_size,))
headers = {
'Range': 'bytes=%d-' % temp_size,
"Connection": "close",
"User-Agent": ua.random,
}
res_left = s.get(self.url, stream=True, headers=headers)
with open(self.file_path, "ab") as f:
for chunk in res_left.iter_content(chunk_size=1024):
temp_size += len(chunk)
f.write(chunk)
f.flush()
done = int(50 * temp_size / total_size)
sys.stdout.write("\r[%s%s] %d%%" % ('█' * done, ' ' * (50 - done), 100 * temp_size / total_size))
sys.stdout.flush()
def async_url(url):
try:
downloader = Downloader(url, filepath)
downloader.start()
except Exception as e:
print(e)
pool = ThreadPoolExecutor(20)
if __name__ == '__main__':
video_list = []
for url in video_lst:
pool.submit(async_url, url)
pool.shutdown(wait=True)
Q1: 出现SSL验证警告
Requests包可以设置绕过ssl证书的校验
requests.packages.urllib3.disable_warnings()
Q2:Failed to establish a new connection: [Errno 60] Operation timed out’ ?
主要是下面几种原因导致:
- 本身的链接无法访问,造成读取失败,直至超时错误。复制链接到浏览器中访问一下,看能否可以正常访问。
- 访问的地址是外网,如果没有Vpn或者Vpn带宽限制非常小,就很容易导致这样的问题出现。
- requests的timeout限值明确缩小
例如:response = requests.get(url, timeout = 5,verify=True),因为如果是死链,时间再长也没用。
Q3: 下载速度慢?
这个有物理带宽限制,主要和网速有关系!!!
|