刚开始学习爬虫,用了以下代码
from urllib import request
from http import cookiejar
cookie_support = request.HTTPCookieProcessor(cookiejar.CookieJar())
opener = request.build_opener(cookie_support, request.HTTPHandler)
request.install_opener(opener)
content = request.urlopen('https://movie.douban.com/').read().decode('utf-8')
print(content)
发现报错了
Traceback (most recent call last):
File "E:\Program\python3.7sourcecode-master\python3.7sourcecode-master\chapter18\cookie_request.py", line 11, in <module>
content = request.urlopen('https://movie.douban.com/').read().decode('utf-8')
File "C:\Users\MAIBENBEN\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\MAIBENBEN\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 523, in open
response = meth(req, response)
File "C:\Users\MAIBENBEN\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 632, in http_response
response = self.parent.error(
File "C:\Users\MAIBENBEN\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 561, in error
return self._call_chain(*args)
File "C:\Users\MAIBENBEN\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
result = func(*args)
File "C:\Users\MAIBENBEN\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 641, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 418:
Process finished with exit code 1
然后发现说是Http error 418 找了下原因是在做request 的时候没有加入User-agent,改成如下
from urllib.request import urlopen, Request
url = 'https://movie.douban.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
ret = Request(url, headers=headers)
res = urlopen(ret)
content = res.read().decode('utf-8')
print(content)
然后又出错了
python 写数据入文件碰见的bug:UnicodeEncodeError: ‘gbk‘ codec can‘t encode character ‘\xee‘ in position 21865:
解决方案:
在setting里边指定编码为utf-8,就可以解决问题了,如下图所示。原因就是lang在豆瓣网是
html lang="zh-CN"
?
最后跑以下成功:
<!DOCTYPE html> <html lang="zh-CN" class="ua-windows ua-webkit"> <head> ? ? <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> ? ? <meta name="renderer" content="webkit"> ? ? <meta name="referrer" content="always"> ? ? <meta name="google-site-verification" content="ok0wCgT20tBBgo9_zat2iAcimtN4Ftf5ccsh092Xeyw" /> ? ? <title> 豆瓣电影 Top 250 </title>
|