分析:爬取豆瓣网某电影短评,前10页不需要登录就能爬取,但是从第10页开始就需要登录才能获取数据。使用selenium模拟登录后,因为是静态网页,可以保存cookie,然后利用requests,添加cookie进行登录操作。也可以直接登录后赋值网页cookie添加到requests请求中,进行登录。
本来想直接使用requets的post传送表单,保存cookie,但是里面的ticke、randstr参数每次都会变,这个是滑动验证码后,会出现的东西,搞不懂是怎么生成的,所以没办法直接post登录 下面进入正题 第一步:先使用selenium半自动登录,保存cookie
import selenium
username = '用户名'
password = '密码'
def Login(url):
wb.maximize_window()
wb.get(url)
time.sleep(3)
wb.find_element_by_class_name('account-tab-account').click()
wb.find_element_by_class_name('account-form-input').send_keys(username)
wb.find_element_by_id('password').send_keys(password)
wb.find_element_by_class_name('btn-active').click()
time.sleep(20)
if '你的用户名' == wb.find_element_by_css_selector('.bn-more>span').text:
print('登录成功')
return wb.get_cookies()
else:
print('登录失败')
return False
def write(cookies):
"""写入文件"""
with open('cookies.txt','w') as f:
f.write(str(cookies))
if __name__ == '__main__':
wb = webdriver.Chrome('resource/chromedriver.exe')
url = 'https://accounts.douban.com/passport/login'
cookie = Login(url)
write_cookie(cookie)
第二步:读取cookie传入requests,进行登录,爬取短评,这里就爬取前20页。 注意: requests传入cookie只需要传入name 和value ,其它的不需要,所以需要遍历取出’name’和’value’. 方法1是循环遍历,将name和value存入字典 方法2是创建RequestsCookieJar 对象直接set方法,
import time
import requests
from Tools.i18n.pygettext import safe_eval
from requests.cookies import RequestsCookieJar
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'
}
def get_html(url, cookie, num):
"""
获取网页
:param url: 需要爬取的链接
:param cookie: cookie
:param num: 页数
:return: str
"""
r = requests.get(url, headers=headers, cookies=cookie, params={'start': num})
return r.text
def read_cookie():
"""读取你保存的cookie"""
with open('resource/豆瓣cookie.txt', 'r') as f:
cookies = f.read()
cookies = safe_eval(cookies)
return cookies
def beautiful(text):
"""解析网页,返回的是1页的所有评论"""
new = []
soup = BeautifulSoup(text, 'lxml')
shorts = soup.select('.short')
for short in shorts:
new.append(short.string)
return new
def write(totals):
with open('resource/你是我的荣耀.txt', 'a',encoding='utf-8',newline="") as f:
for i in totals:
f.write(i+'\n')
if __name__ == '__main__':
url = 'https://movie.douban.com/subject/33454980/comments?status=P'
cookie = read()
jar = RequestsCookieJar()
for i in cookie:
jar.set(i['name'], i['value'])
total = []
for i in range(20):
time.sleep(5)
text = get_html(url, jar, i * 20)
if '年轻人的帐号' in text:
print('登录成功')
else:
print('登录失败')
shorts = beautiful(text)
for short in shorts:
total.append(short)
write(total)
第三步:最后就可以读出文件做词云图啦 这里使用jieba库进行分词
import wordcloud
import jieba
def read_txt():
"""读取文件"""
with open('resource/你是我的荣耀.txt', 'r',encoding='utf-8') as f:
reader = f.read()
return reader
def fenci(context):
"""分词"""
s = jieba.lcut(context)
news = []
for i in s:
if len(i) > 1:
news.append(i)
return ' '.join(news)
if __name__ == '__main__':
reader = read_txt()
text = fenci(reader)
word = wordcloud.WordCloud(font_path='msyh.ttc',
background_color='white',
max_words=50).generate(text)
word.to_file('你是我的荣耀.png')
效果图:
总结: 1. 主要传入cookie的时候,requests只有要cookie里面的name 和value 。最后注意每次运行的结果是什么数据类型就行了 2. headers里面的User-Agent 必须要添加,不然会遭到反爬报错403 如果你爬取频率过快,会被封ip,豆瓣封好像是1天,可以使用ip代理,我现在还没有学会。
|