IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> Python知识库 -> 从入门到入土:Python实现爬取刷新微博推荐和最新好友微博|cookie调用|模拟登录 -> 正文阅读

[Python知识库]从入门到入土:Python实现爬取刷新微博推荐和最新好友微博|cookie调用|模拟登录

写在前面:
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出。欢迎各位前来交流。(部分材料来源网络,若有侵权,立即删除)

免责声明

  • 代码仅用于学习,如被转载用于其他非法行为,自负法律责任
  • 代码全部都是原创,不允许转载,转载侵权

情况说明

效果展示

在这里插入图片描述
在这里插入图片描述

代码讲解

引用库

import requests
import time
import json
from selenium import webdriver
import csv

cookie

参考博客

  • 这边使用的是Selenium模拟登录的方法
  • 函数如下:
def get_cookies():
    driver = webdriver.Firefox()#启动浏览器
    url = 'https://www.weibo.com/'
    driver.get(url)  # 发送请求
    # 打开之后,手动登录一次
    time.sleep(3)
    input('完成登陆后点击enter:')
    time.sleep(3)
    dictcookies = driver.get_cookies()  # 获取cookies
    cookie = [item["name"] + "=" + item["value"] for item in dictcookies]
    cookiestr = ';'.join(item for item in cookie)
    print(cookiestr)
    with open('wyycookie.txt', 'w') as f:
        f.write(cookiestr)
    print('cookies保存成功!')
    driver.close()
  • 这一行代码是将driver中获取到的cookie转换成requests能直接使用的格式
 cookie = [item["name"] + "=" + item["value"] for item in dictcookies]
 cookiestr = ';'.join(item for item in cookie)
  • 然后写入文件
with open('wbcookie.txt', 'w') as f:
        f.write(cookiestr)
    print('cookies保存成功!')
    driver.close()
  • 读取cookie
def read_cookie():
    try:

        print("[INFO]:正常尝试读取本地cookie")
        with open('wbcookie.txt', 'r', encoding='utf8') as f:
            Cookies = f.read()
            # print(Cookies)
    except:
        print("[ERROR]:读取失败,请手动登录并更新")
        get_cookies()
        read_cookie()
    return Cookies
  • 这边也有读取的机制和读取失败的机制

网页分析

爬取好友刷新页面

在这里插入图片描述

  • 找到了

  • 点击进去看一下

  • https://weibo.com/ajax/feed/unreadfriendstimeline?list_id=100017243538816&refresh=4&since_id=0&count=10
    在这里插入图片描述

  • 这是第一次点开的发送的包

  • 每次刷新会发现list_id不一样

  • 但是发现把list_id去掉也不影响

  • 然后我们把网页向下滑

  • 抓取新的包
    在这里插入图片描述

  • https://weibo.com/ajax/feed/unreadfriendstimeline?list_id=100017243538816&refresh=4&max_id=1638885029309985&count=10

  • 仔细看会发现传入了max_id这个参数就进入下一页

  • 然后我们在第一页仔细找会找到max_id
    在这里插入图片描述

  • 所以思路就是爬取第一页的信息以及max_id

  • 然后重新构造url去爬取下一些

  • 形成循环

def get_friend_new(number):
    max_id=0
    results = []
    for k in range(number):
        if(k==0):
            url = 'https://www.weibo.com/ajax/feed/unreadfriendstimeline?list_id=&refresh=4&since_id=0&count=10'
        else:
            url='https://www.weibo.com/ajax/feed/unreadfriendstimeline?list_id=&refresh=4&max_id={}&count=10'.format(max_id)
        DATA=get_new(url)
        results.append(DATA[1])
        max_id=DATA[0]
    return results

爬取推荐页面

  • 两个页面大同小异
  • 区别就是max_id在这里是规律的从0依次递增
  • 并且每次url中也只有max_id改变
def get_hot_new(number):
    results = []
    for k in range(number):
        url='https://www.weibo.com/ajax/feed/hottimeline?since_id=0&refresh=0&group_id=102803&containerid=102803&extparam=discover%7Cnew_feed&max_id={}&count=10'.format(k)
        DATA = get_new(url)
        results.append(DATA[1])
    return results

爬取操作

代码讲解

  • 考虑到两者的爬取后,页面的结构和解析方式是一样的所以封装成一个函数进行调用
results=[]
    r = rs.get(url, headers=headers)
    r.encoding = 'utf-8'
    str_r = r.text
    dict_r = json.loads(str_r)
    # print(dict_r['max_id'])
    max_id = dict_r['max_id']
  • 将请求得到的内容转化为字典
  • 提取出其中的max_id
 # for i in dict_r['statuses']:
    #     print(i)

    # print(dict_r['statuses'][1])
    # for dict_key, dict_value in dict_r['statuses'][1].items():
    #      print(dict_key, '->', dict_value)
    # try:
    #     pic=[]
    # for dict_key, dict_value in  dict_r['statuses'][2].items():
    #     print(dict_key, '->', dict_value)

    # print(dict_value['original']['url'])
    # pic.append(dict_value['original']['url'])
    # except:
    #     pass

  • 首先把存储的所需信息的值输出查看结构
  • 然后依次输出每一组微博字典中的键值对对应关系
    for i in dict_r['statuses']:
        data = []
        data.append(i['created_at'])
        # data.append(i['id'])
        data.append(i['user']['screen_name'])
        # data.append(i['user']['id'])
        text = i['text_raw'].replace(u'\u200b', '')
        text = text.replace(u'\n', '')
        data.append(text)
        data.append(i['source'])
        data.append(i['reposts_count'])
        data.append(i['comments_count'])
        data.append(i['attitudes_count'])
        try:
            pic = []
            for dict_key, dict_value in i['pic_infos'].items():
                # print(dict_key, '->', dict_value)
                # print(dict_value['original']['url'])
                pic.append(dict_value['original']['url'])

        except:
            pass
        try:
            vid = []
            vid.append(i['url_struct'][0]['short_url'])

        except:
            pass
  • 依次将一些共有的信息进行获取

  • 然后到了分歧点

  • 有些微博是有照片的,有些微博是有视频的

  • 所以尝试使用了两个try

  • 但是还存在原创和非原创的区别

  • 所以二者创造的列表 pic 和 vid 先不着急添加进入data

 if ('retweeted_status' in i):
            data.append('转发')
            data.extend(pic)
            data.extend(vid)
            pic = []
            vid = []
            results.append(data)
            data = []

            data.append(i['retweeted_status']['created_at'])
            # data.append(i['retweeted_status']['id'])
            data.append(i['retweeted_status']['user']['screen_name'])
            # data.append(i['retweeted_status']['user']['id'])
            text = i['retweeted_status']['text_raw'].replace(u'\u200b', '')
            text = text.replace(u'\n', '')
            data.append(text)
            data.append(i['retweeted_status']['source'])
            data.append(i['retweeted_status']['reposts_count'])
            data.append(i['retweeted_status']['comments_count'])
            data.append(i['retweeted_status']['attitudes_count'])
            data.append('原创')
            try:
                pic = []
                for dict_key, dict_value in i['retweeted_status']['pic_infos'].items():
                    # print(dict_key, '->', dict_value)
                    # print(dict_value['original']['url'])
                    pic.append(dict_value['original']['url'])
            except:
                pass
            try:
                vid = []
                vid.append(i['url_struct'][0]['short_url'])

            except:
                pass
            data.extend(pic)
            data.extend(vid)
            results.append(data)
        else:
            data.append('原创')
            data.extend(pic)
            data.extend(vid)
            results.append(data)

  • 发现在本条微博中,如果是转发,则会嵌套了原微博的信息
  • 所以如果是转发的情况下
  • 将pic和vid导入后,打上标签
  • 将data添加到results
  • 把data置空
  • 重现上述提取内容的步骤
  • 把原微博的信息也一并爬取下来

代码如下:

def get_new(url):
    results=[]
    r = rs.get(url, headers=headers)
    r.encoding = 'utf-8'
    str_r = r.text
    dict_r = json.loads(str_r)
    # print(dict_r['max_id'])
    max_id = dict_r['max_id']
    # for i in dict_r['statuses']:
    #     print(i)

    # print(dict_r['statuses'][1])
    # for dict_key, dict_value in dict_r['statuses'][1].items():
    #      print(dict_key, '->', dict_value)
    # try:
    #     pic=[]
    # for dict_key, dict_value in  dict_r['statuses'][2].items():
    #     print(dict_key, '->', dict_value)

    # print(dict_value['original']['url'])
    # pic.append(dict_value['original']['url'])
    # except:
    #     pass

    for i in dict_r['statuses']:
        data = []
        data.append(i['created_at'])
        # data.append(i['id'])
        data.append(i['user']['screen_name'])
        # data.append(i['user']['id'])
        text = i['text_raw'].replace(u'\u200b', '')
        text = text.replace(u'\n', '')
        data.append(text)
        data.append(i['source'])
        data.append(i['reposts_count'])
        data.append(i['comments_count'])
        data.append(i['attitudes_count'])
        try:
            pic = []
            for dict_key, dict_value in i['pic_infos'].items():
                # print(dict_key, '->', dict_value)
                # print(dict_value['original']['url'])
                pic.append(dict_value['original']['url'])

        except:
            pass
        try:
            vid = []
            vid.append(i['url_struct'][0]['short_url'])

        except:
            pass

        if ('retweeted_status' in i):
            data.append('转发')
            data.extend(pic)
            data.extend(vid)
            pic = []
            vid = []
            results.append(data)
            data = []

            data.append(i['retweeted_status']['created_at'])
            # data.append(i['retweeted_status']['id'])
            data.append(i['retweeted_status']['user']['screen_name'])
            # data.append(i['retweeted_status']['user']['id'])
            text = i['retweeted_status']['text_raw'].replace(u'\u200b', '')
            text = text.replace(u'\n', '')
            data.append(text)
            data.append(i['retweeted_status']['source'])
            data.append(i['retweeted_status']['reposts_count'])
            data.append(i['retweeted_status']['comments_count'])
            data.append(i['retweeted_status']['attitudes_count'])
            data.append('原创')
            try:
                pic = []
                for dict_key, dict_value in i['retweeted_status']['pic_infos'].items():
                    # print(dict_key, '->', dict_value)
                    # print(dict_value['original']['url'])
                    pic.append(dict_value['original']['url'])
            except:
                pass
            try:
                vid = []
                vid.append(i['url_struct'][0]['short_url'])

            except:
                pass
            data.extend(pic)
            data.extend(vid)
            results.append(data)
        else:
            data.append('原创')
            data.extend(pic)
            data.extend(vid)
            results.append(data)

    for j in results:
        print(j)
    time.sleep(1)
    return [max_id,results]
  • 以及考虑到爬取好友微博需要max_id
  • 所以get_new()返回的值为一个列表
  • 其中包括了max_id和爬取结果results

写入文件

def write(results):
    with open("爬取结果.csv", "a", encoding="gb18030", newline="") as csvfile:
        writer = csv.writer(csvfile)
        print("[INFO]正在写入csv文件中")
        for i in results:
            writer.writerows(i)
  • 简简单单的把文件写入就好了

主函数

if __name__ == "__main__":

    Cookies=read_cookie()
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
         'Cookie': '{}'.format(Cookies)}

    #get_friend_new(10)
    results=get_hot_new(10)
    write(results)

全部代码

from lxml import etree
import requests
import time
import json
from selenium import webdriver
import csv

def read_cookie():
    try:

        print("[INFO]:正常尝试读取本地cookie")
        with open('wbcookie.txt', 'r', encoding='utf8') as f:
            Cookies = f.read()
            #print(Cookies)
    except:
        print("[ERROR]:读取失败,请手动登录并更新")
        get_cookies()
        read_cookie()
    return Cookies

def get_cookies():
    driver = webdriver.Firefox()
    url = 'https://www.weibo.com/'
    driver.get(url)  # 发送请求
    # 打开之后,手动登录一次
    time.sleep(3)
    input('完成登陆后点击enter:')
    time.sleep(3)
    dictcookies = driver.get_cookies()  # 获取cookies
    cookie = [item["name"] + "=" + item["value"] for item in dictcookies]
    cookiestr = ';'.join(item for item in cookie)
    print(cookiestr)
    with open('wbcookie.txt', 'w') as f:
        f.write(cookiestr)
    print('cookies保存成功!')
    driver.close()


rs = requests.session()

# r = rs.get('https://m.weibo.cn/api/container/getIndex?containerid=1076032039679457')
#
# rs = requests.session()
# s = r.text
# #s=s.encode('utf-8').decode('unicode_escape')
# content = re.sub(r'(\\u[a-zA-Z0-9]{4})', lambda x: x.group(1).encode("utf-8").decode("unicode-escape"), s)

# print(content)

def get_hot_new(number):
    results = []
    for k in range(number):
        url='https://www.weibo.com/ajax/feed/hottimeline?since_id=0&refresh=0&group_id=102803&containerid=102803&extparam=discover%7Cnew_feed&max_id={}&count=10'.format(k)
        DATA = get_new(url)
        results.append(DATA[1])
    return results

def get_friend_new(number):
    max_id=0
    results = []
    for k in range(number):
        if(k==0):
            url = 'https://www.weibo.com/ajax/feed/unreadfriendstimeline?list_id=&refresh=4&since_id=0&count=10'
        else:
            url='https://www.weibo.com/ajax/feed/unreadfriendstimeline?list_id=&refresh=4&max_id={}&count=10'.format(max_id)
        DATA=get_new(url)
        results.append(DATA[1])
        max_id=DATA[0]
    return results

def get_new(url):
    results=[]
    r = rs.get(url, headers=headers)
    r.encoding = 'utf-8'
    str_r = r.text
    dict_r = json.loads(str_r)
    # print(dict_r['max_id'])
    max_id = dict_r['max_id']
    # for i in dict_r['statuses']:
    #     print(i)

    # print(dict_r['statuses'][1])
    # for dict_key, dict_value in dict_r['statuses'][1].items():
    #      print(dict_key, '->', dict_value)
    # try:
    #     pic=[]
    # for dict_key, dict_value in  dict_r['statuses'][2].items():
    #     print(dict_key, '->', dict_value)

    # print(dict_value['original']['url'])
    # pic.append(dict_value['original']['url'])
    # except:
    #     pass

    for i in dict_r['statuses']:
        data = []
        data.append(i['created_at'])
        # data.append(i['id'])
        data.append(i['user']['screen_name'])
        # data.append(i['user']['id'])
        text = i['text_raw'].replace(u'\u200b', '')
        text = text.replace(u'\n', '')
        data.append(text)
        data.append(i['source'])
        data.append(i['reposts_count'])
        data.append(i['comments_count'])
        data.append(i['attitudes_count'])
        try:
            pic = []
            for dict_key, dict_value in i['pic_infos'].items():
                # print(dict_key, '->', dict_value)
                # print(dict_value['original']['url'])
                pic.append(dict_value['original']['url'])

        except:
            pass
        try:
            vid = []
            vid.append(i['url_struct'][0]['short_url'])

        except:
            pass

        if ('retweeted_status' in i):
            data.append('转发')
            data.extend(pic)
            data.extend(vid)
            pic = []
            vid = []
            results.append(data)
            data = []

            data.append(i['retweeted_status']['created_at'])
            # data.append(i['retweeted_status']['id'])
            data.append(i['retweeted_status']['user']['screen_name'])
            # data.append(i['retweeted_status']['user']['id'])
            text = i['retweeted_status']['text_raw'].replace(u'\u200b', '')
            text = text.replace(u'\n', '')
            data.append(text)
            data.append(i['retweeted_status']['source'])
            data.append(i['retweeted_status']['reposts_count'])
            data.append(i['retweeted_status']['comments_count'])
            data.append(i['retweeted_status']['attitudes_count'])
            data.append('原创')
            try:
                pic = []
                for dict_key, dict_value in i['retweeted_status']['pic_infos'].items():
                    # print(dict_key, '->', dict_value)
                    # print(dict_value['original']['url'])
                    pic.append(dict_value['original']['url'])
            except:
                pass
            try:
                vid = []
                vid.append(i['url_struct'][0]['short_url'])

            except:
                pass
            data.extend(pic)
            data.extend(vid)
            results.append(data)
        else:
            data.append('原创')
            data.extend(pic)
            data.extend(vid)
            results.append(data)

    for j in results:
        print(j)
    time.sleep(1)
    return [max_id,results]

def write(results):
    with open("爬取结果.csv", "a", encoding="gb18030", newline="") as csvfile:
        writer = csv.writer(csvfile)
        print("[INFO]正在写入csv文件中")
        for i in results:
            writer.writerows(i)

if __name__ == "__main__":

    Cookies=read_cookie()
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
         'Cookie': '{}'.format(Cookies)}

    #get_friend_new(10)
    results=get_hot_new(10)
    write(results)

  • 至于爬取的条数和内容
  • 可以自己修改
  • 好了结束
  • 晚安
  Python知识库 最新文章
Python中String模块
【Python】 14-CVS文件操作
python的panda库读写文件
使用Nordic的nrf52840实现蓝牙DFU过程
【Python学习记录】numpy数组用法整理
Python学习笔记
python字符串和列表
python如何从txt文件中解析出有效的数据
Python编程从入门到实践自学/3.1-3.2
python变量
上一篇文章      下一篇文章      查看所有文章
加:2021-12-08 13:46:09  更:2021-12-08 13:46:21 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2024年11日历 -2024/11/16 5:50:37-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码