[Python知识库] 爬虫框架scrapy--5模拟登陆

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> 爬虫框架scrapy--5模拟登陆 -> 正文阅读

[Python知识库]爬虫框架scrapy--5模拟登陆

一、利用已有的cookies:通过在spiders下的爬虫文件中重写start_requests方法，在回调函数中提取数据

class BaiduSpider(scrapy.Spider):
    name = 'baidu'
    allowed_domains = ['baidu.com']
    start_urls = ['http://baidu.com/']

    # 重写start_requests方法
    def start_requests(self):
        cookies = """pgv_pvi = 8991404032;
                        rpdid = kmkkkilsxxdosoqkoxqww;
                        CURRENT_QUALITY = 64;
                        bsource = search_360;
                        innersign = 1"""
        cookies = {i.split('=')[0]:i.split('=')[1] for i in cookies.split(";")}
        #构造请求
        yield scrapy.Request(self.start_urls[0], callback=self.parse(),
                             cookies=cookies)

    def parse(self, response):
        print(response.text)
        print(response.xpath('//div'))#.extract()

二、post登陆之使用scrapy.FormRequest

class GithubSpider(scrapy.Spider):
    name= 'github'
    allowed_domains=['github.com']
    start_urLs=['https://github.com/Login']
    def parse(self,response):
        authenticity_token=response.xpath("//input[@name='authenticity_token']/@value").extract_first()
        utf8 = response.xpath("//input[@name='utf8']/@value").extract_first()
        commit=response.xpath("//input[@name='commit']/@value").extract_first()
        post_data = dict(
            Login="noobpythoner",
            EeSswaTc="zhoudaweL23",
            authenticity_taken=authenticity_token,
            utf8=utf8,
            cammit=commit
        )
        yield scrapy.FormRequest(
                                "https://github.com/session",
                                farmdata=post_data,
                                callback=self.after_Login)
    def after_Login(self,response):
        #对登陆后的页面进行处理
        #print(re.findalL("noobpythoner|NoobPythoner,response.body.decode()))
        pass

三、若表单中含有action属性，还可以使用FormRequest.from_response自动寻找表单

class Github2Spider(scrapy.Spider):
    name ='github2'
    alLowed_domains =['github.com']
    start_urls =['https://github.com/Login']
    def parse(self,response):
        yield scrapy.FormRequest.from_response(
        response,#自动的从response中寻找from表单,多个表单时可以通过添加其他参数进行定位
        formdata={"Login":"noobpythoner","password":"daWei1231"},
        caLlback = self.after_Login
        )
    def after_login(self,response):
        #对登陆后的页面进行处理
        print(re.findaLl("noobpythoner|NoobPythoner",response.body.decode()))

Python知识库最新文章

Python中String模块

【Python】 14-CVS文件操作

python的panda库读写文件

使用Nordic的nrf52840实现蓝牙DFU过程

【Python学习记录】numpy数组用法整理

Python学习笔记

python字符串和列表

python如何从txt文件中解析出有效的数据

Python编程从入门到实践自学/3.1-3.2

python变量

加:2021-09-24 10:31:00 更:2021-09-24 10:32:22

360图书馆购物三丰科技阅读网日历万年历 2025年11日历

-2025/11/23 3:19:47-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码