IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> Python知识库 -> 爱企查 爬取公司基本信息 -> 正文阅读

[Python知识库]爱企查 爬取公司基本信息

前言:

前几天老板让我爬下制造业相关的公司,最开始说用企查查,但是你懂的,6页之后就要收费了,不能白嫖。就是前6页页不好爬,超频访问阻拦着我。但是VIP似乎可以直接导出,不用爬虫,遂和老板交涉,哪怕花点儿呢。

老板直言:不是花不起这钱,当程序员的,不能受这个气。

于是和企查查斗智斗勇半天无果后,我果断换了百度旗下的爱企查,强烈推荐爱企查,良心企业。注册账号能看100页左右的数据,还没有超频限制。

废话不多说,下面给大家整俩自己写的爬虫代码。
第一份是爱企查中的制造业公司,搜索“制造”,再勾选“制造业”,显示出来的公司。
第二份是根据一份500强制造业公司名录,爬的制造业公司。

第一份

首先,用xlwt库创建excel和工作表,将表头设置好(也可以不设置表头,就是丑点儿)
需要设置一个savepath变量(字符串),作为你爬下来的数据的存储地址。

# 创建workbook对象 和 工作表
book = xlwt.Workbook(encoding="utf-8", style_compression=0)
sheet = book.add_sheet('AQCdocu', cell_overwrite_ok=True)
col = ("名称", "法人", "注册资本", "成立日期", "地址", "经营范围")
for i in range(0, 6):
    sheet.write(0, i, col[i])  # 列名
book.save(savepath)  # 保存

进入爱企查网站,搜索制造,再勾选制造业(其实勾选不勾选都是一个url),拿到URL,定义变量。

url = "https://aiqicha.baidu.com/s?q=%E5%88%B6%E9%80%A0&t=0"

观察这个网页,一页可以显示10条公司,所以设置一个计数器变量,用于excel向下滚动列,爬一页往下滚10条。
作为刚从学校出来的菜鸟,当然用count啦!

# 每页有10个公司,对应表中的10列,每爬1页往下滚10列
count = 0

然后设置一个for循环,一页页爬,一共爬100页。

    for num in range(1, 100):
    	# 请求网页,返回html
        html = askURL(url, num)
		'''
		这块儿是爬每一页中每一项的东西,别急,待会儿“摘东西”部分填上
		'''
        # 计数器+10
        count = count + 10

在爬每一页的时候,需要先请求网页,然后返回一个html。
我们从这个html里头摘抄想要的东西,然后放到excel中。
先说请求网页,再说摘东西

请求网页:
请求网页这里定义了个函数askURL(抄的另一个大佬的,我加了个num参数,他的文章我最后放上链接,相当清晰)
仔细观察网页,翻几页发现,不论怎么翻页,url都不变。于是点击F12,按照下图的1234依次点击,将header翻到最底下,发现在Query String Parameters中有个变量p代表了页号。
所以,现在知道为什么要弄个参数num了吧,num参数代表了变量p,放在params中翻页用的。

其中1234分别是:
Network、
Fetch/XHR、
随便一个页号、
Name下的那一个(点一个页号会弹出来一个,看看他们,对比一下Query String Parameters中的p就能发现是参数中的p在变)
藏得很深的翻页机制
接下来的操作就很常规了,先把函数框架写出来,然后挨个抄Network–Fetch/XHR–Headers中的东西。
注意Header中的General里头的Request Method,是GET方法,所以用requests库时也要用get。
代码如下:

def askURL(url, num):

    # 请求头部
    headers = {
        'Accept': 'application / json, text / plain, * / *',
        'Accept - Encoding': 'gzip, deflate, br',
        'Accept - Language': 'en - US, en;q = 0.9',
        'Connection': 'keep - alive',
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
        'cookie': 'BIDUPSID=381DD96C2966B9CC44CC57CADD3B67D1; PSTM=1632724182; BAIDUID=381DD96C2966B9CCD0AB4379252D8022:FG=1; __yjs_duid=1_cf0d57675356f745bcbb2c45c59881491632793531463; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDSFRCVID=bKDOJexroG382q3HDDZtwituB2KKg7jTDYrEZguiLEnlccDVJeC6EG0PtOqPGZu-EHtdogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tJuf_DthfIt3fP36q45HMt00qxby26ndfg79aJ5nQI5nh-QP55J1hUPNhlJ0-nQG0jTlVpvKQUbmjRO206oay6O3LlO83h5MQGnMKl0MLPb5sbRPLjOD0tA4LxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFlejA2j65bDaRf-b-X-C72sJOOaCvW8pROy4oWK441DhjyqRj7aKTnKP3VbP5IhlvobTJ83M04K4oAaT38JGOM_Jb8WMQJoMQ2Qft20b3bb-RT0qOa3g5wWn7jWhk2Dq72y5jvQlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjHCDq6kjJJFOoIvt-5rDHJTg5DTjhPrMj4OWWMT-MTryKKJKaKTKOb7NX-QbMJ00LG5iB--f2HnRh4oNB-3iV-OxDUvnyxAZbn7pLUQxtNRJVnbcLpQmHlbVX4vobUPUDMc9LUkqW2cdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtCIWhKLCe503-RJH-xQ0KnLXKKOLVb5HWh7ketn4hUt254R-K47RXP5gbK5JLl_-WhvJMnc2QhrKQf4WWb3ebTJr32Qr-J39QfbpsIJM557fyp8z0M5RBx6QaKviaKJEBMb1MlvDBT5h2M4qMxtOLR3pWDTm_q5TtUJMeCnTDMFhe6oM-frDa4J3K4oa3RTeb6rjDnCr-xRUXUI82h5y05tOtjCeapbgytbbjtbGL65vyPbWMRORXRj4yDvtBlRNaJRjHpbKy4oTjxL1Db3JWboT3aQtsl5dbnboepvoD-cc3MvByPjdJJQOBKQB0KnGbUQkeq8CQft20b0EeMtjW6LEK5r2SC_KtCP53f; BDPPN=04a584e8a43b3c2543211c1ff0083491; log_guid=5c0734af5e94123684a6f61121e70a8f; _j47_ka8_=57; Hm_lvt_ad52b306e1ae4557f5d3534cce8f8bbf=1633747411,1633747424,1633747468; ZX_UNIQ_UID=ad0ed23cec5aa23f175c6abb6a19be38; Hm_lpvt_ad52b306e1ae4557f5d3534cce8f8bbf=1633748170; _s53_d91_=76b33b1f7b50d853e35a0ac59d45020eb588e5b4d4bffd71ed55400a95b4096de8657d250af19df5057cfea80e21ae451091cbf031ecc598c2ee5f212c2788df46767c12e766e00ad42a19c9113ccc138394669b0c028d0947c24e73ff8e3be8a67a5d51fb926dc1ac67756522b47e8ef0baeb5127f5f910146e9d5c27b446f16b1c90dfe4ff71d10043c846a542d5aff4cade220accfe9201f7dce1fb216c5dae97bdf76ae98a71591e3a195e035334ce024ed6aae1cfed2773365c5272a6f41811e03945bab155c60a5a10fe2c4cdb; _y18_s21_=1a10c202; H_PS_PSSID=34652_34441_34068_31254_34711_34525_34584_34505_34706_34107_26350_34419_34691_34671; delPer=0; PSINO=2; BDSFRCVID_BFESS=bKDOJexroG382q3HDDZtwituB2KKg7jTDYrEZguiLEnlccDVJeC6EG0PtOqPGZu-EHtdogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tJuf_DthfIt3fP36q45HMt00qxby26ndfg79aJ5nQI5nh-QP55J1hUPNhlJ0-nQG0jTlVpvKQUbmjRO206oay6O3LlO83h5MQGnMKl0MLPb5sbRPLjOD0tA4LxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFlejA2j65bDaRf-b-X-C72sJOOaCvW8pROy4oWK441DhjyqRj7aKTnKP3VbP5IhlvobTJ83M04K4oAaT38JGOM_Jb8WMQJoMQ2Qft20b3bb-RT0qOa3g5wWn7jWhk2Dq72y5jvQlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjHCDq6kjJJFOoIvt-5rDHJTg5DTjhPrMj4OWWMT-MTryKKJKaKTKOb7NX-QbMJ00LG5iB--f2HnRh4oNB-3iV-OxDUvnyxAZbn7pLUQxtNRJVnbcLpQmHlbVX4vobUPUDMc9LUkqW2cdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtCIWhKLCe503-RJH-xQ0KnLXKKOLVb5HWh7ketn4hUt254R-K47RXP5gbK5JLl_-WhvJMnc2QhrKQf4WWb3ebTJr32Qr-J39QfbpsIJM557fyp8z0M5RBx6QaKviaKJEBMb1MlvDBT5h2M4qMxtOLR3pWDTm_q5TtUJMeCnTDMFhe6oM-frDa4J3K4oa3RTeb6rjDnCr-xRUXUI82h5y05tOtjCeapbgytbbjtbGL65vyPbWMRORXRj4yDvtBlRNaJRjHpbKy4oTjxL1Db3JWboT3aQtsl5dbnboepvoD-cc3MvByPjdJJQOBKQB0KnGbUQkeq8CQft20b0EeMtjW6LEK5r2SC_KtCP53f; BAIDUID_BFESS=3D228C3FC1B3512BAD3529CB3B6ACE85:FG=1; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217c6313f4bb433-0ec3e42acbdd15-b7a1a38-1024000-17c6313f4bccf9%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217c6313f4bb433-0ec3e42acbdd15-b7a1a38-1024000-17c6313f4bccf9%22%7D; ab_sr=1.0.1_M2I1NDRhNTY5NGMwY2EzOTdkZDg5N2YzNTM0NjVlNzAwMmJjNzY1M2I3NmNiYjkxMjVjMTAwMjY5NTg0OTEyZTI3NWI4YTFkYWUyYzRlYjM0OTMxNDc3OGYwMDI5MTRmOWNkOTlhN2E0ZTg1MDMwNTk4NmViYjkxZWZjZmVmMmY2NjQ1N2MwNmNmOGRkMGExNjY1MGNhNjU5OTFmMmNmNg==; RT="z=1&dm=baidu.com&si=3h8cfcpdixy&ss=kuj7h7p1&sl=0&tt=0&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=15q82&ul=113ud&hd=113y9&cl=47l83',
        'Host': 'aiqicha.baidu.com',
        'Referer': 'https: // aiqicha.baidu.com / s?q = % E5 % 88 % B6 % E9 % 80 % A0 & t = 0',
        'sec - ch - ua': '";Not A Brand";v = "99", "Chromium";v = "94"',
        'sec - ch - ua - platform': '"Windows"',
        'Sec - Fetch - Dest': 'empty',
        'Sec - Fetch - Mode': 'cors',
        'Sec - Fetch - Site': 'same - origin',
    }

    # 请求参数
    params = {
        'q': '制造',
        'p': str(num)
    }

    response = requests.get(url,params=params,headers=headers)
    print(response.status_code)
    return html

打印一下这个html试试

{
                "name": "\u65e5\u7acb\u6cf5<em>\u5236\u9020<\/em>",
                "entName": "\u65e5\u7acb\u6cf5\u5236\u9020(\u65e0\u9521)\u6709\u9650\u516c\u53f8",
                "type": "\u54c1\u724c\u9879\u76ee",
                "latestRound": "\u88ab\u6536\u8d2d",
                "projectSimilarCnt": 20,
                "fundCnt": 0,
                "investEventCnt": 0,
                "brief": "\u5927\u578b\u6cf5\u5236\u9020\u5546",
                "entLogo": "https:\/\/zhengxin-pub.cdn.bcebos.com\/financepic\/c03f59fc536166836f2932460b777e1c_fullsize.jpg",
                "entlogoWord": "",
                "linkUrl": "\/brand\/detail?pid=46081494222640&id=882667927",
                "pid": "46081494222640",
                "fundList": [],
                "brandId": 882667927,
                "investevent": [],
                "startDate": "2006-02-06",
                "projectBrandFrom": "\u6c5f\u82cf\u7701\u65e0\u9521\u5e02",
                "engName": "",
                "entLogoWord": "\u65e5"
            }

咦?冒号前的标题还认识,冒号后的信息怎么都是乱码?这爬下来也8行呀!另一个朋友看出来这是unicode编码,不是utf-8,那好办了,在askURL函数中转个码再返回html。

    response = requests.get(url,params=params,headers=headers)
    print(response.status_code)
    response.encoding = response.apparent_encoding
    html = response.text
    return html.encode("utf-8").decode("unicode_escape")

至此,请求页面部分结束。

摘东西:
在刚才for循环框架中,有一个多行注释,这里将把那段填写上。
以摘公司名为例,先观察转码后的html:

{
            "pid": "28791745513157",
            "entName": "天津钢管<em>制<\/em><em>造<\/em>有限公司",
            "entType": "有限责任公司",
            "validityFrom": "2010-12-10",
            "domicile": "天津市东丽区津塘公路396号",
            "entLogo": "https:\/\/zhengxin-pub.cdn.bcebos.com\/logopic\/1c5bd3854fcc2c7430cfdb737d6ca37f_fullsize.jpg",
            "openStatus": "开业",
            "legalPerson": "张铭杰",
            "tags": {
                "abnormal": "<span class=\"zx-ent-tag abnormal\">经营异常<\/span>",
                "laTaxer": "<span class=\"zx-ent-tag laTaxer\">A级纳税人(2015)<\/span>"
            },
            "logoWord": "钢管制造",
            "titleName": "天津钢管制造有限公司",
            "titleLegal": "张铭杰",
            "titleDomicile": "天津市东丽区津塘公路396号",
            "levelAtaxer": [2015, 2014],
            "regCap": "980,000.0万",
            "scope": "一般项目:钢、铁冶炼;金属材料制造;钢压延加工;金属废料和碎屑加工处理;金属材料销售;高品质特种钢铁材料销售;金属制品销售;金属矿石销售;热力生产和供应;污水处理及其再生利用;固体废物治理;技术服务、技术开发、技术咨询、技术交流、技术转让、技术推广;装卸搬运;普通货物仓储服务(不含危险化学品等需许可审批的项目);国内货物运输代理;汽车租赁;机动车修理和维护;住房租赁;非居住房地产租赁;物业管理;园林绿化工程施工。(除依法须经批准的项目外,凭营业执照依法自主开展经营活动)。许可项目:技术进出口;货物进出口;特种设备制造;道路货物运输(不含危险货物);检验检测服务;特种设备检验检测服务;发电、输电、供电业务;餐饮服务;住宿服务;小食杂;烟草制品零售;食品生产;食品经营(销售预包装食品);文件、资料等其他印刷品印刷;包装装潢印刷品印刷;印刷品装订服务。(依法须经批准的项目,经相关部门批准后方可开展经营活动,具体经营项目以相关部门批准文件或许可证件为准)。",
            "regNo": "91120110566114496B",
            "hitReason": [{"企业名称": "天津钢管<em>制<\/em><em>造<\/em>有限公司"}, {"网站名称": "钢管<em>制造<\/em>有限公司"}, {"经营范围": "一般项目:钢、铁冶炼;金属材料<em>制造<\/em>;钢压延加工;金属废料和碎屑加工处理;金属材料销售;高品质特种钢铁材料销售;金属制品销售;金属矿石销售;热力生产和供应;污水处理及其再生利用;固体废物治理;技术服务、技术开发"}],
            "labels": {
                "opening": {"text": "开业", "style": "blue"},
                "abnormal": {"text": "经营异常", "style": "red"}
            },
            "personTitle": "法定代表人",
            "personId": "06e7660efcba8782c4f3c488a98a5b1f"
        }

可以看到 entName 和 titleName 后头是公司名,选中间没有em的titleName作为定位点,然后用正则表达式re的findall函数来摘抄。

Name = re.findall(r'"titleName":"(.*?)"', html, re.S)

因为一页有10个公司的信息,所以需要做个循环,将所有信息都填入excel中。

j = 0
for i in range(count, count+10):
	sheet.write(i, 0, Name[j])
	j = j + 1
	book.save(savepath)

完整的摘公司名的代码如下:

        # 公司名
        Name = re.findall(r'"titleName":"(.*?)"', html, re.S)
        print(Name)
        j = 0
        for i in range(count, count+10):
            sheet.write(i, 0, Name[j])  # 列名
            j = j + 1
            book.save(savepath)

至此,所有东西都写完了,下面贴上全部源代码:

# -*- codeing = utf-8 -*-
from bs4 import BeautifulSoup  # 网页解析,获取数据
import re  # 正则表达式,进行文字匹配`
import xlwt  # 进行excel操作
import requests

def main():
    # 爱企查制造业URL
    url = "https://aiqicha.baidu.com/s?q=%E5%88%B6%E9%80%A0&t=0"

    # 储存xls的路径
    savepath = "C:/StevenXu/0930crawl/AQCdocu.xls"

    # 创建workbook对象 和 工作表
    book = xlwt.Workbook(encoding="utf-8", style_compression=0)
    sheet = book.add_sheet('AQCdocu', cell_overwrite_ok=True)
    col = ("名称", "法人", "注册资本", "成立日期", "地址", "经营范围")
    for i in range(0, 6):
        sheet.write(0, i, col[i])  # 列名
    book.save(savepath)  # 保存

    # 每页有10个公司,对应表中的10列,每爬1页往下滚10列
    count = 0

    # 爬100页
    for num in range(1, 100):
        html = askURL(url, num)

        # 打印html看看样子
        print("HHHHHHHHHHHHHHHHHHHH", html)

        # 公司名
        Name = re.findall(r'"titleName":"(.*?)"', html, re.S)
        print(Name)
        j = 0
        for i in range(count, count+10):
            sheet.write(i, 0, Name[j]) 
            j = j + 1
            book.save(savepath)

        # 法人
        LegalPerson = re.findall(r'"titleLegal":"(.*?)"', html, re.S)  # 通过正则表达式查找
        print(LegalPerson)
        j = 0
        for i in range(count, count+10):
            sheet.write(i, 1, LegalPerson[j])
            j = j + 1
            book.save(savepath)

        # 注册资金
        RegisteredCapital = re.findall(r'"regCap":"(.*?)"', html, re.S)
        print(RegisteredCapital)
        j = 0
        for i in range(count, count+10):
            sheet.write(i, 2, RegisteredCapital[j])
            j = j + 1
            book.save(savepath)

        # 成立时间
        EstablishDate = re.findall(r'validityFrom":"(.*?)"', html, re.S)
        j = 0
        for i in range(count, count+10):
            sheet.write(i, 3, EstablishDate[j])
            j = j + 1
            book.save(savepath)

        # 位置
        Location = re.findall(r'"titleDomicile":"(.*?)"', html, re.S)
        j = 0
        for i in range(count, count+10):
            sheet.write(i, 4, Location[j])
            j = j + 1
            book.save(savepath)

        # 经营范围
        BusinessScope = re.findall(r'"scope":"(.*?)"', html, re.S)
        j = 0
        for i in range(count, count+10):
            sheet.write(i, 5, BusinessScope[j])
            j = j + 1
            book.save(savepath)

        # 计数器+10
        count = count + 10


# 得到指定一个URL的网页内容
def askURL(url, num):

    # 请求头部
    headers = {
        'Accept': 'application / json, text / plain, * / *',
        'Accept - Encoding': 'gzip, deflate, br',
        'Accept - Language': 'en - US, en;q = 0.9',
        'Connection': 'keep - alive',
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
        'cookie': 'BIDUPSID=381DD96C2966B9CC44CC57CADD3B67D1; PSTM=1632724182; BAIDUID=381DD96C2966B9CCD0AB4379252D8022:FG=1; __yjs_duid=1_cf0d57675356f745bcbb2c45c59881491632793531463; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDSFRCVID=bKDOJexroG382q3HDDZtwituB2KKg7jTDYrEZguiLEnlccDVJeC6EG0PtOqPGZu-EHtdogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tJuf_DthfIt3fP36q45HMt00qxby26ndfg79aJ5nQI5nh-QP55J1hUPNhlJ0-nQG0jTlVpvKQUbmjRO206oay6O3LlO83h5MQGnMKl0MLPb5sbRPLjOD0tA4LxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFlejA2j65bDaRf-b-X-C72sJOOaCvW8pROy4oWK441DhjyqRj7aKTnKP3VbP5IhlvobTJ83M04K4oAaT38JGOM_Jb8WMQJoMQ2Qft20b3bb-RT0qOa3g5wWn7jWhk2Dq72y5jvQlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjHCDq6kjJJFOoIvt-5rDHJTg5DTjhPrMj4OWWMT-MTryKKJKaKTKOb7NX-QbMJ00LG5iB--f2HnRh4oNB-3iV-OxDUvnyxAZbn7pLUQxtNRJVnbcLpQmHlbVX4vobUPUDMc9LUkqW2cdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtCIWhKLCe503-RJH-xQ0KnLXKKOLVb5HWh7ketn4hUt254R-K47RXP5gbK5JLl_-WhvJMnc2QhrKQf4WWb3ebTJr32Qr-J39QfbpsIJM557fyp8z0M5RBx6QaKviaKJEBMb1MlvDBT5h2M4qMxtOLR3pWDTm_q5TtUJMeCnTDMFhe6oM-frDa4J3K4oa3RTeb6rjDnCr-xRUXUI82h5y05tOtjCeapbgytbbjtbGL65vyPbWMRORXRj4yDvtBlRNaJRjHpbKy4oTjxL1Db3JWboT3aQtsl5dbnboepvoD-cc3MvByPjdJJQOBKQB0KnGbUQkeq8CQft20b0EeMtjW6LEK5r2SC_KtCP53f; BDPPN=04a584e8a43b3c2543211c1ff0083491; log_guid=5c0734af5e94123684a6f61121e70a8f; _j47_ka8_=57; Hm_lvt_ad52b306e1ae4557f5d3534cce8f8bbf=1633747411,1633747424,1633747468; ZX_UNIQ_UID=ad0ed23cec5aa23f175c6abb6a19be38; Hm_lpvt_ad52b306e1ae4557f5d3534cce8f8bbf=1633748170; _s53_d91_=76b33b1f7b50d853e35a0ac59d45020eb588e5b4d4bffd71ed55400a95b4096de8657d250af19df5057cfea80e21ae451091cbf031ecc598c2ee5f212c2788df46767c12e766e00ad42a19c9113ccc138394669b0c028d0947c24e73ff8e3be8a67a5d51fb926dc1ac67756522b47e8ef0baeb5127f5f910146e9d5c27b446f16b1c90dfe4ff71d10043c846a542d5aff4cade220accfe9201f7dce1fb216c5dae97bdf76ae98a71591e3a195e035334ce024ed6aae1cfed2773365c5272a6f41811e03945bab155c60a5a10fe2c4cdb; _y18_s21_=1a10c202; H_PS_PSSID=34652_34441_34068_31254_34711_34525_34584_34505_34706_34107_26350_34419_34691_34671; delPer=0; PSINO=2; BDSFRCVID_BFESS=bKDOJexroG382q3HDDZtwituB2KKg7jTDYrEZguiLEnlccDVJeC6EG0PtOqPGZu-EHtdogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tJuf_DthfIt3fP36q45HMt00qxby26ndfg79aJ5nQI5nh-QP55J1hUPNhlJ0-nQG0jTlVpvKQUbmjRO206oay6O3LlO83h5MQGnMKl0MLPb5sbRPLjOD0tA4LxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFlejA2j65bDaRf-b-X-C72sJOOaCvW8pROy4oWK441DhjyqRj7aKTnKP3VbP5IhlvobTJ83M04K4oAaT38JGOM_Jb8WMQJoMQ2Qft20b3bb-RT0qOa3g5wWn7jWhk2Dq72y5jvQlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjHCDq6kjJJFOoIvt-5rDHJTg5DTjhPrMj4OWWMT-MTryKKJKaKTKOb7NX-QbMJ00LG5iB--f2HnRh4oNB-3iV-OxDUvnyxAZbn7pLUQxtNRJVnbcLpQmHlbVX4vobUPUDMc9LUkqW2cdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtCIWhKLCe503-RJH-xQ0KnLXKKOLVb5HWh7ketn4hUt254R-K47RXP5gbK5JLl_-WhvJMnc2QhrKQf4WWb3ebTJr32Qr-J39QfbpsIJM557fyp8z0M5RBx6QaKviaKJEBMb1MlvDBT5h2M4qMxtOLR3pWDTm_q5TtUJMeCnTDMFhe6oM-frDa4J3K4oa3RTeb6rjDnCr-xRUXUI82h5y05tOtjCeapbgytbbjtbGL65vyPbWMRORXRj4yDvtBlRNaJRjHpbKy4oTjxL1Db3JWboT3aQtsl5dbnboepvoD-cc3MvByPjdJJQOBKQB0KnGbUQkeq8CQft20b0EeMtjW6LEK5r2SC_KtCP53f; BAIDUID_BFESS=3D228C3FC1B3512BAD3529CB3B6ACE85:FG=1; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217c6313f4bb433-0ec3e42acbdd15-b7a1a38-1024000-17c6313f4bccf9%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217c6313f4bb433-0ec3e42acbdd15-b7a1a38-1024000-17c6313f4bccf9%22%7D; ab_sr=1.0.1_M2I1NDRhNTY5NGMwY2EzOTdkZDg5N2YzNTM0NjVlNzAwMmJjNzY1M2I3NmNiYjkxMjVjMTAwMjY5NTg0OTEyZTI3NWI4YTFkYWUyYzRlYjM0OTMxNDc3OGYwMDI5MTRmOWNkOTlhN2E0ZTg1MDMwNTk4NmViYjkxZWZjZmVmMmY2NjQ1N2MwNmNmOGRkMGExNjY1MGNhNjU5OTFmMmNmNg==; RT="z=1&dm=baidu.com&si=3h8cfcpdixy&ss=kuj7h7p1&sl=0&tt=0&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=15q82&ul=113ud&hd=113y9&cl=47l83',
        'Host': 'aiqicha.baidu.com',
        'Referer': 'https: // aiqicha.baidu.com / s?q = % E5 % 88 % B6 % E9 % 80 % A0 & t = 0',
        'sec - ch - ua': '";Not A Brand";v = "99", "Chromium";v = "94"',
        'sec - ch - ua - platform': '"Windows"',
        'Sec - Fetch - Dest': 'empty',
        'Sec - Fetch - Mode': 'cors',
        'Sec - Fetch - Site': 'same - origin',
    }

    # 请求参数
    params = {
        'q': '制造',
        'p': str(num)
    }

    response = requests.get(url,params=params,headers=headers)
    print(response.status_code)
    response.encoding = response.apparent_encoding
    html = response.text
    return html.encode("utf-8").decode("unicode_escape")


if __name__ == "__main__":  # 当程序执行时
    # 调用函数
    main()
    print("爬取完毕!")

第二份

其实和第一份代码的80%都相同,改动的地方有3个:
1 使用xlrd库,读取500强制造业企业名录的excel,读取结果赋值给变量Name
2 请求URL时需要拼接字符串,将Name加在URL中
3 不用在askURL中添加params和num了

下面,根据各项改动做出说明:
改动1
打开excel,选择工作簿

wb = xlrd.open_workbook(filename="C:/StevenXu/0930crawl/CL2015.xls", formatting_info=True)
CLsheet = wb.sheet_by_name("CL")

改动2
在大for循环中,读取企业名称,复制给Name,拼接URL,调用askURL函数,访问每一个公司的页面。

        Name = CLsheet.cell(num, 0).value
        print(Name)
        url = "https://aiqicha.baidu.com/s?q="+str(Name)+"&t=0"
        html = askURL(url)

改动3
不做解释,删掉params和num相关的东西就行,不会的直接看后头的全部代码吧~

至此,基本全部完成。但是在实际爬取的过程中,出现了2个小问题。
1 搜索一个公司名称可能会蹦出来好几个结果
2 有的公司搜不到

解决方案:
问题1:只取第一个结果。一个页面中的所有企业都存在一个list里,只取[0]即可。(这里有点问题,待会儿作者补上,来活了。
问题2:如果titleName搜不出来,则Name肯定为空,遇到这种情况直接continue,开始下一次循环,找下一个公司即可。

解决这两个问题的代码如下:

        Name = re.findall(r'"titleName":"(.*?)"', TotalStr, re.S)
        print(Name)
        # 如果找不到,则continue
        if Name == []:
            continue
        # 只取Name[0],其他同理
        sheet.write(count, 0, Name[0])
        book.save(savepath)

第二份的全部代码如下:

# -*- codeing = utf-8 -*-
from bs4 import BeautifulSoup  # 网页解析,获取数据
import re  # 正则表达式,进行文字匹配`
import xlwt  # 进行excel操作
import xlrd
import requests


def main():
    wb = xlrd.open_workbook(filename="C:/StevenXu/0930crawl/CL2015.xls", formatting_info=True)
    CLsheet = wb.sheet_by_name("CL")

    savepath = "C:/StevenXu/0930crawl/AQCdocu.xls"  # 当前目录新建XLS,存储进去

    book = xlwt.Workbook(encoding="utf-8", style_compression=0)  # 创建workbook对象
    sheet = book.add_sheet('AQCdocu', cell_overwrite_ok=True)  # 创建工作表
    col = ("名称", "法人", "注册资本", "成立日期", "地址", "经营范围")
    for i in range(0, 6):
        sheet.write(0, i, col[i])  # 列名
    book.save(savepath)  # 保存
    count = 0
    for num in range(0, 600):


        Name = CLsheet.cell(num, 0).value
        print(Name)
        url = "https://aiqicha.baidu.com/s?q="+str(Name)+"&t=0"
        html = askURL(url)

        TotalStr = str(re.findall(r'resultList(.*?)regNo', html, re.S))
        print(TotalStr)

        Name = re.findall(r'"titleName":"(.*?)"', TotalStr, re.S)
        print(Name)
        # 如果找不到,则continue
        if Name == []:
            continue
        # 只取Name[0],其他同理
        sheet.write(count, 0, Name[0])
        book.save(savepath)

        LegalPerson = re.findall(r'"titleLegal":"(.*?)"', TotalStr, re.S)  # 通过正则表达式查找
        print(LegalPerson)
        sheet.write(count, 1, LegalPerson[0])
        book.save(savepath)

        RegisteredCapital = re.findall(r'"regCap":"(.*?)"', TotalStr, re.S)
        print(RegisteredCapital)
        sheet.write(count, 2, RegisteredCapital[0])
        book.save(savepath)

        EstablishDate = re.findall(r'validityFrom":"(.*?)"', TotalStr, re.S)
        print(EstablishDate)
        sheet.write(count, 3, EstablishDate[0])
        book.save(savepath)


        Location = re.findall(r'"titleDomicile":"(.*?)"', TotalStr, re.S)
        print(Location)
        sheet.write(count, 4, Location[0])
        book.save(savepath)


        BusinessScope = re.findall(r'"scope":"(.*?)"', TotalStr, re.S)
        print(BusinessScope)
        sheet.write(count, 5, BusinessScope[0])
        book.save(savepath)

        count = count + 1


# 得到指定一个URL的网页内容
def askURL(url):

    headers = {
        'Accept': 'application / json, text / plain, * / *',
        'Accept - Encoding': 'gzip, deflate, br',
        'Accept - Language': 'en - US, en;q = 0.9',
        'Connection': 'keep - alive',
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
        'cookie': 'BIDUPSID=381DD96C2966B9CC44CC57CADD3B67D1; PSTM=1632724182; BAIDUID=381DD96C2966B9CCD0AB4379252D8022:FG=1; __yjs_duid=1_cf0d57675356f745bcbb2c45c59881491632793531463; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDSFRCVID=bKDOJexroG382q3HDDZtwituB2KKg7jTDYrEZguiLEnlccDVJeC6EG0PtOqPGZu-EHtdogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tJuf_DthfIt3fP36q45HMt00qxby26ndfg79aJ5nQI5nh-QP55J1hUPNhlJ0-nQG0jTlVpvKQUbmjRO206oay6O3LlO83h5MQGnMKl0MLPb5sbRPLjOD0tA4LxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFlejA2j65bDaRf-b-X-C72sJOOaCvW8pROy4oWK441DhjyqRj7aKTnKP3VbP5IhlvobTJ83M04K4oAaT38JGOM_Jb8WMQJoMQ2Qft20b3bb-RT0qOa3g5wWn7jWhk2Dq72y5jvQlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjHCDq6kjJJFOoIvt-5rDHJTg5DTjhPrMj4OWWMT-MTryKKJKaKTKOb7NX-QbMJ00LG5iB--f2HnRh4oNB-3iV-OxDUvnyxAZbn7pLUQxtNRJVnbcLpQmHlbVX4vobUPUDMc9LUkqW2cdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtCIWhKLCe503-RJH-xQ0KnLXKKOLVb5HWh7ketn4hUt254R-K47RXP5gbK5JLl_-WhvJMnc2QhrKQf4WWb3ebTJr32Qr-J39QfbpsIJM557fyp8z0M5RBx6QaKviaKJEBMb1MlvDBT5h2M4qMxtOLR3pWDTm_q5TtUJMeCnTDMFhe6oM-frDa4J3K4oa3RTeb6rjDnCr-xRUXUI82h5y05tOtjCeapbgytbbjtbGL65vyPbWMRORXRj4yDvtBlRNaJRjHpbKy4oTjxL1Db3JWboT3aQtsl5dbnboepvoD-cc3MvByPjdJJQOBKQB0KnGbUQkeq8CQft20b0EeMtjW6LEK5r2SC_KtCP53f; BDPPN=04a584e8a43b3c2543211c1ff0083491; log_guid=5c0734af5e94123684a6f61121e70a8f; _j47_ka8_=57; Hm_lvt_ad52b306e1ae4557f5d3534cce8f8bbf=1633747411,1633747424,1633747468; ZX_UNIQ_UID=ad0ed23cec5aa23f175c6abb6a19be38; Hm_lpvt_ad52b306e1ae4557f5d3534cce8f8bbf=1633748170; _s53_d91_=76b33b1f7b50d853e35a0ac59d45020eb588e5b4d4bffd71ed55400a95b4096de8657d250af19df5057cfea80e21ae451091cbf031ecc598c2ee5f212c2788df46767c12e766e00ad42a19c9113ccc138394669b0c028d0947c24e73ff8e3be8a67a5d51fb926dc1ac67756522b47e8ef0baeb5127f5f910146e9d5c27b446f16b1c90dfe4ff71d10043c846a542d5aff4cade220accfe9201f7dce1fb216c5dae97bdf76ae98a71591e3a195e035334ce024ed6aae1cfed2773365c5272a6f41811e03945bab155c60a5a10fe2c4cdb; _y18_s21_=1a10c202; H_PS_PSSID=34652_34441_34068_31254_34711_34525_34584_34505_34706_34107_26350_34419_34691_34671; delPer=0; PSINO=2; BDSFRCVID_BFESS=bKDOJexroG382q3HDDZtwituB2KKg7jTDYrEZguiLEnlccDVJeC6EG0PtOqPGZu-EHtdogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tJuf_DthfIt3fP36q45HMt00qxby26ndfg79aJ5nQI5nh-QP55J1hUPNhlJ0-nQG0jTlVpvKQUbmjRO206oay6O3LlO83h5MQGnMKl0MLPb5sbRPLjOD0tA4LxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFlejA2j65bDaRf-b-X-C72sJOOaCvW8pROy4oWK441DhjyqRj7aKTnKP3VbP5IhlvobTJ83M04K4oAaT38JGOM_Jb8WMQJoMQ2Qft20b3bb-RT0qOa3g5wWn7jWhk2Dq72y5jvQlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjHCDq6kjJJFOoIvt-5rDHJTg5DTjhPrMj4OWWMT-MTryKKJKaKTKOb7NX-QbMJ00LG5iB--f2HnRh4oNB-3iV-OxDUvnyxAZbn7pLUQxtNRJVnbcLpQmHlbVX4vobUPUDMc9LUkqW2cdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtCIWhKLCe503-RJH-xQ0KnLXKKOLVb5HWh7ketn4hUt254R-K47RXP5gbK5JLl_-WhvJMnc2QhrKQf4WWb3ebTJr32Qr-J39QfbpsIJM557fyp8z0M5RBx6QaKviaKJEBMb1MlvDBT5h2M4qMxtOLR3pWDTm_q5TtUJMeCnTDMFhe6oM-frDa4J3K4oa3RTeb6rjDnCr-xRUXUI82h5y05tOtjCeapbgytbbjtbGL65vyPbWMRORXRj4yDvtBlRNaJRjHpbKy4oTjxL1Db3JWboT3aQtsl5dbnboepvoD-cc3MvByPjdJJQOBKQB0KnGbUQkeq8CQft20b0EeMtjW6LEK5r2SC_KtCP53f; BAIDUID_BFESS=3D228C3FC1B3512BAD3529CB3B6ACE85:FG=1; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217c6313f4bb433-0ec3e42acbdd15-b7a1a38-1024000-17c6313f4bccf9%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217c6313f4bb433-0ec3e42acbdd15-b7a1a38-1024000-17c6313f4bccf9%22%7D; ab_sr=1.0.1_M2I1NDRhNTY5NGMwY2EzOTdkZDg5N2YzNTM0NjVlNzAwMmJjNzY1M2I3NmNiYjkxMjVjMTAwMjY5NTg0OTEyZTI3NWI4YTFkYWUyYzRlYjM0OTMxNDc3OGYwMDI5MTRmOWNkOTlhN2E0ZTg1MDMwNTk4NmViYjkxZWZjZmVmMmY2NjQ1N2MwNmNmOGRkMGExNjY1MGNhNjU5OTFmMmNmNg==; RT="z=1&dm=baidu.com&si=3h8cfcpdixy&ss=kuj7h7p1&sl=0&tt=0&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=15q82&ul=113ud&hd=113y9&cl=47l83',
        'Host': 'aiqicha.baidu.com',
        'Referer': 'https: // aiqicha.baidu.com / s?q = % E5 % 88 % B6 % E9 % 80 % A0 & t = 0',
        'sec - ch - ua': '";Not A Brand";v = "99", "Chromium";v = "94"',
        'sec - ch - ua - platform': '"Windows"',
        'Sec - Fetch - Dest': 'empty',
        'Sec - Fetch - Mode': 'cors',
        'Sec - Fetch - Site': 'same - origin',
    }

    response = requests.get(url, headers=headers)
    print(response.status_code)
    response.encoding = response.apparent_encoding
    html = response.text
    return html.encode("utf-8").decode("unicode_escape")

if __name__ == "__main__":  # 当程序执行时
    main()
    print("爬取完毕!")

参考文章:
爬虫基础
https://blog.csdn.net/bookssea/article/details/107309591
URL翻页不变
https://blog.csdn.net/weixin_43881394/article/details/108056798

  Python知识库 最新文章
Python中String模块
【Python】 14-CVS文件操作
python的panda库读写文件
使用Nordic的nrf52840实现蓝牙DFU过程
【Python学习记录】numpy数组用法整理
Python学习笔记
python字符串和列表
python如何从txt文件中解析出有效的数据
Python编程从入门到实践自学/3.1-3.2
python变量
上一篇文章      下一篇文章      查看所有文章
加:2021-10-13 11:24:31  更:2021-10-13 11:24:42 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2024年11日历 -2024/11/15 20:09:07-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码