前言:
前几天老板让我爬下制造业相关的公司,最开始说用企查查,但是你懂的,6页之后就要收费了,不能白嫖。就是前6页页不好爬,超频访问阻拦着我。但是VIP似乎可以直接导出,不用爬虫,遂和老板交涉,哪怕花点儿呢。
老板直言:不是花不起这钱,当程序员的,不能受这个气。
于是和企查查斗智斗勇半天无果后,我果断换了百度旗下的爱企查,强烈推荐爱企查,良心企业。注册账号能看100页左右的数据,还没有超频限制。
废话不多说,下面给大家整俩自己写的爬虫代码。 第一份是爱企查中的制造业公司,搜索“制造”,再勾选“制造业”,显示出来的公司。 第二份是根据一份500强制造业公司名录,爬的制造业公司。
第一份
首先,用xlwt库创建excel和工作表,将表头设置好(也可以不设置表头,就是丑点儿) 需要设置一个savepath变量(字符串),作为你爬下来的数据的存储地址。
book = xlwt.Workbook(encoding="utf-8", style_compression=0)
sheet = book.add_sheet('AQCdocu', cell_overwrite_ok=True)
col = ("名称", "法人", "注册资本", "成立日期", "地址", "经营范围")
for i in range(0, 6):
sheet.write(0, i, col[i])
book.save(savepath)
进入爱企查网站,搜索制造,再勾选制造业(其实勾选不勾选都是一个url),拿到URL,定义变量。
url = "https://aiqicha.baidu.com/s?q=%E5%88%B6%E9%80%A0&t=0"
观察这个网页,一页可以显示10条公司,所以设置一个计数器变量,用于excel向下滚动列,爬一页往下滚10条。 作为刚从学校出来的菜鸟,当然用count啦!
count = 0
然后设置一个for循环,一页页爬,一共爬100页。
for num in range(1, 100):
html = askURL(url, num)
'''
这块儿是爬每一页中每一项的东西,别急,待会儿“摘东西”部分填上
'''
count = count + 10
在爬每一页的时候,需要先请求网页,然后返回一个html。 我们从这个html里头摘抄想要的东西,然后放到excel中。 先说请求网页,再说摘东西。
请求网页: 请求网页这里定义了个函数askURL(抄的另一个大佬的,我加了个num参数,他的文章我最后放上链接,相当清晰) 仔细观察网页,翻几页发现,不论怎么翻页,url都不变。于是点击F12,按照下图的1234依次点击,将header翻到最底下,发现在Query String Parameters中有个变量p代表了页号。 所以,现在知道为什么要弄个参数num了吧,num参数代表了变量p,放在params中翻页用的。
其中1234分别是: Network、 Fetch/XHR、 随便一个页号、 Name下的那一个(点一个页号会弹出来一个,看看他们,对比一下Query String Parameters中的p就能发现是参数中的p在变) 接下来的操作就很常规了,先把函数框架写出来,然后挨个抄Network–Fetch/XHR–Headers中的东西。 注意Header中的General里头的Request Method,是GET方法,所以用requests库时也要用get。 代码如下:
def askURL(url, num):
headers = {
'Accept': 'application / json, text / plain, * / *',
'Accept - Encoding': 'gzip, deflate, br',
'Accept - Language': 'en - US, en;q = 0.9',
'Connection': 'keep - alive',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
'cookie': 'BIDUPSID=381DD96C2966B9CC44CC57CADD3B67D1; PSTM=1632724182; BAIDUID=381DD96C2966B9CCD0AB4379252D8022:FG=1; __yjs_duid=1_cf0d57675356f745bcbb2c45c59881491632793531463; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDSFRCVID=bKDOJexroG382q3HDDZtwituB2KKg7jTDYrEZguiLEnlccDVJeC6EG0PtOqPGZu-EHtdogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tJuf_DthfIt3fP36q45HMt00qxby26ndfg79aJ5nQI5nh-QP55J1hUPNhlJ0-nQG0jTlVpvKQUbmjRO206oay6O3LlO83h5MQGnMKl0MLPb5sbRPLjOD0tA4LxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFlejA2j65bDaRf-b-X-C72sJOOaCvW8pROy4oWK441DhjyqRj7aKTnKP3VbP5IhlvobTJ83M04K4oAaT38JGOM_Jb8WMQJoMQ2Qft20b3bb-RT0qOa3g5wWn7jWhk2Dq72y5jvQlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjHCDq6kjJJFOoIvt-5rDHJTg5DTjhPrMj4OWWMT-MTryKKJKaKTKOb7NX-QbMJ00LG5iB--f2HnRh4oNB-3iV-OxDUvnyxAZbn7pLUQxtNRJVnbcLpQmHlbVX4vobUPUDMc9LUkqW2cdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtCIWhKLCe503-RJH-xQ0KnLXKKOLVb5HWh7ketn4hUt254R-K47RXP5gbK5JLl_-WhvJMnc2QhrKQf4WWb3ebTJr32Qr-J39QfbpsIJM557fyp8z0M5RBx6QaKviaKJEBMb1MlvDBT5h2M4qMxtOLR3pWDTm_q5TtUJMeCnTDMFhe6oM-frDa4J3K4oa3RTeb6rjDnCr-xRUXUI82h5y05tOtjCeapbgytbbjtbGL65vyPbWMRORXRj4yDvtBlRNaJRjHpbKy4oTjxL1Db3JWboT3aQtsl5dbnboepvoD-cc3MvByPjdJJQOBKQB0KnGbUQkeq8CQft20b0EeMtjW6LEK5r2SC_KtCP53f; BDPPN=04a584e8a43b3c2543211c1ff0083491; log_guid=5c0734af5e94123684a6f61121e70a8f; _j47_ka8_=57; Hm_lvt_ad52b306e1ae4557f5d3534cce8f8bbf=1633747411,1633747424,1633747468; ZX_UNIQ_UID=ad0ed23cec5aa23f175c6abb6a19be38; Hm_lpvt_ad52b306e1ae4557f5d3534cce8f8bbf=1633748170; _s53_d91_=76b33b1f7b50d853e35a0ac59d45020eb588e5b4d4bffd71ed55400a95b4096de8657d250af19df5057cfea80e21ae451091cbf031ecc598c2ee5f212c2788df46767c12e766e00ad42a19c9113ccc138394669b0c028d0947c24e73ff8e3be8a67a5d51fb926dc1ac67756522b47e8ef0baeb5127f5f910146e9d5c27b446f16b1c90dfe4ff71d10043c846a542d5aff4cade220accfe9201f7dce1fb216c5dae97bdf76ae98a71591e3a195e035334ce024ed6aae1cfed2773365c5272a6f41811e03945bab155c60a5a10fe2c4cdb; _y18_s21_=1a10c202; H_PS_PSSID=34652_34441_34068_31254_34711_34525_34584_34505_34706_34107_26350_34419_34691_34671; delPer=0; PSINO=2; BDSFRCVID_BFESS=bKDOJexroG382q3HDDZtwituB2KKg7jTDYrEZguiLEnlccDVJeC6EG0PtOqPGZu-EHtdogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tJuf_DthfIt3fP36q45HMt00qxby26ndfg79aJ5nQI5nh-QP55J1hUPNhlJ0-nQG0jTlVpvKQUbmjRO206oay6O3LlO83h5MQGnMKl0MLPb5sbRPLjOD0tA4LxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFlejA2j65bDaRf-b-X-C72sJOOaCvW8pROy4oWK441DhjyqRj7aKTnKP3VbP5IhlvobTJ83M04K4oAaT38JGOM_Jb8WMQJoMQ2Qft20b3bb-RT0qOa3g5wWn7jWhk2Dq72y5jvQlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjHCDq6kjJJFOoIvt-5rDHJTg5DTjhPrMj4OWWMT-MTryKKJKaKTKOb7NX-QbMJ00LG5iB--f2HnRh4oNB-3iV-OxDUvnyxAZbn7pLUQxtNRJVnbcLpQmHlbVX4vobUPUDMc9LUkqW2cdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtCIWhKLCe503-RJH-xQ0KnLXKKOLVb5HWh7ketn4hUt254R-K47RXP5gbK5JLl_-WhvJMnc2QhrKQf4WWb3ebTJr32Qr-J39QfbpsIJM557fyp8z0M5RBx6QaKviaKJEBMb1MlvDBT5h2M4qMxtOLR3pWDTm_q5TtUJMeCnTDMFhe6oM-frDa4J3K4oa3RTeb6rjDnCr-xRUXUI82h5y05tOtjCeapbgytbbjtbGL65vyPbWMRORXRj4yDvtBlRNaJRjHpbKy4oTjxL1Db3JWboT3aQtsl5dbnboepvoD-cc3MvByPjdJJQOBKQB0KnGbUQkeq8CQft20b0EeMtjW6LEK5r2SC_KtCP53f; BAIDUID_BFESS=3D228C3FC1B3512BAD3529CB3B6ACE85:FG=1; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217c6313f4bb433-0ec3e42acbdd15-b7a1a38-1024000-17c6313f4bccf9%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217c6313f4bb433-0ec3e42acbdd15-b7a1a38-1024000-17c6313f4bccf9%22%7D; ab_sr=1.0.1_M2I1NDRhNTY5NGMwY2EzOTdkZDg5N2YzNTM0NjVlNzAwMmJjNzY1M2I3NmNiYjkxMjVjMTAwMjY5NTg0OTEyZTI3NWI4YTFkYWUyYzRlYjM0OTMxNDc3OGYwMDI5MTRmOWNkOTlhN2E0ZTg1MDMwNTk4NmViYjkxZWZjZmVmMmY2NjQ1N2MwNmNmOGRkMGExNjY1MGNhNjU5OTFmMmNmNg==; RT="z=1&dm=baidu.com&si=3h8cfcpdixy&ss=kuj7h7p1&sl=0&tt=0&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=15q82&ul=113ud&hd=113y9&cl=47l83',
'Host': 'aiqicha.baidu.com',
'Referer': 'https: // aiqicha.baidu.com / s?q = % E5 % 88 % B6 % E9 % 80 % A0 & t = 0',
'sec - ch - ua': '";Not A Brand";v = "99", "Chromium";v = "94"',
'sec - ch - ua - platform': '"Windows"',
'Sec - Fetch - Dest': 'empty',
'Sec - Fetch - Mode': 'cors',
'Sec - Fetch - Site': 'same - origin',
}
params = {
'q': '制造',
'p': str(num)
}
response = requests.get(url,params=params,headers=headers)
print(response.status_code)
return html
打印一下这个html试试
{
"name": "\u65e5\u7acb\u6cf5<em>\u5236\u9020<\/em>",
"entName": "\u65e5\u7acb\u6cf5\u5236\u9020(\u65e0\u9521)\u6709\u9650\u516c\u53f8",
"type": "\u54c1\u724c\u9879\u76ee",
"latestRound": "\u88ab\u6536\u8d2d",
"projectSimilarCnt": 20,
"fundCnt": 0,
"investEventCnt": 0,
"brief": "\u5927\u578b\u6cf5\u5236\u9020\u5546",
"entLogo": "https:\/\/zhengxin-pub.cdn.bcebos.com\/financepic\/c03f59fc536166836f2932460b777e1c_fullsize.jpg",
"entlogoWord": "",
"linkUrl": "\/brand\/detail?pid=46081494222640&id=882667927",
"pid": "46081494222640",
"fundList": [],
"brandId": 882667927,
"investevent": [],
"startDate": "2006-02-06",
"projectBrandFrom": "\u6c5f\u82cf\u7701\u65e0\u9521\u5e02",
"engName": "",
"entLogoWord": "\u65e5"
}
咦?冒号前的标题还认识,冒号后的信息怎么都是乱码?这爬下来也8行呀!另一个朋友看出来这是unicode编码,不是utf-8,那好办了,在askURL函数中转个码再返回html。
response = requests.get(url,params=params,headers=headers)
print(response.status_code)
response.encoding = response.apparent_encoding
html = response.text
return html.encode("utf-8").decode("unicode_escape")
至此,请求页面部分结束。
摘东西: 在刚才for循环框架中,有一个多行注释,这里将把那段填写上。 以摘公司名为例,先观察转码后的html:
{
"pid": "28791745513157",
"entName": "天津钢管<em>制<\/em><em>造<\/em>有限公司",
"entType": "有限责任公司",
"validityFrom": "2010-12-10",
"domicile": "天津市东丽区津塘公路396号",
"entLogo": "https:\/\/zhengxin-pub.cdn.bcebos.com\/logopic\/1c5bd3854fcc2c7430cfdb737d6ca37f_fullsize.jpg",
"openStatus": "开业",
"legalPerson": "张铭杰",
"tags": {
"abnormal": "<span class=\"zx-ent-tag abnormal\">经营异常<\/span>",
"laTaxer": "<span class=\"zx-ent-tag laTaxer\">A级纳税人(2015)<\/span>"
},
"logoWord": "钢管制造",
"titleName": "天津钢管制造有限公司",
"titleLegal": "张铭杰",
"titleDomicile": "天津市东丽区津塘公路396号",
"levelAtaxer": [2015, 2014],
"regCap": "980,000.0万",
"scope": "一般项目:钢、铁冶炼;金属材料制造;钢压延加工;金属废料和碎屑加工处理;金属材料销售;高品质特种钢铁材料销售;金属制品销售;金属矿石销售;热力生产和供应;污水处理及其再生利用;固体废物治理;技术服务、技术开发、技术咨询、技术交流、技术转让、技术推广;装卸搬运;普通货物仓储服务(不含危险化学品等需许可审批的项目);国内货物运输代理;汽车租赁;机动车修理和维护;住房租赁;非居住房地产租赁;物业管理;园林绿化工程施工。(除依法须经批准的项目外,凭营业执照依法自主开展经营活动)。许可项目:技术进出口;货物进出口;特种设备制造;道路货物运输(不含危险货物);检验检测服务;特种设备检验检测服务;发电、输电、供电业务;餐饮服务;住宿服务;小食杂;烟草制品零售;食品生产;食品经营(销售预包装食品);文件、资料等其他印刷品印刷;包装装潢印刷品印刷;印刷品装订服务。(依法须经批准的项目,经相关部门批准后方可开展经营活动,具体经营项目以相关部门批准文件或许可证件为准)。",
"regNo": "91120110566114496B",
"hitReason": [{"企业名称": "天津钢管<em>制<\/em><em>造<\/em>有限公司"}, {"网站名称": "钢管<em>制造<\/em>有限公司"}, {"经营范围": "一般项目:钢、铁冶炼;金属材料<em>制造<\/em>;钢压延加工;金属废料和碎屑加工处理;金属材料销售;高品质特种钢铁材料销售;金属制品销售;金属矿石销售;热力生产和供应;污水处理及其再生利用;固体废物治理;技术服务、技术开发"}],
"labels": {
"opening": {"text": "开业", "style": "blue"},
"abnormal": {"text": "经营异常", "style": "red"}
},
"personTitle": "法定代表人",
"personId": "06e7660efcba8782c4f3c488a98a5b1f"
}
可以看到 entName 和 titleName 后头是公司名,选中间没有em的titleName作为定位点,然后用正则表达式re的findall函数来摘抄。
Name = re.findall(r'"titleName":"(.*?)"', html, re.S)
因为一页有10个公司的信息,所以需要做个循环,将所有信息都填入excel中。
j = 0
for i in range(count, count+10):
sheet.write(i, 0, Name[j])
j = j + 1
book.save(savepath)
完整的摘公司名的代码如下:
Name = re.findall(r'"titleName":"(.*?)"', html, re.S)
print(Name)
j = 0
for i in range(count, count+10):
sheet.write(i, 0, Name[j])
j = j + 1
book.save(savepath)
至此,所有东西都写完了,下面贴上全部源代码:
from bs4 import BeautifulSoup
import re
import xlwt
import requests
def main():
url = "https://aiqicha.baidu.com/s?q=%E5%88%B6%E9%80%A0&t=0"
savepath = "C:/StevenXu/0930crawl/AQCdocu.xls"
book = xlwt.Workbook(encoding="utf-8", style_compression=0)
sheet = book.add_sheet('AQCdocu', cell_overwrite_ok=True)
col = ("名称", "法人", "注册资本", "成立日期", "地址", "经营范围")
for i in range(0, 6):
sheet.write(0, i, col[i])
book.save(savepath)
count = 0
for num in range(1, 100):
html = askURL(url, num)
print("HHHHHHHHHHHHHHHHHHHH", html)
Name = re.findall(r'"titleName":"(.*?)"', html, re.S)
print(Name)
j = 0
for i in range(count, count+10):
sheet.write(i, 0, Name[j])
j = j + 1
book.save(savepath)
LegalPerson = re.findall(r'"titleLegal":"(.*?)"', html, re.S)
print(LegalPerson)
j = 0
for i in range(count, count+10):
sheet.write(i, 1, LegalPerson[j])
j = j + 1
book.save(savepath)
RegisteredCapital = re.findall(r'"regCap":"(.*?)"', html, re.S)
print(RegisteredCapital)
j = 0
for i in range(count, count+10):
sheet.write(i, 2, RegisteredCapital[j])
j = j + 1
book.save(savepath)
EstablishDate = re.findall(r'validityFrom":"(.*?)"', html, re.S)
j = 0
for i in range(count, count+10):
sheet.write(i, 3, EstablishDate[j])
j = j + 1
book.save(savepath)
Location = re.findall(r'"titleDomicile":"(.*?)"', html, re.S)
j = 0
for i in range(count, count+10):
sheet.write(i, 4, Location[j])
j = j + 1
book.save(savepath)
BusinessScope = re.findall(r'"scope":"(.*?)"', html, re.S)
j = 0
for i in range(count, count+10):
sheet.write(i, 5, BusinessScope[j])
j = j + 1
book.save(savepath)
count = count + 10
def askURL(url, num):
headers = {
'Accept': 'application / json, text / plain, * / *',
'Accept - Encoding': 'gzip, deflate, br',
'Accept - Language': 'en - US, en;q = 0.9',
'Connection': 'keep - alive',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
'cookie': 'BIDUPSID=381DD96C2966B9CC44CC57CADD3B67D1; PSTM=1632724182; BAIDUID=381DD96C2966B9CCD0AB4379252D8022:FG=1; __yjs_duid=1_cf0d57675356f745bcbb2c45c59881491632793531463; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDSFRCVID=bKDOJexroG382q3HDDZtwituB2KKg7jTDYrEZguiLEnlccDVJeC6EG0PtOqPGZu-EHtdogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tJuf_DthfIt3fP36q45HMt00qxby26ndfg79aJ5nQI5nh-QP55J1hUPNhlJ0-nQG0jTlVpvKQUbmjRO206oay6O3LlO83h5MQGnMKl0MLPb5sbRPLjOD0tA4LxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFlejA2j65bDaRf-b-X-C72sJOOaCvW8pROy4oWK441DhjyqRj7aKTnKP3VbP5IhlvobTJ83M04K4oAaT38JGOM_Jb8WMQJoMQ2Qft20b3bb-RT0qOa3g5wWn7jWhk2Dq72y5jvQlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjHCDq6kjJJFOoIvt-5rDHJTg5DTjhPrMj4OWWMT-MTryKKJKaKTKOb7NX-QbMJ00LG5iB--f2HnRh4oNB-3iV-OxDUvnyxAZbn7pLUQxtNRJVnbcLpQmHlbVX4vobUPUDMc9LUkqW2cdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtCIWhKLCe503-RJH-xQ0KnLXKKOLVb5HWh7ketn4hUt254R-K47RXP5gbK5JLl_-WhvJMnc2QhrKQf4WWb3ebTJr32Qr-J39QfbpsIJM557fyp8z0M5RBx6QaKviaKJEBMb1MlvDBT5h2M4qMxtOLR3pWDTm_q5TtUJMeCnTDMFhe6oM-frDa4J3K4oa3RTeb6rjDnCr-xRUXUI82h5y05tOtjCeapbgytbbjtbGL65vyPbWMRORXRj4yDvtBlRNaJRjHpbKy4oTjxL1Db3JWboT3aQtsl5dbnboepvoD-cc3MvByPjdJJQOBKQB0KnGbUQkeq8CQft20b0EeMtjW6LEK5r2SC_KtCP53f; BDPPN=04a584e8a43b3c2543211c1ff0083491; log_guid=5c0734af5e94123684a6f61121e70a8f; _j47_ka8_=57; Hm_lvt_ad52b306e1ae4557f5d3534cce8f8bbf=1633747411,1633747424,1633747468; ZX_UNIQ_UID=ad0ed23cec5aa23f175c6abb6a19be38; Hm_lpvt_ad52b306e1ae4557f5d3534cce8f8bbf=1633748170; _s53_d91_=76b33b1f7b50d853e35a0ac59d45020eb588e5b4d4bffd71ed55400a95b4096de8657d250af19df5057cfea80e21ae451091cbf031ecc598c2ee5f212c2788df46767c12e766e00ad42a19c9113ccc138394669b0c028d0947c24e73ff8e3be8a67a5d51fb926dc1ac67756522b47e8ef0baeb5127f5f910146e9d5c27b446f16b1c90dfe4ff71d10043c846a542d5aff4cade220accfe9201f7dce1fb216c5dae97bdf76ae98a71591e3a195e035334ce024ed6aae1cfed2773365c5272a6f41811e03945bab155c60a5a10fe2c4cdb; _y18_s21_=1a10c202; H_PS_PSSID=34652_34441_34068_31254_34711_34525_34584_34505_34706_34107_26350_34419_34691_34671; delPer=0; PSINO=2; BDSFRCVID_BFESS=bKDOJexroG382q3HDDZtwituB2KKg7jTDYrEZguiLEnlccDVJeC6EG0PtOqPGZu-EHtdogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tJuf_DthfIt3fP36q45HMt00qxby26ndfg79aJ5nQI5nh-QP55J1hUPNhlJ0-nQG0jTlVpvKQUbmjRO206oay6O3LlO83h5MQGnMKl0MLPb5sbRPLjOD0tA4LxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFlejA2j65bDaRf-b-X-C72sJOOaCvW8pROy4oWK441DhjyqRj7aKTnKP3VbP5IhlvobTJ83M04K4oAaT38JGOM_Jb8WMQJoMQ2Qft20b3bb-RT0qOa3g5wWn7jWhk2Dq72y5jvQlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjHCDq6kjJJFOoIvt-5rDHJTg5DTjhPrMj4OWWMT-MTryKKJKaKTKOb7NX-QbMJ00LG5iB--f2HnRh4oNB-3iV-OxDUvnyxAZbn7pLUQxtNRJVnbcLpQmHlbVX4vobUPUDMc9LUkqW2cdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtCIWhKLCe503-RJH-xQ0KnLXKKOLVb5HWh7ketn4hUt254R-K47RXP5gbK5JLl_-WhvJMnc2QhrKQf4WWb3ebTJr32Qr-J39QfbpsIJM557fyp8z0M5RBx6QaKviaKJEBMb1MlvDBT5h2M4qMxtOLR3pWDTm_q5TtUJMeCnTDMFhe6oM-frDa4J3K4oa3RTeb6rjDnCr-xRUXUI82h5y05tOtjCeapbgytbbjtbGL65vyPbWMRORXRj4yDvtBlRNaJRjHpbKy4oTjxL1Db3JWboT3aQtsl5dbnboepvoD-cc3MvByPjdJJQOBKQB0KnGbUQkeq8CQft20b0EeMtjW6LEK5r2SC_KtCP53f; BAIDUID_BFESS=3D228C3FC1B3512BAD3529CB3B6ACE85:FG=1; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217c6313f4bb433-0ec3e42acbdd15-b7a1a38-1024000-17c6313f4bccf9%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217c6313f4bb433-0ec3e42acbdd15-b7a1a38-1024000-17c6313f4bccf9%22%7D; ab_sr=1.0.1_M2I1NDRhNTY5NGMwY2EzOTdkZDg5N2YzNTM0NjVlNzAwMmJjNzY1M2I3NmNiYjkxMjVjMTAwMjY5NTg0OTEyZTI3NWI4YTFkYWUyYzRlYjM0OTMxNDc3OGYwMDI5MTRmOWNkOTlhN2E0ZTg1MDMwNTk4NmViYjkxZWZjZmVmMmY2NjQ1N2MwNmNmOGRkMGExNjY1MGNhNjU5OTFmMmNmNg==; RT="z=1&dm=baidu.com&si=3h8cfcpdixy&ss=kuj7h7p1&sl=0&tt=0&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=15q82&ul=113ud&hd=113y9&cl=47l83',
'Host': 'aiqicha.baidu.com',
'Referer': 'https: // aiqicha.baidu.com / s?q = % E5 % 88 % B6 % E9 % 80 % A0 & t = 0',
'sec - ch - ua': '";Not A Brand";v = "99", "Chromium";v = "94"',
'sec - ch - ua - platform': '"Windows"',
'Sec - Fetch - Dest': 'empty',
'Sec - Fetch - Mode': 'cors',
'Sec - Fetch - Site': 'same - origin',
}
params = {
'q': '制造',
'p': str(num)
}
response = requests.get(url,params=params,headers=headers)
print(response.status_code)
response.encoding = response.apparent_encoding
html = response.text
return html.encode("utf-8").decode("unicode_escape")
if __name__ == "__main__":
main()
print("爬取完毕!")
第二份
其实和第一份代码的80%都相同,改动的地方有3个: 1 使用xlrd库,读取500强制造业企业名录的excel,读取结果赋值给变量Name 2 请求URL时需要拼接字符串,将Name加在URL中 3 不用在askURL中添加params和num了
下面,根据各项改动做出说明: 改动1 打开excel,选择工作簿
wb = xlrd.open_workbook(filename="C:/StevenXu/0930crawl/CL2015.xls", formatting_info=True)
CLsheet = wb.sheet_by_name("CL")
改动2 在大for循环中,读取企业名称,复制给Name,拼接URL,调用askURL函数,访问每一个公司的页面。
Name = CLsheet.cell(num, 0).value
print(Name)
url = "https://aiqicha.baidu.com/s?q="+str(Name)+"&t=0"
html = askURL(url)
改动3 不做解释,删掉params和num相关的东西就行,不会的直接看后头的全部代码吧~
至此,基本全部完成。但是在实际爬取的过程中,出现了2个小问题。 1 搜索一个公司名称可能会蹦出来好几个结果 2 有的公司搜不到
解决方案: 问题1:只取第一个结果。一个页面中的所有企业都存在一个list里,只取[0]即可。(这里有点问题,待会儿作者补上,来活了。 问题2:如果titleName搜不出来,则Name肯定为空,遇到这种情况直接continue,开始下一次循环,找下一个公司即可。
解决这两个问题的代码如下:
Name = re.findall(r'"titleName":"(.*?)"', TotalStr, re.S)
print(Name)
if Name == []:
continue
sheet.write(count, 0, Name[0])
book.save(savepath)
第二份的全部代码如下:
from bs4 import BeautifulSoup
import re
import xlwt
import xlrd
import requests
def main():
wb = xlrd.open_workbook(filename="C:/StevenXu/0930crawl/CL2015.xls", formatting_info=True)
CLsheet = wb.sheet_by_name("CL")
savepath = "C:/StevenXu/0930crawl/AQCdocu.xls"
book = xlwt.Workbook(encoding="utf-8", style_compression=0)
sheet = book.add_sheet('AQCdocu', cell_overwrite_ok=True)
col = ("名称", "法人", "注册资本", "成立日期", "地址", "经营范围")
for i in range(0, 6):
sheet.write(0, i, col[i])
book.save(savepath)
count = 0
for num in range(0, 600):
Name = CLsheet.cell(num, 0).value
print(Name)
url = "https://aiqicha.baidu.com/s?q="+str(Name)+"&t=0"
html = askURL(url)
TotalStr = str(re.findall(r'resultList(.*?)regNo', html, re.S))
print(TotalStr)
Name = re.findall(r'"titleName":"(.*?)"', TotalStr, re.S)
print(Name)
if Name == []:
continue
sheet.write(count, 0, Name[0])
book.save(savepath)
LegalPerson = re.findall(r'"titleLegal":"(.*?)"', TotalStr, re.S)
print(LegalPerson)
sheet.write(count, 1, LegalPerson[0])
book.save(savepath)
RegisteredCapital = re.findall(r'"regCap":"(.*?)"', TotalStr, re.S)
print(RegisteredCapital)
sheet.write(count, 2, RegisteredCapital[0])
book.save(savepath)
EstablishDate = re.findall(r'validityFrom":"(.*?)"', TotalStr, re.S)
print(EstablishDate)
sheet.write(count, 3, EstablishDate[0])
book.save(savepath)
Location = re.findall(r'"titleDomicile":"(.*?)"', TotalStr, re.S)
print(Location)
sheet.write(count, 4, Location[0])
book.save(savepath)
BusinessScope = re.findall(r'"scope":"(.*?)"', TotalStr, re.S)
print(BusinessScope)
sheet.write(count, 5, BusinessScope[0])
book.save(savepath)
count = count + 1
def askURL(url):
headers = {
'Accept': 'application / json, text / plain, * / *',
'Accept - Encoding': 'gzip, deflate, br',
'Accept - Language': 'en - US, en;q = 0.9',
'Connection': 'keep - alive',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
'cookie': 'BIDUPSID=381DD96C2966B9CC44CC57CADD3B67D1; PSTM=1632724182; BAIDUID=381DD96C2966B9CCD0AB4379252D8022:FG=1; __yjs_duid=1_cf0d57675356f745bcbb2c45c59881491632793531463; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDSFRCVID=bKDOJexroG382q3HDDZtwituB2KKg7jTDYrEZguiLEnlccDVJeC6EG0PtOqPGZu-EHtdogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tJuf_DthfIt3fP36q45HMt00qxby26ndfg79aJ5nQI5nh-QP55J1hUPNhlJ0-nQG0jTlVpvKQUbmjRO206oay6O3LlO83h5MQGnMKl0MLPb5sbRPLjOD0tA4LxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFlejA2j65bDaRf-b-X-C72sJOOaCvW8pROy4oWK441DhjyqRj7aKTnKP3VbP5IhlvobTJ83M04K4oAaT38JGOM_Jb8WMQJoMQ2Qft20b3bb-RT0qOa3g5wWn7jWhk2Dq72y5jvQlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjHCDq6kjJJFOoIvt-5rDHJTg5DTjhPrMj4OWWMT-MTryKKJKaKTKOb7NX-QbMJ00LG5iB--f2HnRh4oNB-3iV-OxDUvnyxAZbn7pLUQxtNRJVnbcLpQmHlbVX4vobUPUDMc9LUkqW2cdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtCIWhKLCe503-RJH-xQ0KnLXKKOLVb5HWh7ketn4hUt254R-K47RXP5gbK5JLl_-WhvJMnc2QhrKQf4WWb3ebTJr32Qr-J39QfbpsIJM557fyp8z0M5RBx6QaKviaKJEBMb1MlvDBT5h2M4qMxtOLR3pWDTm_q5TtUJMeCnTDMFhe6oM-frDa4J3K4oa3RTeb6rjDnCr-xRUXUI82h5y05tOtjCeapbgytbbjtbGL65vyPbWMRORXRj4yDvtBlRNaJRjHpbKy4oTjxL1Db3JWboT3aQtsl5dbnboepvoD-cc3MvByPjdJJQOBKQB0KnGbUQkeq8CQft20b0EeMtjW6LEK5r2SC_KtCP53f; BDPPN=04a584e8a43b3c2543211c1ff0083491; log_guid=5c0734af5e94123684a6f61121e70a8f; _j47_ka8_=57; Hm_lvt_ad52b306e1ae4557f5d3534cce8f8bbf=1633747411,1633747424,1633747468; ZX_UNIQ_UID=ad0ed23cec5aa23f175c6abb6a19be38; Hm_lpvt_ad52b306e1ae4557f5d3534cce8f8bbf=1633748170; _s53_d91_=76b33b1f7b50d853e35a0ac59d45020eb588e5b4d4bffd71ed55400a95b4096de8657d250af19df5057cfea80e21ae451091cbf031ecc598c2ee5f212c2788df46767c12e766e00ad42a19c9113ccc138394669b0c028d0947c24e73ff8e3be8a67a5d51fb926dc1ac67756522b47e8ef0baeb5127f5f910146e9d5c27b446f16b1c90dfe4ff71d10043c846a542d5aff4cade220accfe9201f7dce1fb216c5dae97bdf76ae98a71591e3a195e035334ce024ed6aae1cfed2773365c5272a6f41811e03945bab155c60a5a10fe2c4cdb; _y18_s21_=1a10c202; H_PS_PSSID=34652_34441_34068_31254_34711_34525_34584_34505_34706_34107_26350_34419_34691_34671; delPer=0; PSINO=2; BDSFRCVID_BFESS=bKDOJexroG382q3HDDZtwituB2KKg7jTDYrEZguiLEnlccDVJeC6EG0PtOqPGZu-EHtdogKK0mOTH6KF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF_BFESS=tJuf_DthfIt3fP36q45HMt00qxby26ndfg79aJ5nQI5nh-QP55J1hUPNhlJ0-nQG0jTlVpvKQUbmjRO206oay6O3LlO83h5MQGnMKl0MLPb5sbRPLjOD0tA4LxnMBMPjamOnaIQc3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFlejA2j65bDaRf-b-X-C72sJOOaCvW8pROy4oWK441DhjyqRj7aKTnKP3VbP5IhlvobTJ83M04K4oAaT38JGOM_Jb8WMQJoMQ2Qft20b3bb-RT0qOa3g5wWn7jWhk2Dq72y5jvQlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjHCDq6kjJJFOoIvt-5rDHJTg5DTjhPrMj4OWWMT-MTryKKJKaKTKOb7NX-QbMJ00LG5iB--f2HnRh4oNB-3iV-OxDUvnyxAZbn7pLUQxtNRJVnbcLpQmHlbVX4vobUPUDMc9LUkqW2cdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtCIWhKLCe503-RJH-xQ0KnLXKKOLVb5HWh7ketn4hUt254R-K47RXP5gbK5JLl_-WhvJMnc2QhrKQf4WWb3ebTJr32Qr-J39QfbpsIJM557fyp8z0M5RBx6QaKviaKJEBMb1MlvDBT5h2M4qMxtOLR3pWDTm_q5TtUJMeCnTDMFhe6oM-frDa4J3K4oa3RTeb6rjDnCr-xRUXUI82h5y05tOtjCeapbgytbbjtbGL65vyPbWMRORXRj4yDvtBlRNaJRjHpbKy4oTjxL1Db3JWboT3aQtsl5dbnboepvoD-cc3MvByPjdJJQOBKQB0KnGbUQkeq8CQft20b0EeMtjW6LEK5r2SC_KtCP53f; BAIDUID_BFESS=3D228C3FC1B3512BAD3529CB3B6ACE85:FG=1; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217c6313f4bb433-0ec3e42acbdd15-b7a1a38-1024000-17c6313f4bccf9%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%2217c6313f4bb433-0ec3e42acbdd15-b7a1a38-1024000-17c6313f4bccf9%22%7D; ab_sr=1.0.1_M2I1NDRhNTY5NGMwY2EzOTdkZDg5N2YzNTM0NjVlNzAwMmJjNzY1M2I3NmNiYjkxMjVjMTAwMjY5NTg0OTEyZTI3NWI4YTFkYWUyYzRlYjM0OTMxNDc3OGYwMDI5MTRmOWNkOTlhN2E0ZTg1MDMwNTk4NmViYjkxZWZjZmVmMmY2NjQ1N2MwNmNmOGRkMGExNjY1MGNhNjU5OTFmMmNmNg==; RT="z=1&dm=baidu.com&si=3h8cfcpdixy&ss=kuj7h7p1&sl=0&tt=0&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=15q82&ul=113ud&hd=113y9&cl=47l83',
'Host': 'aiqicha.baidu.com',
'Referer': 'https: // aiqicha.baidu.com / s?q = % E5 % 88 % B6 % E9 % 80 % A0 & t = 0',
'sec - ch - ua': '";Not A Brand";v = "99", "Chromium";v = "94"',
'sec - ch - ua - platform': '"Windows"',
'Sec - Fetch - Dest': 'empty',
'Sec - Fetch - Mode': 'cors',
'Sec - Fetch - Site': 'same - origin',
}
response = requests.get(url, headers=headers)
print(response.status_code)
response.encoding = response.apparent_encoding
html = response.text
return html.encode("utf-8").decode("unicode_escape")
if __name__ == "__main__":
main()
print("爬取完毕!")
参考文章: 爬虫基础 https://blog.csdn.net/bookssea/article/details/107309591 URL翻页不变 https://blog.csdn.net/weixin_43881394/article/details/108056798
|