在上一篇中,对企查查进行了数据获取,关于爬取企业信息类的爬虫(一),本篇对cookie中的js进行解析。
在企查查的cookie中,主要包含以下几个:
- acw_tc=701ec49416327465587377184eb448e3cf457f2bbf56789e0313b461cd
- QCCSESSID=42negcpgs96lali07famk9fsp2
- qcc_did=7009749f-0fb0-4fb4-ad93-c0bb260c9a81
- UM_distinctid=17c27475a7c266-092b008134d5d-513c1743-144000-17c27475a7d6d2
- CNZZDATA1254842228=841623527-1632737666-%7C1632737666
- zg_did=%7B%22did%22%3A%20%2217c27475d0bb15-0d46738956f8f1-513c1743-144000-17c27475d0c9cc%22%7D
- zg_294c2ba1ecc244809c552f8f6fd2a440=%7B%22sid%22%3A%201632746560786%2C%22updated%22%3A%201632746560791%2C%22info%22%3A%201632746560790%2C%22superProperty%22%3A%20%22%7B%7D%22%2C%22platform%22%3A%20%22%7B%7D%22%2C%22utm%22%3A%20%22%7B%7D%22%2C%22referrerDomain%22%3A%20%22%22%7D
- _uab_collina=163274656143041952478997
要分析几个cookie,怎么能没有所抓到的包呢?fillder抓到的包(审核通过后附链接)
一、acw_tc,QCCSESSID
根据fillder抓包可知,这两个cookie由服务器返回值。
二、CNZZDATAXXXXXX,UM_distinctid
非必要。CNZZDATA是CNZZ统计的cookie,而CNZZ数据统计被友盟收购,所以UM_xxx是友盟的cookie
三、qcc_did
qcc_did=7009749f-0fb0-4fb4-ad93-c0bb260c9a81
在各个js文件中搜索该cookie的名称,终于在https://www.qcc.com/material/theme/chacha/cms/v2/js/zhuge.js?这个连接中找到相应的代码。
function setDeviceId(){
var deviceId = getCookie('qcc_did');
// console.info(deviceId)
if(!deviceId){
var uuid = generateUUID();
setCookie('qcc_did',uuid,24*365*3); // 3年
}
}
function generateUUID() {
var d = new Date().getTime()
if (window.performance && typeof window.performance.now === 'function') {
d += performance.now()
}
var uuid = 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
var r = (d + Math.random() * 16) % 16 | 0
d = Math.floor(d / 16)
return (c === 'x' ? r : (r & 0x3 | 0x8)).toString(16)
})
return uuid
}
可见qcc_did是由generateUUID这个function生成的,使用pyexecjs即可在python中运行该js函数。
import execjs
qcc_did_js = """function generateUUID() {
var d = new Date().getTime()
var window = {}
if (window.performance && typeof window.performance.now === 'function') {
d += performance.now()
}
var uuid = 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
var r = (d + Math.random() * 16) % 16 | 0
d = Math.floor(d / 16)
return (c === 'x' ? r : (r & 0x3 | 0x8)).toString(16)
})
return uuid
}
"""
qcc_did = execjs.compile(qcc_did_js).call('generateUUID')
print(qcc_did)
事实上,该cookie最重要,设置该值,即可正常访问企查查。剩下的不设置也没事。
四、_uab_collina
_uab_collina=163274656143041952478997
发现是阿里CDN的cookie,在https://g.alicdn.com/sd/ncpc/nc.js?t=1520579483?该js文件中。
参考文章:taobao去验证码文件cookie处理模块浅析 - LiveZingy?即可解析。
先说结论:13位当前毫秒时间+11位随机数字符串
4.1、发现赋值给了变量g
照例直接搜索该变量名,发现赋值给了g
var d, u, p, _ = window,
f = document,
g = "_uab_collina",
h = _.pointman && pointman._now ? pointman._now: (new Date).getTime();
4.2、搜索g,发现function r()中调用了o(g)
若e已有值,则直接返回;若e为空,则运算||后面的值,因此目标变为找到a和i函数。
function r() {
var e, t = /Firefox/.test(navigator.userAgent);
if (t) try {
e = localStorage.getItem(g)
} catch(n) {}
return e = e || o(g),
e || (e = h + a(11), i(g, e, 3650)),
e
}
4.3、function i()
作用是写入cookie,设置有效时间,由此可知,function r中的三个返回值,只有第二个是将返回值写入cookie的,故_uab_collina的关键在于h + a(11)
function i(e, t, n) {
n = n || 7;
var o = new Date;
o.setTime(o.getTime() + 864e5 * n),
f.cookie = [encodeURIComponent(e), "=", encodeURIComponent("" + t), ";expires=", o.toGMTString()].join("")
}
4.4、获取包含e个数值的随机数字字符串的函数a(e)
该函数代码如下,其相关的基础知识点有:
- substring(start,stop):返回一个子字符串,从start到stop-1处的所有字符;
- substring(start):从start到字符串结尾的字符;
- substr(start,length):若start<0,则start=length+start; 若length<=0,则返回空;
- substr(start):若length未指定,则从start到字符串的结尾;
- Math.random:随机选取大于等于 0.0 且小于 1.0 的值,小数点后会有15~18个数值。
function a(e) {
for (var t = ""; t.length < e;) t += Math.random().toString().substr(2);
return t.substring(t.length - e)
}
4.5、h变量
getTime()返回值:Java和JavaScript都支持时间类型Date,他们的getTime()方法返回的是毫秒数。默认返回的是13位数字,单位是毫秒。
4.6、实现代码
import execjs
_uab_collina_js = """function a(e) {
for (var t = ""; t.length < e;) t += Math.random().toString().substr(2);
return (new Date).getTime() + t.substring(t.length - e)
}"""
_uab_collina = execjs.compile(_uab_collina_js).call('a', 11)
print(_uab_collina)
五、zg_did
{"did": "178b59f60733ad-089245814caeb-45410429-144000-178b59f607440b"}
在https://tongji.qichacha.com/zhuge.js?中,搜索zg_did即可找到
5.1、在zhuge.js文件中,搜索zg_did
y.prototype._initDid = function(e) {
var t = n.cookie.get("_zg"),
i = "",
r = n.hasMobileSdk();
r.flag && (i = r.getDid()),
e = e || this.config.did || i || n.UUID(),
t && n.JSONDecode(t).uuid && (e = n.JSONDecode(t).uuid),
n.cookie.get("zg_did") || n.cookie.remove("zg_" + this._key);
var o = n.extend({},
this.config);
o.cookie_expire_days = this.config.did_cookie_expire_days,
this.did = new u("zg_did", o),
this.did.register_once({
did: e
},
"")
},
直接相关的就是this.did = new u("zg_did", o),以及后面的注册函数this.did.register_once({did: e},根据zg_did的形式,很容易得出e就是具体的那一大串的值。
e的来源仅有e = e || this.config.did || i || n.UUID(),因此搜索UUID即可。
5.2、在文件中仅有此处UUID
UUID: (n = function() {
for (var e = 1 * new Date,
t = 0; e == 1 * new Date;) t++;
return e.toString(16) + t.toString(16)
},
function() {
var e = (screen.height * screen.width).toString(16);
return n() + "-" + Math.random().toString(16).replace(".", "") + "-" +
function(e) {
var t, i, n = m,
r = [],
o = 0;
function a(e, t) {
var i, n = 0;
for (i = 0; i < t.length; i++) n |= r[i] << 8 * i;
return e ^ n
}
for (t = 0; t < n.length; t++) i = n.charCodeAt(t),
r.unshift(255 & i),
r.length >= 4 && (o = a(o, r), r = []);
return r.length > 0 && (o = a(o, r)),
o.toString(16)
} () + "-" + e + "-" + n()
})
5.3、此处m的值,经过搜索
g = window.navigator,
v = window.document,
m = g.userAgent,
从5.2的代码中,可以看到其返回值的形式:n()-随机数-function(e)-e-n(),刚好和其一一对应。从代码中可以看到,均是转换成了16进制显示的,因此,先将其转换为10进制。
| 16进制 | 10进制 | n() | 178b59f60733ad | 6627142960362413 | 随机数 | 089245814caeb | 0150789189585643 | function(e) | 45410429 | 1161888809 | e | 144000 | 1327104 | n() | 178b59f607440b | 6627142960366603 |
5.4、实现代码
import execjs
zg_did_js = """UUID: (n = function() {
for (var e = 1 * new Date,
t = 0; e == 1 * new Date;) t++;
return e.toString(16) +"-"+ t.toString(16)
},
function c(m) {
var e = 144000;
return n() + "-" + Math.random().toString(16).replace(".", "") + "-" +
function(e) {
var t, i, n = m,
r = [],
o = 0;
function a(e, t) {
var i, n = 0;
for (i = 0; i < t.length; i++) n |= r[i] << 8 * i;
return e ^ n
}
for (t = 0; t < n.length; t++) i = n.charCodeAt(t),
r.unshift(255 & i),
r.length >= 4 && (o = a(o, r), r = []);
return r.length > 0 && (o = a(o, r)),
o.toString(16)
} () + "-" + e + "-" + n()
})
"""
ua = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
zg_did = execjs.compile(zg_did_js).call('c', ua)
print(zg_did)
注意:此处执行出来的n()并不一直都是14位的,因为有时候t的值转换为16进制时,只有两位,需要在前面加0。
六、zg_294c2ba1ecc2XXXXXXXXXXXX
{"sid": 1632746560786, "updated": 1632746560791, "info": 1632746560790, "superProperty": "{}", "platform": "{}", "utm": "{}", "referrerDomain": ""}
主要找的就4个:
- key:294c2ba1ecc2XXXXXXXXXX
- "sid": 1632746560786
- "updated": 1632746560791
- "info": 1632746560790
从值上可以看出,sid、updated、info均是时间,在https://tongji.qichacha.com/zhuge.js?中,搜索验证
6.1、info
y.prototype._info = function(e) {
var t = this.cookie.props.info,
i = 1 * new Date;
……
this._batchTrack(r),
this.cookie.register({
info: i
},
"")
}
},
6.2、updated、sid
y.prototype._session = function(e) {
var t = !1,
i = this.cookie.props.updated,
r = this.cookie.props.sid,
o = 1 * new Date,
a = new Date;
if (0 == r || o > i + 60 * this.config.session_interval_mins * 1e3) {
……
r = e || o,
r *= 1;
……
this.cookie.register({
sid: r
},
""),
t = !0
}
return this.cookie.register({
updated: o
},
""),
t
},
6.3、key
就在js中:https://www.qcc.com/material/theme/chacha/cms/v2/js/zhuge.js 直接re即可。
window.zhuge.load('294c2ba1ecc244809c552f8f6fd2a440',{
visualizer: false,
// debug: true,
autoTrack:false
});
6.4、实现代码
request = urllib.request.Request(url="https://www.qcc.com/material/theme/chacha/cms/v2/js/zhuge.js", headers=header)
html_1 = opener.open(request, timeout=10).read()
buff = BytesIO(html_1)
f = gzip.GzipFile(fileobj=buff)
htmls = f.read().decode('utf-8')
re.findall("window.zhuge.load\('(.*)'",htmls)[0]
七、总结
以上就是企查查网页中几个cookie的js代码,及其实现方式啦。但是,1、公司的环境中没有安装pyexecjs;2、目前pyexecjs下架了,无法通过pipp安装;导致我只能将其转换为python代码 (ó﹏ò。),但js真心不熟,等后续有时间再转吧。
# coding:utf-8
import execjs
import time
import urllib.request
import http.cookiejar
import re
# 由于Accept-Encoding为gzip,需要解压
from io import BytesIO
import gzip
from lxml import etree
def get_header():
ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'
header = {
'Connection': 'keep-alive',
'User-Agent': ua,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip',
'Accept-Language': 'zh-CN,zh;q=0.9',
}
qcc_did_js = """function generateUUID() {
var d = new Date().getTime()
var window = {}
if (window.performance && typeof window.performance.now === 'function') {
d += performance.now()
}
var uuid = 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
var r = (d + Math.random() * 16) % 16 | 0
d = Math.floor(d / 16)
return (c === 'x' ? r : (r & 0x3 | 0x8)).toString(16)
})
return uuid
}"""
qcc_did = execjs.compile(qcc_did_js).call('generateUUID')
qcc_did_str = 'qcc_did=' + qcc_did
_uab_collina_js = """function a(e) {
for (var t = ""; t.length < e;) t += Math.random().toString().substr(2);
return (new Date).getTime() + t.substring(t.length - e)
}"""
_uab_collina = execjs.compile(_uab_collina_js).call('a', 11)
_uab_collina_str = '_uab_collina=' + _uab_collina
zg_did_js_1 = """UUID: (n = function() {
for (var e = 1 * new Date,
t = 0; e == 1 * new Date;) t++;
return e.toString(16) + t.toString(16)
},
function c(m) {
var e = 144000;
return n() + "-" + Math.random().toString(16).replace(".", "") + "-" +
function(e) {
var t, i, n = m,
r = [],
o = 0;
function a(e, t) {
var i, n = 0;
for (i = 0; i < t.length; i++) n |= r[i] << 8 * i;
return e ^ n
}
for (t = 0; t < n.length; t++) i = n.charCodeAt(t),
r.unshift(255 & i),
r.length >= 4 && (o = a(o, r), r = []);
return r.length > 0 && (o = a(o, r)),
o.toString(16)
} () + "-" + e + "-" + n()
})"""
zg_did_1 = execjs.compile(zg_did_js_1).call('c', ua)
def n():
e = int(time.time() * 1000)
t = 0
while e == int(time.time() * 1000):
t = t + 1
tt = hex(t)[2:]
while len(tt) < 3:
tt = '0' + tt
return hex(e)[2:] + tt
uuid_js = """function uuid(m) {
var e = 144000;
return "-" + Math.random().toString(16).replace(".", "") + "-" +
function(e) {
var t, i, n = m,
r = [],
o = 0;
function a(e, t) {
var i, n = 0;
for (i = 0; i < t.length; i++) n |= r[i] << 8 * i;
return e ^ n
}
for (t = 0; t < n.length; t++) i = n.charCodeAt(t),
r.unshift(255 & i),
r.length >= 4 && (o = a(o, r), r = []);
return r.length > 0 && (o = a(o, r)),
o.toString(16)
} () + "-" + e + "-"
}"""
zg_did_2 = n() + execjs.compile(uuid_js).call('uuid', ua) + n()
zg_did_str = 'zg_did=' + '{"did": "' + zg_did_1 + '"}'
request = urllib.request.Request(url="https://www.qcc.com/material/theme/chacha/cms/v2/js/zhuge.js", headers=header)
html_1 = opener.open(request, timeout=10).read()
buff = BytesIO(html_1)
f = gzip.GzipFile(fileobj=buff)
htmls = f.read().decode('utf-8')
key = re.findall("window.zhuge.load\('(.*)'", htmls)[0]
sid = int(time.time())
info = sid + 4
updated = info + 1
key = '{"sid": ' + str(sid) + ', "updated": ' + str(updated) + ', "info": ' + str(info) + ', "superProperty": "{}", "platform": "{}","utm": "{}", "referrerDomain": ""}'
cookie = qcc_did_str# + _uab_collina_str# + zg_did_str + key
header['Cookie'] = cookie
return header
|