为什么做知乎回答的导出?
“恰同学少年,风华正茂;
书生意气,挥斥方遒。
指点江山,激扬文字…”
知乎上总有许多令人眼前一亮、深入浅出的人类高质量回答,它们往往从读者少有或从未注意过的角度深入问题,从而引出新的观点。
遇到这种随时可能被折叠or删除的回答,收集癖怎么可能坐得住?!
于是花一晚上堆出了一段优雅而不失暴力的代码,用于实现知乎回答一键导出为PDF。
实现环境
CPU :Intel? Core? i7-9750H CPU @ 2.60GHz
wkhtmltopdf安装位置 :N:\wkhtmltox\bin\wkhtmltopdf.exe
Windows :Win10 家庭版
PyCharm :2020.1.1 (Community Edition) Build #PC-201.7223.92, built on April 30, 2020
测试回答链接 :https://www.zhihu.com/question/463243373/answer/1983723459
源代码
由于急着用所以写的比较简陋,其中很多地方写的都十分暴力(例如“知乎图床重处理”部分就可以用BeautifulSoap库去实现等等),针对形如“https://www.zhihu.com/question/问题编号/answer/回答编号”可实现就行啦。
后续有空可能会加入多网址多线程解析导出PDF、单问题所有高赞回答导出PDF、收藏夹下所有回答导出PDF、适用其他图文发布网站的解析导出等。
现存问题:图文排版可能出现问题、未做登录cookie、基本没啥robustness
来看代码实现吧:
"""
ZhihuCapture.py
@author: Felerdise
"""
import requests
import re
import pdfkit
import time
def GetZhiHuAnswer(url):
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36"
}
r = requests.get(url, headers=header)
r.encoding = 'utf-8'
html = r.text
if ZhiHuAnswer:
html = html[0: html.find(r'<script id="js-initialData"')]
html = html.replace(r'"Question-mainColumnLogin"', r'"Question-mainColumnLogin" style="display:none"',
html.count(r'"Question-mainColumnLogin"'))
html = html.replace(r'"AppHeader-userInfo"', r'"AppHeader-userInfo" style="display:none"',
html.count(r'"AppHeader-userInfo"'))
html = html.replace(r'"Card ViewAll"', r'"Card ViewAll" style="display:none"', html.count(r'"Card ViewAll"'))
html = html.replace(r'"ModalWrap"', r'"ModalWrap" style="display:none"', html.count(r'"ModalWrap"'))
html = re.sub(r'<noscript>.*?</noscript>', '', html)
ImgObj = re.findall(r'<img .*?/>', html)
for item in ImgObj:
if r'data-actualsrc' in item:
srcObj = re.findall(r'src="(.*?)"', item)[0]
actualObj = re.findall(r'data-actualsrc="(.*?)"', item)[0]
html = html.replace(srcObj, actualObj, 1)
if ZhiHuZhuanLan:
html = html.replace(r'Post-Sub Post-NormalSub"', r'Post-Sub Post-NormalSub" style="display:none"', html.count(r'Post-Sub Post-NormalSub"'))
searchObj = re.search(r'<title data-react-helmet="true">(.*?)</title>', html)
titleOverall = searchObj.group(1)
try:
config = pdfkit.configuration(wkhtmltopdf=r'N:\wkhtmltox\bin\wkhtmltopdf.exe')
except:
print("Error: ExeDirectionError")
return
option = {
'quiet': '',
'page-size': 'Letter',
'dpi': '300',
'disable-smart-shrinking': '',
'margin-top': '0in',
'margin-right': '0in',
'margin-bottom': '0in',
'margin-left': '0in',
'encoding': "UTF-8",
'no-outline': None,
}
try:
pdfkit.from_string(html, "{}.pdf".format(titleOverall), configuration=config, options=option)
except:
print("Error: PDFGenerateError")
return
ZhiHuAnswer = False
ZhiHuZhuanLan = False
if __name__ == '__main__':
keyword = input('输入待解析的知乎答案网址:\n').strip('\r\n')
TimeStart = time.time()
if 'zhihu' in keyword:
if 'answer' in keyword:
ZhiHuAnswer = True
elif 'zhuanlan' in keyword:
ZhiHuZhuanLan = True
ZhiHuCaptor(keyword)
else:
pass
TimeLapse = time.time() - TimeStart
print('Time used: {:.2f}s'.format(TimeLapse))
PDF导出测试结果
截取了头尾与中部三个部分的导出PDF截图以示意。
到此为止啦,日常瞎写的小工具(づ。????。)づ
测试回答链接生成的PDF中,字数为5000+字、图片数量为10+张、PDF导出用时为8.50s
用于展示PDF导出结果的截图等若涉侵权,请第一时间告知我,我将修改文中的侵权内容
|