为什么做知乎回答的导出？

“恰同学少年，风华正茂；

书生意气，挥斥方遒。

指点江山，激扬文字…”

知乎上总有许多令人眼前一亮、深入浅出的人类高质量回答，它们往往从读者少有或从未注意过的角度深入问题，从而引出新的观点。

遇到这种随时可能被折叠or删除的回答，收集癖怎么可能坐得住？！

于是花一晚上堆出了一段优雅而不失暴力的代码，用于实现知乎回答一键导出为PDF。

实现环境

CPU ：Intel? Core? i7-9750H CPU @ 2.60GHz

wkhtmltopdf安装位置 ：N:\wkhtmltox\bin\wkhtmltopdf.exe

Windows ：Win10 家庭版

PyCharm ：2020.1.1 (Community Edition) Build #PC-201.7223.92, built on April 30, 2020

测试回答链接 ：https://www.zhihu.com/question/463243373/answer/1983723459

源代码

由于急着用所以写的比较简陋，其中很多地方写的都十分暴力（例如“知乎图床重处理”部分就可以用BeautifulSoap库去实现等等），针对形如“https://www.zhihu.com/question/问题编号/answer/回答编号”可实现就行啦。

后续有空可能会加入多网址多线程解析导出PDF、单问题所有高赞回答导出PDF、收藏夹下所有回答导出PDF、适用其他图文发布网站的解析导出等。

现存问题：图文排版可能出现问题、未做登录cookie、基本没啥robustness

来看代码实现吧：

# -*- coding: utf-8 -*-
"""
ZhihuCapture.py
@author: Felerdise
"""
# 无需多言的导入，其中time仅用于导出计时
import requests
import re
import pdfkit
import time

def GetZhiHuAnswer(url):
    # 浏览器标识设置为Chrome，有效去除冗余组件
    header = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36"
    }
    r = requests.get(url, headers=header)
    r.encoding = 'utf-8'
    html = r.text

    # 去冗余：对html数据暴力处理
    # Answer中的大量冗余脚本<script>
    if ZhiHuAnswer:
        html = html[0: html.find(r'<script id="js-initialData"')]
    # 登录
    html = html.replace(r'"Question-mainColumnLogin"', r'"Question-mainColumnLogin" style="display:none"',
                        html.count(r'"Question-mainColumnLogin"'))
    html = html.replace(r'"AppHeader-userInfo"', r'"AppHeader-userInfo" style="display:none"',
                        html.count(r'"AppHeader-userInfo"'))
    # 查看所有回答
    html = html.replace(r'"Card ViewAll"', r'"Card ViewAll" style="display:none"', html.count(r'"Card ViewAll"'))
    # 知乎app定向
    html = html.replace(r'"ModalWrap"', r'"ModalWrap" style="display:none"', html.count(r'"ModalWrap"'))
    # 图片懒加载
   
    # 知乎图床重处理（此处储存的是品质略差的图片）
    html = re.sub(r'<noscript>.*?</noscript>', '', html)
    ImgObj = re.findall(r'<img .*?/>', html)
    for item in ImgObj:
        if r'data-actualsrc' in item:
            srcObj = re.findall(r'src="(.*?)"', item)[0]
            actualObj = re.findall(r'data-actualsrc="(.*?)"', item)[0]
            html = html.replace(srcObj, actualObj, 1)
    
    # 专栏收录文章
    if ZhiHuZhuanLan:
        html = html.replace(r'Post-Sub Post-NormalSub"', r'Post-Sub Post-NormalSub" style="display:none"', html.count(r'Post-Sub Post-NormalSub"'))

    # 提取标题
    searchObj = re.search(r'<title data-react-helmet="true">(.*?)</title>', html)
    titleOverall = searchObj.group(1)

    # 以源码输出PDF
    # 设置wkhtmltopdf.exe路径（此处为绝对路径，请根据相应安装位置调整）
    try:
        config = pdfkit.configuration(wkhtmltopdf=r'N:\wkhtmltox\bin\wkhtmltopdf.exe')
    except:
        print("Error: ExeDirectionError")
        return
    
        # 设置输出格式
    option = {
        'quiet': '',
        'page-size': 'Letter',
        'dpi': '300',
        'disable-smart-shrinking': '',
        'margin-top': '0in',
        'margin-right': '0in',
        'margin-bottom': '0in',
        'margin-left': '0in',
        'encoding': "UTF-8",
        # 'custom-header': [
        # ],
        # 'cookie': [
        # ],
        'no-outline': None,
        # 'javascript-delay': '',
    }

    # html输出PDF
    try:
        pdfkit.from_string(html, "{}.pdf".format(titleOverall), configuration=config, options=option)
    except:
        print("Error: PDFGenerateError")
        return

ZhiHuAnswer = False
ZhiHuZhuanLan = False

if __name__ == '__main__':
    keyword = input('输入待解析的知乎答案网址：\n').strip('\r\n')
    # keyword = 'https://www.zhihu.com/question/463243373/answer/1983723459'
    # 多网址数据，多线程
    # if '\n' in keyword:
    #     keyword = keyword.split('\n')
    #     pass
    TimeStart = time.time()
    # 这里本是想做拓展的
    if 'zhihu' in keyword:
    	if 'answer' in keyword:
        	ZhiHuAnswer = True
        # GetZhiHuAnswer(keyword)
    	elif 'zhuanlan' in keyword:
        	ZhiHuZhuanLan = True
        # GetZhiHuAnswer(keyword)
   		ZhiHuCaptor(keyword)
    else:
        pass
    TimeLapse = time.time() - TimeStart
    print('Time used: {:.2f}s'.format(TimeLapse))