下载普通网页
获取网页内容
打开VS Code,选择文件,打开文件夹。然后按CTRL+P,调出命令菜单,输入>jup。选择create new blank notebook。
按CTRL+S,保存文件为.ipynb模式。(要记得pip install urllib3)
import urllib3
url = "http://jandan.net/p/110355"
http = urllib3.PoolManager()
response = http.request("GET", url)
response_data = response.data
html_content =response_data.decode()
print(html_content)
将网页保存到文件
把字符串保存成一个文件。
fo = open("jiandan.html", "w", encoding="utf-8")
fo.write(html_content)
fo.close()

写成一个整体函数
def download_content(url):
http = urllib3.PoolManager()
response = http.request("GET", url)
response_data = response.data
html_content = response_data.decode()
return html_content
def save_to_file(filename, content):
fo = open(filename, "w", encoding= "utf-8")
fo.write(content)
fo.close()
url = "http://jandan.net/"
html_content = download_content(url)
save_to_file("jiandan.html", html_content)
下载动态网页
首先安装selenium。
 安装完毕。 (写的时候,又出错了orz。所以又在cmd中pip了以下) 
存储动态网页
url = "http://movie.douban.com/tv"
from selenium import webdriver
brow = webdriver.Chrome()
brow.get(url)
html_content = brow.page_source
save_to_file("douban_tv.html", html_content)
分页存储
for i in range(5):
url = "http://jiandan.net"
if i > 1:
url = url + "/page/" + str(i)
html_content = download_content(url)
save_to_file("jiandian_"+ str(i) + ".html", html_content)
|