一.在Anaconda的虚拟环境下安装selenium 和webdrive等必要库
1.虚拟环境的主要操作命令
1.创建虚拟环境
打开Anaconda Prompt
conda create -n env_name python=3.6
其中env_name是自己虚拟环境的名称,可任意命名
同时安装必要的包:
conda create -n env_name numpy matplotlib python=3.6
2.查看已经创建的虚拟环境
conda env list
我的虚拟环境如下
data:image/s3,"s3://crabby-images/da08c/da08cf08501989f60051aecd0d6abf13ac5e46ea" alt="在这里插入图片描述"
3.激活虚拟环境
activate your_env_name(虚拟环境名称)
此时使用
python --version
可以检查当前python版本是否为想要的(即虚拟环境的python版本)
4.退出虚拟环境
deactivate your_env_name(虚拟环境名称)
5.删除虚拟环境
#删除环境
conda remove -n your_env_name(虚拟环境名称) --all
#使用命令
conda remove --name $your_env_name $package_name(包名)
2.安装本次实验所需安装包
selenium
pip install selenium
webdrive
要使用selenium去调用浏览器,还需要一个驱动,不同浏览器的webdriver需要独立安装
我这里就下载Chrome的驱动
可以从这里下载:https://npm.taobao.org/mirrors/chromedriver/
下载后是一个exe文件
data:image/s3,"s3://crabby-images/37fc7/37fc7efb837892add7622e4959c6f748a1061007" alt="在这里插入图片描述"
将该文件添加到PATH下
data:image/s3,"s3://crabby-images/cba04/cba0410dd7480b15dbfdadb4c9a4ad61fad11e4f" alt="在这里插入图片描述"
二.对百度进行自动化测试
1.打开浏览器,进入百度搜索界面
from selenium import webdriver
driver=webdriver.Chrome('D:\\software\\chromedriver_win32\\chromedriver.exe')
driver.get("https://www.baidu.com/")
这里运行时总是报错,这和下载的驱动版本有关系
这里提供一个简单的方法
通过安装webdriver_manager来解决webdriver的管理问题
pip install webdriver_manager
接下里就是调库而已
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
顺便说一下, 如果你下载了新版本驱动, 可以用这个指定路径 : webdriver.Chrome函数的参数executable_path可以指定软件驱动的路径
driver = webdriver.Chrome(executable_path=r'C:\path\to\chromedriver.exe')
修改后的代码:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://www.baidu.com/")
运行结果:
data:image/s3,"s3://crabby-images/ef4f1/ef4f1aa80ac03742eb023c7984946445b8504bf8" alt="在这里插入图片描述"
对百度页面右键检查
data:image/s3,"s3://crabby-images/b133d/b133d1629e514b81a5571aedf6ac540dd0306486" alt="在这里插入图片描述"
找到搜索框的id是kw
找到该元素,并填取相应的值
p_input = driver.find_element_by_id("kw")
p_input.send_keys('知乎')
运行: data:image/s3,"s3://crabby-images/9a6b3/9a6b3c04e18be87c06e35ef853cf6e75c6bcded8" alt="在这里插入图片描述"
同样检查网页找到按钮百度一下的id,为su
点击该按钮
p_btn=driver.find_element_by_id('su')
p_btn.click()
运行:
data:image/s3,"s3://crabby-images/2d3ec/2d3ec0dd8b610ee7cc5954160fa61b0005db64da" alt="在这里插入图片描述"
打开该网页,分析网页元素
data:image/s3,"s3://crabby-images/a1e31/a1e311adc6f6d7568432ba458e946dddebfd8373" alt="在这里插入图片描述"
可以看到名言的id为text
实现代码:
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import csv
from selenium.webdriver.chrome.options import Options
from tqdm import tqdm
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('http://quotes.toscrape.com/js/')
quote_head=['名言','作者']
quote_path='C:\\Users\\28205\\Documents\\Tencent Files\\2820535964\\FileRecv\\quote_csv.csv'
quote_content=[]
'''
function_name:write_csv
parameters: csv_head,csv_content,csv_path
csv_head: the csv file head
csv_content: the csv file content,the number of columns equal to length of csv_head
csv_path: the csv file route
'''
def write_csv(csv_head,csv_content,csv_path):
with open(csv_path, 'w', newline='') as file:
fileWriter =csv.writer(file)
fileWriter.writerow(csv_head)
fileWriter.writerows(csv_content)
print('爬取信息成功')
quote=driver.find_elements_by_class_name("quote")
for i in tqdm(range(len(quote))):
quote_text=quote[i].find_element_by_class_name("text")
quote_author=quote[i].find_element_by_class_name("author")
temp=[]
temp.append(quote_text.text)
temp.append(quote_author.text)
quote_content.append(temp)
write_csv(quote_head,quote_content,quote_path)
运行结果:
data:image/s3,"s3://crabby-images/ea962/ea962b3ce982b7d778e073ca9b1f57f7e4e4ea8f" alt="在这里插入图片描述"
查看爬取信息:
data:image/s3,"s3://crabby-images/71ff4/71ff452bf4e681944cbdd6dd38ec0c49e9587d7a" alt="在这里插入图片描述"
四.Selenium:requests+Selenum爬取京东图书
打开京东页面查看页面元素,分析需要爬取信息的标签id:
data:image/s3,"s3://crabby-images/5c904/5c904c1359ba7baffb66a153c3de3c677049e790" alt="在这里插入图片描述"
按钮没有写明id,就无法直接通过id获取
data:image/s3,"s3://crabby-images/e6b82/e6b823665d46d7759054ae34efd9cbab570dc161" alt="在这里插入图片描述" 这里是价格,名称等的标签
实现代码:
from selenium import webdriver
import time
import csv
from tqdm import tqdm
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://www.jd.com/")
time.sleep(3)
goods_info_list=[]
goods_num=200
goods_head=['价格','名字','链接']
goods_path='C:\\Users\\28205\\Documents\\Tencent Files\\2820535964\\FileRecv\\qinming.csv'
p_input = driver.find_element_by_id("key")
p_input.send_keys('法医秦明')
from_filed=driver.find_element_by_class_name('form')
s_btn=from_filed.find_element_by_tag_name('button')
s_btn.click()
def get_prince_and_name(goods):
goods_price=goods.find_element_by_css_selector('div.p-price')
goods_name=goods.find_element_by_css_selector('div.p-name')
goods_herf=goods.find_element_by_css_selector('div.p-img>a').get_property('href')
return goods_price,goods_name,goods_herf
def drop_down(web_driver):
web_driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
time.sleep(3)
def crawl_a_page(web_driver,goods_num):
drop_down(web_driver)
goods_list=web_driver.find_elements_by_css_selector('div#J_goodsList>ul>li')
for i in tqdm(range(len(goods_list))):
goods_num-=1
goods_price,goods_name,goods_herf=get_prince_and_name(goods_list[i])
goods=[]
goods.append(goods_price.text)
goods.append(goods_name.text)
goods.append(goods_herf)
goods_info_list.append(goods)
if goods_num==0:
break
return goods_num
while goods_num!=0:
goods_num=crawl_a_page(driver,goods_num)
btn=driver.find_element_by_class_name('pn-next').click()
time.sleep(1)
write_csv(goods_head,goods_info_list,goods_path)
运行结果:
data:image/s3,"s3://crabby-images/38e48/38e4850010dd932d0471882cc717a81fa0b9267c" alt="在这里插入图片描述"
查看文件:
data:image/s3,"s3://crabby-images/967d5/967d534a51b9cea6041ba249052902de73b9b813" alt="在这里插入图片描述"
五.总结
通过本次实验,完成动态网页的信息爬取,和静态网页一样需要查看网页结构,找到元素id或者利用相关函数得到元素,然后将信息获取,存储。
六.参考链接
https://blog.csdn.net/weixin_40547993/article/details/100159125
https://zhuanlan.zhihu.com/p/331712873
https://blog.csdn.net/junseven164/article/details/121707162
https://blog.csdn.net/HaoZiHuang/article/details/106263207
|