很多面临留学申请的小伙伴可能在寻求学长学姐留学咨询服务的时候一定好奇相关服务的价格等等信息,那么今天就来爬取一下相关内容。
程序分为【两个部分】。首先爬取导师详情页url,之后解析详情页内容。
由于此网站索引页的html不包含导师的详情页信息(貌似使用了js),因此首先利用Selenium爬取导师的相对url。
导入模块
import requests
from selenium import webdriver
import time
from pyquery import PyQuery as pq
其中的while循环会一直找到并点击进入下一页,并通过pagesource获取本页html。?
def get_page_source():
url = 'http://www.dearmentor.com/search'
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
time.sleep(3)
info = driver.page_source
get_relative_link(info)
while True:
driver.find_element_by_xpath("//div[@class='paging']/a[@class='next']").click()
time.sleep(2)
info2 = driver.page_source
get_relative_link(info2)
#print(info)
#time.sleep(2)
此函数用来获取相对url并于base url拼接成绝对url?
def get_relative_link(info):
base_url = 'http://www.dearmentor.com'
doc = pq(info)
divs = doc('.avatar').items()
for div in divs:
relative_link = div('a').attr.href
abs_link = base_url + relative_link
save_abs_link(abs_link)
def save_abs_link(abs_link):
with open('mentor_link.txt','a+') as f:
f.write(abs_link)
主函数
if __name__ == '__main__':
get_page_source()
?第一部分完。
已经搞定了所有导师详情页的url后面解析就好办了。
import requests
from pyquery import PyQuery as pq
import re
import csv
import threading
打开刚刚第一部分保存的txt文件,整理后转为列表形式。
with open('mentor_link.txt') as f:
f = f.read()
#print(f)
f = f.replace('detail','detail\n')
#print(f)
# 列表:包含所有mentor的abs link
f = f.split('\n')
#print(f)
print(len(f))
接下来定义解析函数。
这里我们获取导师的代号名称、服务项目及价格、education信息以及过往评价。?
def get_mentor_allinfo(url):
res = requests.get(url=url)
#print(res.text)
doc = pq(res.text)
semaphore.release()
# mentor basic info
men_basic_info = doc('head title').text()
name = doc('.mentor-info.w1080 .name').text()
print(name)
# mentor service and price
service_list = doc('.server-list .li').items()
service_total = []
for service in service_list:
ser_name = service('.tt').text()
ser_des = service('.intro').text()
price_info = service('.stat-wrap .stat').text()
price = price_info.split('价 格:')
if len(price) > 1:
price = price[1].split('.00')[0]
else:
price = ''
print(ser_name,'\n',ser_des,'\n',price_info,'\n',price)
service_total.append((ser_name,ser_des,price_info,price))
# edu bg
edu_list = doc('#educational .record-list li').items()
edu_total = []
for edu in edu_list:
date = edu('.date').text()
program = edu('.tags').text().replace('?',' ')
print(date,program)
edu_total.append((date,program))
# comment
comment_list = doc('#commentList li').items()
comment_total = []
for comment in comment_list:
com_date = comment('.date').text()
com_service = comment('.tag-wrap').text()
com_content = comment('.content').text()
print(com_date,com_service,com_content)
comment_total.append((com_date,com_service,com_content))
save_mentor_info(name,service_total,edu_total,comment_total)
定义保存函数,这里将文件保存为csv格式,但实际上存储为mongodb会更好。?
def save_mentor_info(name,service_total,edu_total,comment_total):
with open('mentor_info.csv','a+') as f:
writer = csv.writer(f)
writer.writerow((name,service_total,edu_total,comment_total))
定义主函数,boundedsemaohore限制并发数量。?
if __name__ == '__main__':
semaphore = threading.BoundedSemaphore(5)
for url in f:
semaphore.acquire()
t1 = threading.Thread(target=get_mentor_allinfo,args=(url,))
t1.start()
?成果展示
?
?
?
?
?
?
?
?
?
?
|