(二)Scrapy爬取新浪新闻 创建项目、爬取内容保存到mysql
最近是在做一个全套的推荐系统,需要一些原始数据,所以着手用Scrapy爬取新浪新闻,包括创建项目、爬取内容保存到mysql等。
1. Scrapy创建爬虫项目
1.1 Scrapy框架介绍
Scrapy是一个基于Twisted的异步处理框架,是纯Python实现的爬虫框架,其模块间的耦合程度低,可扩展性极强,可灵活完成各种需求。
1.1.1 架构介绍
分为以下几个部分:
- Engine
- item
- Scheduler
- Downloader
- Spiders
- Item Pipeline
- Downkoader Middlewares
- Spider Middlewares
1.1.2 数据流
1.1.3 项目结构
Scrapy框架是通过命令行来创建项目,用IDE(Pycharm等)编写代码。项目创建之后,项目文件结构如下:
scrapy.cfg
project/
__init__.py
items.py
pipelines.py
settings.py
middlewares.py
spiders/
__init__.py
***.py
1.2 创建项目
在之前,已经在anaconda虚拟环境下安装好了Scrapy库、pymysql库,MySQL、MongoDB数据库。
1.2.1 创建项目
直接用Scrapy命令生成项目文件: 在命令行中cd目标文件夹,scrapy startproject sina
1.2.2 创建Spider
spider是自己定义的类。Scrapy用它来从网页里抓取内容,并解析抓取的结果。此类必须继承Scrapy提供的Spider类scrapy.Spider,还要定义Spider的名称和起始要求,以及怎样处理爬取后的结果的方法。
cd sina
scrapy genspider sina sina.com.cn
1.2.3 创建Item
Item是保存数据的容器,使用方法和字典类似。
创建Item需要继承scrapy.Item类,并且定义类型为scrapy.Field的字段。打开新浪新闻网站,看一看,可以获取到的内容有title、times、context、type等。
import scrapy
class SinaItem(scrapy.Item):
pass
class GuoneiItem(scrapy.Item):
title = scrapy.Field()
desc = scrapy.Field()
times = scrapy.Field()
class ZongyiItem(scrapy.Item):
title = scrapy.Field()
desc = scrapy.Field()
times = scrapy.Field()
class DataItem(scrapy.Item):
title = scrapy.Field()
desc = scrapy.Field()
times = scrapy.Field()
type = scrapy.Field()
2. 爬取数据
解析Response
import scrapy
from scrapy.http import Request
from scrapy.selector import Selector
from selenium import webdriver
import datetime
import re
from sina.items import GuoneiItem
from sina.items import DataItem
class Sina2Spider(scrapy.Spider):
name = 'sina2'
allowed_domains = ['sina.com.cn']
def __init__(self, *args, **kwargs):
super(Sina2Spider, self).__init__(*args, **kwargs)
self.pages = 2
self.flag = 0
self.start_urls = ['https://news.sina.com.cn/china/','https://ent.sina.com.cn/zongyi/','https://ent.sina.com.cn/film/']
self.option = webdriver.ChromeOptions()
self.option.add_argument('headless')
self.option.add_argument('no=sandbox')
self.option.add_argument('--blink-setting=imagesEnable=false')
def start_requests(self):
for url in self.start_urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
driver = webdriver.Chrome(chrome_options=self.option)
driver.set_page_load_timeout(30)
driver.get(response.url)
for i in range(self.pages):
while not driver.find_element_by_xpath("//div[@class='feed-card-page']").text:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
title = driver.find_elements_by_xpath("//h2[@class='undefined']/a[@target='_blank']")
time = driver.find_elements_by_xpath("//h2[@class='undefined']/../div[@class='feed-card-a feed-card-clearfix']/div[@class='feed-card-time']")
for i in range(len(title)):
item = DataItem()
if response.url == "https://news.sina.com.cn/china/":
item['type'] = 'news'
if response.url == "https://ent.sina.com.cn/zongyi/":
item['type'] = 'zongyi'
if response.url == "https://ent.sina.com.cn/film/":
item['type'] = 'film'
eachtitle = title[i].text
item['title'] = eachtitle
item['desc'] = ''
eachtime = time[i].text
href = title[i].get_attribute('href')
today = datetime.datetime.now()
eachtime = eachtime.replace('今天', str(today.month) + '月' + str(today.day) + '日')
if '分钟前' in eachtime:
minute = int(eachtime.split('分钟前')[0])
t = datetime.datetime.now() - datetime.timedelta(minutes=minute)
t2 = datetime.datetime(year=t.year, month=t.month, day=t.day, hour=t.hour, minute=t.minute)
print(str(t2))
else:
if '年' not in eachtime:
eachtime = str(today.year) + '年' + eachtime
t1 = re.split('[年月日:]', eachtime)
t2 = datetime.datetime(year=int(t1[0]), month=int(t1[1]), day=int(t1[2]), hour=int(t1[3]), minute=int(t1[4]))
item['times'] = t2
yield Request(url=response.urljoin(href), meta={'name': item}, callback=self.parse_namedetail)
try:
driver.find_element_by_xpath("//div[@class='feed-card-page']/span[@class='pagebox_next']/a").click()
except:
break
def parse_namedetail(self, response):
selector = Selector(response)
desc = selector.xpath("//div[@class='article']/p/text()").extract()
item = response.meta['name']
desc = list(map(str.strip, desc))
item['desc'] = ''.join(desc)
print(item)
yield item
3. 保存到MySQL
pipelines
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, create_engine, Text, DateTime, String, Integer
from sqlalchemy.orm import sessionmaker
import pymysql
Base = declarative_base()
class Data(Base):
__tablename__ = 'data'
id = Column(Integer(), primary_key=True)
times = Column(DateTime)
title = Column(Text())
content = Column(Text())
type = Column(Text())
class SinaPipeline:
def __init__(self):
self.engine = create_engine('mysql+pymysql://root:123456@localhost:3306/sina2', encoding='utf-8')
Base.metadata.create_all(self.engine)
self.DBSession = sessionmaker(bind=self.engine)
def process_item(self, item, spider):
new = Data()
new.title = item['title']
new.times = item['times']
new.content = item['desc']
new.type = item['type']
session = self.DBSession()
session.add(new)
session.commit()
return item
MySQL Workbench: 上图为保存在MySQL sina数据库中表名为data的数据,包含Item中的定义的所有信息。
|