小伙伴们,你们好,我是老寇
目录
一、职责链模式
二、装饰器模式
三、观察者模式
四、webmagic
五、微服务集成
六、测试
一、职责链模式
介绍
顾名思义,职责链模式是为请求创建一个接收者对象的链,对请求的发送者和接收者进行解耦。
举个例子,比如说,公司请假,根据请假时长不同,递交到公司领导的级别也不一样,这种层级递交的关系就是一种链式结构
实现
创建抽象类AbstractArticleHandler,创建两种类型的文章处理类,都扩展了AbstractArticleHandler,每个文章处理类都有自己的逻辑,通过文章类型判断,如果是则执行相应文章处理类,否则把消息传给下一个文章处理类
步骤1
创建抽象的文章处理类
public abstract class AbstractArticleHandler {
/**
* 下一个处理者
*/
private AbstractArticleHandler abstractArticleHandler;
/**
* 获取articleType
* @return
*/
protected abstract ArticleTypeEnum getArticleTypeEnum();
/**
* 拉取文章
* @param uris 链接数组
* @return
*/
protected abstract void articlePull(String[] uris);
public final void handlerArticle(final List<String> links,final String articleType) {
if (this.getArticleTypeEnum().getValue().equals(articleType)) {
this.articlePull(links.toArray(new String[links.size()]));
} else {
if (this.abstractArticleHandler != null) {
this.abstractArticleHandler.handlerArticle(links, articleType);
}
}
}
public void setNext(AbstractArticleHandler abstractArticleHandler) {
this.abstractArticleHandler = abstractArticleHandler;
}
}
enum ArticleTypeEnum {
CSDN("csdn"),
BKY("bky");
private final String value;
ArticleTypeEnum(String value) {
this.value = value;
}
public String getValue() {
return value;
}
}
步骤2
创建扩展文章处理类
public class CsdnArticleHandler extends AbstractArticleHandler{
@Override
protected ArticleTypeEnum getArticleTypeEnum() {
return ArticleTypeEnum.CSDN;
}
@Autowired
private PipelineObserver pipelineObserver;
@Override
protected void articlePull(String[] uris) {
}
}
public class BkyArticleHandler extends AbstractArticleHandler{
@Override
protected ArticleTypeEnum getArticleTypeEnum() {
return ArticleTypeEnum.BKY;
}
@Override
protected void articlePull(String[] uris) {
}
}
步骤3
添加文章处理器,形成链式调用
public class ArticleService {
public static void main(String[] args) {
AbstractArticleHandler a1 = new CsdnArticleHandler();
AbstractArticleHandler a2 = new BkyArticleHandler();
a1.setNext(a2);
a1.handlerArticle("链接地址","csdn");
}
}
二、装饰器模式
介绍
装饰器模式允许向一个现有的对象添加新的功能,同时又不修改其结构
举个例子,比如说,手机有没有贴膜,都是是可以使用,手机贴上膜,不影响手机的使用
实现
创建ProcessStrategy接口和实现了ProcessStrategy接口的实体类,然后创建一个实现ProcessStrategy接口的抽象装饰类ProcessHandler,并把processStrategy对象作为它的实例变量,IteratorProcess实现ProcessHandler实体类,ArticleHandler类使用ProcessHandler来装饰ProcessStrategy
步骤1
创建接口
/**
* @author Kou Shenhai
* @version 1.0
* @date 2021/4/24 0024 下午 3:44
*/
public interface ProcessStrategy {
/**
* 爬虫具体执行方法
* @param page
*/
void process(Page page);
}
步骤2
创建实现接口的实现类
/**
*
* @author Kou Shenhai
* @version 1.0
* @date 2021/4/24 0024 下午 4:05
*/
public class BkyArticleProcess implements ProcessStrategy{
@Override
public void process(Page page) {
}
}
/**
*
* @author Kou Shenhai
* @version 1.0
* @date 2021/4/24 0024 下午 4:05
*/
public class CsdnArticleProcess implements ProcessStrategy{
@Override
public void process(Page page) {
}
}
步骤3
创建实现ProcessStrategy接口的抽象装饰类
/**
* 装饰类 ,伪实现类
* @author Kou Shenhai
* @version 1.0
* @date 2021/4/24 0024 下午 4:01
*/
public abstract class ProcessHandler implements ProcessStrategy{
protected volatile ProcessStrategy processStrategy;
public ProcessHandler(ProcessStrategy processStrategy) {
this.processStrategy = processStrategy;
}
@Override
public void process(Page page) {
processStrategy.process(page);
}
}
步骤4
扩展ProcessHandler类的实体装饰类
/**
* 装饰者,用来装饰csdn文章
* @author Kou Shenhai
* @version 1.0
* @date 2021/4/24 0024 下午 4:15
*/
public class IteratorProcess extends ProcessHandler{
public IteratorProcess(ProcessStrategy processStrategy) {
super(processStrategy);
}
}
步骤5
使用IteratorProcess来装饰ProcessStrategy对象
public class ArticleHandler{
public static void main(String[] args) {
//装饰
IteratorProcess process = new IteratorProcess(new BkyArticleProcess());
}
}
三、观察者模式
介绍
当对象存在一对多关系时,则使用观察者模式。
举个例子,比如说一个对象的数据发生变更,则会自动通知依赖它的对象
注:jdk有对观察者模式的支持类
实现(采用jdk自带的观察者模式并进行扩展)
观察者模式使用三个类,ArticleObserver、Observer和Observable(由具体的类来实现)。Observable对象带有绑定观察者到ArticleObserver对象和从Client对象解绑观察者的方法。我们创建Observable类、Observer接口和实现Observer类的实体类
步骤1
创建Observable类
/**
* 参考java.util.Observable
* 让具体的实现类实现相关逻辑,^秒啊^
* @author Kou Shenhai
*/
public interface Observable {
/**
* 加入观察者
* @param o
*/
void addObserver(Observer o);
/**
* 通知观察者
* @param arg
*/
void notifyObservers(Object arg);
/**
* 解绑观察者
* @param o
*/
void deleteObserver(Observer o);
}
步骤2
实现Observable类
public class ArticlePipeline implements Observable{
private Vector<Observer> obs;
public ArticlePipeline() {
obs = new Vector<>(1);
}
@Override
public void process(ResultItems resultItems, Task task) {
notifyObservers(resultItems.getAll());
}
@Override
public synchronized void addObserver(Observer o) {
if (o == null) {
throw new NullPointerException();
}
if (!obs.contains(o)) {
obs.addElement(o);
}
}
@Override
public synchronized void notifyObservers(Object arg) {
Object[] arrLocal;
synchronized (this) {
arrLocal = obs.toArray();
}
for (int i = arrLocal.length - 1; i >= 0; i--) {
((Observer)arrLocal[i]).update(this, arg);
}
}
@Override
public synchronized void deleteObserver(Observer o) {
obs.removeElement(o);
}
}
步骤3
创建 Observer 类
/**
* 参考{@link java.util.Observer}设计
* @author Kou Shenhai
*/
public interface Observer {
/**
* 信息变更
* @param o
* @param data
*/
void update(Observable o, Object data);
}
步骤4
创建实体观察类
public class PipelineObserver implements Observer {
@Override
public void update(Observable o, Object data) {
}
}
步骤5
使用Observable和实体观察者对象
public class ArticleHandler{
public static void main(String[] args) {
Observer o = new PipelineObserver();
Observable ob = new ArticlePipeline();
ob.addObserver(o);
}
}
四、webmagic
官方文档
介绍
webmagic的是参考业界最优秀爬虫Scrapy来实现的,使用了HttpClient、Jsoup等Java世界最成熟的工具
架构
WebMagic的结构分为Downloader(下载)、PageProcessor(处理)、Scheduler(管理)、Pipeline(持久化)四个组件,并由Spider(容器)将它们彼此组织起来,可以互相交互、流程化的执行,总体架构图如下
组件
Downloader
- 负责从网络上下载页面,以便后续处理,webmagic默认使用httpclient
PageProcessor
- 负责解析页面,抽取有用信息,以及发现新的链接,使用Jsoup来解析HTML
Scheduler
- 负责管理待抓取URL,以及一些去重工作。webmagic默认使用JDK自带的内存队列来管理URL,用集合去重,支持redis分布式管理
Pipeline
- 负责抽取结果的处理,包括计算、持久化到文件、数据库等
XSoup
五、微服务集成
数据库表设计
-- ----------------------------
-- Table structure for boot_link
-- ----------------------------
DROP TABLE IF EXISTS `boot_link`;
CREATE TABLE `boot_link` (
`id` bigint(20) NOT NULL COMMENT 'id',
`uri` varchar(400) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '文章链接',
`type` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '网站类型',
PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;
INSERT INTO `boot_link` VALUES ('11', 'https://www.cnblogs.com/koushenhai/p/12595630.html', 'bky');
INSERT INTO `boot_link` VALUES ('12', 'https://kcloud.blog.csdn.net/article/details/118633942', 'csdn');
INSERT INTO `boot_link` VALUES ('20', 'https://kcloud.blog.csdn.net/article/details/121491124', 'csdn');
INSERT INTO `boot_link` VALUES ('33', 'https://kcloud.blog.csdn.net/article/details/82109656', 'csdn');
INSERT INTO `boot_link` VALUES ('41', 'https://kcloud.blog.csdn.net/article/details/117769662', 'csdn');
INSERT INTO `boot_link` VALUES ('49', 'https://kcloud.blog.csdn.net/article/details/118660073', 'csdn');
INSERT INTO `boot_link` VALUES ('57', 'https://kcloud.blog.csdn.net/article/details/119720174', 'csdn');
INSERT INTO `boot_link` VALUES ('65', 'https://kcloud.blog.csdn.net/article/details/123179670', 'csdn');
INSERT INTO `boot_link` VALUES ('66', 'https://kcloud.blog.csdn.net/article/details/117635759', 'csdn');
INSERT INTO `boot_link` VALUES ('74', 'https://kcloud.blog.csdn.net/article/details/117771583', 'csdn');
INSERT INTO `boot_link` VALUES ('78', 'https://kcloud.blog.csdn.net/article/details/123039609', 'csdn');
INSERT INTO `boot_link` VALUES ('79', 'https://kcloud.blog.csdn.net/article/details/82588914', 'csdn');
INSERT INTO `boot_link` VALUES ('96', 'https://kcloud.blog.csdn.net/article/details/108021143', 'csdn');
INSERT INTO `boot_link` VALUES ('118', 'https://kcloud.blog.csdn.net/article/details/121305244', 'csdn');
INSERT INTO `boot_link` VALUES ('128', 'https://kcloud.blog.csdn.net/article/details/82110125', 'csdn');
INSERT INTO `boot_link` VALUES ('129', 'https://kcloud.blog.csdn.net/article/details/123630814', 'csdn');
INSERT INTO `boot_link` VALUES ('130', 'https://kcloud.blog.csdn.net/article/details/116420798', 'csdn');
INSERT INTO `boot_link` VALUES ('131', 'https://kcloud.blog.csdn.net/article/details/123484520', 'csdn');
INSERT INTO `boot_link` VALUES ('132', 'https://kcloud.blog.csdn.net/article/details/123013305', 'csdn');
INSERT INTO `boot_link` VALUES ('133', 'https://kcloud.blog.csdn.net/article/details/123390833', 'csdn');
INSERT INTO `boot_link` VALUES ('134', 'https://kcloud.blog.csdn.net/article/details/123311487', 'csdn');
INSERT INTO `boot_link` VALUES ('135', 'https://kcloud.blog.csdn.net/article/details/123292276', 'csdn');
INSERT INTO `boot_link` VALUES ('136', 'https://kcloud.blog.csdn.net/article/details/123123229', 'csdn');
INSERT INTO `boot_link` VALUES ('137', 'https://kcloud.blog.csdn.net/article/details/116704223', 'csdn');
INSERT INTO `boot_link` VALUES ('145', 'https://kcloud.blog.csdn.net/article/details/123739314', 'csdn');
INSERT INTO `boot_link` VALUES ('146', 'https://kcloud.blog.csdn.net/article/details/123688809', 'csdn');
INSERT INTO `boot_link` VALUES ('147', 'https://kcloud.blog.csdn.net/article/details/123673741', 'csdn');
INSERT INTO `boot_link` VALUES ('148', 'https://kcloud.blog.csdn.net/article/details/123628721', 'csdn');
INSERT INTO `boot_link` VALUES ('149', 'https://kcloud.blog.csdn.net/article/details/123599384', 'csdn');
INSERT INTO `boot_link` VALUES ('150', 'https://kcloud.blog.csdn.net/article/details/122181814', 'csdn');
INSERT INTO `boot_link` VALUES ('151', 'https://kcloud.blog.csdn.net/article/details/121557788', 'csdn');
INSERT INTO `boot_link` VALUES ('159', 'https://kcloud.blog.csdn.net/article/details/116449621', 'csdn');
INSERT INTO `boot_link` VALUES ('160', 'https://kcloud.blog.csdn.net/article/details/83623118', 'csdn');
INSERT INTO `boot_link` VALUES ('161', 'https://kcloud.blog.csdn.net/article/details/84777724', 'csdn');
INSERT INTO `boot_link` VALUES ('162', 'https://kcloud.blog.csdn.net/article/details/105587614', 'csdn');
INSERT INTO `boot_link` VALUES ('163', 'https://kcloud.blog.csdn.net/article/details/83515122', 'csdn');
INSERT INTO `boot_link` VALUES ('164', 'https://kcloud.blog.csdn.net/article/details/83451040', 'csdn');
INSERT INTO `boot_link` VALUES ('165', 'https://kcloud.blog.csdn.net/article/details/117252826', 'csdn');
INSERT INTO `boot_link` VALUES ('166', 'https://kcloud.blog.csdn.net/article/details/84826176', 'csdn');
INSERT INTO `boot_link` VALUES ('167', 'https://kcloud.blog.csdn.net/article/details/120031600', 'csdn');
INSERT INTO `boot_link` VALUES ('168', 'https://kcloud.blog.csdn.net/article/details/119685953', 'csdn');
INSERT INTO `boot_link` VALUES ('169', 'https://kcloud.blog.csdn.net/article/details/120147123', 'csdn');
INSERT INTO `boot_link` VALUES ('170', 'https://kcloud.blog.csdn.net/article/details/120245035', 'csdn');
INSERT INTO `boot_link` VALUES ('171', 'https://kcloud.blog.csdn.net/article/details/120190383', 'csdn');
INSERT INTO `boot_link` VALUES ('179', 'https://kcloud.blog.csdn.net/article/details/94590629', 'csdn');
INSERT INTO `boot_link` VALUES ('187', 'https://kcloud.blog.csdn.net/article/details/116949872', 'csdn');
INSERT INTO `boot_link` VALUES ('192', 'https://kcloud.blog.csdn.net/article/details/123789292', 'csdn');
INSERT INTO `boot_link` VALUES ('193', 'https://kcloud.blog.csdn.net/article/details/123780832', 'csdn');
INSERT INTO `boot_link` VALUES ('194', 'https://kcloud.blog.csdn.net/article/details/123771040', 'csdn');
INSERT INTO `boot_link` VALUES ('195', 'https://kcloud.blog.csdn.net/article/details/122522290', 'csdn');
INSERT INTO `boot_link` VALUES ('196', 'https://kcloud.blog.csdn.net/article/details/123833614', 'csdn');
DROP TABLE IF EXISTS `boot_article`;
CREATE TABLE `boot_article` (
`id` bigint(20) NOT NULL COMMENT 'id',
`title` varchar(200) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '文章链接',
`content` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL COMMENT '网站类型',
PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;
微服务
引入依赖
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.7.3</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
<exclusion>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-logging</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.7.3</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</exclusion>
<exclusion>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-logging</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.esotericsoftware</groupId>
<artifactId>reflectasm</artifactId>
<version>1.11.7</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-aspects</artifactId>
</dependency>
代码架构
核心代码(以采集csdn为例) 创建CsdnArticleSpider类
/**
* 爬虫默认实现
* @author Kou Shenhai
* @version 1.0
* @date 2020/11/15 0015 下午 4:40
*/
@Configuration
@Slf4j
public class CsdnArticleSpider implements PageProcessor {
private ProcessStrategy processStrategy;
private static final int SLEEP_TIME = 3000;
private static final int TIMEOUT = 3000;
private static final int RETRY_TIMES = 10;
private static final int RETRY_SLEEP_TIME = 3000;
private static final String CHARSET = "utf-8";
private static final String DOMAIN = "csdn.net";
private static final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36";
/**
*
* @param site 抓取网站的相关配置,包括编码、重试次数、抓取间隔
*/
private Site site = Site
.me()
.setRetryTimes(RETRY_TIMES)
.setRetrySleepTime(RETRY_SLEEP_TIME)
.setDomain(DOMAIN)
.setSleepTime(SLEEP_TIME)
.setTimeOut(TIMEOUT)
.setCharset(CHARSET)
.setUserAgent(USER_AGENT)
.addHeader("Cookie","");
public void setProcessStrategy(ProcessStrategy processStrategy) {
this.processStrategy = processStrategy;
}
/**
*
* @param page process是定制爬虫逻辑的核心接口,在这里编写抽取逻辑
*/
@Override
public void process(Page page) {
if (processStrategy == null) {
throw new NullPointerException();
}
/**
* 开始
*/
preProcess(page);
//策略模式
processStrategy.process(page);
/**
* 结束
*/
afterProcess(page);
}
@Override
public Site getSite() {
return site;
}
public Spider getSpider() {
return Spider.create(this);
}
/**
* 下面两个方法用于扩展自定义的process方法,比如加入迭代url等等,主要逻辑放在processStategy
*/
protected void preProcess(Page page) {
log.info("开始爬取...");
}
protected void afterProcess(Page page) {
log.info("完成爬取...");
}
}
创建CsdnArticleHandler
/**
* @author Kou Shenhai
*/
@Component
public class CsdnArticleHandler extends AbstractArticleHandler{
@Autowired
private CsdnArticleSpider csdnArticleSpider;
@Autowired
private ArticlePipeline articlePipeline;
@Override
protected ArticleTypeEnum getArticleTypeEnum() {
return ArticleTypeEnum.CSDN;
}
@Autowired
private PipelineObserver pipelineObserver;
@Override
@Async
protected void articlePull(String[] uris) {
HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
articlePipeline.addObserver(pipelineObserver);
csdnArticleSpider.setProcessStrategy(new IteratorProcess(new CsdnArticleProcess()));
csdnArticleSpider.getSpider().addUrl(uris)
.setDownloader(httpClientDownloader)
// 开启线程抓取
.thread(2 * Runtime.getRuntime().availableProcessors())
.addPipeline(articlePipeline)
//启动爬虫
.start();
}
}
创建ArticlePipeline
public class ArticlePipeline implements CallablePipeline{
private Vector<Observer> obs;
public ArticlePipeline() {
obs = new Vector<>(1);
}
@Override
public void process(ResultItems resultItems, Task task) {
notifyObservers(resultItems.getAll());
}
@Override
public synchronized void addObserver(Observer o) {
if (o == null) {
throw new NullPointerException();
}
if (!obs.contains(o)) {
obs.addElement(o);
}
}
@Override
public synchronized void notifyObservers(Object arg) {
Object[] arrLocal;
synchronized (this) {
arrLocal = obs.toArray();
}
for (int i = arrLocal.length - 1; i >= 0; i--) {
((Observer)arrLocal[i]).update(this, arg);
}
}
@Override
public synchronized void deleteObserver(Observer o) {
obs.removeElement(o);
}
}
创建CsdnArticleProcess
/**
*
* @author Kou Shenhai
* @version 1.0
* @date 2021/4/24 0024 下午 4:05
*/
public class CsdnArticleProcess implements ProcessStrategy{
@Override
public void process(Page page) {
String content = page.getHtml().xpath("//*[@id='mainBox']/main/div[1]/article").get();
String title = page.getHtml().xpath("//*[@id='articleContentId']/text()").get();
page.putField("content",content);
page.putField("title",title);
}
}
六、测试
码云源码,点点我
数据采集运行概况
参考教程:菜鸟教程-设计模式
参考教程:webmagic文档
参考项目:牛客项目-网络爬虫
本项目仅作为技术学习研究使用,禁止用于任何商业用途,禁止任何损害网站利益的行为
本项目仅作为技术学习研究使用,禁止用于任何商业用途,禁止任何损害网站利益的行为
本项目仅作为技术学习研究使用,禁止用于任何商业用途,禁止任何损害网站利益的行为
|