| 【资源链接】  
 链接:https://pan.baidu.com/s/15fuerPIQgmwV1MZEts8jEQ提取码:6psl
 【包含文件】 
 1.说明当前项目需要用到商品分类数据,在网上查了淘宝和京东的首页,京东 https://www.jd.com/allSort.aspx 的数据更容易获取。 
 2.实现2.1 建表语句项目用的是GreenPlum数据库,其他类型的数据库建表小伙伴儿们自己动手啊 😄 
CREATE TABLE "data_commodity_classification" ( 
"id" VARCHAR ( 32 ), 
"parent_id" VARCHAR ( 32 ), 
"level" int2, 
"name" VARCHAR ( 64 ), 
"merger_name" VARCHAR ( 255 ) 
);
COMMENT ON TABLE "data_commodity_classification" IS '3级商品分类数据表';
COMMENT ON COLUMN "data_commodity_classification"."level" IS '类别等级';
COMMENT ON COLUMN "data_commodity_classification"."name" IS '商品分类';
COMMENT ON COLUMN "data_commodity_classification"."merger_name" IS '商品类别组合名';
 2.2 Jar包依赖jsoup是必须的,项目使用了mybatis-plus在保存对象时可以调用.saveBatch()方法,不是必须的。
 
<dependency>
	<groupId>org.jsoup</groupId>
	<artifactId>jsoup</artifactId>
	<version>1.10.2</version>
</dependency>
<dependency>
	<groupId>com.baomidou</groupId>
	<artifactId>mybatis-plus-boot-starter</artifactId>
	<version>3.3.0</version>
</dependency>
 2.3 对象封装用了lombok为的是在构建对象时用builder来简化代码: @Data
@EqualsAndHashCode(callSuper = false)
@Accessors(chain = true)
@ApiModel(value="DataCommodityClassification对象", description="")
@Builder
public class DataCommodityClassification implements Serializable {
    private static final long serialVersionUID=1L;
    private String id;
    private String parentId;
    @ApiModelProperty(value = "类别等级")
    private Integer level;
    @ApiModelProperty(value = "商品分类")
    private String name;
    @ApiModelProperty(value = "商品类别组合名")
    private String mergerName;
}
 2.4 爬虫源代码html 页面标签: 
 数据获取逻辑:清除历史数据 >爬取最新数据并封装>保存最新数据。 	public boolean getCommodityClassificationData() throws IOException {
 		
 		
        LambdaQueryWrapper<DataCommodityClassification> lambdaQuery = Wrappers.lambdaQuery(DataCommodityClassification.class);
        dataCommodityClassificationService.remove(lambdaQuery);
        
        AtomicInteger atomicIntegerOne = new AtomicInteger();
        AtomicInteger atomicIntegerTwo = new AtomicInteger();
        AtomicInteger atomicIntegerThree = new AtomicInteger();
        
        List<DataCommodityClassification> dataCommodityClassificationList = new ArrayList<>();
        
        
		
        String url = "https://www.jd.com/allSort.aspx";
        Document document = Jsoup.parse(new URL(url), 300000);
        
        Element root = document.getElementsByClass("category-items clearfix").get(0);
        
        Elements levelOne = root.getElementsByClass("category-item m");
        levelOne.forEach(one -> {
            String levelOneData = one.getElementsByClass("item-title").get(0).child(2).text();
            String oneId = "" + atomicIntegerOne.getAndIncrement();
            dataCommodityClassificationList.add(DataCommodityClassification.builder().id(oneId).parentId(null).level(0).name(levelOneData).build());
            
            Elements levelTwo = one.getElementsByClass("items").get(0).getElementsByTag("dl");
            levelTwo.forEach(two -> {
                String levelTwoData = two.getElementsByTag("dt").text();
                String twoId = oneId + atomicIntegerTwo.getAndIncrement();
                String mergerNameTwo = levelOneData + "," + levelTwoData;
                dataCommodityClassificationList.add(DataCommodityClassification.builder().id(twoId).parentId(oneId).level(1).name(levelTwoData).mergerName(mergerNameTwo).build());
                
                Elements levelThree = two.getElementsByTag("dd").get(0).children();
                levelThree.forEach(three -> {
                    
                    String levelThreeData = three.text();
                    String threeId = twoId + atomicIntegerThree.getAndIncrement();
                    String mergerNameThree = mergerNameTwo + "," + levelThreeData;
                    dataCommodityClassificationList.add(DataCommodityClassification.builder().id(threeId).parentId(twoId).level(2).name(levelThreeData).mergerName(mergerNameThree).build());
                });
            });
        });
        
        boolean isSaveSuccess = dataCommodityClassificationService.saveBatch(dataCommodityClassificationList);
        return isSaveSuccess;
	}
 3.结果一级分类的parent_id和merger_name没有进行处理,不知道在业务使用的过程中有没有问题。  提供了csv和sql格式的数据,爬取日期是
 20220310,需要最新数据的话就需要小伙伴儿们运行爬虫代码获取了。
 |