[Python知识库] 爬虫实例图片爬取--1（某领域）

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> 爬虫实例图片爬取--1（某领域） -> 正文阅读

[Python知识库]爬虫实例图片爬取--1（某领域）

爬图片的流程

1.向目标网站发送请求

2.获取数据(网页源码)

3.解析数据

4.向详情页发送请求

5.获取数据

确定网址，发起请求

导入请求库，然后发起请求

import requests

headers = {
    'User-Agent': 'sadasdsafdgdsd'
}
url = 'https://www.xxxxx.com/104212.html'
response = requests.get(url=url,headers=headers)
print(response.text)

接着导入解析库

import parsel

对数据进行解析

import requests
import parsel

headers = {
    'User-Agent': 'sadasdsafdgdsd'
}
url = 'https://www.jdlingyu.com/104212.html'
response = requests.get(url=url,headers=headers)
html = response.text
select = parsel.Selector(html)
print(select)

利用css来进行定位，

img_all = select.css('.entry-content img::attr(src)').getall()

?::attr(src)是提取这元素中的src对应的地址。

但是提取出来并不好看，可以用for来进行再次提取

for img in img_all:
    print(img)

既然拿到了图片的地址，那么就对这个地址发起请求

img_data = requests.get(img,headers=headers).content

?然后取这个地址的反斜杠最后一段，用来做图片的名称

img_name = img.split('/')[-1]

接着创建一个名为img的文件夹用来存图片

    with open(f'img/{img_name}',mode='wb') as f:
        f.write(img_data)

可以了，但是就这四张太少了，来整多点！！

那么重新来一遍，对整体进行请求

import requests
import parsel

headers = {
    'User-Agent': 'sadasdsafdgdsd'
}
url = 'https://www.xxxx.com/tuji'
response = requests.get(url=url,headers=headers)
html = response.text
select = parsel.Selector(html)
print(select)

这里有一堆

再次用css来进行定位

?因为它的地址和名称都在这，所以

title_list = select.css('.post-info h2 a::text').getall()
link_list = select.css('.post-info h2 a::attr(href)').getall()

然后

for title,link in zip(title_list,link_list):

把它们提取出来。

接着导入os模块，来创建文件夹

    if not os.path.exists(f'img/{title}'):
        os.mkdir(f'img/{title}')

然后运行就完事了

完整代码如下：

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
"""
@Project ：untitled1 
@File    ：实例1.py
@IDE     ：PyCharm 
@Author  ：冷巷(?_?)
@Date    ：2022/7/15 14:28 
"""

import requests
import parsel
import os

headers = {
    'User-Agent': 'sadasdsafdgdsd'
}
url = 'https://www.jdlingyu.com/tuji'
response = requests.get(url=url,headers=headers).text
select = parsel.Selector(response)
title_list = select.css('.post-info h2 a::text').getall()
link_list = select.css('.post-info h2 a::attr(href)').getall()
for title,link in zip(title_list,link_list):
    if not os.path.exists(f'img/{title}'):
        os.mkdir(f'img/{title}')
    response_1 = requests.get(link,headers=headers)
    html_data = response_1.text
    select = parsel.Selector(html_data)
    img_all = select.css('.entry-content img::attr(src)').getall()
    for img in img_all:
        img_data = requests.get(img,headers=headers).content
        img_name = img.split('/')[-1]
        with open(f'img/{img_name}',mode='wb') as f:
            f.write(img_data)

Python知识库最新文章

Python中String模块

【Python】 14-CVS文件操作

python的panda库读写文件

使用Nordic的nrf52840实现蓝牙DFU过程

【Python学习记录】numpy数组用法整理

Python学习笔记

python字符串和列表

python如何从txt文件中解析出有效的数据

Python编程从入门到实践自学/3.1-3.2

python变量

加:2022-07-20 18:47:30 更:2022-07-20 18:48:58

360图书馆购物三丰科技阅读网日历万年历 2026年5日历

-2026/5/5 8:35:26-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码