[Python知识库] 爬取豆瓣电影

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> 爬取豆瓣电影 -> 正文阅读

[Python知识库]爬取豆瓣电影

爬虫学习

爬取网站的url豆瓣排行
工具：vscode
1、首先安装requests 、lxml，在终端输入

pip install requests
pip install lxml

在这里插入图片描述
2、查看豆瓣页面源代码，输入电影名检查是否直接可以从源代码中获取 ctrl+f打开检查工具输入“霸王别姬”

可以确定所有电影名可直接从源码获取

3、编写代码

url='https://movie.douban.com/top250'

hearders={
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36 Edg/94.0.992.47'
}
resp = requests.get(url,headers=hearders)

#print(resp.text)

#创建etree对象，传入网页内容
#找到最外层xpth，然后嵌套一层循环获取所有电影信息内容
#路径  <ol> <li> <div class='item> <div class='pic'> <a alt>
tree=etree.HTML(resp.text)
#这里li有多个所以返回的是一个列表
the_first=tree.xpath('/html/body/div[3]/div[1]/div/div[1]/ol/li')
for i in the_first:
#./表示从上一级，也就是the_first路径
#末尾加上[0]是原本获取的数据形式是列表形式，加上之后就只会获得数据 .strip()去掉标签前面空格
    title=i.xpath('./div/div[1]/a/img/@alt')[0]
    score=i.xpath('./div/div[2]/div[2]/div/span[2]/text()')[0].strip()
    comment=i.xpath('./div/div[2]/div[2]/p[2]/span/text()')[0]
    print(title)
    print(score)
    print(comment)

这里的xpath路径可以直接获取，然后根据需要自己编排
在这里插入图片描述
xpath获得内容方式，获取标签中内容直接在路径末尾加上
text（），若要获取标签属性，在末尾加上@xx

4、获取到所有数据后添加到数据库之中

#这里修改为自己数据库的用户名、密码
def insert(value):
    db = pymysql.connect(host='localhost',user='root',password='123456',database='python')
 
    cursor = db.cursor()
    sql = "INSERT INTO moviemessage(moviename,score,comment) VALUES (%s, %s, %s)"
    try:
        cursor.execute(sql,value)
        db.commit()
        print('插入数据成功')
    except:
        db.rollback()
        print("插入数据失败")
    db.close()


CREATE TABLE `moviemessage` (
  `id` int NOT NULL AUTO_INCREMENT COMMENT 'id',
  `moviename` varchar(255) NOT NULL COMMENT '电影名',
  `score` double NOT NULL COMMENT '评分',
  `comment` varchar(255) NOT NULL COMMENT '评论',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=26 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

在这里插入图片描述
所有代码：

#检查页面源代码，看所要获取得信息是否直接存在页面代码之中

#需求：获取电影名称、年份、评价

import requests
from lxml import etree
import pymysql

def insert(value):
    db = pymysql.connect(host='localhost',user='root',password='123456',database='python')
 
    cursor = db.cursor()
    sql = "INSERT INTO moviemessage(moviename,score,comment) VALUES (%s, %s, %s)"
    try:
        cursor.execute(sql,value)
        db.commit()
        print('插入数据成功')
    except:
        db.rollback()
        print("插入数据失败")
    db.close()

url='https://movie.douban.com/top250'

hearders={
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36 Edg/94.0.992.47'
}
resp = requests.get(url,headers=hearders)

#print(resp.text)

#创建etree对象，传入网页内容
#找到最外层xpth，然后嵌套一层循环获取所有电影信息内容
#路径  <ol> <li> <div class='item> <div class='pic'> <a alt>
tree=etree.HTML(resp.text)
#这里li有多个所以返回的是一个列表
the_first=tree.xpath('/html/body/div[3]/div[1]/div/div[1]/ol/li')
for i in the_first:
    title=i.xpath('./div/div[1]/a/img/@alt')[0]
    score=i.xpath('./div/div[2]/div[2]/div/span[2]/text()')[0].strip()
    comment=i.xpath('./div/div[2]/div[2]/p[2]/span/text()')[0]
    print(title)
    print(score)
    print(comment)
    #插入到数据库之中
    data = (title,score,comment)
    insert(data)

Python知识库最新文章

Python中String模块

【Python】 14-CVS文件操作

python的panda库读写文件

使用Nordic的nrf52840实现蓝牙DFU过程

【Python学习记录】numpy数组用法整理

Python学习笔记

python字符串和列表

python如何从txt文件中解析出有效的数据

Python编程从入门到实践自学/3.1-3.2

python变量

加:2021-10-15 11:45:10 更:2021-10-15 11:46:56

360图书馆购物三丰科技阅读网日历万年历 2026年1日历

-2026/1/9 8:18:17-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码