开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> 东方财富：网址和项目名称爬取 -> 正文阅读

[Python知识库]东方财富：网址和项目名称爬取

平安银行(000001)资金流向 _ 数据中心 _ 东方财富网 (eastmoney.com)

#导包
import requests
from bs4 import BeautifulSoup
import pandas as pd

【第一步】下载页面的HTML

构造函数download_all_htmls，下载页面的HTML

1、用requests.get函数获取html网页（构造一个向服务器请求资源的url对象，返回的是一个包含服务器资源的Response对象）

2、r.status_code返回http请求的返回状态：200表示连接成功，返回418表示爬取的网站有反爬虫机制，要向服务器发出爬虫请求，需要添加请求头headers

3、htmls.append(r.text)添加url对应的页面内容到列表htmls

def download_all_htmls():
    htmls=[]
    url=f"http://data.eastmoney.com/zjlx/000001.html"
    print(url)
    r=requests.get(url)
    if r.status_code!=200:
        raise Exception("error")
    print(r)
    htmls.append(r.text)
    return htmls

【第二步】解析HTML得到数据

构造函数parse_single_html，用于解析单个html得到数据

import re
def parse_single_html(html):
    soup=BeautifulSoup(html,'html.parser')     #构建BeautifulSoup实例(第一个参数是要匹配的内容;第二个参数是要采用的模块,即规则)
    items=(
    soup.find("div",class_="main") 
        .find("div",class_="main-content")
        .find("div",class_="framecontent")
        .find("div",class_="sinstock-filter-wrap")
        .find("table",class_="tab")
        .find_all("td")      #获取每一条信息，返回列表items
    )
    datas=[]
    for item in items:
        item=str(item)   #先把每条原始信息转化成字符串，以便后续处理
        if re.findall(r'<a href="(.*)">', item)!=[]:     #网址内容不为空的情况
            link=re.findall(r'a href="(.*)">', item)[0]     #获取网址
            title=re.findall(r'[\u4e00-\u9fa5]+', item)[0]   #获取标题名称
            datas.append({
                "网址":link,
                "名称":title
        })    #每条信息存成字典，作为列表datas的一个元素
    return datas

【第三步】DataFrame数据导入MySQL

import sqlalchemy
from sqlalchemy import create_engine

#建立连接
conn = create_engine('mysql+pymysql://root:123@localhost:3306/crawl?charset=utf8')
#写入数据，‘replace’表示如果同名表存在就替换掉
df.to_sql(name='df_1', con=conn,if_exists='replace',index=False,index_label='排名',
         dtype={'网址': sqlalchemy.types.String(length=40),
       '名称': sqlalchemy.types.String(length=20),
       })
print('ok')

【完整代码】

import requests
from bs4 import BeautifulSoup
import pandas as pd

def download_all_htmls():
    htmls=[]
    url=f"http://data.eastmoney.com/zjlx/000001.html"
    print(url)
    r=requests.get(url)
    if r.status_code!=200:
        raise Exception("error")
    print(r)
    htmls.append(r.text)
    return htmls

htmls=download_all_htmls()

import re
def parse_single_html(html):
    soup=BeautifulSoup(html,'html.parser')
    items=(
    soup.find("div",class_="main")
        .find("div",class_="main-content")
        .find("div",class_="framecontent")
        .find("div",class_="sinstock-filter-wrap")
        .find("table",class_="tab")
        .find_all("td")
    )
    datas=[]
    for item in items:
        item=str(item)
        if re.findall(r'<a href="(.*)">', item)!=[]:
            link=re.findall(r'a href="(.*)">', item)[0]     #获取网址
            title=re.findall(r'[\u4e00-\u9fa5]+', item)[0]   #获取标题名称
            datas.append({
                "网址":link,
                "名称":title
        })
    return datas

data=[]
data.extend(parse_single_html(htmls[0]))
print(data)

df=pd.DataFrame(data)

import sqlalchemy
from sqlalchemy import create_engine
#建立连接
conn = create_engine('mysql+pymysql://root:123@localhost:3306/crawl?charset=utf8')
#写入数据，‘replace’表示如果同名表存在就替换掉
df.to_sql(name='df_1', con=conn,if_exists='replace',index=False,index_label='排名',
         dtype={'网址': sqlalchemy.types.String(length=40),
       '名称': sqlalchemy.types.String(length=20),
       })
print('ok')

Python知识库最新文章

Python中String模块

【Python】 14-CVS文件操作

python的panda库读写文件

使用Nordic的nrf52840实现蓝牙DFU过程

【Python学习记录】numpy数组用法整理

Python学习笔记

python字符串和列表

python如何从txt文件中解析出有效的数据

Python编程从入门到实践自学/3.1-3.2

python变量

加:2021-10-18 17:21:19 更:2021-10-18 17:21:37

360图书馆购物三丰科技阅读网日历万年历 2025年12日历

-2025/12/4 13:31:57-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码