[大数据] Apache Hudi 在 AWS Glue 中的实践

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> Apache Hudi 在 AWS Glue 中的实践 -> 正文阅读

[大数据]Apache Hudi 在 AWS Glue 中的实践

文章目录

AWS Glue 实践

Glue Job 配置

ps：IAM的权限配置请自行摸索，权限上基本是缺啥补啥，如果只是作为测试，直接给所有的权限（action和resource都是*）就可以一劳永逸了

Job Pramater

Key	Value
–conf	spark.serializer=org.apache.spark.serializer.KryoSerializer
–enable-glue-datacatalog

Dependent jars path

s3://gavin-test2/dependency_jars/hudi/spark-avro_2.11-2.4.3.jar,s3://gavin-test2/dependency_jars/hudi/hudi-spark-bundle_2.11-0.8.0.jar

jar 的下载路径：

Jar包	下载链接
hudi-spark-bundle_2.11-0.8.0.jar	https://search.maven.org/remotecontent?filepath=org/apache/hudi/hudi-spark-bundle_2.11/0.8.0/hudi-spark-bundle_2.11-0.8.0.jar
spark-avro_2.11-2.4.3.jar	https://search.maven.org/remotecontent?filepath=org/apache/spark/spark-avro_2.11/2.4.3/spark-avro_2.11-2.4.3.jar

存储为非分区表

Python Code

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.types import *


args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

basePath = 's3://gavin-test2/tables/hudi/table1/'
table_name = 'table1'
database = 'default'
data = [('Alice', 1, '2022/02/28'), ('Jhone', 2, '2022/03/01')]
rdd = sc.parallelize(data)
schema = StructType(
    [
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True),
        StructField("partitioin_path", StringType(), True),
    ]
)
src_df = spark.createDataFrame(rdd, schema)

hudi_options = {
            'hoodie.table.name': table_name,
            'hoodie.datasource.write.operation': 'insert',
            'hoodie.datasource.write.recordkey.field': 'name',
            'hoodie.datasource.write.table.name': table_name,
            'hoodie.datasource.hive_sync.enable': 'true',
            'hoodie.datasource.hive_sync.database': 'default',
            'hoodie.datasource.hive_sync.use_jdbc': 'false',
            'hoodie.datasource.hive_sync.table': table_name,
            'hoodie.datasource.write.partitionpath.field': '',
            'hoodie.datasource.hive_sync.partition_fields': '',
            'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator',
            'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor'
        }

src_df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath)

AWS Glue Catalog

在这里插入图片描述

Query in Athena

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tAUZYAMB-1646662547718)(../../../resources/typora_img/image-20220307161327394.png)]

_hoodie_commit_time	_hoodie_commit_seqno	_hoodie_record_key	_hoodie_partition_path	_hoodie_file_name	name	age	partitioin_path
20220307080836	20220307080836_0_1	Jhone		8965ef34-4048-4420-8e69-562a478c3989-0_0-13-481_20220307080836.parquet	Jhone	2	2022/03/01
20220307080836	20220307080836_0_2	Alice		8965ef34-4048-4420-8e69-562a478c3989-0_0-13-481_20220307080836.parquet	Alice	1	2022/02/28

Files in S3 Bucket

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4qoRZ2nv-1646662547719)(../../../resources/typora_img/image-20220307163056583.png)]

存储为分区表

分区表可以选择时间字段作为分区字段，也可以选择非时间字段作为分区字段，本文使用形如「yyyy/mm/dd」的时间字符串作为分区字段；

Python Code

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.types import *


args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

basePath = 's3://gavin-test2/tables/hudi/table2/'
table_name = 'table2'
database = 'default'
data = [('Alice', 1, '2022/02/28'), ('Jhone', 2, '2022/03/01')]
rdd = sc.parallelize(data)
schema = StructType(
    [
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True),
        StructField("partitioin_path", StringType(), True),
    ]
)
src_df = spark.createDataFrame(rdd, schema)

hudi_options = {
            'hoodie.table.name': table_name,
            'hoodie.datasource.write.operation': 'insert',
            'hoodie.datasource.write.recordkey.field': 'name',
            'hoodie.datasource.write.table.name': table_name,
            'hoodie.datasource.hive_sync.enable': 'true',
            'hoodie.datasource.hive_sync.database': 'default',
            'hoodie.datasource.hive_sync.use_jdbc': 'false',
            'hoodie.datasource.hive_sync.table': table_name,
            'hoodie.datasource.write.partitionpath.field': 'partitioin_path',
            'hoodie.datasource.hive_sync.partition_fields': 'partitioin_path',
            'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.SimpleKeyGenerator',
            'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor'
        }

src_df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath)

AWS Glue Catalog

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SbY2OJRi-1646662547720)(../../../resources/typora_img/image-20220307165709937.png)]

Query in Athena

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-qfXh6Rq7-1646662547720)(../../../resources/typora_img/image-20220307165818410.png)]

_hoodie_commit_time	_hoodie_commit_seqno	_hoodie_record_key	_hoodie_partition_path	_hoodie_file_name	name	age	partitioin_path
20220307084134	20220307084134_0_1	Jhone	2022/03/01	afb8d8e3-5f3e-4420-b390-1eeb20a59165-0_0-8-322_20220307084134.parquet	Jhone	2	2022-03-01
20220307084134	20220307084134_1_1	Alice	2022/02/28	351c3c19-3c69-4661-8051-11a90c031112-0_1-8-323_20220307084134.parquet	Alice	1	2022-02-28

Files in S3 Bucket

在这里插入图片描述啊水电费

在这里插入图片描述

FAQ

ClassNotFoundException: org.apache.calcite.rel.type.RelDataTypeSystem

Error Info

在Hudi中设置了：DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY -> “false”，具体错误为：

java.lang.ClassNotFoundException: org.apache.calcite.rel.type.RelDataTypeSystem

这是由于Hiv3/Spark3移除了对于calcite包的依赖引起的

Solution:

我偷了个懒，将spark 的版本降到2.X

IllegalArgumentException: Partition path default is not in the form yyyy/mm/dd

Error Info

Caused by: java.lang.IllegalArgumentException: Partition path default is not in the form yyyy/mm/dd 
	at org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55)
	at org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220)
	at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221)
	... 42 more

Solution

由于配置'hoodie.datasource.hive_sync.partition_extractor_class':'org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor', 而org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor 中要求时间格式必须是这个类型

org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor#extractPartitionValuesInPath

在这里插入图片描述