压缩算法 | 原始文件大小 | 压缩文件大小 | 压缩速度 | 解压速度 | 自带 | 切分 | 改程序 |
---|
gzip | 8.3GB | 1.8GB | 17.5MB/s | 58MB/s | 是 | 否 | 否 | bzip2 | 8.3GB | 1.1GB | 2.4MB/s | 9.5MB/s | 是 | 是 | 否 | LZO | 8.3GB | 2.9GB | 49.3MB/s | 74.6MB/s | 否 | 是 | 是 |
- 输入压缩:(Hadoop使用文件扩展名判断是否支持某种编解码器,core-site.xml)
org.apache.hadoop.io.compress.DefaultCodec org.apache.hadoop.io.compress.GzipCodec org.apache.hadoop.io.compress.BZip2Codec com.hadoop.compression.lzo.LzopCodec org.apache.hadoop.io.compress.SnappyCodec - mapper输出:(企业多使用LZO或Snappy编解码器在此阶段压缩数据,mapred-site.xml)
com.hadoop.compression.lzo.LzopCodec org.apache.hadoop.io.compress.SnappyCodec - reducer输出:(使用标准工具或者编解码器,如gzip和bzip2,mapred-site.xml)
org.apache.hadoop.io.compress.GzipCodec org.apache.hadoop.io.compress.BZip2Codec
PS:LZO格式是基于GPL许可的,不能通过Apache来分发许可,基于此,它的hadoop编码/解码器必须单独下载,Linux上安装编译lzo详解。lzop编码/解码器兼容干lzop工具,它其实就是LZO 格式,但额外还有头部,它正是我们想要的。还有一个纯LZO格式的编码/解码器LzoCodec,它使用.lzo_deflate作为扩展名(根据 DEFLATE类推,是没有头部的gzip格式)。
1.1、数据流的压缩和解压缩
CompressionCodecFactory factory = new CompressionCodecFactory(new Configuration());
CompressionCodec codec = factory.getCodecByName(method);
FileOutputStream fos = new FileOutputStream(new File(filename + codec.getDefaultExtension()));
CompressionOutputStream cos = codec.createOutputStream(fos);
CompressionCodecFactory factory = new CompressionCodecFactory(new Configuration());
CompressionCodec codec = factory.getCodec(new Path(filename));
FileInputStream fis = new FileInputStream(new File(filename));
CompressionInputStream cis = codec.createInputStream(fis);
1.2、Map、Reduce输出端采用压缩
Mapper和Reducer不变
conf.setBoolean("mapreduce.map.output.compress", true);
conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class,CompressionCodec.class);
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
2、Hadoop企业优化
2.1、MapReduce程序效率的瓶颈
1)计算机性能:CPU、内存、硬盘、网络 2)I/O操作优化:数据倾斜、MapTask和ReduceTask数不合理、小文件、压缩文件不可切分、切片数过多、Merge数过多、Reduce时间过长
解决方案: 1)输入阶段:CombineTextInputFormat合并输入端大量的小文件 2)Map阶段:减少溢写次数、减少合并次数、加入Combine mapred-default.xml
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>100</value>
<description>The total amount of buffer memory to use while sorting
files, in megabytes. By default, gives each merge stream 1MB, which
should minimize seeks.</description>
</property>
<property>
<name>mapreduce.map.sort.spill.percent</name>
<value>0.80</value>
<description>The soft limit in the serialization buffer. Once reached, a
thread will begin to spill the contents to disk in the background. Note that
collection will not block if this threshold is exceeded while a spill is
already in progress, so spills may be larger than this threshold when it is
set to less than .5</description>
</property>
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>10</value>
<description>The number of streams to merge at once while sorting
files. This determines the number of open file handles.</description>
</property>
3)Reduce阶段:合理设置MapTask和ReduceTask数(太少task会等待,太多task会竞争)、设置Map和Reduce共存(Map运行到一定程度后,开始运行Reduce)、减少Reduce(Reduce获取数据产生大量的网络消耗) mapred-default.xml
<property>
<name>mapreduce.job.reduce.slowstart.completedmaps</name>
<value>0.05</value>
<description>Fraction of the number of maps in the job which should be
complete before reduces are scheduled for the job.
</description>
</property>
<property>
<name>mapreduce.reduce.input.buffer.percent</name>
<value>0.0</value>
<description>The percentage of memory- relative to the maximum heap size- to
retain map outputs during the reduce. When the shuffle is concluded, any
remaining map outputs in memory must consume less than this threshold before
the reduce can begin.
</description>
</property>
4)I/O阶段:使用Snappy和LZO压缩编码器、使用SequenceFile二进制文件(对hive二进制存储格式,即SequenceFile和RCFile的思考总结) 5)数据倾斜:抽样和范围分区(数据抽样预设分区)、自定义分区、Combiner精简数据、避免Reduce Join(尽量Map Join)
补充:SequenceFile是由一系列的二进制k/v组成,如果为key为文件名,value为文件内容,可将大批小文件合并成一个大文件
3、Hadoop新特性
3.1、采用distcp命令实现两个Hadoop集群之间的递归数据复制
[atguigu@hadoop102 hadoop-3.1.3]$ bin/hadoop distcp hdfs://hadoop102:9820/user/atguigu/hello.txt hdfs://hadoop105:9820/user/atguigu/hello.txt
3.2、小文件存档
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop archive -archiveName input.har -p /user/atguigu/input /user/atguigu/output
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -ls /user/atguigu/output/input.har
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -ls har:///user/atguigu/output/input.har
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -cp har:/// user/atguigu/output/input.har/* /user/atguigu
补充:通过程序删除的文件不会经过回收站,需要调用moveToTrash()才进入回收站
Trash trash = New Trash(conf);
trash.moveToTrash(path);
PS:纠删码Erasure Coding (分布式存储系统)
3.5、Hadoop HA 高可用
hadoop中的JournalNode HDFS HA(高可用)机制 使用Quorum Journal Manager(QJM)的HDFS NameNode高可用配置 【HDFS篇11】HA高可用 HDFS-HA:YARN的JournalNode实现EditLog的共享,Zookeeper的ZKFailoverController实现自动故障切换 Hadoop3 HA部署之Yarn篇 Hadoop(3)—如何构建HDFS–HA,YARN—HA Hadoop3.1.2 高可用安装Yarn (ResourceManager High Availability) YARN-HA:ResourceManager中ActiveStandbyElector有与Zookeeper进行交互的功能,无需使用ZKFailoverController。
NameNode架构的局限性(单NameNode): 1)Namespace的限制(NameNode所能存储的对象(文件+块)数目受到NameNode所在JVM的heap size的限制); 2)隔离问题(HDFS上的一个程序运行整个HDFS上的所有程序); 3)性能的瓶颈(HDFS文件系统的吞吐量受限于单个NameNode的吞吐量);
|