IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> 大数据 -> presto查hive报错:end index must not be greater than size 问题分析和解决 -> 正文阅读

[大数据]presto查hive报错:end index must not be greater than size 问题分析和解决

tips

????presto?0.208连接hive有不少坑,请尽量不要选择这个版本。presto0.208以上的版本,jdk需要8_151+。

问题还原:

集群环境
hive?1.1.0
presto?0.208
hadoop?2.6
集群有张hive表使用hive-cli查询是OK,?但是使用presto?cli进行select?*?from?table_name?同样的sql?语法查询会报错,报错核心如下:

java.lang.RuntimeException: Error fetching next at http://10.90.50.169:28080/v1/statement/20221117_120110_00027_yetvx/2 returned an invalid response: JsonResponse{statusCode=500, statusMessage=Internal Server Error, headers={content-length=[5070], content-type=[text/plain], date=[Mon, 17 Jun 2019 12:01:10 GMT]}, hasValue=false} [Error: com.fasterxml.jackson.databind.JsonMappingException: end index (4) must not be greater than size (1) (through reference chain: com.facebook.presto.client.QueryResults["data"])  at com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:379)  at com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:339)  at com.fasterxml.jackson.databind.ser.std.StdSerializer.wrapAndThrow(StdSerializer.java:343)  at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:698)  at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:155)  at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:292)  at com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1419)  at com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:940)  at io.airlift.jaxrs.JsonMapper.writeTo(JsonMapper.java:233)  at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.invokeWriteTo(WriterInterceptorExecutor.java:266)  at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.aroundWriteTo(WriterInterceptorExecutor.java:251)  at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:163)  at org.glassfish.jersey.server.internal.JsonWithPaddingInterceptor.aroundWriteTo(JsonWithPaddingInterceptor.java:109)  at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:163)  at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:85)  at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:163)  at org.glassfish.jersey.message.internal.MessageBodyFactory.writeTo(MessageBodyFactory.java:1135)  at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:662)  at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:395)  at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:385)  at org.glassfish.jersey.server.ServerRuntime$AsyncResponder$3.run(ServerRuntime.java:884)  at org.glassfish.jersey.internal.Errors$1.call(Errors.java:272)  at org.glassfish.jersey.internal.Errors$1.call(Errors.java:268)  at org.glassfish.jersey.internal.Errors.process(Errors.java:316)  at org.glassfish.jersey.internal.Errors.process(Errors.java:298)  at org.glassfish.jersey.internal.Errors.process(Errors.java:268)  at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:289)  at org.glassfish.jersey.server.ServerRuntime$AsyncResponder.resume(ServerRuntime.java:916)  at org.glassfish.jersey.server.ServerRuntime$AsyncResponder.resume(ServerRuntime.java:872)  at io.airlift.http.server.AsyncResponseHandler$1.onSuccess(AsyncResponseHandler.java:102)  at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1347)  at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  at java.lang.Thread.run(Thread.java:748)

????后面进一步排查,发现presto-cli中,查询的sql只要指定部分int/bigint类型的字段就会报错,对其他string类型的字段不报错。

排查思路

step?1:
hive-cli中执行show?create?table?table_name,得到表结构,并在自己的测试环境创建同样的表test_table;

step?2:
从hive表对应的hdfs文件出导出部分文件,在hive-cli中执行load?data?local?语句,导入抽样数据到test_table中;

step?3:
hive-cli和presto-cli中分别执行查询语句看问题是否复现
(hive-cli中?执行?select?*?from?test_table?limit?1,?返回结果正常,?presto-cli中执行报错,问题复现,同时发现只有查询表中的int/bigint类型字段会出问题);
step?4:
step1的时候发现表内容的存储格式是orc?,所以执行?hive?–orcfiledump?/user/hive/warehouse/test_table/part-00000-4ae99c5d-9fec-47ab-a19c-c970f4ec40be-c000.snappy.orc?看orc文件的统计信息;

step4发现异常:table中定义为int,bigint类型的字段在步骤4的类型统计部分显示全部都是string类型,所以就怀疑是orc文件的文件部分字段类型和hive?table中定义的不一致造成presto?在读取相关orc文件的时候报错。至于为什么hive-sql读取不报错,可我是hive对orc文件的处理和presto不一样吧。
????抱着这样的猜想,决定去进行一次实验,去构建一个表结构定义和orc文件字段定义类型不一致的场景,?实验过程如下:

错误复现实验

创建测试表test_string

tips:这里创建test_string表是为了将表中的orc文件拷贝到另外的test_int
#建表测试CREATE TABLE `test_string`(`server` string,`uid` string)ROW FORMAT SERDE    'org.apache.hadoop.hive.ql.io.orc.OrcSerde'STORED AS INPUTFORMAT    'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'OUTPUTFORMAT    'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
/* 或者可以简写为如下    CREATE TABLE `test_string`(        `server` string,        `uid` string)        STORED AS orc*/
#hive-cli中执行show?create?table?test_string输出如下
CREATE TABLE `test_string`(`server` string,`uid` string)ROW FORMAT SERDE'org.apache.hadoop.hive.ql.io.orc.OrcSerde'STORED AS INPUTFORMAT'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'LOCATION'hdfs://Ucluster/user/hive/warehouse/test_string'TBLPROPERTIES ('last_modified_by'='work','last_modified_time'='1560497239','transient_lastDdlTime'='1560772601')
#这里看到hive表的location在hdfs的/user/hive/warehouse/test_string目录
#向test_string表导入部分数据
insert?into?table?test_string?values?('server1',?'1'),('server2',?'2');
#退出hive-cli,查看test_string在hdfs上的orc文件信息概览
hadoop fs -ls /user/hive/warehouse/test_string (发现其下有一个000000_0,这是test_string的orc存储文件,输出如下)
Found 1 items-rwxrwxrwt???3?root?supergroup????????323?2022-11-17?21:14?/user/hive/warehouse/test_string/000000_0

hive?–orcfiledump?/user/hive/warehouse/test_string/000000_0?(执行这个命令查看orc文件的概览信息,?输出如下):

Processing data file /user/hive/warehouse/test_string/000000_0 [length: 323]Structure for /user/hive/warehouse/test_string/000000_0File Version: 0.12 with ORC_135Rows: 2Compression: ZLIBCompression size: 262144Type: struct<server:string,uid:string> (这里可以看到orc文件每行有两个字段,server和uid,且都是string类型)
Stripe Statistics:  Stripe 1:    Column 0: count: 2 hasNull: false    Column 1: count: 2 hasNull: false min: server1 max: server2 sum: 14    Column 2: count: 2 hasNull: false min: 1 max: 2 sum: 2
File Statistics:  Column 0: count: 2 hasNull: false  Column 1: count: 2 hasNull: false min: server1 max: server2 sum: 14  Column 2: count: 2 hasNull: false min: 1 max: 2 sum: 2
Stripes:  Stripe: offset: 3 data: 32 rows: 2 tail: 51 index: 72    Stream: column 0 section ROW_INDEX start: 3 length 11    Stream: column 1 section ROW_INDEX start: 14 length 34    Stream: column 2 section ROW_INDEX start: 48 length 27    Stream: column 1 section DATA start: 75 length 15    Stream: column 1 section LENGTH start: 90 length 6    Stream: column 2 section DATA start: 96 length 5    Stream: column 2 section LENGTH start: 101 length 6    Encoding column 0: DIRECT    Encoding column 1: DIRECT_V2    Encoding column 2: DIRECT_V2
File length: 323 bytesPadding length: 0 bytesPadding ratio: 0%

创建测试表test_int

# create table (注意,这里创建表的uid字段是int类型)CREATE TABLE `test_int`(`server`?string,`uid`?int)?STORED?AS?orc

????此时,hive-cli中执行show?create?table?test_int,输出内容内容如下:

CREATE TABLE `test_int`(`server` string,`uid` bigint)ROW FORMAT SERDE'org.apache.hadoop.hive.ql.io.orc.OrcSerde'STORED AS INPUTFORMAT'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'LOCATION'hdfs://Ucluster/user/hive/warehouse/test_int'

????试着将test_string表的hdfs?orc文件移动到test_int表的location下(这里是hdfs://Ucluster/user/hive/warehouse/test_int)

load?data?inpath?'hdfs://Ucluster/user/hive/warehouse/test_string/000000_0'?overwrite?into?table?test_int;

????此时退出hive-cli ,查看test_int表的orc文件概览信息

hive?–orcfiledump?/user/hive/warehouse/test_int/000000_0
Processing data file /user/hive/warehouse/test_int/000000_0 [length: 323]Structure for /user/hive/warehouse/test_int/000000_0File Version: 0.12 with ORC_135Rows: 2Compression: ZLIBCompression size: 262144Type: struct<server:string,uid:string>
Stripe Statistics:Stripe 1:    Column 0: count: 2 hasNull: false    Column 1: count: 2 hasNull: false min: server1 max: server2 sum: 14    Column 2: count: 2 hasNull: false min: 1 max: 2 sum: 2
File Statistics:Column 0: count: 2 hasNull: falseColumn 1: count: 2 hasNull: false min: server1 max: server2 sum: 14Column 2: count: 2 hasNull: false min: 1 max: 2 sum: 2
Stripes:Stripe: offset: 3 data: 32 rows: 2 tail: 51 index: 72    Stream: column 0 section ROW_INDEX start: 3 length 11    Stream: column 1 section ROW_INDEX start: 14 length 34    Stream: column 2 section ROW_INDEX start: 48 length 27    Stream: column 1 section DATA start: 75 length 15    Stream: column 1 section LENGTH start: 90 length 6    Stream: column 2 section DATA start: 96 length 5    Stream: column 2 section LENGTH start: 101 length 6    Encoding column 0: DIRECT    Encoding column 1: DIRECT_V2    Encoding column 2: DIRECT_V2
File length: 323 bytesPadding length: 0 bytesPadding ratio: 0%

hive/presto查询结果对比

????经过上面两个建表的步骤,我们就得到了一个符合预期的test_int表(表的定义中uid为int类型,?表的hdfs?orc文件中的uid的类型为string。

#?hive测试如下hive>?select?*?from?test_int;OKserver1  1server2  2Time taken: 4.186 seconds, Fetched: 2 row(s)
hive> select uid  from test_int;OK12Time taken: 0.178 seconds, Fetched: 2 row(s)
# presto测试如下这里的10.90.50.169:28080为presto coordinator的主机名和监听端口,database使用默认的default库, 从结果发现,查询test_int表的string 类型字段是ok的,查询int类型的uid字段会报错。./presto-cli?–debug?–server?10.90.50.169:28080?–catalog?hive?–schema?default?
# presto?查询string?类型的字段okpresto:default> select server  from test_int; server--------- server1 server2(2 rows

????测试重点来了!!!

# 查询int类型的字段报错,复现之前的问题。presto:default> select uid from test_int;java.lang.RuntimeException:?Error?fetching?next?at?http://10.90.50.169:28080/v1/statement/20221117_133934_00034_yetvx/2?returned?an?invalid?response:?JsonResponse{statusCode=500,?statusMessage=Internal?Server?Error,?headers={content-length=[5070],?content-type=[text/plain],?date=[Mon,?17?Jun?2019?13:39:34?GMT]},?hasValue=false}?[Error:?com.fasterxml.jackson.databind.JsonMappingException:?end?index?(8)?must?not?be?greater?than?size?(1)?(through?reference?chain:?com.facebook.presto.client.QueryResults["data"])  at com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:379)  at com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:339)  at com.fasterxml.jackson.databind.ser.std.StdSerializer.wrapAndThrow(StdSerializer.java:343)  at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:698)  at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:155)  at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:292)  at com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1419)  at com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:940)  at io.airlift.jaxrs.JsonMapper.writeTo(JsonMapper.java:233)  at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.invokeWriteTo(WriterInterceptorExecutor.java:266)  at org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.aroundWriteTo(WriterInterceptorExecutor.java:251)  at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:163)  at org.glassfish.jersey.server.internal.JsonWithPaddingInterceptor.aroundWriteTo(JsonWithPaddingInterceptor.java:109)  at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:163)  at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:85)  at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:163)  at org.glassfish.jersey.message.internal.MessageBodyFactory.writeTo(MessageBodyFactory.java:1135)  at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:662)  at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:395)  at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:385)  at org.glassfish.jersey.server.ServerRuntime$AsyncResponder$3.run(ServerRuntime.java:884)  at org.glassfish.jersey.internal.Errors$1.call(Errors.java:272)  at org.glassfish.jersey.internal.Errors$1.call(Errors.java:268)  at org.glassfish.jersey.internal.Errors.process(Errors.java:316)  at org.glassfish.jersey.internal.Errors.process(Errors.java:298)  at org.glassfish.jersey.internal.Errors.process(Errors.java:268)  at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:289)  at org.glassfish.jersey.server.ServerRuntime$AsyncResponder.resume(ServerRuntime.java:916)  at org.glassfish.jersey.server.ServerRuntime$AsyncResponder.resume(ServerRuntime.java:872)  at io.airlift.http.server.AsyncResponseHandler$1.onSuccess(AsyncResponseHandler.java:102)  at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1347)  at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78)  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)  at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.IndexOutOfBoundsException: end index (8) must not be greater than size (1)  at io.airlift.slice.Preconditions.checkPositionIndexes(Preconditions.java:81)  at io.airlift.slice.Slice.checkIndexLength(Slice.java:1267)  at io.airlift.slice.Slice.getLong(Slice.java:479)  at com.facebook.presto.spi.block.AbstractVariableWidthBlock.getLong(AbstractVariableWidthBlock.java:63)  at com.facebook.presto.spi.type.BigintType.getObjectValue(BigintType.java:38)  at com.facebook.presto.server.protocol.RowIterable$RowIterator.computeNext(RowIterable.java:77)  at com.facebook.presto.server.protocol.RowIterable$RowIterator.computeNext(RowIterable.java:50)  at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)  at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)  at com.google.common.collect.Iterators$ConcatenatedIterator.hasNext(Iterators.java:1331)  at com.google.common.collect.Iterators$1.hasNext(Iterators.java:136)  at com.fasterxml.jackson.databind.ser.std.IterableSerializer.serializeContents(IterableSerializer.java:83)  at com.fasterxml.jackson.databind.ser.std.IterableSerializer.serialize(IterableSerializer.java:74)  at com.fasterxml.jackson.databind.ser.std.IterableSerializer.serialize(IterableSerializer.java:12)  at com.fasterxml.jackson.databind.ser.std.StdDelegatingSerializer.serialize(StdDelegatingSerializer.java:170)  at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:693)  at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:690)  ... 31 more]  at com.facebook.presto.client.StatementClientV1.requestFailedException(StatementClientV1.java:436)  at com.facebook.presto.client.StatementClientV1.advance(StatementClientV1.java:383)  at com.facebook.presto.cli.StatusPrinter.printInitialStatusUpdates(StatusPrinter.java:123)  at com.facebook.presto.cli.Query.renderQueryOutput(Query.java:137)  at com.facebook.presto.cli.Query.renderOutput(Query.java:121)  at com.facebook.presto.cli.Console.process(Console.java:312)  at com.facebook.presto.cli.Console.runConsole(Console.java:256)  at com.facebook.presto.cli.Console.run(Console.java:146)  at com.facebook.presto.cli.Presto.main(Presto.java:31)Query is gone (server restarted?)

测试总结

????如果想让hive以orc格式存储的表能正常在presto中查询,最好保证orc文件的字段类型定义和create?table时的定义一致,否则,有可能会造成hive-cli中查询正常,在presto-cli?中查询异常的情况。特别是在从别处拷贝orc文件过来的时候,需要先做一下类型比对。
????ps:?为何hive-cli中查询可以兼容字段类型不一致,而presto中确不行,就需要看presto的源码了。之前从网上看文档,presto在读取hdfs?orc文件是进行了一些优化的,有了解细节的大大可以告诉我一声哈。另,测试是发现,test_int表在hive中查询uid字段,如果有非法字段,会显示为null。

  大数据 最新文章
实现Kafka至少消费一次
亚马逊云科技:还在苦于ETL?Zero ETL的时代
初探MapReduce
【SpringBoot框架篇】32.基于注解+redis实现
Elasticsearch:如何减少 Elasticsearch 集
Go redis操作
Redis面试题
专题五 Redis高并发场景
基于GBase8s和Calcite的多数据源查询
Redis——底层数据结构原理
上一篇文章      下一篇文章      查看所有文章
加:2022-12-25 11:16:02  更:2022-12-25 11:19:57 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2025年1日历 -2025/1/27 14:29:23-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码