官方API文档:https://arrow.apache.org/docs/python/index.html 1、测试服务器能够正确连接hdfs:
>hadoop fs -ls /
Found 5 items
drwxrwxrwx - hbase supergroup 0 2021-09-15 13:58 /hbase
drwxr-xr-x - root root 0 2021-12-08 09:38 /hive
drwxrwxrwx - root root 0 2021-12-02 15:15 /system
drwxrwxrwx - hdfs supergroup 0 2021-12-08 15:54 /tmp
drwxrwxrwx - hdfs supergroup 0 2022-03-09 10:28 /user
2、安装 pyarrow:
pip install pyarrow
3、读写hdfs文件 csv_2_parquet.py
import pyarrow as pa
import pyarrow.parquet as parquet
import pyarrow.csv as csv
if __name__ == '__main__':
hdfs_host = '你的hdfs服务器ip'
hdfs_port = 8020
file = '/user/data/df_raw_label.csv'
fs = pa.hdfs.HadoopFileSystem(host=hdfs_host, port=hdfs_port, user='hdfs')
with fs.open(file, mode='rb') as f:
arrow_table = csv.read_csv(f)
parquet_file = '/user/data/df_raw_label.parquet'
with fs.open(parquet_file, mode="wb") as f:
parquet.write_table(arrow_table, f)
报错:/usr/local/lib64/python3.6/site-packages/pyarrow/下找不到libhdfs.so
__main__:1: FutureWarning: pyarrow.hdfs.HadoopFileSystem is deprecated as of 2.0.0, please use pyarrow.fs.HadoopFileSystem instead.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib64/python3.6/site-packages/pyarrow/hdfs.py", line 49, in __init__
self._connect(host, port, user, kerb_ticket, extra_conf)
File "pyarrow/_hdfsio.pyx", line 85, in pyarrow._hdfsio.HadoopFileSystem._connect
File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: Unable to load libhdfs: ./libhdfs.so: cannot open shared object file: No such file or directory
解决方法: (1)在系统中查找该文件的位置:
>find /opt/ -name libhdfs.so
/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib64/libhdfs.so
(2)简单方案 将libhdfs.so复制到所需位置:
cp /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib64/libhdfs.so /usr/local/lib64/python3.6/site-packages/pyarrow
(3)官方方案(暂未验证成功): 参考:https://arrow.apache.org/docs/python/filesystems_deprecated.html?highlight=arrow_libhdfs_dir 添加环境变量
vi /etc/profile
在文件尾添加:
export CLASSPATH=`hadoop classpath --glob`
export ARROW_LIBHDFS_DIR="/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib64/"
使环境变量生效:
>source /etc/profile
>export
......
declare -x ARROW_LIBHDFS_DIR="/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib64/"
declare -x CLASSPATH="/etc/hadoop/conf:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop/libexec/../../hadoop/lib/kerb-client-1.0.0.jar
......
|