前言
最近看了一些基于MapReduce的神经网络并行训练方面的论文,老师让我自己去实现一下,更深入的体会其中的原理。
MapReduce是基于java语言的框架,于是一开始想用java写深度学习代码。但是dl4j框架实在太难用了,而且网络上的深度学习教程都是基于python的,所以最终还是决定用python去实现基于MapReduce框架的神经网络。如果行不通的话,后面再考虑用java实现神经网络。
目前大致的学习步骤如下: 1、Ubuntu下安装python3 参考链接 2、Python实现最简单的MapReduce例子,如Wordcount 参考链接1:pycharm运行 参考链接2:命令行运行 3、Pycharm(linux)+Hadoop+Spark(环境搭建) 参考链接 4、MapReduce实现BP算法并行化 5、复现论文代码
一、Ubuntu下安装Python3
输入python -V查看版本,发现python命令不可用
root@ubuntu:/# python -V
Command 'python' not found, did you mean:
command 'python3' from deb python3
command 'python' from deb python-is-python3
进入/usr/bin,输入ls -l | grep python,发现python3链接到python3.8
root@ubuntu:/usr/bin# ls -l | grep python
lrwxrwxrwx 1 root root 23 Jun 2 03:49 pdb3.8 -> ../lib/python3.8/pdb.py
lrwxrwxrwx 1 root root 31 Sep 23 2020 py3versions -> ../share/python3/py3versions.py
lrwxrwxrwx 1 root root 9 Sep 23 2020 python3 -> python3.8
-rwxr-xr-x 1 root root 5490352 Jun 2 03:49 python3.8
lrwxrwxrwx 1 root root 33 Jun 2 03:49 python3.8-config -> x86_64-linux-gnu-python3.8-config
lrwxrwxrwx 1 root root 16 Mar 13 2020 python3-config -> python3.8-config
-rwxr-xr-x 1 root root 384 Mar 27 2020 python3-futurize
-rwxr-xr-x 1 root root 388 Mar 27 2020 python3-pasteurize
-rwxr-xr-x 1 root root 3241 Jun 2 03:49 x86_64-linux-gnu-python3.8-config
lrwxrwxrwx 1 root root 33 Mar 13 2020 x86_64-linux-gnu-python3-config -> x86_64-linux-gnu-python3.8-config
输入python3 -V,发现Ubuntu自带python3,并且版本为3.8.10
root@ubuntu:/# python3 -V
Python 3.8.10
输入pip3 -V查看版本
root@ubuntu:/# pip3 -V
pip 20.0.2 from /usr/lib/python3/dist-packages/pip (python 3.8)
输入pip3 list查看包
root@ubuntu:/# pip3 list
Package Version
---------------------- --------------------
apturl 0.5.2
bcrypt 3.1.7
blinker 1.4
Brlapi 0.7.0
certifi 2019.11.28
chardet 3.0.4
Click 7.0
colorama 0.4.3
command-not-found 0.3
cryptography 2.8
至此,说明Ubuntu具备python3环境。
二、python实现WordCount
1、代码 mapper.py文件
#!/usr/bin/env python3
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print("%s\t%s" % (word, 1))
reducer.py文件
#!/usr/bin/env python3
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
#word, count = line.split()
try:
count = int(count)
except ValueError: #count如果不是数字的话,直接忽略掉
continue
if current_word == word:
current_count += count
else:
if current_word:
print("%s\t%s" % (current_word, current_count))
current_count = count
current_word = word
if word == current_word: #不要忘记最后的输出
print("%s\t%s" % (current_word, current_count))
注意#!/usr/bin/env python3 ,因为本机没有配置环境变量,python命令不生效,所以这里必须写python3 2、在本机上测试代码 在/opt/PycharmProjects/MapReduce/WordCount文件夹下 为了保险起见可以先赋予运行权限
root@ubuntu:/opt/PycharmProjects/MapReduce/WordCount# chmod +x mapper.py
root@ubuntu:/opt/PycharmProjects/MapReduce/WordCount# chmod +x reducer.py
测试mapper.py程序
root@ubuntu:/opt/PycharmProjects/MapReduce/WordCount# echo "aa bb cc dd aa cc" | python3 mapper.py
aa 1
bb 1
cc 1
dd 1
aa 1
cc 1
测试reducer.py程序
root@ubuntu:/opt/PycharmProjects/MapReduce/WordCount# echo "foo foo quux labs foo bar quux" | python3 mapper.py | sort -k1,1 | python3 reducer.py
bar 1
foo 3
labs 1
quux 2
三、在Hadoop上运行Python程序
1、下载文本文件
root@ubuntu:/home/wuwenjing/Downloads/dataset/guteberg# wget http://www.gutenberg.org/files/5000/5000-8.txt
root@ubuntu:/home/wuwenjing/Downloads/dataset/guteberg# wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt
2、在HDFS上创建文件夹,把这两本书传到HDFS上
root@ubuntu:/hadoop/hadoop-2.9.2/sbin# hdfs dfs -mkdir /user/MapReduce/input
root@ubuntu:/hadoop/hadoop-2.9.2/sbin# hdfs dfs -put /home/wuwenjing/Downloads/dataset/gutenberg/*.txt /user/MapReduce/input
3、寻找streaming的jar文件存放地址,找到share文件夹中的hadoop-straming*.jar文件
root@ubuntu:/hadoop# find ./ -name "*streaming*.jar"
./hadoop-2.9.2/share/hadoop/tools/lib/hadoop-streaming-2.9.2.jar
./hadoop-2.9.2/share/hadoop/tools/sources/hadoop-streaming-2.9.2-sources.jar
./hadoop-2.9.2/share/hadoop/tools/sources/hadoop-streaming-2.9.2-test-sources.jar
4、由于通过streaming接口运行的脚本太长了,因此直接建立一个shell名称为run.sh来运行:
root@ubuntu:/hadoop/hadoop-2.9.2# vim run.sh
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-file /opt/PycharmProjects/MapReduce/WordCount/mapper.py -mapper /opt/PycharmProjects/MapReduce/WordCount/mapper.py \
-file /opt/PycharmProjects/MapReduce/WordCount/reducer.py -reducer /opt/PycharmProjects/MapReduce/WordCount/reducer.py \
-input /user/MapReduce/input/*.txt -output /user/MapReduce/output
root@ubuntu:/hadoop/hadoop-2.9.2# source run.sh
这里报错,报错信息如下:
21/08/30 23:24:34 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
21/08/30 23:24:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
packageJobJar: [/opt/PycharmProjects/MapReduce/WordCount/mapper.py, /opt/PycharmProjects/MapReduce/WordCount/reducer.py] [] /tmp/streamjob4294615261368991400.jar tmpDir=null
21/08/30 23:24:35 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
21/08/30 23:24:35 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
21/08/30 23:24:35 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
21/08/30 23:24:35 INFO mapred.FileInputFormat: Total input files to process : 2
21/08/30 23:24:35 INFO mapreduce.JobSubmitter: number of splits:2
21/08/30 23:24:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1574514939_0001
21/08/30 23:24:36 INFO mapred.LocalDistributedCacheManager: Localized file:/opt/PycharmProjects/MapReduce/WordCount/mapper.py as file:/hadoop/hadoop-2.9.2/tmp/mapred/local/1630391076183/mapper.py
21/08/30 23:24:36 INFO mapred.LocalDistributedCacheManager: Localized file:/opt/PycharmProjects/MapReduce/WordCount/reducer.py as file:/hadoop/hadoop-2.9.2/tmp/mapred/local/1630391076184/reducer.py
21/08/30 23:24:36 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
21/08/30 23:24:36 INFO mapreduce.Job: Running job: job_local1574514939_0001
21/08/30 23:24:36 INFO mapred.LocalJobRunner: OutputCommitter set in config null
21/08/30 23:24:36 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
21/08/30 23:24:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
21/08/30 23:24:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
21/08/30 23:24:36 INFO mapred.LocalJobRunner: Waiting for map tasks
21/08/30 23:24:36 INFO mapred.LocalJobRunner: Starting task: attempt_local1574514939_0001_m_000000_0
21/08/30 23:24:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
21/08/30 23:24:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
21/08/30 23:24:36 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
21/08/30 23:24:36 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/MapReduce/input/5000-8.txt:0+1428843
21/08/30 23:24:36 INFO mapred.MapTask: numReduceTasks: 1
21/08/30 23:24:36 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
21/08/30 23:24:36 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
21/08/30 23:24:36 INFO mapred.MapTask: soft limit at 83886080
21/08/30 23:24:36 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
21/08/30 23:24:36 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
21/08/30 23:24:36 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
21/08/30 23:24:36 INFO streaming.PipeMapRed: PipeMapRed exec [/hadoop/hadoop-2.9.2/sbin/./mapper.py]
21/08/30 23:24:36 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir
21/08/30 23:24:36 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start
21/08/30 23:24:36 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
21/08/30 23:24:36 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
21/08/30 23:24:36 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
21/08/30 23:24:36 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
21/08/30 23:24:36 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file
21/08/30 23:24:36 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
21/08/30 23:24:36 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length
21/08/30 23:24:36 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
21/08/30 23:24:36 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
21/08/30 23:24:36 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
21/08/30 23:24:36 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
21/08/30 23:24:36 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
21/08/30 23:24:36 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
21/08/30 23:24:36 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
21/08/30 23:24:36 INFO streaming.PipeMapRed: Records R/W=2820/1
Traceback (most recent call last):
File "/hadoop/hadoop-2.9.2/sbin/./mapper.py", line 3, in <module>
for line in sys.stdin:
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 2092: invalid continuation byte
21/08/30 23:24:36 INFO streaming.PipeMapRed: MRErrorThread done
21/08/30 23:24:36 INFO streaming.PipeMapRed: R/W/S=2820/5129/0 in:NA [rec/s] out:NA [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 HOST=null
USER=wuwenjing
HADOOP_USER=null
last tool output: |Books, 1ed 1-Have 1ement 111|
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at org.apache.hadoop.streaming.io.TextInputWriter.writeUTF8(TextInputWriter.java:72)
at org.apache.hadoop.streaming.io.TextInputWriter.writeValue(TextInputWriter.java:51)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/08/30 23:24:36 WARN streaming.PipeMapRed: java.io.IOException: Stream closed
21/08/30 23:24:36 INFO streaming.PipeMapRed: PipeMapRed failed!
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:120)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/08/30 23:24:36 WARN streaming.PipeMapRed: java.io.IOException: Stream closed
21/08/30 23:24:36 INFO streaming.PipeMapRed: PipeMapRed failed!
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/08/30 23:24:36 INFO mapred.LocalJobRunner: Starting task: attempt_local1574514939_0001_m_000001_0
21/08/30 23:24:36 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
21/08/30 23:24:36 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
21/08/30 23:24:36 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
21/08/30 23:24:36 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/MapReduce/input/pg20417.txt:0+674570
21/08/30 23:24:36 INFO mapred.MapTask: numReduceTasks: 1
21/08/30 23:24:36 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
21/08/30 23:24:36 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
21/08/30 23:24:36 INFO mapred.MapTask: soft limit at 83886080
21/08/30 23:24:36 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
21/08/30 23:24:36 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
21/08/30 23:24:36 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
21/08/30 23:24:36 INFO streaming.PipeMapRed: PipeMapRed exec [/hadoop/hadoop-2.9.2/sbin/./mapper.py]
21/08/30 23:24:36 INFO mapred.LineRecordReader: Found UTF-8 BOM and skipped it
21/08/30 23:24:36 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
21/08/30 23:24:36 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
21/08/30 23:24:37 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
21/08/30 23:24:37 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
21/08/30 23:24:37 INFO streaming.PipeMapRed: Records R/W=2788/1
21/08/30 23:24:37 INFO streaming.PipeMapRed: R/W/S=10000/49444/0 in:NA [rec/s] out:NA [rec/s]
21/08/30 23:24:37 INFO streaming.PipeMapRed: MRErrorThread done
21/08/30 23:24:37 INFO streaming.PipeMapRed: mapRedFinished
21/08/30 23:24:37 INFO mapred.LocalJobRunner:
21/08/30 23:24:37 INFO mapred.MapTask: Starting flush of map output
21/08/30 23:24:37 INFO mapred.MapTask: Spilling map output
21/08/30 23:24:37 INFO mapred.MapTask: bufstart = 0; bufend = 866856; bufvoid = 104857600
21/08/30 23:24:37 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25775024(103100096); length = 439373/6553600
21/08/30 23:24:37 INFO mapred.MapTask: Finished spill 0
21/08/30 23:24:37 INFO mapred.Task: Task:attempt_local1574514939_0001_m_000001_0 is done. And is in the process of committing
21/08/30 23:24:37 INFO mapred.LocalJobRunner: Records R/W=2788/1
21/08/30 23:24:37 INFO mapred.Task: Task 'attempt_local1574514939_0001_m_000001_0' done.
21/08/30 23:24:37 INFO mapred.LocalJobRunner: Finishing task: attempt_local1574514939_0001_m_000001_0
21/08/30 23:24:37 INFO mapred.LocalJobRunner: map task executor complete.
21/08/30 23:24:37 WARN mapred.LocalJobRunner: job_local1574514939_0001
java.lang.Exception: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:491)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:551)
Caused by: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:459)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/08/30 23:24:37 INFO mapreduce.Job: Job job_local1574514939_0001 running in uber mode : false
21/08/30 23:24:37 INFO mapreduce.Job: map 100% reduce 0%
21/08/30 23:24:37 INFO mapreduce.Job: Job job_local1574514939_0001 failed with state FAILED due to: NA
21/08/30 23:24:37 INFO mapreduce.Job: Counters: 22
File System Counters
FILE: Number of bytes read=2069
FILE: Number of bytes written=1568742
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=809738
HDFS: Number of bytes written=0
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=1
Map-Reduce Framework
Map input records=12760
Map output records=109844
Map output bytes=866856
Map output materialized bytes=1086550
Input split bytes=106
Combine input records=0
Spilled Records=109844
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=0
Total committed heap usage (bytes)=357564416
File Input Format Counters
Bytes Read=674570
21/08/30 23:24:37 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!
仔细看了一下,最早的错误出现在这里:
Traceback (most recent call last):
File "/hadoop/hadoop-2.9.2/sbin/./mapper.py", line 3, in <module>
for line in sys.stdin:
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 2092: invalid continuation byte
貌似是因为编码问题 于是我自己创建了一个test.txt上传到HDFS
root@ubuntu:/hadoop/hadoop-2.9.2/sbin
21/08/31 00:02:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
a 1
hello 2
map 4
reduce 3
world 1
运行成功
|