课程《基于pyspark的大数据分析》视频29淘宝数据分析中,源代码如下:
# 需求:按照session_id进行分组,统计次数,会话PV
session_pv = sqlContext.sql("""
SELECT
session_id, COUNT(1) AS cnt
FROM
tmp_page_views
GROUP BY
session_id
ORDER BY
cnt DESC
LIMIT
10
""").map(lambda output: output.session_id + "\t" + str(output.cnt))
for result in session_pv.collect():
print result
在我的环境中运行报错:
AttributeError: ‘DataFrame‘ object has no attribute ‘map‘
?原因:我的环境spark是2.1.1,而原案例中老师的代码是基于spark1.6.1写的。
You can't?map ?a dataframe, but you can convert the dataframe to an RDD and map that by doing?spark_df.rdd.map() . Prior to Spark 2.0,?spark_df.map ?would alias to?spark_df.rdd.map() . With Spark 2.0, you must explicitly call?.rdd ?first.
所以应该修改为:
# 需求:按照session_id进行分组,统计次数,会话PV
session_pv = sqlContext.sql("""
SELECT
session_id, COUNT(1) AS cnt
FROM
tmp_page_views
GROUP BY
session_id
ORDER BY
cnt DESC
LIMIT
10
""").rdd.map(lambda output: output.session_id + "\t" + str(output.cnt))
for result in session_pv.collect():
print result
运行结果:
同样的报错也可能是这个博客里边的情况https://blog.csdn.net/weixin_49256582/article/details/108737822
|