Spark案例实操
数据如下:

数据解析如下:
# 以第一行为例
2019-07-17 日期
95 用户ID
26070e87-1ad7-49a3-8fb3-cc741facaddf sessionID
37 页面ID
2019-07-17 00:00:02 动作时间
手机 搜索-关键字,如果该字段不为null说明当前是搜索操作
-1 点击-品类ID,如果该字段不为-1说明当前操作是点击
-1 点击-产品ID,如果该字段不为-1说明当前操作是点击
null 下单-品类ID,如果该字段不为null说明当前操作是下单操作,多个ID用,隔开
null 下单-产品ID,如果该字段不为null说明当前操作是下单操作,多个ID用,隔开
null 支付-品类ID,如果该字段不为null说明当前操作是支付操作,多个ID用,隔开
null 支付-产品ID,如果该字段不为null说明当前操作是支付操作,多个ID用,隔开
3 城市id
上面的数据图是从数据文件中截取的一部分内容,表示为电商网站的用户行为数据,主要包含用户的 4 种行为:搜索,点击,下单,支付。数据规则如下:
数据文件中每行数据采用下划线分隔数据
每一行数据表示用户的一次行为,这个行为只能是 4 种行为的一种
如果搜索关键字为 null,表示数据不是搜索数据
如果点击的品类 ID 和产品 ID 为-1,表示数据不是点击数据
针对于下单行为,一次可以下单多个商品,所以品类 ID 和产品 ID 可以是多个,id 之间采用逗号分隔,如果本次不是下单行为,则数据采用 null 表示
支付行为和下单行为类似
编号 | 字段名称 | 字段类型 | 字段含义 |
---|
1 | date | String | 用户点击行为的日期 | 2 | user_id | Long | 用户的 ID | 3 | session_id | String | Session 的 ID | 4 | page_id | Long | 某个页面的 ID | 5 | action_time | String | 动作的时间点 | 6 | search_keyword | String | 用户搜索的关键词 | 7 | click_category_id | Long | 某一个商品品类的 ID | 8 | click_product_id | Long | 某一个商品的 ID | 9 | order_category_ids | String | 一次订单中所有品类的 ID 集合 | 10 | order_product_ids | String | 一次订单中所有商品的 ID 集合 | 11 | pay_category_ids | String | 一次支付中所有品类的 ID 集合 | 12 | pay_product_ids | String | 一次支付中所有商品的 ID 集合 | 13 | city_id | Long | 城市 id |
需求一:TOP10热门品类
品类是指产品的分类,大型电商网站品类分多级,咱们的项目中品类只有一级,不同的公司可能对热门的定义不一样。我们按照每个品类的点击、下单、支付的量来统计热门品类。
鞋 点击数 下单数 支付数
例如, 综合排名 = 点击数20%+下单数30%+支付数*50%
本项目需求优化为:先按照点击数排名,靠前的就排名高;如果点击数相同,再比较下单数;下单数再相同,就比较支付数。
第一种实现方法:
object TestHostCategoryTop10T1 {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("HostCategoryTop10")
val sc = new SparkContext(sparkConf)
val rdd: RDD[String] = sc.textFile("datas/spark-core/user_visit_action.txt")
val clickActionRdd: RDD[String] = rdd.filter(_.split("_")(6) != "-1")
val clickCountRdd: RDD[(String, Int)] = clickActionRdd.map((action: String) => (action.split("_")(6), 1)).reduceByKey(_ + _)
val orderActionRdd: RDD[String] = rdd.filter(_.split("_")(8) != "null")
val orderCountRdd: RDD[(String, Int)] = orderActionRdd.flatMap(action => {
action.split("_")(8).split(",").map((_, 1))
}).reduceByKey(_ + _)
val payActionRdd: RDD[String] = rdd.filter(_.split("_")(10) != "null")
val payCountRdd: RDD[(String, Int)] = payActionRdd.flatMap(action => {
action.split("_")(10).split(",").map((_, 1))
}).reduceByKey(_ + _)
val cogrouprdd: RDD[(String, (Iterable[Int], Iterable[Int], Iterable[Int]))] =
clickCountRdd.cogroup(orderCountRdd, payCountRdd)
val analysisRDD: RDD[(String, (Int, Int, Int))] = cogrouprdd.mapValues {
case (clickIter, orderIter, payIter) => {
var clickCount = 0;
if (clickIter.iterator.hasNext) {
clickCount = clickIter.iterator.next()
}
var orderCount = 0;
if (orderIter.iterator.hasNext) {
orderCount = orderIter.iterator.next()
}
var payCount = 0;
if (payIter.iterator.hasNext) {
payCount = payIter.iterator.next()
}
(clickCount, orderCount, payCount)
}
}
val resultRDD: Array[(String, (Int, Int, Int))] = analysisRDD.sortBy(_._2, false).take(10)
resultRDD.foreach(println)
sc.stop()
}
}
第二种方法:
object TestHostCategoryTop10T2 {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("HostCategoryTop10")
val sc = new SparkContext(sparkConf)
val rdd: RDD[String] = sc.textFile("datas/spark-core/user_visit_action.txt")
rdd.cache()
val clickActionRdd: RDD[String] = rdd.filter(_.split("_")(6) != "-1")
val clickCountRdd: RDD[(String, (Int, Int, Int))] = clickActionRdd.map((action: String) => (action.split("_")(6), 1))
.reduceByKey(_ + _)
.mapValues((_, 0, 0))
val orderActionRdd: RDD[String] = rdd.filter(_.split("_")(8) != "null")
val orderCountRdd: RDD[(String, (Int, Int, Int))] = orderActionRdd.flatMap(action => {
action.split("_")(8).split(",").map((_, 1))
}).reduceByKey(_ + _).mapValues((0, _, 0))
val payActionRdd: RDD[String] = rdd.filter(_.split("_")(10) != "null")
val payCountRdd: RDD[(String, (Int, Int, Int))] = payActionRdd.flatMap(action => {
action.split("_")(10).split(",").map((_, 1))
}).reduceByKey(_ + _).mapValues((0, 0, _))
val analysisRdd: RDD[(String, (Int, Int, Int))] = clickCountRdd.union(orderCountRdd).union(payCountRdd).reduceByKey(
(t1, t2) => {
(t1._1 + t2._1, t1._2 + t2._2, t1._3 + t2._3)
}
)
val resultRDD: Array[(String, (Int, Int, Int))] = analysisRdd.sortBy(_._2, false).take(10)
resultRDD.foreach(println)
sc.stop()
}
}
第三种方法实现:
object TestHostCategoryTop10T3 {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("HostCategoryTop10")
val sc = new SparkContext(sparkConf)
val rdd: RDD[String] = sc.textFile("datas/spark-core/user_visit_action.txt")
val flatRDD: RDD[(String, (Int, Int, Int))] = rdd.flatMap(
action => {
val datas = action.split("_")
if (datas(6) != "-1") {
List((datas(6), (1, 0, 0)))
} else if (datas(8) != "null") {
datas(8).split(",").map((_, (0, 1, 0)))
} else if (datas(10) != "null") {
datas(10).split(",").map((_, (0, 0, 1)))
} else {
Nil
}
}
)
val analysisRdd: RDD[(String, (Int, Int, Int))] =flatRDD.reduceByKey(
(t1, t2) => {
(t1._1 + t2._1, t1._2 + t2._2, t1._3 + t2._3)
}
)
val resultRDD: Array[(String, (Int, Int, Int))] = analysisRdd.sortBy(_._2, false).take(10)
resultRDD.foreach(println)
sc.stop()
}
}
第四种实现方法
object TestHostCategoryTop10T4 {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("HostCategoryTop10")
val sc = new SparkContext(sparkConf)
val rdd: RDD[String] = sc.textFile("datas/spark-core/user_visit_action.txt")
val acc = new HotCategoryAccumulator;
sc.register(acc, "hotCategory")
rdd.foreach(
action => {
val datas = action.split("_")
if (datas(6) != "-1") {
acc.add(datas(6), "click")
} else if (datas(8) != "null") {
datas(8).split(",").foreach(acc.add(_, "order"))
} else if (datas(10) != "null") {
datas(10).split(",").foreach(acc.add(_, "pay"))
}
}
)
val categories: mutable.Iterable[HotCategory] = acc.value.map(_._2)
val sortList: List[HotCategory] = categories.toList.sortWith(
(left, right) => {
if (left.clickCnt > right.clickCnt) {
true
} else if (left.clickCnt == right.clickCnt) {
if (left.orderCnt > right.orderCnt) {
true
} else if (left.orderCnt == right.orderCnt) {
left.payCnt > right.payCnt
} else {
false
}
} else {
false
}
}
)
val result = sortList.take(10)
result.foreach(println)
sc.stop()
}
class HotCategoryAccumulator extends AccumulatorV2[(String, String), mutable.Map[String, HotCategory]] {
private val hcMap = mutable.Map[String, HotCategory]()
override def isZero: Boolean = hcMap.isEmpty
override def copy(): AccumulatorV2[(String, String), mutable.Map[String, HotCategory]] =
new HotCategoryAccumulator
override def reset(): Unit = hcMap.clear()
override def add(v: (String, String)): Unit = {
val cid: String = v._1
val actionType: String = v._2
val category: HotCategory = hcMap.getOrElse(cid, HotCategory(cid, 0, 0, 0))
if (actionType == "click") {
category.clickCnt += 1
} else if (actionType == "order") {
category.orderCnt += 1
} else if (actionType == "pay") {
category.payCnt += 1
}
hcMap.update(cid, category)
}
override def merge(other: AccumulatorV2[(String, String), mutable.Map[String, HotCategory]]): Unit = {
val map1 = this.hcMap
val map2 = other.value
map2.foreach {
case (cid, hc) => {
val category: HotCategory = map1.getOrElse(cid, HotCategory(cid, 0, 0, 0))
category.clickCnt += hc.clickCnt
category.orderCnt += hc.orderCnt
category.payCnt += hc.payCnt
map1.update(cid, category)
}
}
}
override def value: mutable.Map[String, HotCategory] = hcMap
}
case class HotCategory(var cid: String, var clickCnt: Int, var orderCnt: Int, var payCnt: Int)
}
需求二:Top10 热门品类中每个品类的 Top10 活跃 Session 统计
需求简化:在需求一的基础上,增加每个品类用户 session 的点击统计
object TestHostCategoryTop10Session1 {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("HostCategoryTop10")
val sc = new SparkContext(sparkConf)
val rdd: RDD[String] = sc.textFile("datas/spark-core/user_visit_action.txt")
rdd.cache()
val top10: Array[String] = top10Category(rdd)
val filterRdd: RDD[String] = rdd.filter(
action => {
val datas = action.split("_")
if (datas(6) != "-1") {
top10.contains(datas(6))
} else {
false
}
}
)
val reduceRdd: RDD[((String, String), Int)] = filterRdd.map(
action => {
val datas = action.split("_")
((datas(6), datas(2)), 1)
}
).reduceByKey(_ + _)
val mapRDD: RDD[(String, (String, Int))] = reduceRdd.map {
case ((cid, sid), sum) => (cid, (sid, sum))
}
val groupRDD: RDD[(String, Iterable[(String, Int)])] = mapRDD.groupByKey()
val resultRDD: RDD[(String, List[(String, Int)])] = groupRDD.mapValues(
iter => {
iter.toList.sortBy(_._2).reverse.take(10)
}
)
resultRDD.collect.foreach(println)
sc.stop()
}
def top10Category (rdd: RDD[String]): Array[String] = {
rdd.flatMap(
(action: String) => {
val datas: Array[String] = action.split("_")
if (datas(6) != "-1") {
List((datas(6), (1, 0, 0)))
} else if (datas(8) != "null") {
datas(8).split(",").map((_, (0, 1, 0)))
} else if (datas(10) != "null") {
datas(10).split(",").map((_, (0, 0, 1)))
} else {
Nil
}
}
).reduceByKey(
(t1, t2) => {
(t1._1 + t2._1, t1._2 + t2._2, t1._3 + t2._3)
}
).sortBy(_._2, false).take(10).map(_._1)
}
}
需求三:页面单跳转换率统计
页面单跳转化率
计算页面单跳转化率,什么是页面单跳转换率,比如一个用户在一次 Session 过程中访问的页面路径 3,5,7,9,10,21,那么页面 3 跳到页面 5 叫一次单跳,7-9 也叫一次单跳,那么单跳转化率就是要统计页面点击的概率。
比如:计算 3-5 的单跳转化率,先获取符合条件的 Session 对于页面 3 的访问次数(PV)为 A,然后获取符合条件的 Session 中访问了页面 3 又紧接着访问了页面 5 的次数为 B,那么 B/A 就是 3-5 的页面单跳转化率。 
统计页面单跳转化率意义
产品经理和运营总监,可以根据这个指标,去尝试分析,整个网站,产品,各个页面的表现怎么样,是不是需要去优化产品的布局;吸引用户最终可以进入最后的支付页面。
数据分析师,可以此数据做更深一步的计算和分析。
企业管理层,可以看到整个公司的网站,各个页面的之间的跳转的表现如何,可以适当调整公司的经营战略或策略。
object TestPageFlow {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("HostCategoryTop10")
val sc = new SparkContext(sparkConf)
val rdd: RDD[String] = sc.textFile("datas/spark-core/user_visit_action.txt")
val dataRDD: RDD[UserVisitAction] = rdd.map(
action => {
val datas = action.split("_")
UserVisitAction(
datas(0),
datas(1).toLong,
datas(2),
datas(3).toLong,
datas(4),
datas(5),
datas(6).toLong,
datas(7).toLong,
datas(8),
datas(9),
datas(10),
datas(11),
datas(12).toLong
)
}
)
dataRDD.cache()
val ids = List(1L, 2L, 3L, 4L, 5L, 6L, 7L)
val okFlowIds = ids.zip(ids.tail)
val pageIdToCountMap: Map[Long, Long] = dataRDD.filter(action => ids.init.contains(action.page_id))
.map(action => (action.page_id, 1L))
.reduceByKey(_ + _).collect.toMap
val sessionRDD: RDD[(String, Iterable[UserVisitAction])] = dataRDD.groupBy(_.session_id)
val mvRDD: RDD[(String, List[((Long, Long), Int)])] = sessionRDD.mapValues(
iter => {
val sortList: List[UserVisitAction] = iter.toList.sortBy(_.action_time)
val flowIds: List[Long] = sortList.map(_.page_id)
val pageFlowIdList: List[(Long, Long)] = flowIds.zip(flowIds.tail)
pageFlowIdList.filter(
t => okFlowIds.contains(t)
).map((_, 1))
}
)
val flatRdd: RDD[((Long, Long), Int)] = mvRDD.map(_._2).flatMap(list => list)
val reduceRDD: RDD[((Long, Long), Int)] = flatRdd.reduceByKey(_ + _)
reduceRDD.foreach{
case ((page1, page2), sum) => {
val lon: Long = pageIdToCountMap.getOrElse(page1, 0L)
println(s"页面${page1}跳转到${page2}的单跳转换率为:${sum.toDouble / lon}")
}
}
sc.stop()
}
case class UserVisitAction(
date: String,
user_id: Long,
session_id: String,
page_id: Long,
action_time: String,
search_keyword: String,
click_category_id: Long,
click_product_id: Long,
order_category_ids: String,
order_product_ids: String,
pay_category_ids: String,
pay_product_ids: String,
city_id: Long
)
}
页面2跳转到3的单跳转换率为:0.019949423995504357
页面4跳转到5的单跳转换率为:0.018323153803442533
页面1跳转到2的单跳转换率为:0.01510989010989011
页面3跳转到4的单跳转换率为:0.016884531590413945
页面5跳转到6的单跳转换率为:0.014594442885209093
页面6跳转到7的单跳转换率为:0.0192040077929307
工程化代码
TApplication
trait TApplication {
def start(master: String = "local[*]", appName: String = "Application")(Op: => Unit) = {
val sparkConf = new SparkConf().setMaster(master).setAppName(appName)
val sc = new SparkContext(sparkConf)
EnvUtil.put(sc)
try {
Op
} catch {
case ex => println(ex.getMessage)
}
sc.stop()
EnvUtil.clear()
}
}
TController
trait TController {
def dispatch(): Any
}
TService
trait TService {
def dataAnalysis(): Any
}
TDao
trait TDao {
def readFile(path: String): RDD[String] = {
EnvUtil.take.textFile(path)
}
}
EnvUtil
object EnvUtil {
private val scLocal = new ThreadLocal[SparkContext]()
def put(sc: SparkContext): Unit = {
scLocal.set(sc)
}
def take(): SparkContext = {
scLocal.get()
}
def clear(): Unit = {
scLocal.remove()
}
}
WordCountApplication
object WordCountApplication extends App with TApplication{
start() {
val controller = new WordCountController
controller.dispatch()
}
}
WordCountController
class WordCountController extends TController{
private val service = new WordCountService
def dispatch (): Unit = {
val wordCount: Array[(String, Int)] = service.dataAnalysis()
wordCount.foreach(println)
}
}
WordCountService
class WordCountService extends TService{
private val dao = new WordCountDao
def dataAnalysis(): Array[(String, Int)] = {
val lines = dao.readFile("datas/spark-core/wordCount")
val words = lines.flatMap(_.split(" "))
val wordToOne = words.map(word => (word, 1))
val wordGroup = wordToOne.groupBy(word => word._1)
wordGroup.map {
case (word, list) => {
val tuple = list.reduce((a, b) => {
(word, a._2 + b._2)
})
tuple
}
}.collect
}
}
WordCountDao
class WordCountDao extends TDao{
}

|