[大数据] 【调度】【公开源码】2013-Adaptive Online Scheduling in Storm

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> 【调度】【公开源码】2013-Adaptive Online Scheduling in Storm -> 正文阅读

[大数据]【调度】【公开源码】2013-Adaptive Online Scheduling in Storm

Storm模型：

 	A Storm application is modeled as a topology, i.e. a graph where nodes are operators and edges represent data flows among such operators. 
storm的应用建模为一个拓扑结构，DAG有向无环图，其中图中的节点是运算符，边代表着运算符之间的数据流

 	A Storm cluster can run topologies (Storm’s jargon for an application) made up of several processing components. Components of a topology can be either spouts, that act as event producers, or bolts that implement the processing logic. 

	Events emitted by a spout constitute a stream that can be transformed by passing through one or multiple bolts where its events are processed.
	Therefore, a topology represents a graph of stream transformations. When a topology is submitted to Storm it schedules its execution in the cluster, i.e., it assigns the execution of each spout and each bolt to one of the nodes forming the cluster.
	一个拓扑代表传输数据流的图。当拓扑提交给storm后，他会在集群中调度执行。
	即：把每个spout、bolt组件的executors分配给集群中的worker node物理节点
  A computation in Storm is represented by a topology, that is a graph where nodes are operators that encapsulate processing logic and edges model data flows among operators. In the Storm’s jargon, such a node is called a component. The unit of information that is exchanged among components is referred to as a tuple, that is a named list of values. There are two types of components: 
  在组件之间交换的信息单位称为一个元组，即一个命名的值列表。
  (i) spouts, that model event sources and usually wrap the actual generators of input events so as to provide a common mechanism to feed data into a topology, and 
  (ii) bolts, that encapsulate the specific processing logic such as filtering, transforming and correlating tuples.
  封装特定的处理逻辑，如过滤、转换和关联元组。
	The software component of nimbus in charge of deciding how to deploy a topology is called scheduler. On the basis of the topology configuration, the scheduler has to perform the deployment in two consecutive phases: (1) assign executors to workers, (2) assign workers to slots.
	nimbus中的调度模块：负责如何部署拓扑组件。在拓扑配置的基础上，调度器分为两个阶段来进行：1.分配executor到workers 2.分配workers到slots

本文中关注点：

	Requiring two distinct levels, one for tasks and one for executors, is dictated by a requirement on dynamic rebalancing that consists in giving the possibility at runtime to scale out a topology on a larger number of processes (workers) and threads (executors). 
	Changing at runtime the number of tasks for a given component would complicate the reconfiguration of the communication patterns among tasks, in particular in the case of fields grouping where each task should repartition its input stream, and possibly its state, accordingly to the new configuration. 
	Introducing the level of executors allows to keep the number of tasks fixed. The limitation of this design choice consists in the topology developer to overestimate the number of tasks in order to account for possible future rebalances. In this paper we don’t focus on this kind of rebalances, and to keep things simple we always consider topologies where the number of tasks for any component is equal to the number of executors, that is each executor includes exactly one task.
再平衡角度上讲两种不同的级别：1.任务tasks  2.executors
再平衡提供了在运行时更多的worker进程和executor线程上拓展拓扑的可能性。
引入executor级别可以保持task数量的固定。这种设计的局限性是在拓扑设计的时候要高估一些任务task的数量，因为调整executor的数量最多不能超过task的数量。这样以便未来可能的再平衡的计算。

在本文中没有关注这种再平衡，只考虑任何组件的task数量===executor的数量。即每个executor包含一个task

问题定义：

A key aspect in tuning Storm performance lies in the strategy used to deploy a topology, i.e. how Storm schedules the execution of each topology component on the available computing infrastructure.
调整storm性能的一个关键的方面在于：调度拓扑的策略
即：如何在可用的物理机器上调度每个拓扑的组件来进行执行

本文的方案：

两种调度器：

原理：
考虑到executor之间的通信，试图在同一个slot中放置高频率通信的executor。
在计算延迟时，由元组传输时间为主导的拓扑结构，减少通过网络传送和接收元组的通信代价有助于提高性能。
（因为位于同一slot的executor之间发送元组只是简单的传递一个指针，但运行在另一个slot中或者不同的worker node物理节点中的executor传送元组会有很大开销）

识别拓扑结构的热边，即有大量事件通过的边
将热边映射到worker进程中（例如，将热边所连接的bolt的executor调度到同一个集群的worker node物理节点上），但也必须考虑到每个节点的处理资源是有限的，否则拓扑的组件上的处理时间会很长。


1.离线调度器通过分析拓扑结构并调整调度，来离线工作
简单的分析拓扑结构图，通过观察他们的连接性，来识别出可放在同一个worker node物理节点的bolt组（或放在同一个是slot的executors）。

2.在线调度器通过持续的监控系统性能，并在运行时重新调度，来提高了性能
在运行时监控调度的有效性，并在合适的时候重新调度，监控是在运行时通过测量组件之间的流量，如果有新的调度策略来减少worker node物理节点之间的网络流量，就会计算调度，然后应用到集群中。

性能衡量指标：
the average event processing latency, i.e.,
how much time is needed for an event injected by a spout to traverse the topology and be thus fully processed, a fundamental metric used to evaluate the responsiveness of event
processing applications to incoming stimuli.
事件（stream）处理的平均延迟：
即：spout发出的事件（stream数据流）穿过整个拓扑而被完全处理所需要的时间。
这是一个基本度量用于评估事件处理应用对传入事件的响应性
 (but for negligible increased processing times when the schedule  is calculated计算调度而增加的处理时间可以忽略不计)

调度器设计基础：

 	In the general case, as shown in Figure 1, the custom scheduler takes as input the structure of the topology (provided by nimbus), represented as a weighted graph G(V, T), w, and set of user-defined additional parameters (α, β, ...). The custom scheduler computes a deployment plan which defines both the assignment of executors to workers and the allocation of workers to slots. 
 	Storm API provides the IScheduler interface to plug-in a custom scheduler, which has a single method schedule that requires two parameters. The first is an object containing the definitions of all the topologies currently running, including topology-specific parameters provided by who submitted the topology, which enables to provide the previously mentioned user-defined parameters. The second parameter is an object representing the physical cluster, with all the required information about worker nodes, slots and current allocations.
 自定义调度器以拓扑结构（nimbus提供）作为输入，表示为加权图G（V,T）、w以及用户定义的参数。
 自定义调度器目的：executor分配到worker，worker到slots分配的策略
 StormAPI提供IScheduler 接口，自定义调度器具有两个参数：
 	topologies对象包含了当前运行的所有的拓扑（提交拓扑的用户定义的拓扑参数，）
    cluster对象表示物理集群的对象，其中包含了有关worker node物理节点、slots以及当前分配的所有信息。
    调度器就是定期或者在提交新拓扑的时候执行。
    目前Storm没有提供任何方法来管理有状态组件的迁移，由开发人员实现特定的程序，将状态保存到存储中，并在重新调度后正确的重新加载他们的状态。

在这里插入图片描述

设计：

本文考虑的拓扑结构是不包含含有循环的拓扑结果，而是有向无环图（其中的任何输入元组从spout到结束处理的bolt的路径长度可以设置一个上限。）

在这里插入图片描述

两种调度的算法都可以使用参数α进行调整，调整每个slot所运行的executor的数量的平衡。
α影响着一个slot中可放置executors的最大数量M

在这里插入图片描述
请添加图片描述

第一种调度器：离线的基于拓扑的调度（源代码逻辑暂时还没看懂）：

检查拓扑结构来确定放置executor的最佳slot位置。
这种调度器是离线的，在拓扑执行之前执行的，所以不考虑负载、传输通信，cpu和内存的约束。
拓扑的组件之间顺序在流配置基础上得出：
 If a component ci emits tuples on a stream that is consumed by another component cj , then we have ci < cj . 
 如果一个组件ci发出的tuples被另一个组件cj处理了，那么ci<cj
 If ci < cj and cj < ck hold, then ci < ck holds by transitivity. Such order is partial because there can be pairs of components ci and cj such that neither ci > cj or ci < cj hold. 
 Since we deal with acyclic topologies, we can always determine a linearization φ of the components according to such partial order. 
 If ci < cj holds, then ci appears in φ before cj . 
 If neither ci < cj nor ci > cj hold, then they can appear in φ in any order. 
 The first element of φ is a spout of the topology. 
 e heuristic employed by the offline scheduler entails iterating φ and, for each component ci, placing its executors in the slots that already contain executors of the components that directly emit tuples towards ci. Finally, the slots are assigned to worker nodes in a round-robin fashion.

在这里插入图片描述

离线调度器将采用启发式方法对φ这个线性结构进行迭代，对φ立面的每一个组件ci，将它的executor放置在已经包含通信元组的executor 的slot中。？？
最后将slot以循环的方式分配给worker node物理节点。


问题：
不是所有的workers都会被使用，因为在算法的每个步骤，空的slot被忽视，因为它们不包含任何executor。？？？？？？
离线调度程序使用的解决方案包括在φ中的组件迭代期间强制在某个时间点使用空slot。
当开始考虑空slot时，由调优参数β控制，其值在[0,1]范围内：
在分配第i个组件的executor期间，如果i>[β·Ci]，调度程序被迫使用空slot。
例如，如果上游组件之间的流量可能更密集，那么β应该设置得足够大，以便在上游组件已经分配时使用空slot。

第二种调度器：在线的基于拓扑的调度：

目标：根据运行时观察到的executor之间的通信，减少节点间和slot之间的流量。将executor分配到worker node物理节点上。
调度器必须是在运行时运行的，使得分配更加适应负载的变化。
满足以下约束条件：
满足拓扑所需的worker数量 min（Wi，Ei）
么给worker node节点上可用的slots数量 Si
每个worker node节点可用的计算能力，必须使得节点间的流量最小化

架构设计：

其中 performance log 性能日志只是一个稳定的缓冲空间，每个slot中运行的监控线程所产生的的数据可以在nimbus上自定义调度器消耗之前放在这里。

在这里插入图片描述

衡量标准：（重新触发调度的两个约束条件）

1.worker node物理节点的计算能力：（利用CPU利用率来衡量一个节点的所承担的worker负载（上面运行的worker进程）和executor线程产生的负载）
CPU速度：
For example, if an executor is taking 10% CPU utilization on a 1GHz CPU, then migrating such executor on a node with 2GHz CPU would generate about 5% CPU utilization. For this reason, we measure the load in Hz. In the previous example, the executor generates a load of 100MHz (10% of 1GHz).
例如：一个线程在1GHz CPU占用了10% CPU，那么它会在2GHz CPU上占用5% CPU
用Hz为单位衡量负载，一个线程在1GHz CPU占用了10% CPU，即线程产生的负载为100Mhz （1g=1000mb）

在这里插入图片描述

CPU measurements have been implemented by leveraging standard Java API for retrieving at runtime the CPU time for a specific thread (getThreadCpuTime(threadID) method of ThreadMXBean class).
在运行时测量CPU可以通过(getThreadCpuTime(threadID) method of ThreadMXBean class).
用这种方法可以检测到节点CPU 因为过载而造成的不平衡。
 We can state that if a node ni exhibits a CPU utilization trend such that Li ≥ Bi for more than Xi seconds, then we trigger a rescheduling.
 定义：如果cpu的利用率趋势，Li ≥ Bi for more than Xi seconds超过x秒，那么就触发重新调度。
 Bi为容量（Hz为单位），xi是时间窗口（秒为单位）

不考虑IO带来的负载，对磁盘读写与外部系统DBMS的网络通信代价。

2.executors之间的通信元组量（为了使节点间流量最小化）

在这里插入图片描述

数学表示：负载均衡目标：

the goal of load balancing is to assign each executor to a slot of a node.
负载均衡的目标是： 将每个executor分配到节点的slot中
 The scheduling is aimed at computing 调度的目标：
 (i) an allocation A1 : E → W, which maps executors to workers, 
 (ii) an allocation A2 : W → N , which maps workers to nodes.

在这里插入图片描述

算法：

上述的问题是NP-完全问题
在运行时再平衡需要有一种快速机制来寻找分配策略，本文采用启发式算法：贪婪启发式算法
分为两个阶段：
1.每个拓扑的executor在拓扑所定义所需的worker数量中进行分配。
2.第一阶段所产生的worker必须分配到集群中的可用slot中。（需要考虑到节点间的流量最小化、节点负载能力的约束）

在这里插入图片描述

在第一阶段：
对于每个拓扑结构，按照executor之间的通信tuples的速率降序排序executors对
如果executors对中两个都没有被分配到worker中，那么就都分配到当前负载最小的worker中
π集合：负载最小的worker、该executor对中任何一个executor被分配到的worker
	All the possible assignments of these executors to these workers are checked to find the best one, that is the assignment that produces the lowest inter-worker traffic. At most, there can be 9 distinct possible assignments to check.
在π中迭代循环，找到最好的worker分配方案：能让executor对放进去在worker之间产生最小的通信

在这里插入图片描述

第二阶段：
通信的worker按照通信速率降序排序
如果通信的worker对都没有分配到任何worker node物理节点中，那么就分配到负载最小的节点中
如果有一个或者两个都被分配到其他节点中，那么就在π集合（这些节点、负载最小的节点），找到产生最小节点间流量的节点。

实验

The tests have been conducted both on a synthetic workload and on a real workload publicly released for the DEBS 2013 Grand Challenge.
测试是利用一个DEBS挑战赛的真实的工作负载进行的

1.用在一般拓扑结构上，展示算法的调整参数如何影响计算效率的
2.利用DEBS大赛数据集

性能评价指标：
1.平均延迟
2.在运行时产生的平均节点间流量


参考拓扑：
利用参考拓扑分析了调整参数如何影响调度算法的和拓扑性能的。

在这里插入图片描述

https://www.researchgate.net/publication/262320373_The_DEBS_2013_grand_challenge

将调度器应用在了2013年DEBS大挑战数据集（上述连接时是这个大挑战）中，实现了查询的一个子集
性能指标：平均处理延迟、在运行时产生的平均节点间流量
The topology includes three components: 
(i) a spout for the sensors (sensor component in the figure, replication factor 8, with a total of 32 sensors to be simulated), 
(ii) a bolt that computes the instantaneous speed and receives tuples by shuffle grouping (speed component in the figure, replication factor 4), 
(iii) a bolt that maintains players’ statistics and updates them as new tuples are received by fields grouping from the speed bolt (analysis component in the figure,replication factor 2).
设计的拓扑结构组件为：
1.传感器的spout（图中的传感器组件，并行度8，总共要模拟32个传感器），
2.计算瞬时速度的bolt，并通过洗牌分组接收元组（图中的速度组件，并行度8，共32个传感器要模拟
3.一个维护球员统计信息的bolt，并从瞬时速度bolt中通过字段分组接收元组，并行度2

在线调度器需要经过一个收集性能指数所需的初始瞬时期。