Flink初级编程实践——大数据基础编程实验之八
一、实验目的
(1)通过实验掌握基本的Flink编程方法。 (2)掌握用IntelliJ IDEA工具编写Flink程序的方法。
二、实验平台
(1)服务器 Ubuntu 16.04。 (2)个人电脑 macOS 10.15.6。 (3)IntelliJ IDEA 2021.2.2。 (4)Flink 1.9.1。
三、实验步骤
1.使用IntelliJ IDEA工具开发WordCount程序
在个人电脑macOS系统中安装IntelliJ IDEA,然后使用IntelliJ IDEA工具开发WordCount程序,并打包成JAR文件,提交到服务器的Flink中运行。 下面介绍如何使用IntelliJ IDEA工具开发WordCount程序。 新建一个项目:File–>New–>Project
选择Maven,Project SDK:使用服务器java对应版本。
输入项目名称:FlinkWordCount。之后完成。
打开浏览器,进入网站https://mvnrepository.com 下载需要的maven依赖。 在搜索栏输入flink java,进入第一个搜索结果Flink:Java
找到对应Flink的版本,我们使用的是1.9.1,进入该链接。
将依赖代码复制粘贴到pom.xml中。
继续按照以上步骤将flink-streaming-java,flink-clients对应1.9.1版本的依赖代码复制粘贴到pom.xml中。注意需要在所有dependency最外放上。所有依赖添加完毕之后在IDEA最右侧maven选项中选择重新加载。
最后pom.xml内容为:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>FinkWordCount</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-java -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.9.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.12</artifactId>
<version>1.9.1</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.12</artifactId>
<version>1.9.1</version>
</dependency>
</dependencies>
</project>
在src–>main–>java下新建Package,输入名称之后完成。
在刚建立的Package下新建Java Class,输入名称"WordCountData"之后完成。
WordCountData.java用于提供原始数据,其内容如下:
package WordCount;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
public class WordCountData {
public static final String[] WORDS=new String[]{"To be, or not to be,--that is the question:--", "Whether \'tis nobler in the mind to suffer", "The slings and arrows of outrageous fortune", "Or to take arms against a sea of troubles,", "And by opposing end them?--To die,--to sleep,--", "No more; and by a sleep to say we end", "The heartache, and the thousand natural shocks", "That flesh is heir to,--\'tis a consummation", "Devoutly to be wish\'d. To die,--to sleep;--", "To sleep! perchance to dream:--ay, there\'s the rub;", "For in that sleep of death what dreams may come,", "When we have shuffled off this mortal coil,", "Must give us pause: there\'s the respect", "That makes calamity of so long life;", "For who would bear the whips and scorns of time,", "The oppressor\'s wrong, the proud man\'s contumely,", "The pangs of despis\'d love, the law\'s delay,", "The insolence of office, and the spurns", "That patient merit of the unworthy takes,", "When he himself might his quietus make", "With a bare bodkin? who would these fardels bear,", "To grunt and sweat under a weary life,", "But that the dread of something after death,--", "The undiscover\'d country, from whose bourn", "No traveller returns,--puzzles the will,", "And makes us rather bear those ills we have", "Than fly to others that we know not of?", "Thus conscience does make cowards of us all;", "And thus the native hue of resolution", "Is sicklied o\'er with the pale cast of thought;", "And enterprises of great pith and moment,", "With this regard, their currents turn awry,", "And lose the name of action.--Soft you now!", "The fair Ophelia!--Nymph, in thy orisons", "Be all my sins remember\'d."};
public WordCountData() {
}
public static DataSet<String> getDefaultTextLineDataset(ExecutionEnvironment env){
return env.fromElements(WORDS);
}
}
按照刚才同样的操作,创建第2个文件WordCountTokenizer.java。 WordCountTokenizer.java用于切分句子,其内容如下:
package WordCount;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
public class WordCountTokenizer implements FlatMapFunction<String, Tuple2<String,Integer>>{
public WordCountTokenizer(){}
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
String[] tokens = value.toLowerCase().split("\\W+");
int len = tokens.length;
for(int i = 0; i<len;i++){
String tmp = tokens[i];
if(tmp.length()>0){
out.collect(new Tuple2<String, Integer>(tmp,Integer.valueOf(1)));
}
}
}
}
按照刚才同样的操作,创建第3个文件WordCount.java。 WordCount.java提供主函数,其内容如下:
package WordCount;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.AggregateOperator;
import org.apache.flink.api.java.utils.ParameterTool;
public class WordCount {
public WordCount(){}
public static void main(String[] args) throws Exception {
ParameterTool params = ParameterTool.fromArgs(args);
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setGlobalJobParameters(params);
Object text;
if(params.has("input")){
text = env.readTextFile(params.get("input"));
}else{
System.out.println("Executing WordCount example with default input data set.");
System.out.println("Use -- input to specify file input.");
text = WordCountData.getDefaultTextLineDataset(env);
}
AggregateOperator counts = ((DataSet)text).flatMap(new WordCountTokenizer()).groupBy(new int[]{0}).sum(1);
if(params.has("output")){
counts.writeAsCsv(params.get("output"),"\n", " ");
env.execute();
}else{
System.out.println("Printing result to stdout. Use --output to specify output path.");
counts.print();
}
}
}
三个代码文件创建以后的效果:
在左侧目录树的pom.xml文件上单击鼠标右键,在弹出的菜单中选择“Maven”,再在弹出的菜单中选择“Generate Sources and Update Folders”。
在左侧目录树的pom.xml文件上单击鼠标右键,在弹出的菜单中选择“Maven”,再在弹出的菜单中选择“Reload Project”或者是"Reimport"(IDEA版本不同名称不同)。
Build–>Build Project执行编译。
打开WordCount.java代码文件,在这个代码文件的代码区域,鼠标右键单击,弹出菜单中选中“Run WordCount.main()”。
执行成功后可以看到词频统计结果。
下面要把代码进行编译打包,打包成jar包。为此,需要做一些准备工作。进入设置界面。
进行以下配置。
之后进入Project Structure界面进行设置。
点Main Class右边的Open Folders。
在搜索框中输入WordCount就会自动搜索到主类,再双击搜索到的结果。
再设置META-INF目录。
include in project build打勾
进入编译打包菜单。进行编译打包。
编译打包成功之后可以看到生成的FlinkWordCount.jar文件。
在终端中将项目文件发送到服务器/usr/local/IdeaProjects文件夹。
cd /User/huanglijie/IdeaProjects
scp -r FlinkWordCount hadoop@101.132.242.168:/usr/local/IdeaProjects
在服务器的flink上运行jar文件。可见运行成功。
数据流词频统计
使用Linux系统自带的NC程序模拟生成数据流,不断产生单词并发送出去。编写Flink程序对NC程序发来的单词进行实时处理,计算词频,并把词频统计结果输出。要求首先在IntelliJ IDEA中开发和调试程序,然后,再打成JAR包部署到Flink中运行。 仿照前面的FlinkWordCount项目的开发流程,在IntelliJ IDEA中新建一个项目,名称为“FlinkWordCount2”。新建一个pom.xm文件,内容和前面的FlinkWordCount项目中的pom.xml一样。新建一个代码文件WordCount.java,内容如下:
package WordCount;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
public class WordCount {
public static void main(String[] args) throws Exception {
int port;
try {
ParameterTool parameterTool = ParameterTool.fromArgs(args);
port = parameterTool.getInt("port");
} catch (Exception e) {
System.err.println("指定port参数,默认值为9000");
port = 9000;
}
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> text = env.socketTextStream("127.0.0.1", port, "\n");
DataStream<WordWithCount> windowCount = text.flatMap(new FlatMapFunction<String, WordWithCount>() {
public void flatMap(String value, Collector<WordWithCount> out) throws Exception {
String[] splits = value.split("\\s");
for (String word : splits) {
out.collect(new WordWithCount(word, 1L));
}
}
})
.keyBy("word")
.timeWindow(Time.seconds(2), Time.seconds(1))
.sum("count");
windowCount.print()
.setParallelism(1);
env.execute("streaming word count");
}
public static class WordWithCount {
public String word;
public long count;
public WordWithCount() {
}
public WordWithCount(String word, long count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return "WordWithCount{" +
"word='" + word + '\'' +
", count=" + count +
'}';
}
}
}
pom.xml中的依赖文件应该为:
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-java -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.9.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.12</artifactId>
<version>1.9.1</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.12</artifactId>
<version>1.9.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-core -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-core</artifactId>
<version>1.9.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-runtime -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-runtime_2.12</artifactId>
<version>1.9.1</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.12</artifactId>
<version>1.9.1</version>
</dependency>
仿照前面的FlinkWordCount项目进行编译。不同的是Project Structure–>Module中要将maven文件的scope全部改为compile。
先使用一个终端连接服务器,再启动NC程序:
nc -lk 9000
IDEA中调试时,不断在刚才的终端命令行输入文字,结果如下。注意:如果先调试再启动nc程序会报错。 仿照前面的FlinkWordCount项目进行打包。上传至服务器中。 使用如下命令启动FlinkWordCount2词频统计程序。 然后,在NC程序窗口内,连续输入一些hello world。 这时可以到浏览器中查看结果。在个人电脑中打开一个浏览器,在里面输入"http://101.132.242.168:8081” (101.132.242.168为服务器公网ip) 进入Flink的WEB管理页面,然后,点击左侧的“Task Managers",会弹出右边的新页面,在页面中点击链接。
在Stdout可以看到词频统计结果。
|