Flink教程（16）- Flink Table与SQL_杨林伟_flink sql 教程

大大的周 02-07 4836

文章目录 01 引言02 Table API & SQL 介绍2.1 Flink Table模块 2.2 Table API & SQL特点 2.3 Table API& SQL发展历程 03 开发准备3.1 添加依赖3.2 程序结构3.3 API3.3.1 获取环境3.3.3 创建表3.3.4 查询表3.3.5 写出表3.3.6 与DataSet/DataStream集成3.3.6.1 从DataStream和DataSet创建视图3.3.6.2 转换DataStream和DataSet到表3.3.6.3 转换表到DataStream3.3.6.3 转换表到DataSet 3.3.7 TableAPI3.3.8 SQL 04 相关概念4.1 动态表和连续查询4.2 表与Stream的转换4.2.1 表中的Update和Delete4.2.2 对表的编码操作4.2.3 将表转换为三种不同编码方式的流4.2.3.1 Append-only流4.2.3.2 Retract流4.2.3.3 Upsert流 05 文末

01 引言

在前面的博客，我们学习了Flink的一些高级API，有兴趣的同学可以参阅下：

《Flink教程（01）- Flink知识图谱》《Flink教程（02）- Flink入门》《Flink教程（03）- Flink环境搭建》《Flink教程（04）- Flink入门案例》《Flink教程（05）- Flink原理简单分析》《Flink教程（06）- Flink批流一体API（Source示例）》《Flink教程（07）- Flink批流一体API（Transformation示例）》《Flink教程（08）- Flink批流一体API（Sink示例）》《Flink教程（09）- Flink批流一体API（Connectors示例）》《Flink教程（10）- Flink批流一体API（其它）》《Flink教程（11）- Flink高级API（Window）》《Flink教程（12）- Flink高级API（Time与Watermaker）》《Flink教程（13）- Flink高级API（状态管理）》《Flink教程（14）- Flink高级API（容错机制）》《Flink教程（15）- Flink高级API（并行度）》

在前面已经讲了Flink的批流一体API以及高级API了，接下来主要讲的是Flink的Table与SQL。

02 Table API & SQL 介绍 2.1 Flink Table模块

参考：https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/

Flink的Table模块包括 Table API和 SQL：

Table API：是一种类SQL的API，通过Table API，用户可以像操作表一样操作数据，非常直观和方便；SQL：作为一种声明式语言，有着标准的语法和规范，用户可以不用关心底层实现即可进行数据的处理，非常易于上手。

Flink Table API 和 SQL 的实现上有80%左右的代码是公用的。作为一个流批统一的计算引擎，Flink 的 Runtime 层是统一的。

2.2 Table API & SQL特点

Flink之所以选择将 Table API & SQL作为未来的核心 API，是因为其具有一些非常重要的特点：

声明式：用户只关心做什么，不用关心怎么做高性能：支持查询优化，可以获取更好的执行性能批流统一：相同的统计逻辑，既可以流模式运行，也可以批模式运行标准稳定：语义遵循SQL标准，不易变动易理解：语义明确，所见即所得

使用举例：

Table APISQLtab.groupBy(“word”).select("word,count(1) as count")SELECT word,COUNT(*) AS cnt FROM MyTable GROUP BY word

2.3 Table API& SQL发展历程

自 2015 年开始，阿里巴巴开始调研开源流计算引擎，最终决定基于 Flink 打造新一代计算引擎，针对 Flink存在的不足进行优化和改进，并且在 2019 年初将最终代码开源，也就是Blink。

Blink 在原来的 Flink 基础上最显著的一个贡献就是 Flink SQL的实现！

架构升级：查询处理器的选择： Flink1.11之后Blink Query Processor查询处理器已经是默认的了。

03 开发准备

参考：https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/

3.1 添加依赖 <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table-api-scala-bridge_2.12</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table-api-java-bridge_2.12</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency>  <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table-planner_2.12</artifactId> <version>${flink.version}</version> </dependency>  <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table-planner-blink_2.12</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table-common</artifactId> <version>${flink.version}</version> <scope>provided</scope> </dependency>

解析：

flink-table-common：这个包中主要是包含 Flink Planner和 Blink Planner一些共用的代码。flink-table-api-java：这部分是用户编程使用的 API，包含了大部分的API。flink-table-api-scala：这里只是非常薄的一层，仅和 Table API的 Expression 和 DSL 相关。两个 Planner：flink-table-planner和 flink-table-planner-blink。两个 Bridge：flink-table-api-scala-bridge 和 flink-table-api-java-bridge

Flink Planner 和 Blink Planner 都会依赖于具体的 JavaAPI，也会依赖于具体的 Bridge，通过Bridge可以将 API 操作相应的转化为Scala的 DataStream、DataSet，或者转化为JAVA的 DataStream 或者Data Set

3.2 程序结构

参考：https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/common.html#structure-of-table-api-and-sql-programs

// create a TableEnvironment for specific planner batch or streaming TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section // create a Table tableEnv.connect(...).createTemporaryTable("table1"); // register an output Table tableEnv.connect(...).createTemporaryTable("outputTable"); // create a Table object from a Table API query Table tapiResult = tableEnv.from("table1").select(...); // create a Table object from a SQL query Table sqlResult = tableEnv.sqlQuery("SELECT ... FROM table1 ... "); // emit a Table API result Table to a TableSink, same for SQL result TableResult tableResult = tapiResult.executeInsert("outputTable"); tableResult... 3.3 API 3.3.1 获取环境

参考：https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/common.html#create-a-tableenvironment

// ********************** // FLINK STREAMING QUERY // ********************** import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.table.api.EnvironmentSettings; import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; EnvironmentSettings fsSettings = EnvironmentSettings.newInstance().useOldPlanner().inStreamingMode().build(); StreamExecutionEnvironment fsEnv = StreamExecutionEnvironment.getExecutionEnvironment(); StreamTableEnvironment fsTableEnv = StreamTableEnvironment.create(fsEnv, fsSettings); // or TableEnvironment fsTableEnv = TableEnvironment.create(fsSettings); // ****************** // FLINK BATCH QUERY // ****************** import org.apache.flink.api.java.ExecutionEnvironment; import org.apache.flink.table.api.bridge.java.BatchTableEnvironment; ExecutionEnvironment fbEnv = ExecutionEnvironment.getExecutionEnvironment(); BatchTableEnvironment fbTableEnv = BatchTableEnvironment.create(fbEnv); // ********************** // BLINK STREAMING QUERY // ********************** import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; import org.apache.flink.table.api.EnvironmentSettings; import org.apache.flink.table.api.bridge.java.StreamTableEnvironment; StreamExecutionEnvironment bsEnv = StreamExecutionEnvironment.getExecutionEnvironment(); EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build(); StreamTableEnvironment bsTableEnv = StreamTableEnvironment.create(bsEnv, bsSettings); // or TableEnvironment bsTableEnv = TableEnvironment.create(bsSettings); // ****************** // BLINK BATCH QUERY // ****************** import org.apache.flink.table.api.EnvironmentSettings; import org.apache.flink.table.api.TableEnvironment; EnvironmentSettings bbSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build(); TableEnvironment bbTableEnv = TableEnvironment.create(bbSettings); 3.3.3 创建表 // get a TableEnvironment TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section // table is the result of a simple projection query Table projTable = tableEnv.from("X").select(...); // register the Table projTable as table "projectedTable" tableEnv.createTemporaryView("projectedTable", projTable); tableEnvironment .connect(...) .withFormat(...) .withSchema(...) .inAppendMode() .createTemporaryTable("MyTable") 3.3.4 查询表

Table API：

// get a TableEnvironment TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section // register Orders table // scan registered Orders table Table orders = tableEnv.from("Orders");// compute revenue for all customers from France Table revenue = orders .filter($("cCountry") .isEqual("FRANCE")) .groupBy($("cID"), $("cName") .select($("cID"), $("cName"), $("revenue") .sum() .as("revSum")); // emit or convert Table // execute query

SQL：

// get a TableEnvironment TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section // register Orders table // compute revenue for all customers from France Table revenue = tableEnv.sqlQuery( "SELECT cID, cName, SUM(revenue) AS revSum " + "FROM Orders " + "WHERE cCountry = 'FRANCE' " + "GROUP BY cID, cName" ); // emit or convert Table // execute query // get a TableEnvironment TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section // register "Orders" table // register "RevenueFrance" output table // compute revenue for all customers from France and emit to "RevenueFrance" tableEnv.executeSql( "INSERT INTO RevenueFrance " + "SELECT cID, cName, SUM(revenue) AS revSum " + "FROM Orders " + "WHERE cCountry = 'FRANCE' " + "GROUP BY cID, cName" ); 3.3.5 写出表 // get a TableEnvironment TableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section // create an output Table final Schema schema = new Schema() .field("a", DataTypes.INT()) .field("b", DataTypes.STRING()) .field("c", DataTypes.BIGINT()); tableEnv.connect(new FileSystem().path("/path/to/file")) .withFormat(new Csv().fieldDelimiter('|').deriveSchema()) .withSchema(schema) .createTemporaryTable("CsvSinkTable"); // compute a result Table using Table API operators and/or SQL queries Table result = ... // emit the result Table to the registered TableSink result.executeInsert("CsvSinkTable"); 3.3.6 与DataSet/DataStream集成

参考：https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/common.html#integration-with-datastream-and-dataset-api

3.3.6.1 从DataStream和DataSet创建视图

Create a View from a DataStream or DataSet：

// get StreamTableEnvironment // registration of a DataSet in a BatchTableEnvironment is equivalent StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section DataStream<Tuple2<Long, String>> stream = ... // register the DataStream as View "myTable" with fields "f0", "f1" tableEnv.createTemporaryView("myTable", stream); // register the DataStream as View "myTable2" with fields "myLong", "myString" tableEnv.createTemporaryView("myTable2", stream, $("myLong"), $("myString")); 3.3.6.2 转换DataStream和DataSet到表

Convert a DataStream or DataSet into a Table：

// get StreamTableEnvironment// registration of a DataSet in a BatchTableEnvironment is equivalent StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section DataStream<Tuple2<Long, String>> stream = ... // Convert the DataStream into a Table with default fields "f0", "f1" Table table1 = tableEnv.fromDataStream(stream); // Convert the DataStream into a Table with fields "myLong", "myString" Table table2 = tableEnv.fromDataStream(stream, $("myLong"), $("myString")); 3.3.6.3 转换表到DataStream

Convert a Table into a DataStream：

追加模式（Append Mode）：只有当动态表仅通过插入更改进行修改时，才能使用此模式，即，它是仅追加模式，并且以前发出的结果从不更新；撤回模式（Retract Mode）：此模式始终可用。它使用布尔标志对插入和删除更改进行编码。 // get StreamTableEnvironment. StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section // Table with two fields (String name, Integer age) Table table = ... // convert the Table into an append DataStream of Row by specifying the class DataStream<Row> dsRow = tableEnv.toAppendStream(table, Row.class); // convert the Table into an append DataStream of Tuple2<String, Integer> // via a TypeInformation TupleTypeInfo<Tuple2<String, Integer>> tupleType = new TupleTypeInfo<>( Types.STRING(), Types.INT()); DataStream<Tuple2<String, Integer>> dsTuple = tableEnv.toAppendStream(table, tupleType); // convert the Table into a retract DataStream of Row. // A retract stream of type X is a DataStream<Tuple2<Boolean, X>>. // The boolean field indicates the type of the change. // True is INSERT, false is DELETE. DataStream<Tuple2<Boolean, Row>> retractStream = tableEnv.toRetractStream(table, Row.class); 3.3.6.3 转换表到DataSet

Convert a Table into a DataSet：

// get BatchTableEnvironment BatchTableEnvironment tableEnv = BatchTableEnvironment.create(env); // Table with two fields (String name, Integer age) Table table = ... // convert the Table into a DataSet of Row by specifying a class DataSet<Row> dsRow = tableEnv.toDataSet(table, Row.class); // convert the Table into a DataSet of Tuple2<String, Integer> via a TypeInformationTupleTypeInfo<Tuple2<String, Integer>> tupleType = new TupleTypeInfo<>( Types.STRING(), Types.INT()); DataSet<Tuple2<String, Integer>> dsTuple = tableEnv.toDataSet(table, tupleType); 3.3.7 TableAPI

参考：https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/tableApi.html

3.3.8 SQL

参考：https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/sql/

04 相关概念

参考：https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/streaming/dynamic_tables.html

4.1 动态表和连续查询

在Flink中，它把针对无界流的表称之为Dynamic Table（动态表）。它是Flink Table API和SQL的核心概念，顾名思义，它表示了Table是不断变化的。

我们可以这样来理解，当我们用Flink的API，建立一个表，其实把它理解为建立一个逻辑结构，这个逻辑结构需要映射到数据上去。

Flink source源源不断的流入数据，就好比每次都往表上新增一条数据。表中有了数据，我们就可以使用SQL去查询了。要注意一下，流处理中的数据是只有新增的，所以看起来数据会源源不断地添加到表中。

动态表也是一种表，既然是表，就应该能够被查询。我们来回想一下原先我们查询表的场景。

打开编译工具，编写一条SQL语句

将SQL语句放入到mysql的终端执行查看结果再编写一条SQL语句再放入到终端执行再查看结果……如此反复

而针对动态表，Flink的source端肯定是源源不断地会有数据流入，然后我们基于这个数据流建立了一张表，再编写SQL语句查询数据，进行处理。这个SQL语句一定是不断地执行的，而不是只执行一次。

注意：针对流处理的SQL绝对不会像批式处理一样，执行一次拿到结果就完了。而是会不停地执行，不断地查询获取结果处理。所以，官方给这种查询方式取了一个名字，叫Continuous Query，中文翻译过来叫连续查询。而且每一次查询出来的数据也是不断变化的。该示意图描述了：我们通过建立动态表和连续查询来实现在无界流中的SQL操作。

大家也可以看到，在Continuous上面有一个State，表示查询出来的结果会存储在State中，再下来Flink最终还是使用流来进行处理。

所以，我们可以理解为Flink的Table API和SQL，是一个逻辑模型，通过该逻辑模型可以让我们的数据处理变得更加简单。

4.2 表与Stream的转换 4.2.1 表中的Update和Delete

我们前面提到的表示不断地 Append，表的数据是一直累加的，因为表示对接Source的，Source是不会有update的，但如果我们编写了一个SQL。这个SQL看起来是这样的：

SELECT user, sum(money) FROM order GROUP BY user;

当执行一条SQL语句之后，这条语句的结果还是一个表，因为在Flink中执行的SQL是Continuous Query，这个表的数据是不断变化的。新创建的表存在Update的情况。仔细看下下面的示例，例如：

第一条数据，张三,2000，执行这条SQL语句的结果是，张三,2000第二条数据，李四,1500，继续执行这条SQL语句，结果是，张三,2000 | 李四,1500第三条数据，张三,300，继续执行这条SQL语句，结果是，张三,2300 | 李四,1500….

大家发现了吗，现在数据结果是有Update的，张三一开始是2000，但后面变成了2300。

那还有删除的情况吗？有的，看下面这条SQL语句：

SELECT t1.`user`, SUM(t1.`money`) FROM t_order t1 WHERE NOT EXISTS (SELECT T2.`user`AS TOTAL_MONEY FROM t_order t2 WHERE T2.`user` = T1.`user` GROUP BY t2.`user` HAVING SUM(T2.`money`) > 3000) GROUP BY t1.`user`GROUP BY t1.`user` 第一条数据，张三,2000，执行这条SQL语句的结果是，张三,2000第二条数据，李四,1500，继续执行这条SQL语句，结果是，张三,2000 | 李四,1500第三条数据，张三,300，继续执行这条SQL语句，结果是，张三,2300 | 李四,1500第四条数据，张三,800，继续执行这条SQL语句，结果是，李四,1500

因为张三的消费的金额已经超过了3000，所以SQL执行完后，张三是被处理掉了。从数据的角度来看，它不就是被删除了吗？

通过上面的两个示例，可以知道在Flink SQL中，对接Source的表都是Append-only的，不断地增加，执行一些SQL生成的表，这个表可能是要UPDATE的、也可能是要INSERT的。

4.2.2 对表的编码操作

我们前面说到过，表是一种逻辑结构，而Flink中的核心还是Stream，所以，Table最终还是会以Stream方式来继续处理，如果是以Stream方式处理，最终Stream中的数据有可能会写入到其他的外部系统中，例如：将Stream中的数据写入到MySQL中。

我们前面也看到了，表是有可能会UPDATE和DELETE的，那么如果是输出到MySQL中，就要执行UPDATE和DELETE语句了，而DataStream我们在学习Flink的时候就学习过了，DataStream是不能更新、删除事件的。

如果对表的操作是INSERT，这很好办，直接转换输出就好，因为DataStream数据也是不断递增的。但如果一个TABLE中的数据被UPDATE了、或者被DELETE了，如果用流来表达呢？因为流不可变的特征，我们肯定要对这种能够进行UPDATE/DELETE的TABLE做特殊操作。

解决方案：我们可以针对每一种操作，INSERT/UPDATE/DELETE都用一个或多个经过编码的事件来表示。例如：

针对UPDATE，我们用两个操作来表达，[DELETE]数据+ [INSERT]数据。也就是先把之前的数据删除，然后再插入一条新的数据。针对DELETE，我们也可以对流中的数据进行编码，[DELETE]数据。

总体来说，我们通过对流数据进行编码，也可以告诉DataStream的下游，[DELETE]表示发出MySQL的DELETE操作，将数据删除。用[INSERT]表示插入新的数据。

4.2.3 将表转换为三种不同编码方式的流

Flink中的Table API或者SQL支持三种不同的编码方式，分别是：

Append-only流Retract流Upsert流 4.2.3.1 Append-only流