From 255abc7d1beda8a5a243142a7d477961621f781d Mon Sep 17 00:00:00 2001 From: Xu Huang Date: Wed, 17 Jul 2024 17:57:17 +0800 Subject: [PATCH] [FLINK-34543][docs] Add document of full partition processing on non-keyed datastream Signed-off-by: Xu Huang --- .../operators/full_window_partition.md | 104 ++++++++++++++++++ .../docs/dev/datastream/operators/overview.md | 26 +++++ .../operators/full_window_partition.md | 104 ++++++++++++++++++ .../docs/dev/datastream/operators/overview.md | 28 +++++ 4 files changed, 262 insertions(+) create mode 100644 docs/content.zh/docs/dev/datastream/operators/full_window_partition.md create mode 100644 docs/content/docs/dev/datastream/operators/full_window_partition.md diff --git a/docs/content.zh/docs/dev/datastream/operators/full_window_partition.md b/docs/content.zh/docs/dev/datastream/operators/full_window_partition.md new file mode 100644 index 0000000000000..f7caeae55df10 --- /dev/null +++ b/docs/content.zh/docs/dev/datastream/operators/full_window_partition.md @@ -0,0 +1,104 @@ +--- +title: "Full Window Partition" +weight: 5 +type: docs +aliases: + - /dev/stream/operators/full_window_partition.html +--- + + +# Full Window Partition Processing on DataStream + +This page explains the use of full window partition processing API on DataStream. +Flink enables both keyed and non-keyed DataStream to directly transform into +`PartitionWindowedStream` now. +The `PartitionWindowedStream` represents collecting all records of each subtask separately +into a full window. +The `PartitionWindowedStream` support four APIs: `mapPartition`, `sortPartition`, `aggregate` +and `reduce`. + +Note: Details about the design and implementation of the full window partition processing can be +found in the proposal and design document +[FLIP-380: Support Full Partition Processing On Non-keyed DataStream](https://cwiki.apache.org/confluence/display/FLINK/FLIP-380%3A+Support+Full+Partition+Processing+On+Non-keyed+DataStream). + +## MapPartition + +`MapPartition` represents collecting all records of each subtask separately into a full window +and process them using the given `MapPartitionFunction` within each subtask. The +`MapPartitionFunction` is called at the end of inputs. + +An example of calculating the sum of the elements in each subtask is as follows: + +```java +DataStream dataStream = //... +PartitionWindowedStream partitionWindowedDataStream = dataStream.fullWindowPartition(); + +DataStream resultStream = partitionWindowedDataStream.mapPartition( + new MapPartitionFunction() { + @Override + public void mapPartition( + Iterable values, Collector out) { + int result = 0; + for (Integer value : values) { + result += value; + } + out.collect(result); + } + } +); +``` + +## SortPartition +`SortPartition` represents collecting all records of each subtask separately into a full window +and sorts them by the given record comparator in each subtask at the end of inputs. + +An example of sorting the records by the first element of tuple in each subtask is as follows: + +```java +DataStream> dataStream = //... +PartitionWindowedStream> partitionWindowedDataStream = dataStream.fullWindowPartition(); +DataStream resultStream = partitionWindowedDataStream.sortPartition(0, Order.ASCENDING); +``` + +## Aggregate +`Aggregate` represents collecting all records of each subtask separately into a full window and +applies the given `AggregateFunction` to the records of the window. The `AggregateFunction` +is called for each element, aggregating values incrementally within the window. + +An example of aggregate the records in each subtask is as follows: + +```java +DataStream> dataStream = //... +PartitionWindowedStream> partitionWindowedDataStream = dataStream.fullWindowPartition(); +DataStream resultStream = partitionWindowedDataStream.aggregate(new AggregateFunction<>{...}); +``` + +## Reduce +`Reduce` represents applies a reduce transformation on all the records in the partition. +The `ReduceFunction` will be called for every record in the window. +An example is as follows: + +```java +DataStream> dataStream = //... +PartitionWindowedStream> partitionWindowedDataStream = dataStream.fullWindowPartition(); +DataStream resultStream = partitionWindowedDataStream.aggregate(new ReduceFunction<>{...}); +``` + +{{< top >}} diff --git a/docs/content.zh/docs/dev/datastream/operators/overview.md b/docs/content.zh/docs/dev/datastream/operators/overview.md index d5f52aa1f7c93..a8fac6327f85a 100644 --- a/docs/content.zh/docs/dev/datastream/operators/overview.md +++ b/docs/content.zh/docs/dev/datastream/operators/overview.md @@ -612,6 +612,32 @@ env.execute() {{< /tab >}} {{< /tabs>}} +### Full Window Partition +#### DataStream → PartitionWindowedStream + +将所有的数据记录收集到一个Window中,然后在输入数据流结束时按 partition 进行处理。本方法特别适用于批处理场景。 +对于非 Keyed DataStream,一个 partition 包含一个并行子任务的所有数据记录。 +对于 Keyed DataStream,一个 partition 包含所有具有相同 key 的数据记录。 + +```java +DataStream dataStream = //... +PartitionWindowedStream partitionWindowedDataStream = dataStream.fullWindowPartition(); +// do full window partition processing with PartitionWindowedStream +DataStream resultStream = partitionWindowedDataStream.mapPartition( + new MapPartitionFunction() { + @Override + public void mapPartition( + Iterable values, Collector out) { + int result = 0; + for (Integer value : values) { + result += value; + } + out.collect(result); + } + } +); +``` + ## 物理分区 Flink 也提供以下方法让用户根据需要在数据转换完成后对数据分区进行更细粒度的配置。 diff --git a/docs/content/docs/dev/datastream/operators/full_window_partition.md b/docs/content/docs/dev/datastream/operators/full_window_partition.md new file mode 100644 index 0000000000000..f7caeae55df10 --- /dev/null +++ b/docs/content/docs/dev/datastream/operators/full_window_partition.md @@ -0,0 +1,104 @@ +--- +title: "Full Window Partition" +weight: 5 +type: docs +aliases: + - /dev/stream/operators/full_window_partition.html +--- + + +# Full Window Partition Processing on DataStream + +This page explains the use of full window partition processing API on DataStream. +Flink enables both keyed and non-keyed DataStream to directly transform into +`PartitionWindowedStream` now. +The `PartitionWindowedStream` represents collecting all records of each subtask separately +into a full window. +The `PartitionWindowedStream` support four APIs: `mapPartition`, `sortPartition`, `aggregate` +and `reduce`. + +Note: Details about the design and implementation of the full window partition processing can be +found in the proposal and design document +[FLIP-380: Support Full Partition Processing On Non-keyed DataStream](https://cwiki.apache.org/confluence/display/FLINK/FLIP-380%3A+Support+Full+Partition+Processing+On+Non-keyed+DataStream). + +## MapPartition + +`MapPartition` represents collecting all records of each subtask separately into a full window +and process them using the given `MapPartitionFunction` within each subtask. The +`MapPartitionFunction` is called at the end of inputs. + +An example of calculating the sum of the elements in each subtask is as follows: + +```java +DataStream dataStream = //... +PartitionWindowedStream partitionWindowedDataStream = dataStream.fullWindowPartition(); + +DataStream resultStream = partitionWindowedDataStream.mapPartition( + new MapPartitionFunction() { + @Override + public void mapPartition( + Iterable values, Collector out) { + int result = 0; + for (Integer value : values) { + result += value; + } + out.collect(result); + } + } +); +``` + +## SortPartition +`SortPartition` represents collecting all records of each subtask separately into a full window +and sorts them by the given record comparator in each subtask at the end of inputs. + +An example of sorting the records by the first element of tuple in each subtask is as follows: + +```java +DataStream> dataStream = //... +PartitionWindowedStream> partitionWindowedDataStream = dataStream.fullWindowPartition(); +DataStream resultStream = partitionWindowedDataStream.sortPartition(0, Order.ASCENDING); +``` + +## Aggregate +`Aggregate` represents collecting all records of each subtask separately into a full window and +applies the given `AggregateFunction` to the records of the window. The `AggregateFunction` +is called for each element, aggregating values incrementally within the window. + +An example of aggregate the records in each subtask is as follows: + +```java +DataStream> dataStream = //... +PartitionWindowedStream> partitionWindowedDataStream = dataStream.fullWindowPartition(); +DataStream resultStream = partitionWindowedDataStream.aggregate(new AggregateFunction<>{...}); +``` + +## Reduce +`Reduce` represents applies a reduce transformation on all the records in the partition. +The `ReduceFunction` will be called for every record in the window. +An example is as follows: + +```java +DataStream> dataStream = //... +PartitionWindowedStream> partitionWindowedDataStream = dataStream.fullWindowPartition(); +DataStream resultStream = partitionWindowedDataStream.aggregate(new ReduceFunction<>{...}); +``` + +{{< top >}} diff --git a/docs/content/docs/dev/datastream/operators/overview.md b/docs/content/docs/dev/datastream/operators/overview.md index e9834b7c8bdfa..1e5d5c625c777 100644 --- a/docs/content/docs/dev/datastream/operators/overview.md +++ b/docs/content/docs/dev/datastream/operators/overview.md @@ -617,6 +617,34 @@ env.execute() {{< /tab >}} {{< /tabs>}} +### Full Window Partition +#### DataStream → PartitionWindowedStream + +Collects all records of each partition separately into a full window and processes them. The window +emission will be triggered at the end of inputs. +This approach is primarily applicable to batch processing scenarios. +For non-keyed DataStream, a partition contains all records of a subtask. +For KeyedStream, a partition contains all records of a key. + +```java +DataStream dataStream = //... +PartitionWindowedStream partitionWindowedDataStream = dataStream.fullWindowPartition(); +// do full window partition processing with PartitionWindowedStream +DataStream resultStream = partitionWindowedDataStream.mapPartition( + new MapPartitionFunction() { + @Override + public void mapPartition( + Iterable values, Collector out) { + int result = 0; + for (Integer value : values) { + result += value; + } + out.collect(result); + } + } +); +``` + ## Physical Partitioning Flink also gives low-level control (if desired) on the exact stream partitioning after a transformation, via the following functions.