Learn Flink:Fault Tolerance

Fault Tolerance via State Snapshots 通过状态快照实现容错 State Backends 状态后端

The keyed state managed by Flink is a sort of sharded, key/value store, and the working copy of each item of keyed state is kept somewhere local to the taskmanager responsible for that key. Operator state is also local to the machine(s) that need(s) it.
Flink管理的keyed state是一种分片的键/值存储，每个keyed state的工作副本保存在负责该key的taskmanager的本地某处。
Operator state也是存储在它运行机器的本地。

This state that Flink manages is stored in a state backend. Two implementations of state backends are available – one based on RocksDB, an embedded key/value store that keeps its working state on disk, and another heap-based state backend that keeps its working state in memory, on the Java heap.
Flink管理的状态存储在状态后端。状态后端有两种实现——一种基于RocksDB，这是一种嵌入式键/值存储，将其工作状态保存在磁盘上;另一种基于堆的状态后端，将其运行状态保存在内存中，保存在Java堆上。

>Supports state larger than available memory 支持大于可用内存的状态
>Rule of thumb: 10x slower than heap-based backends 经验法则：比基于堆的后端慢10倍

When working with state kept in a heap-based state backend, accesses and updates involve reading and writing objects on the heap. But for objects kept in the EmbeddedRocksDBStateBackend, accesses and updates involve serialization and deserialization, and so are much more expensive. But the amount of state you can have with RocksDB is limited only by the size of the local disk. Note also that only the EmbeddedRocksDBStateBackend is able to do incremental snapshotting, which is a significant benefit for applications with large amounts of slowly changing state.
当使用保存在基于堆的状态后端中的状态时，访问和更新状态就涉及在堆上读取和写入对象。
但是对于保存在EmbeddedRocksDBStateBackend中的对象，访问和更新状态就涉及序列化和反序列化，因此成本要高得多。
但RocksDB的状态量仅受本地磁盘大小的限制。还请注意，只有EmbeddedRocksDBStateBackend能够进行增量快照，这对于具有大量缓慢变化状态的应用程序是一个显著的好处。

Both of these state backends are able to do asynchronous snapshotting, meaning that they can take a snapshot without impeding the ongoing stream processing.
这两个状态后端都能够进行异步快照，这意味着它们可以在不妨碍正在进行的流处理的情况下进行快照。

Checkpoint Storage 检查点存储

Flink periodically takes persistent snapshots of all the state in every operator and copies these snapshots somewhere more durable, such as a distributed file system. In the event of the failure, Flink can restore the complete state of your application and resume processing as though nothing had gone wrong.
Flink定期获取operator中所有状态的持久快照，并将这些快照复制到更持久的地方，例如分布式文件系统。
在发生故障的情况下，Flink可以恢复应用程序的完整状态并恢复处理，就好像什么都没有发生一样。

The location where these snapshots are stored is defined via the jobs checkpoint storage. Two implementations of checkpoint storage are available - one that persists its state snapshots to a distributed file system, and another that users the JobManager’s heap.
存储这些快照的位置通过该job的检查点存储定义。检查点存储有两种实现——一种将其状态快照持久化到分布式文件系统，另一种使用JobManager的堆。

>Supports very large state size 支持非常大的状态
>Highly durable 高度耐用
>Recommended for production deployments 建议用于生产部署
>Good for testing and experimentation with small state (locally) 适用于小状态（本地）的测试和实验

State Snapshots 状态快照 Definitions 定义

1.Snapshot – a generic term referring to a global, consistent image of the state of a Flink job. A snapshot includes a pointer into each of the data sources (e.g., an offset into a file or Kafka partition), as well as a copy of the state from each of the job’s stateful operators that resulted from having processed all of the events up to those positions in the sources.
快照–一个通用术语，指Flink作业状态的全局一致图像。快照包括指向每个数据源的指针（例如，文件或Kafka分区中的偏移量），以及来自每个作业带状态的operators的状态副本，该状态产生于对数据源中指针之前的所有事件的处理。

2.Checkpoint – a snapshot taken automatically by Flink for the purpose of being able to recover from faults. Checkpoints can be incremental, and are optimized for being restored quickly.
检查点–Flink为了能够从故障中恢复而自动拍摄的快照。检查点可以是增量的，并针对快速恢复进行了优化。

3.Externalized Checkpoint – normally checkpoints are not intended to be manipulated by users. Flink retains only the n-most-recent checkpoints (n being configurable) while a job is running, and deletes them when a job is cancelled. But you can configure them to be retained instead, in which case you can manually resume from them.
外部化检查点——通常情况下，检查点不会被用户操纵。Flink在作业运行时仅保留n个最近的检查点（n个可配置），并在作业取消时删除它们。但您可以将它们配置为保留，在这种情况下，您可以手动从它们恢复计算结果。

4.Savepoint – a snapshot triggered manually by a user (or an API call) for some operational purpose, such as a stateful redeploy/upgrade/rescaling operation. Savepoints are always complete, and are optimized for operational flexibility.
保存点–用户（或API调用）出于某些操作目的手动触发的快照，如有状态重新部署/升级/重新缩放操作。保存点始终完整，并针对操作灵活性进行了优化。

How does State Snapshotting Work? 状态快照是如何工作的？

Flink uses a variant of the Chandy-Lamport algorithm known as asynchronous barrier snapshotting.
Flink使用了Chandy-Lamport算法的一种变体，称为异步分界线快照。

When a task manager is instructed by the checkpoint coordinator (part of the job manager) to begin a checkpoint, it has all of the sources record their offsets and insert numbered checkpoint barriers into their streams. These barriers flow through the job graph, indicating the part of the stream before and after each checkpoint.
当检查点协调器（job manager的一部分）指示task manager开始检查点时，它会让所有数据源记录它们的偏移量，并将带编号的检查点分界线插入数据源的流中。这些分界线贯穿作业图，标识出每个检查点之前和之后的部分。

Checkpoint n will contain the state of each operator that resulted from having consumed every event before checkpoint barrier n, and none of the events after it.
检查点n将包含每个operator的状态，这些状态产生于对检查点分界线n之前的每个事件的处理，而不处理它之后的任何事件。

As each operator in the job graph receives one of these barriers, it records its state. Operators with two input streams (such as a CoProcessFunction) perform barrier alignment so that the snapshot will reflect the state resulting from consuming events from both input streams up to (but not past) both barriers.
当作业图中的operator接收到分界线时，它会记录其状态。具有两个输入流（如CoProcessFunction）的Operators会对齐分界线，因此快照将能够准确反映其状态(该状态是通过消费两个输入数据流中在两个分界线之前的事件而产生的)。

Flink’s state backends use a copy-on-write mechanism to allow stream processing to continue unimpeded while older versions of the state are being asynchronously snapshotted. Only when the snapshots have been durably persisted will these older versions of the state be garbage collected.
Flink的状态后端使用一种写时复制机制，允许流处理在异步快照旧版本的状态时不受阻碍地继续。只有当快照被持久保存时，这些旧版本的状态才会被垃圾收集。

Exactly Once Guarantees 保证精确一次

When things go wrong in a stream processing application, it is possible to have either lost, or duplicated results. With Flink, depending on the choices you make for your application and the cluster you run it on, any of these outcomes is possible:
当流处理应用程序出错时，可能会导致丢失数据或重复计算数据。使用Flink，根据您为应用程序和运行它的集群所做的设置，以下任何结果都是可能的：

1.Flink makes no effort to recover from failures (at most once)
2.Nothing is lost, but you may experience duplicated results (at least once)
3.Nothing is lost or duplicated (exactly once)
Flink不努力从故障中恢复（最多一次）
不会丢失任何内容，但您可能会对数据进行重复计算（至少一次）
没有任何东西丢失或重复（仅一次）

Given that Flink recovers from faults by rewinding and replaying the source data streams, when the ideal situation is described as exactly once this does not mean that every event will be processed exactly once. Instead, it means that every event will affect the state being managed by Flink exactly once.
鉴于Flink通过倒带和重放源数据流从故障中恢复，虽然理想情况被描述为精确一次，这并不意味着每个事件都将被精确处理一次。而是，这意味着每个事件都会影响Flink管理的状态一次。

Barrier alignment is only needed for providing exactly once guarantees. If you don’t need this, you can gain some performance by configuring Flink to use CheckpointingMode.AT_LEAST_ONCE, which has the effect of disabling barrier alignment.
分解线对齐仅用于当保证精确一次时。如果不需要，可以通过将Flink配置为使用CheckpointingMode.AT_LEAST_ONCE来获得一些性能提升。其具有禁用分解线对齐的效果。

Exactly Once End-to-end 端到端精确一次

To achieve exactly once end-to-end, so that every event from the sources affects the sinks exactly once, the following must be true:
要实现端到端的精确一次，以便source的每个事件都精确一次影响 sink，必须满足以下条件：

1.your sources must be replayable, and
2.your sinks must be transactional (or idempotent)
1.您的sources必须是可重复播放的，并且
2.sinks必须是事务性的（或幂等的）

Learn Flink:Fault Tolerance

Java相关栏目本月热门文章