Removing Double-Logging (FAST 22) 论文笔记

摘要

类似 Mysql 的 RDB 存储引擎通常是以B+树为代表的 InnoDB，而基于 LSM-Tree 为代表的存储引擎以高性能和更高效的存储空间利用，逐渐替代前者。

但 LSM-Tree 存储引擎与上层 SQL层的结合又会引来一个关键的双重日志问题：

上层 RDB 和下层存储引擎都实现了冗余的 redo-log 机制，导致了不必要的IO性能开销

本文提出了一种新的机制：PASV 被动数据持久化方案，能够有效解决双重日志问题；

具体来说，PASV 通过完全移除存储引擎层的 WAL，同时引进了新的机制，包括：

被动内存缓冲区刷新策略
基于周期的数据持久化方案
优化后的部分数据恢复过程

来实现正常运行期间的可靠性和低成本进行数据持久化，并在系统故障时快速恢复；

最后达到的性能指标：

吞吐量提高 49.9%
延迟降低 89.3%
减少IO数量达到 42.9%
系统恢复时间缩短 4.8%

1 介绍

LSM-Tree 存储层与上层 SQL 层是隔离的，存储引擎可以作为一个插件进行替换

当请求到来时，

SQL 层，需要负责 RDB 层面的逻辑，比如：SQL解析、查询优化、查询处理、事务执行，但为了系统崩溃后的恢复，通常需要一个 redo-log 机制，Mysql 中就是 Binlog
存储引擎层，需要接收来自上层的请求，持久化数据到存储设备，类似的，通过 WAL 来保证在系统崩溃等异常情况下对数据的恢复

2 Double-Logging Problem

2.1 三种日志的作用和 double-log 带来的影响

Binlog 日志记录内容：

完整的 SQL 语句或行级变更，事务的开始、提交或回滚信息、元数据时间戳等

恢复过程：重放SQL操作、整个事务

Rocksdb WAL 记录内容：

完整的键值对操作，插入、更新、删除等，以及每个操作的序列号或时间戳

恢复过程：重放键值对操作

Innodb redo-log 记录内容：

页面内容的具体更改，包含页面ID、偏移量、旧值、新值，不包含完整的行数据

恢复过程：重放页面更改操作

WAL 和 Binlog 在功能和记录的内容上重叠度更高，区别仅仅是 Binlog 保存的是原始 SQL，WAL 保存的是转换后的 KV，导致了以下影响：

空间浪费
IO次数增加，日志都是需要落盘的，重复的日志写入将导致性能的降低，如图2(b)

Mysql 事务执行与 Binlog 的两阶段提交：

（1）事务执行（一个只有 update 的事务）

从磁盘获取这条记录，SQL层将旧记录和新记录同时传递给存储引擎
开启事务，首先生成一条 undo log，用于事务后面如果要回滚
存储层收到事务转译过来的kv后，写入 redo-log（WAL）
SQL层只需要确定事务都写入存储层的WAL后，就定义已执行完成，并持久化
然后，上层把 SQL 日志写到 binlog cache（遵从每个事务串行写入）
至此，事务执行完成

（2）事务提交（两阶段）

目的：防止半成功的状态，需要保证 binlog 和 WAL 都完成持久化

由于上下层之间是互相独立的，

如果没有两阶段提交，无法保证 binlog 持久化完成和 WAL 持久化完成能同时成功：

情况1：WAL 持久化，binlog cache 未刷盘，此时宕机后恢复

恢复阶段，存储层 WAL 日志重新执行，比如 k1 = 2，但由于 binlog 里面没有记录这条记录，

导致主从架构中，从库复制了 binlog 后，最后执行的状态，存储层出现键值不一致的情况，从库可能k1=1

情况2：binlog 持久化，WAL 未刷盘，此时宕机后恢复

Binlog 默认存储层已持久化，主库恢复时，WAL 中没有 k1 = 2，但 binlog 里面有个 update k1 = 2

由于 binlog 不参与恢复重建，只有 WAL 是 cache-safe 的

主库 k1 = 1，从库 k1 = 2

MyRocks 两阶段提交：

Prepare 阶段

kv对写入到 WAL，同时在 WAL 中将对应的事务状态设置为 prepare

Commit 阶段

将 binlog cache 里面的记录刷盘，成为真正的 binlog，

接着 SQL 层调用存储层的事务提交接口，将 WAL 里面对应的事务状态置为 commit。

至此，能够保证 binlog 和 WAL 的一致性。

MyRocks Two-Phase Commit:

Prepare Phase:
1. Key-value pairs are written to WAL, the corresponding transaction status in the WAL is set to “prepare”.
Commit Phase:
1. The records in the binlog cache are flushed to disk, becoming the actual binlog.
2. Then, the SQL layer calls the storage layer’s transaction commit interface to set the corresponding transaction status in the WAL to “commit”.

The same records will be written twice in binlog and wal.

（2）再由 Binlog 回放，在重新执行 SQL 语句的过程中，SQL层先解析成 kv 对并发送给存储层，在存储层接收时，RocksDB 优先将新的上层请求放入 WAL，再将kv对写入Rocksdb（相同kv写Rocksdb第二次）

如何解决 double-log 带来的性能下降

本文提出的解决方案：移除存储层的 WAL，仅保留 SQL 层的 binlog

The main idea is to remove WAL.

However, it will cause three critical challenges.

Unwarranted data persistence.
- After removing WAL, RocksDB may only write key-value pairs to the in-memory memtable.
- If the storage engine crashes, the binlog will consider the txn has already persisted and not need replay.
Partial persistence.
- A SQL transaction may be parsed into multiple KV pair writes.
- Multiple key-value pairs may be distributed across different column families in Rocksdb.
- Some KV items have been flushed to disk while others have not.
Lost track of LSN.
- Consistent snapshots are ineffective.

但简单的移除 WAL 会带来一系列问题：三个 Critical Challenges

（1）Unwarranted data persistence. 无法保证数据的持久性

在原先的系统里，SQL层一旦收到存储层的 WAL 写入响应，就默认这个键值对已经被持久化了

即使存储引擎突然崩溃，Rocksdb 立刻重放 WAL，也能达到一致性

但移除 WAL 后，Rocksdb 可能仅仅将键值对写入内存中的 memtable，并立即返回响应

如果存储引擎崩溃，binlog 会认为已经被持久化，不需要重放，导致这个 kv 对消失

（2）Partial persistence. 造成部分持久化

部分持久化现象针对的是事务，一个 SQL 事务可能被解析成多个 kv 对的写入，

多个键值对可能分布在不同的 memtable 中，可能存在同一事务有的 kv项已刷盘，有的 kv项未刷盘

从而导致系统崩溃时造成事务的部分持久化

（3）Lost track of LSN. 丢失跟踪的LSN信息

LSN（log sequence number)，Rocksdb 每当有一个键值对的插入或更新都会分配一个新的 LSN

这里主要会影响 Rocksdb 的 MVCC 并发控制，在有 WAL 的情况下，重放的每个kv都会保留之前的 LSN

但如果去掉 WAL，重放 binlog 会为原先的kv分配新的 LSN，这将导致：

原有的一致性快照失效，因为它们基于旧的 LSN

由于并发执行，原先的操作执行顺序和恢复后的执行顺序可能不同，将无法保证读一致性

有WAL: LSN帮助确保事务的原子性，通过WAL可以确保事务要么完全应用，要么完全回滚。

删除WAL后: 无法保证事务的完整性，特别是对于跨多个MemTable的事务

对于已刷盘的数据，无法通过 LSN 知道磁盘上数据的版本，一致性快照失效

3 系统设计

PASV-mgr Layer.

Flush flag: Used to pass transaction information related to data recovery between two layers, with each CF maintaining a key for the flag.
EBP: Determine the nearest global safe point to the current time.
Partial Recovery: Determine the minimum number of KVs to be recovered for each CF.

去除 Rocksdb 的 WAL 层后，在 SQL 层与存储层之间新加了 PASV-mgr 层，里面包含三个模块：

EBP：确定离当前最近的全局安全点，能够完整恢复丢失数据
Partial Recovery：确定每个CF要恢复的最少KV数
Flush flag：用于在两层之间传递与数据恢复相关的事务信息，每个CF维护一个flag的key
- flag的value：<CF, TSN, $$LSN_{first$$, $$LSN_{last$$>

ACT (Active Flush) 是一种主动同步数据持久化的方法，用于解决移除 Write-Ahead Log (WAL) 后的数据一致性问题。以下是 ACT 的主要算法思路：

基本原理： ACT 通过在 RDB 层定期强制刷新存储引擎层的所有内存缓冲区（MemTables），来创建一个同步点。这确保了在这个点之前的所有事务都被安全地持久化到存储中。
实现机制： a. 事务计数：
1. 系统维护一个事务计数器。
2. 每当一个事务提交时，计数器增加。
b. 刷新阈值：
1. 预定义一个刷新阈值（如每 100、200、500 等事务）。
c. 强制刷新：
1. 当事务计数达到阈值时，触发 flushAll() 操作。
2. flushAll() 会同时刷新所有 LSM 树的内存缓冲区。
d. 同步操作：
1. 刷新过程是同步的，需要获取锁。
2. 在刷新期间，所有提交线程都会被阻塞。
优点：
1. 简单直接：实现相对简单，容易理解。
2. 可控的持久化点：可以通过调整阈值来控制持久化频率。
3. 确定性：每次刷新后，可以明确知道哪些事务已经被持久化。
缺点：
1. 性能开销：频繁的强制刷新会导致大量小型同步 I/O，影响性能。
2. 并行性降低：同步刷新会阻塞其他操作，降低系统并行性。
3. 内存缓冲效率降低：可能导致未满的内存缓冲被过早刷新。
4. 模块化受损：RDB 层直接控制存储引擎层的操作，破坏了模块化设计。

ACT creates a synchronization point by periodically forcing a flush of all memory buffers (MemTables) in the storage engine layer at the RDB layer.

Advantages:

Simplicity.
Determinism: After each flush, it’s clear which transactions have persisted.

Disadvantages:

Performance overhead: Frequent forced flushes lead to many small synchronous I/Os.
The flush process is synchronous and requires acquiring a lock in SQL layer.
All the commit threads will be blocked during the ACT.
More frequently happening compaction operations in the underlying LSM-trees.
Compromised modularity.

Implementation principles:

A mechanism is needed to track the progress of data persistence, while also tracking the status of transactions at the storage layer.
- PASV Flush Flag.
The key-value pairs parsed from a single transaction may be distributed across multiple different Column Families (CFs). A mechanism is needed to ensure a global data persistence point.
- Epoch-based Persistence.
For recovery and reconstruction, a mechanism needs to be designed to reallocate Log Sequence Numbers.
- Partial Recovery Module.

3.1 Flush flag 模块

作用：

标志数据持久化的进度
在不需要 WAL 的情况下也能跟踪事务的状态

Flush flag 设计可行性的前提

Binlog 串行记录着事务提交的顺序
事务解析成键值对的过程是串行的，即当一个事务在被解析和转换成kv对时，另一个事务必须等待
一个事务解析出的所有键值对按照顺序串行加入 LSM-Tree 的 memtable

由 Flush flag 定义两个恢复点

$$Flag = ，

TSN 代表当前CF memtable 收到的最后一个已传递所有kv到这个列族的事务, 设为 $$T_{p$$

$$LSN_{first$$ 代表该事务在这个 memtable 的第一个键值对的 log sequence number

$$LSN_{last$$代表最新那个键值对的 log sequence number

Flush Flag value = $$$$

Flush flag is a special KV. The key is a random 128-bit magic number distinguished from user data.
TSN is the Transaction Sequence Number of the last transaction whose KV items are inserted in the memory buffer of the column family.
$$LSN_{first$$ represents the first KV of the transaction.
$$LSN_{last}$$ represents the last persisted KV of the transaction.

Prerequisites for the feasibility of flush flag design: serial property of transaction processing.

During the transaction commit process, all transaction records are persisted to the binlog in serial.
The SQL transactions are parsed in the RDB layer and translated into KV batches in the storage engine layer in serial.
The KV items that are translated from a transaction are inserted into the LSM-trees’ memory buffers in serial.

Hence we can ensure that all KV items logically prior to the last KV item of the last transaction in a column family would never be persisted to storage later than it.

一个 flag 可以定义不同事务的数据安全点和数据持久化点

Data Save Point: 事务 $$T_{p-1}$$的数据安全点，表示的是事务 P-1 及之前的所有事务在 CF 上都已经完成持久化
Data Persistent Point: 事务 $$T_{p}$$ 的数据持久化点，由于事务 P 还可能有其他键值对在来的路上（上层向下层传递中），这里表示的是事务 P 持久化到了事务里面的一个键值对及之前这个事务的所有键值对

Based on this serial property, we can conclude that:

If the retrieved flush flag contains transaction $$T_p$$, the KV items of transaction $$T_{p-1}$$ and transactions prior to it in $$CF_i$$ must have already been persisted.
Local Data Save Point for $$T_{p-1}$$.
- It indicates that$$T_{p-1}$$ and all transactions before it have been fully persisted on the CF.
Local Data Persistent Point for $$T_{p}$$.
- Due to some kv pairs of $$T_{p}$$ not yet being sent to the storage layer, this indicates the last persisted key-value pair in $$T_{p}$$.

理解 Data Save Point

假设 P-1 事务没有完成持久化，后续还有 P-1 事务的键值对在来的路上，即下面场景：

memtable：kv1(t1), kv2(t2), kv3(t3), kv4(t2)

Sending to the storage layer: kv5(t1), kv6(t2)

由于 binlog 串行记录事务的提交顺序，即 T3 -> T1 -> T2

flag = <CF1, t2, kv2_lsn, kv4_lsn>， P = t2，P-1 = t1

这里违反了 serial property 的条件二和三：t1 的所有键值对解析和传递操作都必须先于 t2 任何操作之前进行处理

若要满足 serial property 的三个条件，顺序应为串行的：

kv1(t1), kv5(t1), kv3(t3), kv2(t2), kv4(t2), kv6(t2)

3.2 Epoch-based Persistence 模块

作用：

协调多个 CF 之间的数据持久化进度；
确定数据恢复的全局安全点；

由于每个 CF 进行 flush 的时间不同，一个事务的键值对操作可能分散在多个 CF 里面，

每个 CF 的持久化进度不同，因此，需要确定在所有 CF 里面，最新的已完全持久化到存储的事务。

Local data persistent point : 事务 $$T_{p}$$

Global data persistent point : 根据 binlog 记录的事务 commit 的顺序，比如 T3 -> T1 -> T2；

取 min{ local data persistent point }，大小顺序如上，扫描 binlog 构建。

3.3 Partial Recovery 模块

作用：

崩溃后数据恢复的具体执行过程；
重建丢失的 LSN 信息；
优化：只重放每个 CF 尚未持久化的事务，最小化需要重放的数据量；

一个简单的思路是重放全局点及之后的所有事务的键值对，但会造成冗余放大；

如右图，CF1~3的数据持续点为：t, n, s，全局点为 n，

如果 CF1 也重放事务 n 的键值对，会造成已持久化重复插入。

作者的思路也很简单，根据 global persistent point 和 binlog 确定需要重放的事务范围：

先让 binlog 根据 global persistent point 开始重放事务并转换成键值对；
每个 CF 根据 local persistent point（flag）和 binlog 事务顺序来判断某个键值对是否需要被执行；
如右图，恢复阶段 CF1 将事务t 之前的所有键值对抛弃。

3.4 恢复阶段重新构建每个键值对操作的 LSN

移除 WAL 后，每个键值对的 LSN 也会丢失

在 MVCC 中，读操作需要在某个时间点的一致性视图就需要根据 LSN 读取相应时间点的数据

重放阶段，binlog 事务一个个串行执行

A 事务解析成kv对发到各个 CF，如果该 CF flag 的 TSN > 当前事务的 TSN，丢弃（不需要重放）
由 serial property 原则，每个 CF 只会接收大于等于 TSN 事务的键值对，每个键值对新LSN = flag.last_lsn++

4 实验

Workload.

FaceBook UDB，模拟社交网络 benchmark，模拟用户图谱数据，包括节点（用户）和边（关系）
- 吞吐量：提升 49.9%
- 延迟降低：30%
- IO次数：存储引擎层降低 42%
TPC-C
- 总执行时间：减少 26%
- 吞吐：提升 35%
- IO次数：存储引擎层降低 23%
Facebook UDB, simulating social network benchmarks, models user graph data including nodes (users) and edges (relationships).
- Throughput: increased by 49.9%.
- Latency: reduced by 30%.
- I/O count: reduced by 42% at the storage engine layer.
TPC-C
- Total execution time: reduced by 26%.
- Throughput: increased by 35%.
- I/O count: reduced by 23% at the storage engine layer.

图5：

PASV 和 Myrocks baseline 在性能指标上的对比，延迟、吞吐、IO次数
PASV and Myrocks baseline comparison in performance metrics.
Including total execution time, avg latency, throughput, and I/O count.

图6：

PASV 和主动刷新方法 active flush method ACT 的性能对比
主动刷新方法 ACT：每隔一段时间将所有 CF 的 memtable 强制 flush 到磁盘
- ACT 的缺点：一般来说，Rocksdb 在memtable满了后转换成 imm，再 flush
- 但 ACT 每隔一段时间强制刷新，导致同一时间点大量小的memtable都要被flush，造成性能的下降
对比的指标：吞吐，延迟，恢复时间

Performance comparison between PASV and the Active flush method(ACT)

Active flush method: Forcibly flushes the memtables of all CFs to disk at regular intervals.
- Drawback of ACT: ACT forces flushing at regular intervals, causing many small memtables to be flushed at the same time, leading to performance degradation.
Comparison metrics: throughput, latency, recovery time.

图7：

Native：各种指标的上界，直接删除 WAL，不考虑系统崩溃的场景
PASV 在性能上能够达到极限性能上界
同时，恢复时间 Native 是必须重放所有binlog，与 PASV 相差几百倍
Native: The upper bound for various metrics, directly removing WAL without considering crash scenarios.
PASV can achieve extreme performance upper bound in terms of performance.(SOTA)
For recovery time, Native must replay all binlogs, which differs from PASV by hundreds of times.