ADOC LSM-Tree (FAST 23) 论文笔记
一篇针对 LSM-Tree write stall 的 study,对写停滞原因进行了深入分析,证实之前这方面的一些研究是存在缺陷的。通过在多种存储设备上进行实验,总结出了写停滞的根本来源,即:data overflow;
Data Overflow:指的是由于数据流入LSM-KV系统的某个组件而导致一个或多个组件的快速扩张;
作者的方法:通过平衡和协调组件之间的数据流,能够减少数据溢出,从而减少写入停顿;
提出 ADOC 自动数据溢出控制框架,自动调整系统配置,主要两个参数:线程数和 batch size
以最小化每个组件的 data overflow。
图1:无论是哪种存储设备,只要不使用流量控制,强如 PM 持久化内存,也无法避免大量写压力下造成的写停滞。
先前的研究
总结了先前的研究分析写停滞有三种类型:
- Memtable Stall:比如,RocksDB 将 Memtable 的数量设置为 2,当两个都填满时,系统输入就会停止,导致停顿;
- level-0,level-1 compaction:当 L0 文件数量达到设定阈值(20个)时,LSM-KV 会减慢或甚至停止输入流,导致停顿;
- Pending Input Size (PS) stall:待压缩作业任务的处理字节数达到 64GB 和 128GB 时,分别用于减速和停止系统的前台写入。(待压缩字节数)目的是避免深层压缩作业发生时磁盘带宽突增。
接下来,对先前的研究结论进行实验分析,主要考虑两个参数:
- 系统中并发运行的线程数
- 影响的资源:CPU 时间、占用的带宽
- 批处理大小:memtable 大小、个数;SST 大小、个数
- 影响后台任务的调度模式和输入规模
1、一些研究认为资源耗尽是导致 write stall 的根本原因,资源耗尽包括:
- 带宽有限以及争用,会导致提供给后台 compaction 的带宽不足;
- CPU 资源有限,高 CPU 开销是写停顿的一个原因;
2、一些研究认为 L0-L1 压缩过程是导致 write stall 的根本原因:
- L0 到 L1 的 compaction 只能串行执行,因为 L0 层每个 SST File 可能包含相同的 key;
3、另一些研究认为是深层次的压缩过程是导致 write stall 的根本原因:
- 深层次的 Compaction 任务的高资源消耗和其他后台任务竞争,造成了写停顿;
Some previous studies found the root cause of write stalls:
(1) Resource Exhaustion.
- Limited bandwidth and contention, which leads to insufficient bandwidth for background compaction.
- Limited CPU resources, where high CPU overhead is one cause of write stalls.
(2) L0-L1 Compaction Data Movement.
- L0 to L1 compaction can only be executed serially because each SST File in the L0 layer may contain identical keys.
(3) Deep Level Compaction Data Movement.
- High resource consumption of deep-level compaction tasks and competition with other background tasks cause write stalls;
Observations.
- For PM and NVMe SSD, the stall duration decreases as the thread count increases until up to four threads.
- Write stall occurrences start to drop, while the stall duration increases, as the thread count beyond six.
- The number of threads increases, the CPU utilization decreases.
Conclusion.
- Limited CPU resources may be a cause of write stall in certain scenarios, but it is not universally applicable.
结论:有限的 CPU 资源可能在某些场景下是写停顿的原因,但是并不普适。
当增加batch size大小时,write stall 发生次数和持续时间都显著降低,但 CPU 利用率没有什么变化
结论2:即使 CPU 利用率很高,仅通过增加批处理大小就可以减少写入停顿。也就是说,CPU 利用率和写停顿是没有那么高的相关性
Observation.
- When increasing the batch size, both the frequency and duration of write stalls significantly decrease, while CPU utilization remains unchanged.
Conclusion.
- Even with high CPU utilization, write stalls can be reduced simply by increasing the batch size.
- This indicates that there isn’t a strong correlation between CPU utilization and write stalls.
先前研究认为随着线程数的增加,磁盘带宽饱和,不堪重负,导致写停顿;
如果写停顿是因为带宽不足,设备应长时间处于高写入压力下,平均带宽应接近理论值;
但事实是,即使磁盘空闲,有空闲带宽,也会发生写停顿。
Previous research suggests that,
- As the number of threads increases, disk bandwidth becomes overwhelmed, leading to write stalls.
If write stalls were caused by insufficient bandwidth:
- The device should experience sustained high write pressure.
- Average bandwidth utilization should approach theoretical maximum values.
Actual Observation.
- write stalls occur even when the disk is idle and has available bandwidth.
Observation.
In the early execution stages, performance valleys in NVMe SSD and PM match the occurrence of Compaction.
The correlation between performance troughs and L0-L1 merge operations weakens over time.
在执行的早期阶段,NVMe SSD 和 PM 中的性能波谷与 Compaction 的发生相匹配
性能低谷和 L0-L1 合并作业之间的对应关系会随着时间的推移而减弱
Observation.
When more threads are generated for Compaction jobs, the processing speed of flush operations decreases.
As the number of threads increases, the occurrence of PS stall (Pending Size Stall) caused by slow compaction decreases.
当生成更多线程用于压缩 Compaction 作业时,刷新 flush 作业 的处理速度会降低
随着线程数量的增加,由缓慢压缩引起的 PS stall (Pending Size Stall)的发生减少
PS stall triggered when the volume of pending compaction data (pending bytes) exceeds the threshold.
The main purpose is to prevent excessive accumulation of redundant data in the system.
RocksDB default thresholds:
- soft limit (slows down writes): 64GB
- hard limit (completely stops): 128GB
Data overflow Definition.
- The sudden surge in data flow into one of the components leads to rapid expansion of one or multiple components.
It happens when the processing rates of different background jobs do not match each other.
Three types of data overflow:
- Memory Overflow (MMO): When the system input rate exceeds the immutable Memtable Flush rate.
- Level 0 Overflow (L0O): When the L0-L1 merge processing rate cannot match the flush rate.
- Redundancy Overflow (RDO): When the merge threads’ work efficiency cannot match the rate at which redundant data is generated.
对三种 data overflow 分别进行处理:
- MMO:当 immemtable 数量达到阈值时,ADOC 通过减少线程数量来提高刷写速率,这将为 flush job预留更多带宽,并通过增加 batch size来提高处理速率
- L0O:若 L0 文件数量超过阈值,ADOC 会增加线程的数量。这可以提高为 L0-L1 压缩分配线程的机会,并降低刷写速率以缓解溢出。
- RDO:当总冗余数据大小超过阈值 (RocksDB 默认为64GB) 时,ADOC 会增加线程数并降低批量大小。前者提高了深层合并的速率(同时也降低了刷写速率),后者允许调度器生成更细粒度的合并作业,因为小而密集的合并作业有助于提高冗余缩减的效率
Handling three types of data overflow:
MMO (Memory Management Overflow):
- When immemtable count reaches threshold, ADOC reduces thread count to increase flush rate.
- Increases batch size (memtable size) to improve processing rate.
L0O (Level 0 Overflow):
- When L0 file count exceeds threshold, ADOC increases thread count.
- This improves chances of allocating threads for L0-L1 compaction, and reduces flush rate to alleviate overflow.
RDO (Redundant Data Overflow):
- Triggered when total redundant data size exceeds threshold (RocksDB default: 64GB)
- ADOC increases thread count and reduces batch size (SST file size).
- More threads improve deep merge rate (while reducing flush rate).
首先当刷新的速度不够高导致不能及时持久化输入请求的时候,MMO 将导致暂停输入;
随着线程数的增加,更多线程强制共享有限的带宽,导致更少的带宽分配给 flush 线程,导致了 Immutable 不能被及时刷回。
在执行的更早阶段,因为深层次的数据还没有聚集,L0-L1 的合并很容易分配给一个线程,因此早期 Compaction 和 Write Stall 是匹配的。
然而 随着继续执行,数据开始聚集然后有了更多深层次的 Compaction 请求。在线程数量有限的情况下,这减少 L0-L1 合并分配给线程的机会,因此这时候开始不匹配。
Write stall duration (bar) and occurrences (lines).
Tuning actions during one-hour execution.
Write Stall 的本质:由于流入某一个组件的数据量激增,导致某一个或多个组件迅速膨胀。
ADOC 仅动态控制 线程数 和 batch size 是无法完全解决 write stall 的,只能缓解。
图1描述了流量控制是能解决 write stall 的,但如果将前台流量设置成定值,反而适得其反。
- ICDE 21 CruiseDB 用的就是流量控制,但由于建模有些许缺陷,存在改进的空间;
- 如何设计一个更优秀的流量控制反馈模型,是解决 write stall 的关键。
停顿持续时间的决定因素:
- 大规模操作会导致更长的停顿:L0-L1 > 深度压缩 > Flush;
- 并行度越差,停顿时间越长:Flush < L0-L1 < 深度压缩;
- 较慢的flush速率会导致更多MMO,减少LOO;
- 更高的并行性(compaction thread increase),更少的 RDO