Design A Storage Engine for Distributed Relational Database from Scratch

13 min readNov 30, 2022

This article is the transcript of a tech talk, Design A Storage Engine For
Distributed Relational Database From The Beginning, given by Sam Zhao, tech leader of the storage engine team at OceanBase. Sam joined Alibaba in 2010. He has rich experience in database kernel development.

In his tech talk, Sam covered the following topics:

- Storage engine overview

- Components of the storage engine

- Future work

So without further ado, let’s dive right into the talk.

Storage Engine Overview

Before diving into the architectural details, let’s just ask one question: Which storage structure should be used? As you know, both B-tree and LSM-Tree are widely used in storage engines, but which one is better?

Let me answer this question from two perspectives: scalability and data consistency, which are two fundamental functions of a distributed storage engine.

To achieve high scalability, a table should be partitioned into many partitions. Then we can distribute different partitions to different servers such that all servers can share read/write requests from different users. A load of different servers may be out of balance. So we need split partitions and migrate some partitions from one server to another. Compared with B-tree, LSM-Tree is easier. Because SSTables are static and read-only, it’s very easy to split an SSTable or migrate it.

The Paxos protocol can guarantee strong data consistency. The problem is how to verify that. Though the Paxos protocol can be proved in mathematics, there may be bugs in a real complex distributed system, which may lead to a wrong result. There is a compaction routine in LSM-Tree, which is a wonderful timing to check data consistency between different replicas. So LSM-Tree is easier in the consistency check.

With that being said, we chose LSM-Tree as the basic data structure of the storage engine and InnoDB is not appropriate.

Then why not RocksDB? It is a very popular open-source storage engine. However, RockDB cannot provide what we require for a distributed RDBMS.

First, Schema. There are schemas in RDBMS. There are many columns in a row in RDBMS. Whereas in RocksDB, the schema is very simple, there are just keys and values.

Second, Strong data type. The types of columns in RDBMS are strong. There is a specific column type for each column. Whereas in RocksDB, the type of a column is uncertain.

Third, complex transaction processing. There may be very complex transaction processing in RDBMS. A transaction may be very large, inserting or modifying millions of rows, and it requires all the modification atomic. Whereas in RocksDB, transactions are very simple.

With these being considered, no open-source storage components can meet our requirements for a distributed RDBMS. We have to design a new storage engine from scratch.

Storage Engine Architecture

Here is the storage engine architecture.

The storage engine is based on LSM-Tree. Generally, data is divided into two parts, Memtable and SSTable. SSTable is static and only readable. All mutations such as inserts, updates, and deletes are written to Memtable.

In Memtable, there are two kinds of structure, B-tree-like structure with MVCC, which is ordered by primary key and supports table scan, and a memory Hash index, to accelerate single row get operations.

In SSTable, data is divided into blocks and is also ordered by the primary key. When the size of the Memtable exceeds a certain threshold, an SSTable is generated by dumping Memtable to disk. These SSTables are called mini SSTables. And this process is called mini-compaction.

When the number of mini SSTables exceeds a certain threshold, these mini SSTables are compacted into a bigger SSTable, which is called a minor SSTable. And this process is called minor compaction.

Finally, major compaction will compact an old major SSTable, a minor SSTable and multiple mini SSTable into a single new major SSTable.

When a SQL query is coming, we need to merge rows from both Memtables and SSTables and return the final result to the SQL layer. Caches are widely used to accelerate the query, including row cache for single row get and block cache for the scan.

The Advantages of a Self-Developed Storage Engine

Designing a new storage engine is quite challenging. Compared with using open-source components, designing and developing a storage engine from scratch is much harder.

However, there are also some advantages because the storage engine is designed and implemented entirely by the team.

The first one is that we make the database much cheaper to use. Compared with Btree, the LSM-Tree structure is more compression-friendly, supporting many general compression algorithms such as zstd and lz4. In addition to general compression algorithms, encoding is used to further compress data because the schema information is accessible, which achieves a higher compression ratio. Here is an article explaining the encoding and compression technology in detail.

The second one is that we make the database much faster for both transactions and analytical workloads. Memtable optimization is implemented for better transaction performance and SSTable for analytical workloads.

Storage Engine Components

There are three major components in the storage engine: Compaction, DDL processing, and Backup & Recovery. Now let’s take a look at how each component works.

Compaction

Compaction is the basic operation in LSM-Tree. But there are two major problems in compaction, which brings in a lot of challenges.

1. Amplification

There are three kinds of amplification:

1) Write amplification. In the case of random modification, LSM-Tree will bring large write amplification, because the entire SSTable needs to be rewritten.

2) Space amplification. There may be many SSTables. Different SSTables may contain the same row, which results in space amplification.

3) Read amplification. If we read rows from an LSM tree, we need to read both Memtable and SSTable. And there may be multiple Memtables or multiple SSTables.

To reduce write & space amplification, we have divided an SSTable into many fixed-size blocks, which are called macroblocks.

A macroblock is a 2MB fixed size. And it is the basic unit of compaction write. Fixed-size macroblocks are very easy to manage and the size of 2MB makes data writes very fast on both SSD and HDD disks.

However, 2MB is too large for data reads to be fast. Therefore, there are many blocks in a single macroblock, which are called microblocks. The size of a microblock is variable and can be adjusted by table parameters, ranging from 4KB to 512 KB. A microblock contains some rows and can be compressed using the encoding and compression algorithms as we mentioned before. Microblock is the basic unit of SSTable reads.

In addition, unmodified macroblocks will be reused directly without any extra operations, which can reduce both write and space amplification.

As to read amplification, reducing the number of SSTables will work because the reason for read amplification is that there are too many SSTables.

But how?

We have defined three kinds of SSTables: mini SSTable, minor SSTable, and major SSTable.

There may be some mini SSTables in an LSM-Tree. Basically, a mini SSTable is dumped from Memtable if the size of Memtable exceeds the threshold. And some mini SSTables may be compacted to be a new mini SSTable if the number of mini SSTables exceeds the threshold. This routine is called mini compaction. The trigger of mini-compaction can be adjusted by system parameters. Generally, the number of mini SSTable is 1 to 3.

When the size of mini SSTables exceeds a given threshold, all mini SSTables will be compacted to a minor SSTable. There will be only one minor SSTable. Both mini SSTable and minor SSTable may contain uncommitted rows to support large transactions.

When the system load is low or at a given time within a day, the database will do major compaction to generate major SSTables. Major SSTables is a global data snapshot. So we can check the data checksum between different major SSTable replicas to make sure that the data of different replicas are exactly the same. We can also check the column checksum between the main user table and the index table to make sure that the data of the index table is consistent with the main table.

The diagram below shows the above process precisely:

Generally, there are three levels of SSTables, responding to three kinds of compaction.

To further reduce read amplification, we have added many kinds of cache:

1) Row cache to optimize Get and MultiGet performance.

2) BloomFilter cache to optimize Insert, Get, and Scan performance.

3) Block cache to optimize all queries to reduce IO costs.

2. Performance jitter

The compaction process will consume lots of CPU/IO resources, which will drag down the response time of user queries during compaction.

How to smooth the performance curve?

There are two methods to minimize the affection of Compaction.

1) Cache prewarm. During Compaction, we have added some new micro blocks generated from the Compaction to block cache. When the compaction is completed, the block cache will not be very cold.

2) IO Isolation. We classify IO into two different categories, user IO and Sys IO. The blocking IO caused by user operations such as Insert, Update, Delete, and Select are classified as user IO. The blocking IO caused by background tasks such as compaction, migration, and backup is classified as Sys IO. We control the IOPS of Sys IO to separate it from user IO.

DDL

Generally, DDL (Data Definition Language) operations can be categorized into two kinds depending on whether they need to touch data, DDL that does not need to touch data includes renaming a table, modifying a column name, modifying the column default value, modifying constraints from not null to null, etc. DDL that needs to touch data includes adding columns, adding indexes, adding a foreign key, modifying column type, modifying the primary key, modifying partition rule, partition split, etc.

To deal with DDL that does not need to touch data, we only modify the inner schema table and notify all related partitions to refresh the schema, then the DDL is finished.

To handle DDL that needs to touch data, we divide them into two types: DDLs that can be delayed and DDLs that need to be executed instantly. DDLs that can be delayed include adding a column, modifying row format, modifying micro block size, modifying compress algorithm, etc. DDLs that need to be executed instantly include adding an index, adding a foreign index, modifying column type, modifying the primary key, modifying partition rule, partition split, etc.

To handle DDLs that are deferrable, we modify schema instantly without touching any data, and return success to the user just like the DDL has been finished. Then we progressively modify data during major compaction. For example, we only rewrite 10% of the total data during each compaction to reduce the impact of DDL operation. When all data has been touched and modified, the DDL operation is finished actually.

It is much more complex to handle DDLs that must be executed instantly. We choose index building as an example to explain why it is difficult.

First, index building needs to be online, not blocking other DML (Data Manipulation Language) operations.

Second, index building should be atomic, that is to say, either success or failure.

Third, index building should be highly available in a distributed system. In case of any single node failure, index building should continue to proceed.

Finally, index building should be fast. Because there are many partitions distributed to many servers, index building should be distributed. And for those single large partitions, index building should be parallel to use the ability of multiple CPU cores.

How to make index building online and atomic?

There are three statuses for index building:

- no index

- index writable

- index available

Each with different schema version. Initially, there is no index.

When RootService receives the DDL of creating an index, it modifies the inner schema table to record this newly created index and notifies all partition leaders of the new index schema. When a partition leader receives the new index schema, it writes the schema change to the Paxos log. The Paxos log can guarantee consistency among all partition replicas. Then all partition leaders start to write index tables in the new transactions. The index status can be changed to ‘index writable’. It means that the index is currently only writable, but not readable. Then we wait for all old transactions with the old schema to finish. When all old transactions have finished, we build an index based on the current data snapshot. The index status will be changed to index available.

How make index building highly available?

First, we record index building status in the inner system tables. Inner table transaction guaranteed this modification ACID. In case of a node failure, RootService will reschedule the index-building tasks according to the index status stored in inner tables.

Second, we record index-building task information in inner system tables. All building tasks are reentrant and can be rescheduled by RootService in case of a node failure.

Third, we write index SSTable macroblocks to Paxos logs. Then followers can build the indexes by replaying Paxos logs.

How to make index building fast?

Similar to the SQL plan, we make an index build plan to utilize parallel SQL executors. But different from normal SQL insert statements, we generate macroblocks to SSTable directly instead of inserting them to Memtable, which can significantly reduce memory and CPU costs. By utilizing SQL operators, the index building is both parallel and distributed. We first sample data from different partitions, then we make distributed and parallel sort to generate index SSTables.

Backup & Recovery

Although high availability is a native function of a distributed database system, backup & recovery are still very useful, especially when there are some unexpected maloperations to the database such as wrongly deleting some important data.

Generally, there are two kinds of backup and recovery strategies, logical backup and recovery, and physical backup and recovery.

Logical Backup & Recovery

In logical backup and recovery, we use SQL such as ‘select * from table’ to backup data to files and use SQL such as ‘load data local infile xxx into table’ to recover data from files.

Advantages

The advantages of logical backup & recovery are very obvious. It is rather simple and easy to develop.

Drawbacks

But it also brings many drawbacks:

1) It needs additional agents to execute these SQLs

2) It can only backup and recover to a single point

3) The backup cannot cross DDL

4) Both backup and recovery are very slow

Physical Backup & Recovery

In physical backup and recovery, we directly back up data blocks and redo logs to files, and recover data blocks from data files and redo log files.

Advantages

There are many advantages to physical backup and recovery:

1) It is very easy to use

2) database inside

3) It does not need additional agents to execute SQLs.

4) It can recover the database at any given time as long as there are enough backup data and log files.

5) Both backup and recovery are very fast.

Drawbacks

The only drawback of physical backup and recovery is that it is very complex to develop.

So what’s our choice? To make the database much easier for users, we develop the physical backup and recovery.

Backup & Recovery on a Distributed Database

It is hard to implement physical backup and recovery. And it is even harder to implement one on a distributed database.

There are at least three requirements that need to be satisfied.

1) First, both backup and recovery need to be distributed and parallel. It needs to be distributed because there are many partitions distributed to many servers. And the backup and recovery of a single large partition need to be parallel to be more efficient.

2) Second, both backup and recovery need to be highly available. In case of server failure, backup and recovery need to be continued to be successful. And the data which has been recovered should be reused instead of being recovered again.

3) Third, data needs to be recovered to a global snapshot. All transactions that have not been committed need to be rolled back and distributed transactions need to be handled very carefully to identify their final status.

So how to make backup and recovery distributed and parallel?

There are two kinds of data that need to be backed up and recovered, SSTables and redo logs. We back up SSTables of different partitions in parallel. For SSTables of a single large partition, we split it into multiple backup tasks to make the backup proceed in parallel. A redo log is written sequentially. But there are many different log streams. Different log streams can be backup in parallel. Parallel recovery is similar to parallel backup.

How to make backup and recovery high availability?

We store backup and recovery status, and backup and recovery tasks in inner system tables. In case of a node failure, RootService can reschedule these backup and recovery tasks according to the status stored in system tables. All backup and recovery tasks are designed to be reentrant such that each task can be executed again and again safely.

How to recover to a global snapshot?

Generally, there are three steps.

In step one, we recover system tables. The SSTables of system tables are recovered to the cluster and the redo logs of system tables are replayed. Then we execute the system table upgrade operation to change the system table schema or insert some system table rows to handle upgrade compatibility.

In step two, we recover user tables. The SSTables of user tables are recovered to the cluster and the redo logs of user tables are replayed. Note that both SSTables and redo logs can be recovered in parallel. So the recovery process can be very fast. The speed mainly depends on the network bandwidth.

In step three, we do the cleanup work. We identify the status of the uncertain distributed transactions to commit or roll them back. Then we identify the status of DDL to roll back all unfinished DDL operations.

Future Work

There are still many features to be developed in the future.

Column store

It is known that column store is better for OLAP performance. We are expecting better AP performance in future versions of OceanBase.

Auto Partition Split

Partitioning is the basic method to achieve scalability. Currently, the table partition needs to be defined by the user, which is not very convenient to use. In the future versions of OceanBase, we expect that partition split can be done automatically.

Online DDL

Currently, there are some DDLs that cannot be executed online, such as modifying the data type of a column, changing the primary key of a table, and modifying the partition rules of a table. In the future, we expect that these DDL operations can be operated online, which means that user DMLs will not be blocked when these DDLs are executed.

This is a brief introduction to the storage engine brought to you by Sam Zhao. If you have any questions, feel free to leave a comment below! Any technical discussion is warmly welcomed.