OceanBase: Architecture Overview and Key Concepts

10 min readDec 14, 2022

This article is the transcript of a tech talk, OceanBase Introduction, by Charlie Yang, CTO of OceanBase. Charlie joined the company in 2010 as the first developer of OceanBase. He has been leading OceanBase for the previous 12 years.

The 1st Principle of OceanBase’s Architecture Design

The traditional monolithic database has full SQL functionality, and the performance of a single node is very high. But they do not support some very useful features in the cloud environment, such as high scalability and high availability.

On the other hand, distributed storage systems have some useful features in the cloud, such as high scalability and high availability. But they do not support full SQL functionality. Maybe they are just a key-value store, typically called NoSQL systems, and they do support limited SQL functionality. There are some distributed SQL databases with full SQL functionality and high scalability. But the single-node performance of these distributed SQL databases is very low compared to MySQL databases.

The design goal of OceanBase is a distributed SQL database with full SQL support and high performance of a single node.

The Overall Architecture of OceanBase

In OceanBase, each cluster consists of several zones in one or multiple regions. There is a role called OBProxy.

OBProxy is used to route requests to OBServer. When receiving a SQL query from the user, OBProxy will parse the query and then send the query to the appropriate OBServer according to the location info of the partition, and each OBServer is similar to a classical RDBMS. You will compile SQL statements to produce a SQL execution plan.

There are three types of SQL execution plans: local plan, remote plan, and distributed plan. There is only one OBServer. It will be elected to host the root service. This is different from the other distributed storage systems. In other systems, there will be an independent process to hold root service to do global management. But in OceanBase, the root service is integrated into OBServer. Redo logs are replicated among the zones using Paxos. There are two kinds of transactions. If the transaction is for just one partition, it’s executed locally. But if the transaction is for multiple partitions, no matter whether it’s in one server or across servers, it is executed using 2PC (two-phase-commit) to achieve an atomic distributed transaction.

Basic Concepts to Understand OceanBase

Each cluster in OceanBase has multiple zones. You can think of a zone as an availability zone. Mostly it is just a data center, and each zone has multiple OBServers.

That is a concept called a resource pool. Each resource pool has multiple resource units, and each unit can be hosted on just one OBServer. OceanBase has a multi-tenant architecture. It is divided into multiple resource pools owned by tenants.

You can think of a tenant as an independent MySQL instance in RDS in the cloud environment. And each tenant has its own database, and each database has its own table, and each table has its own partition, and each partition can be replicated at several replicas. So in OceanBase, resource isolation is done internally by the database, not by the virtual machine, Docker, or some other mechanism at the operating system level.

Partition

OceanBase uses two-level partitioning, that is Hash partition and range partition or Hash partition as the first level and range partition as the second level, or vice versa.

There is another concept, partition group, in OceanBase. The relative tables in the same table group form a partition group, and each partition group locates on the same server.

For example, suppose that a user has his own item and account. You can add the item table and the account table into one table group. Then the relative partition of the same server forms a partition group and locate on the same server to avoid distributed transactions. A meta table is used to maintain the location info of each partition replica. The underlying mechanism of the meta table is almost the same as the user table.

Therefore, in OceanBase, the meta table is also scalable. Root service will do the load balance, such as partition movement, partition replication, and leader switch. When a server crashes, the root service will detect this event just by heartbeat. If the server restarts immediately, the root service will just add the server again to the same cluster. But if the server crashes for a very long period of time, for example, more than two hours, the root service will deem that server as down permanently. And then it will do partition re-replication to make sure that there are always enough replicas for all partitions.

Paxos-based Replication

OceanBase uses Paxos for the redo log replication. Paxos is a column-based consensus protocol. Each redo log should be replicated to the majority of the replicas, that is two out of three replicas or three out of five replicas. Paxos is used to achieve high availability. The Recovery Point Objective (RPO) is zero, and the Recovery Time Objective (RTO) is less than 30 seconds. It means that in case of failure, OceanBase can recover automatically in less than 30 seconds without data loss. We will do a data consistency check, as I mentioned before. At the transaction level, OceanBase verifies the accumulative checksum of each transaction. In case of a software bug in a transaction processing module or concurrency processing model, OceanBase can detect these kinds of software bugs automatically by transaction level check. At the replica level, OceanBase verifies the checksum on each replica after major compaction, typically once a day. And yes, we will do a disk check each time we read and write from the disk, verifying the checksum of each block.

Distributed Transaction

Let’s dive into the distributed transaction. OceanBase uses 2PC to achieve the atomic distributed transaction.

Suppose we have a transaction. User Alice wants to transfer ten dollars to Bob. Firstly, we just start a transaction, and then minus 10 dollars from Alice, and then add 10 dollars to Bob. Finally, we commit the transaction.

When we commit the transaction, OceanBase will initiate a two-phase commit process in the background, that is, do Prepare as the first phase and do Commit as the second phase.

There are two important problems.

The first one is about GTS (Global Transactions Service). As you know, in each database, we must make sure that each transaction has a unique transaction ID and in a distributed database, the transaction ID should be global. So in OceanBase, we use GTS for each OBServer to gather transaction ID. To make sure that the GTS will not be the bottleneck of the system, we will retry GTS requests in a batch mode to reduce the number of requests. We improve the two-phase commit algorithm to reduce the latency of the distributed transaction.

In traditional 2PC, we need to wait for the coordinator to write a global commit log to make sure that the global status of the transaction is persisted in there after the Prepare phase. But in OceanBase 2PC, we can respond user immediately after the Prepare phase. So in case of failure, this way we have not loaded the global commit log, we cannot get the global status of the global transaction immediately, and we need to wait for the coordinator to get all local information from all participants, and then we can decide whether the transaction should be committed or should be rolled back.

Isolation Level

The isolation level of OceanBase is snapshot isolation. We use Multi-Version Concurrency Control, MVCC, to make sure that the read queries will not be blocked by the write queries. That is a famous problem called Write Skew problem in snapshot isolation.

For example, suppose that we have two rows: A=1 and B=2, and we have two concurrent transactions, transaction one: A=B+1, and transaction two: B=A+1. Maybe the final result can be A=2 and B=2 at the snapshot isolation level. So the behavior is very strange because it is not equivalent to any sequential execution of these two concurrent transactions. So in OceanBase, we use row lock to avoid this kind of Write Skew problem. Users can use a statement like “select from some table for update” to use row lock exclusively to lock that row and then the Write Skew problem can be solved.

Linearizability

There is another problem, linearizability, (which) I think is very important in distributed SQL databases.

I will just give you an example to show what it means. Suppose Alice runs a select from some table query and doesn’t get a response yet. And then I run the query to insert one row into that table, and Bob runs another query to insert another row into that table. Both two transactions are committed. And then Alice’s query returns. Maybe she’ll get Bob’s row, but not mine. This behavior is very strange because my transaction happens before Bob’s. Alice sees only Bob’s transaction, but not my transaction.

This problem is called a violation of Linearizability. In OceanBase, we use GTS (Global Transcription Service) to make sure that each transaction has a unique global ID to avoid the linearizability problem. And Google Cloud Spanner uses a very famous mechanism called TrueTime to avoid the linearizability problem. But there are some cons to TrueTime.

Each time we commit the transaction, we need to wait for about seven to ten million seconds. So these are the cons of Google Spanner. In CockroachDB, it uses another mechanism called HLC, which is a hybrid logic lock. You can think of hybrid logic lock as a combination of logical lock and physical lock. The value is very similar to the NTP log value, but the HLC value is still not very accurate. There may be a time deviation in HLC. It means that if the time value of the two events is very very close, we don’t know which one happens before. So HLC cannot solve the problem of linearizability. If you want to see some details, I think you can read the blog of CockroachDB.

Storage Engine

OceanBase uses an LSM-Tree storage engine. We have a memory table and SSTable, and we have two types of compaction, minor compaction, and major compaction. Minor compaction just bumps the memory table into the SSTable.

Major compaction is used to merge the major SSTable and several minor SSTable or maybe some memory tables into one single SSTable. OceanBase maintains two indexes in the memory table, the B Tree index and the Hash index. Hash index is used to accelerate single row get. SSTable is divided into data blocks.

We have two kinds of data blocks, macroblock, and microblock. The macroblock is mostly 2MB. It is the unit of data write in OceanBase. Only modified macroblocks need to be rewritten during the compaction process. I think it’s very different from other LSM-tree-based storage engines. For example, in RocksDB, the whole SSTable should be rewritten during compaction.

Microblock, you can think of it as maybe a data page or data block in other databases, mostly it is only 8–512KB. It is the unit of data read. And OceanBase as encoding and compression, each time we read and write from the block. OceanBase has two kinds of cache, Block Cache and Row Cache.

Check the detailed explanation of the storage engine here.

SQL Engine

This is the architecture of the SQL engine. OceanBase has Parser, Resolver, Query Transformer, Optimizer, and Code Generator. I think it is very similar to other databases. And we use the System-R-like cost-based optimizer to do query transformation and query optimization. That is a component called Plan Cache. Each time we receive one query from a new user, OceanBase tries to match an existing plan in the Plan Cache, and the Vector Execution and Parallel Execution are used for OLAP big queries.

So I will just give you an example of big query parallel execution. For example, we want to draw into tables t1 and t2 using a filter. From the execution plan, you can see that we will scan this to a table and then use the operator called exchange operator to shuffle the data to several machines, and then do Hash-Hash distribution in each machine, and finally merge the results.

So in OceanBase, parallel execution, which is a very important operator called exchange operator, is used in data transfer, data shuffle, and data exchange between servers. OceanBase has implemented several kinds of distributed join operators such as partition-wise join, partial partition-wise join, broadcast join, merge join, Bloom filter join, and so on.

Check the detailed explanation of the SQL engine here.

Online Schema Change

In OceanBase, we support online schema change. We use multiple version schema to allow the rollout of a new scheme while the older scheme is still in use. When a DBA does a DDL operation, such as adding a column or dropping a column in OceanBase, we can make sure that you will not affect the online business. That is another mechanism called SQL plan management. When a DBA does a DDL operation, there may be a very bad plan. So in OceanBase, we will not switch traffic to the new plan immediately. We just switch traffic to the new plan step by step. 1% at first, 5%, 20%, and so on. When the new plan has been proven that it is better than the old plan, we can replace the old plan and old baseline with the new plan and new baseline.

Multitenancy

OceanBase is a multitenancy architecture in which only one system tenant is used to provide resources for system maintainers and management. There are several ordinary tenants in OceanBase, and we do resource isolation internally to avoid resource conflict between tenants. We will isolate three kinds of resources: CPU, memory, and disk.

“Five Data Centers in Three Cities” Solution

In Alipay, we use an architecture called “five data centers in three cities” to achieve city-level disaster recovery. We will do unitization by middleware. All users are divided into several units, or you can think of a unit as a cell. Each unit or each cell can be hosted by one independent tenant in OceanBase to achieve unlimited scaling. Each tenant can be hosted in different cities. If one city crashes totally, there are always the majority of replicas in the other two cities where they are alive, so OceanBase can recover automatically using Paxos in less than 30 seconds without data loss.

If you are interested in OceanBase, feel free to leave a comment below!