The Evolving Directions of Distributed Database Technology
Han Feng, former Oracle ACE, and MVP of Alibaba Cloud, is an experienced database administrator (DBA) and database architect. As an expert in relational databases who also dabbles in technologies related to NoSQL and big data, Han has published two books: SQL Optimization Best Practice and Efficient Database Optimization.
Emerging technologies such as cloud computing, 5G, IoT, AI, and blockchain give rise to more data usage scenarios. A direct result is the exponential growth of data volume with an increasingly complex data structure.
Why distributed databases?
Based on third-party statistics, the volume of stored data around the world is expected to see explosive growth in a few years. In addition to the data size, databases, the carriers of data, also face higher requirements on the collection, storage, transmission, presentation, analysis, and optimization of data. Because databases must meet the increasingly higher requirements of enterprises on digital asset governance, data-driven value creation, scientific decision-making, data reliability assurance, data availability assurance, and online data analysis so that the enterprises can create more value from their data. However, conventional database architectures can hardly meet the requirements of ensuring the security and real-time processing of massive amounts of highly concurrent data. The high costs of database building also go against the trend of the digital era.
New distributed databases stand out from their conventional counterparts based on the following strengths:
1. Ultra-large storage capacity
With the conventional standalone or centralized architecture, the data storage capacity of a database depends on the size of local disks or external storage spaces. Although an external storage space can be large enough to hold petabytes of data, it takes a long time to build and it still faces the issue of I/O performance bottleneck, let alone its limited scalability and high O&M costs. A database with a distributed architecture, on the contrary, naturally supports data sharding and is therefore an ideal solution to ultra-large data storage.
2. High-performance computing
CPU and memory are crucial computing resources. In a database with a conventional standalone or centralized architecture, CPU and memory resources can only be scaled up. Limited scalability makes it impossible to integrate more computing resources with the database. On the contrary, a database with a distributed architecture provides stronger computing power by integrating more computing resources and is more competitive for applications that require high concurrency and high-performance computing.
3. Enhanced data analysis
Data analysis, one of the key purposes of data usage, was often performed by using technologies such as data warehouses. To some extent, an architecture that combines an online database and an offline data warehouse is acceptable for data analysis, but it has drawbacks in terms of real timeliness, data consistency, and costs. An ideal solution is to integrate these two parts into a single system. However, this solution does not work with a conventional architecture that has limited resources. For a database with a distributed architecture, which is highly flexible, more computing resources can be integrated to produce greater computing power that not only well handles hybrid workloads but also substantially improves the efficiency of data analysis with reduced data redundancy.
4. Higher availability and security
A database with a conventional architecture is often designed with redundant hardware to improve its availability, which depends largely on that of a single server or storage device and cannot achieve sufficiently high availability due to architectural limits. To ensure the data security of a database, general solutions are primary/standby replication and backup, and recovery. However, these solutions cannot ensure the security of online data in a conventional architecture because the replicas and backups can be recovered only during specified time windows. Featuring storage/computing separation, multiple replicas, and automatic scaling, a distributed architecture can effectively improve overall system availability and data security. Users can adjust a distributed architecture as needed to enhance its availability and security.
5. Optimized cost model for on-demand scaling
A conventional database can be a money incinerator, given the fact that a conventional architecture is easy to scale up but hard to scale out or in. Generally, users must plan for the maximum capacity when designing their databases or pay huge bills for higher reliability to cope with their rapid business growth. In comparison, a distributed database architecture naturally supports flexible scaling of storage and computing resources, providing users with multi-replica solutions that deliver high availability at lower costs. These strengths allow enterprises to effectively reduce the upfront investment in building their database systems, especially those for dealing with rapidly growing and changing business scenarios.
Three types of distributed databases
Since the 2012 paper about Google Spanner, distributed databases have been evolving roughly along three technical paths and can be classified into types.
Path 1: Distributed middleware + standalone database
This path mainly improves the scalability of computing and storage resources by modifying a standalone database system. The architecture consists of two layers. Based on sharding rules, the upper layer consists of a group of stateless computing nodes that parse SQL statements, forward requests, and combine results. The lower layer is an enhanced standalone database that provides storage and execution capabilities. This architecture implements near-linear scaling of the computing and storage capabilities based on data sharding at the logical layer.
Path 2: Distributed storage
This path implements scaling by sharing distributed storage among asymmetric computing nodes and is adopted by most public cloud-based databases. Databases on this path have limited scalability and rely on distributed storage engines to ensure cross-region data consistency. The shared storage supports data reads and writes across multiple nodes. The computing layer consists of a group of stateless nodes. When a writable computing node fails, the system maintains the high availability of data writes by automatically switching a healthy readable node into a write node.
Path 3: Native distributed architecture
This path leads to native distributed databases, with each computing node providing equivalent read and write services. Databases on this path are designed based on distributed consensus protocols and are fundamentally different from conventional databases. A native distributed database is an organic combination of distributed storage, transaction processing, and computing capabilities. In a native distributed database, data is stored in multiple replicas, which are automatically distributed. Consistency of data and transaction logs between these replicas is guaranteed by using a consensus protocol. Therefore, it better supports distributed transactions and global MVCC. The distributed architecture is entirely contained as a cluster and does not affect the operation of applications.
How do distributed databases evolve?
When distributed databases bring benefits to various business scenarios, distributed database architecture also faces many challenges. These challenges show the way that leads to the future of distributed databases.
1. Integration of native designs
As mentioned above, the evolution of distributed databases is divided into different paths, giving birth to a variety of products. However, it seems that those branches converge in the long run. In other words, market players are learning from each other and trying to strengthen their own products by integrating the advantageous features of others. On the one hand, distributed databases still have drawbacks in terms of basic capabilities in comparison to their standalone or centralized counterparts. On the other hand, rising user expectations of distributed capabilities must be backed by higher scalability.
2. Enhancement of basic capabilities
A standalone or centralized architecture outperforms a distributed architecture in many aspects that closely relate to the user experience. For example, the atomicity, consistency, isolation, and durability (ACID) of transactions are easily achievable for a standalone database but impose many challenges in a distributed environment. This is because a distributed database splits a transaction into jobs for processing on different servers. Ideally, the whole process is implemented by using a global consensus protocol. However, database sharding and two-phase transaction committing tend to introduce errors under some unexpected circumstances. Another example is that distributed architectures are generally built with storage/computing separation, a feature that naturally brings inter-layer network overhead and requires effort on latency reduction.
3. Scalability improvement
When distributed databases are rolled out in production environments, using them well becomes a key concern. For example, figuring out a better solution for intelligent sharding of distributed databases is a good point to start with. To cope with highly concurrent and massive amounts of data, a distributed database splits tables into partitions by sharding and keeps the data volume of each partition below the threshold. The point is we must do more to make sharding highly efficient. In addition, considering the novel architecture of distributed databases, it is reasonable for us to take a closer look at its features, such as elastic on-demand scaling, big data processing, and fine-grained multi-replica management.
4. HTAP design
Business scenarios of enterprise applications can be roughly classified into online transaction processing (OLTP) and online analytical processing (OLAP). In view of their different characteristics, many enterprises tend to deploy multiple database products to support these two scenarios separately. These combined solutions require data transmission from one database to another. Data synchronization inevitably causes latency and introduces the risk of data inconsistency, let alone the generation of redundant data. As a result, the rising costs slow down the business growth of enterprises. Distributed databases make it possible for enterprises to solve these issues, especially with the hybrid transaction/analytical processing (HTAP) capability that became popular in these years to break down the barrier between transaction and analytical processing. The HTAP capability is a must for a future-proof distributed database to handle highly concurrent transactions while taking good care of complex analytical queries, with zero interference between computing and I/O resources. When online transaction and analytical processing tasks do not affect each other, a single database is enough to meet various requirements of enterprise applications, thus significantly reducing costs while improving decision-making efficiency.
5. Cloud and cloud-native design
According to Gartner, 75% of all databases will be deployed or migrated to a cloud platform by 2022. Cloud migration is undoubtedly the future of the database industry. How to integrate databases with the cloud environment as a part of the IT infrastructure has become an industry-wide concern. Particularly, distributed databases require a lot of resources to build. When integrating a distributed database with a cloud, we must find ways to support flexible deployment, automatic scaling, and effective resource management. On top of that, we can think about how to better use cloud resources and build a true cloud-native system. Therefore, the design of a distributed database must adapt to the cloud environment to the highest level and be compatible with more cloud technologies, so that we can provide more cloud-based capabilities for resource control, multi-mode deployment, and cloud-native resource utilization.
6. High availability and data consistency
As the basic requirements for a database, high service availability and data consistency are two major factors considered by enterprises in database selection. In the era of digital transformation, more data streams are channeled to the business circulation of enterprises, imposing higher requirements for service availability. It is either overwhelming or costly for a conventional database to offer 24/7 service with zero data loss. A distributed database provides a better way to achieve high availability, thanks to its hierarchical multi-node architecture supported by a range of components. Features such as effective fault isolation, automatic error diagnosis, and self-healing also help greatly improve service availability. In addition, data consistency is guaranteed in a distributed database by its multi-replica feature, which is a breakthrough compared with a conventional database and can achieve data consistency at a finer granularity to meet different data consistency requirements in various business scenarios.
7. Hardware-software integrated heterogeneous design
Hardware and software are complementary core components of an information system. The emergence of new hardware can bring more benefits to the development of databases. On the one hand, basic hardware such as multi-core CPUs, heterogeneous computing units (such as GPU and FPGA), persistent memory, and high-speed network provides more possibilities for distributed architectures. On the other hand, new hardware also brings more challenges for database design. Database vendors must find ways to make good use of the new hardware, which may give birth to revolutionary design models. As a key part of the information infrastructure, databases must also support the canary release of operating systems and processing units and support heterogeneous processing units in key industries and software fields. In this way, digital solutions can be more accurate to reduce risks.
8. Low-cost intensive design
Built on top of a new architecture, a distributed database can cause considerable costs to an enterprise. From the perspective of management, the distributed architecture poses new challenges to O&M engineers and good management is key to database performance. Whether the product provides complete management capabilities and eco-tools will directly affect the service quality. From the perspective of resources, given the investment in resources to build a distributed architecture, it is important to plan well and design a solution that supports tenants and other features to effectively reduce O&M costs.
9. High compatibility and easy migration
The replacement of the underlying databases is perhaps the worst pain in the neck for enterprises, because they may have built countless internal business systems over years for information-driven innovation and development. Conventional database products for enterprises provide powerful capabilities to help developers quickly and easily build applications. However, such applications rely heavily on databases. To fit these applications into new database products, developers must make drastic code modifications. No two databases are exactly the same, and this is especially the case when it comes to distributed databases, whose underlying architecture and implementation logic are bound to be different. Making databases highly compatible is a better way to significantly reduce the costs of code modification.
At present, most distributed database products are not fully compatible with the ecosystems of mainstream databases. In other words, more features of distributed databases must be compatible with the mainstream ecosystems, and the degree of compatibility must be improved.
In addition, considering the special requirements in the design of a distributed architecture, it is important to reduce R&D costs and figure out how to shield the differences without causing any noticeable impact.
Moreover, data migration from a conventional centralized database to a distributed database is a complex and huge project. From the pre-migration compatibility assessment and application modification to the in-progress business and performance testing and then to the post-migration consistency verification, every step of the process requires system-wide support, which is a vulnerability of most distributed products on the market.
Hopefully, next-generation distributed databases will provide all-around reliable migration capabilities that meet the requirements of various enterprises.