Redefining High Availability: Lessons from Data Center Outage in Singapore

OceanBase Database
5 min readOct 18, 2023

--

Written by Laura Neo from OceanBase

We’ve all heard the saying that “time is money,” but when it comes to modern businesses, especially in the digital age, that phrase couldn’t be truer. This was demonstrated on October 14, 2023, when a data center outage caused banking and network disruptions across the city-state of Singapore. Financial institutions and consumers were brought to a standstill due to this incident. The outage disrupted vital services, leaving many wondering: What’s the true cost of downtime, and how can we prevent it?

In this article, we’ll delve into the key insights from Senior Technical Expert Liu Hao’s presentation at the first OceanBase Developer Conference in Beijing. This presentation sheds light on OceanBase’s journey toward achieving a Recovery Time Objective (RTO) of less than 8 seconds and how it can redefine the concept of high availability.

OceanBase’s Senior Technical Exper, Liu Hao presenting at the OceanBase Developer Conference

The True Value of Continuous Availability

The recent outage, which disrupted financial services in Singapore found themselves grappling with a situation no business can afford — downtime. In today’s world, business downtime is no longer a minor inconvenience; it’s a massive economic risk. The recent data center outage in Singapore served as a stark reminder of the critical need for continuous availability:

  • Downtime translates to substantial financial losses. For some OceanBase customers, a single hour of downtime could result in losses ranging from tens of thousands to millions of dollars.
  • It also affects brand reputation. Internet-based services rely heavily on databases, and any interruption erodes trust among users.
  • Beyond financial impacts, downtime disrupts the functioning of vital IT-dependent social systems.

RPO and RTO: Understanding Recovery

Every business must grapple with two important parameters when it comes to disaster recovery — Recovery Point Objective (RPO) and Recovery Time Objective (RTO). The journey toward RTO < 8 seconds is a pursuit of excellence, a path filled with intricate challenges.

The Evolution of Database Disaster Recovery

Customer expectations for 24/7 service availability have never been higher. The transition from traditional mainframe-based architectures to modern architecture based on the Paxos protocol like OceanBase is a necessity. OceanBase’s designs are meant to push the limits of high availability.

Comparison between traditional mainframe-based architecture (left) and OceanBase’s modern architecture (right)

Diverse Customer Demands for RTO:

Different sectors have diverse requirements when it comes to Recovery Time Objectives:

  1. Alipay: OceanBase was designed to ensure high availability for core financial services like Alipay, eliminating data loss and minimizing downtime.
  2. Large Financial Institutions: Transitioning from mainframes to PC servers requires addressing reliability challenges to maintain high availability.
  3. Cloud Service Providers: Serving customers with varying storage capabilities across different cloud platforms necessitates consistent service continuity.

Challenges in Achieving RTO < 8s

Reducing RTO to less than 8 seconds is no small feat. It involves addressing an array of scenarios, from common process failures to network partitioning and storage issues in both on-premises and cloud deployments. However, it’s not just about numbers; it’s about the customer experience.

  1. RTO < 8s requires addressing various scenarios such as process failures, network partitioning, and storage issues in both on-premises and cloud deployments.
  2. A business-centric approach extends beyond database recovery to include application recovery, load balancing, and other components.
  3. Zero Parameter Tuning: OceanBase aims to provide RTO < 8s without requiring extensive parameter tuning.
  4. Peer-to-Peer Nodes: Every database server within OceanBase offers multiple services, ensuring that any node failure can be handled.
  5. Multi-Cloud Deployments: Expanding capabilities beyond PC server deployments to support multi-cloud and cross-cloud scenarios.

Fast and Accurate Fault Detection

The road to RTO < 8 seconds starts with fast and accurate fault detection. OceanBase has significantly reduced fault recovery units, resulting in a more efficient recovery process through:

  • Redesign of core election and consensus protocols for stable RTO.
  • Adoption of message-driven relative time for elections.
  • Shortened election lease time.

Rapid Database Recovery

Recovering a database after an outage is a complex task. It involves a real-time parallel replay of primary node writes on follower nodes, intelligent fault detection mechanisms, and dynamic location information query decentralization.

Helping Business Recovery

Once data is restored, what about the business itself? There’s more to it than just the database. OceanBase has enhanced OBProxy, making it more responsive and agile, capable of making real-time decisions to ensure seamless business continuity.

  • Enhanced OBProxy with connection-based health checks to reduce false positives.
  • Real-time blacklisting and whitelisting of faulty nodes.
  • Ensuring rapid detection and reintegration of nodes into the cluster.

Reflecting on the recent data center outage that disrupted banking and network services in Singapore, it’s clear that disaster recovery solutions and high availability remain a top priority. This incident exposed the vulnerabilities in our digital infrastructure, especially for essential services like banking.

OceanBase’s journey towards achieving an RTO < 8 seconds serves as a practical example of the continuous efforts to enhance service reliability. It’s not just about meeting a specific timeframe; it’s about minimizing disruptions and ensuring a smooth customer experience during unforeseen downtime.

This outage reinforces the importance of investing in advanced disaster recovery and business continuity strategies. In today’s interconnected world, the demand for 24/7 services is non-negotiable. The traditional mainframe-based architectures are being replaced by more flexible and resilient solutions, making us adapt to the ever-increasing expectations.

OceanBase, trusted by top Fintech such as Alipay, DANA, and GCash, remains dedicated to pushing the boundaries of high availability and delivering an unparalleled user experience. Stay tuned for more updates on how OceanBase’s innovative solutions are reshaping the world of database management to build a more resilient and reliable future.

Alipay merchant

--

--

OceanBase Database
OceanBase Database

Written by OceanBase Database

A cost-effective SQL database at scale with real-time operational analytics capability