HULK: Building Near Real-Time Data Pipeline For Democratization of Data

Cashfree Payments has seen exponential growth in the last couple of years. 

In fact, the number of merchants on our platform has increased by 150% since 2020! What’s more, we make an average of 20 million API requests every day.

However, this growth brings a plethora of challenges.  

Now, one of those biggest challenges was making our data available to everyone in the Cashfree Payments team

And that too – with low latency and ensuring 100% consistency.

Needless to say, we had quite a big task ahead of us. But thankfully, we have a knack for these things!

This is our story of building a scalable near real-time data pipeline for the democratization of data. 

And it starts with our motivations for creating it.

P.S. We codenamed our new data warehouse pipeline HULK. Read on to know why! 

Why Did We Need A Data Warehouse?

Well, a quite a few reasons:

  • We required near real-time data analytics
  • Need to put data in the centre of decision making
  • Remove dependency on MySQL replicas (and use them for their intended purpose: read-heavy queries for services)
  • Reducing costs; for every RDS instance, we had to provision additional replicas to support our ever-growing analytics requirements.
  • Improving the availability of data for all the teams

As of then, we were working exclusively with MySQL read replicas as our primary data source for analytics and data processing.

But now that we knew what we wanted out of our upcoming data warehouse, we had a clearer idea of how to start.  

Context To Building the Data Warehouse: How We Started Off

Cashfree started as a monolith built with PHP, with all data residing in Primary RDS managed MySQL instances. 

As we scaled our teams and offerings, the components were split into several services and thus multiple database instances. 

However, we faced a lot of issues while doing that. 

  • No capability to join tables; information across multiple databases.
  • Scaling big queries on MySQL replicas is hard, given production databases might not have additional indexes, etc required for analytics queries.
  • Some analytics query could trigger full table scans. This could lead to Replica scanning billions of rows resulting in high replication lag. 
HULK: Building Near Real-Time Data Pipeline For Democratization of Data

We knew that we needed an out-of-the-box solution to meet our objectives.

Looking For A Solution: Our Experiments with Platforms

We began experimenting with CDC platforms and Kafka connectors in late 2020 to move from RDS to RedShift for warehouse and analytics across Cashfree. 

During peak times, one of our primary databases generates 1M+ updates per minute. This leads to a similar number of database events being sent. There are other systems that perform reconciliations, bank statements etc. They generate a surge of even higher updates. 

Most of the systems we experimented with were not able to flexibly handle spikes (including data migration solutions and different Kafka Sink providers.)

This resulted in high latency for data availability in RedShift, as high as 10-12 hours

Another major problem with existing solutions was this:

Ability to replay or restore batches. 

RedShift does not apply any constraints on primary keys (since it has none) on the incoming data. 

This can cause issues if batches fail, or are replayed for some reason. Since existing tools mostly apply messages from CDC, an insert event delivered twice would result in a double entry for the same record. 

This can be very risky, specifically for us at Cashfree Payments since we use the same source for financial reporting and risk analysis. 

So, here is how we decided to go about this. 

We split ourselves into two groups: 

  1. One focuses on finding the right available choice that can help us gain low latency and predictable sync.
  2. And the other group focuses on building a simple service to help us achieve our goals.

Our goals being:

  • Ability to handle millions of updates on multiple databases
  • Near real-time; we set aggressive goals of under 5 minutes for our busiest database. 
  • Keeping costs lower than existing solutions 
  • Handling schema migrations and database version changes 
  • Providing replayable and idempotent batches 

After spending a few weeks iterating on various options, we were ready with the first version of the service. 

We codenamed it HULK and let’s just say, the name was quite fitting!

Introducing HULK: How It Met Our Goals

In our current setup, we have multiple MySQL databases, connected to the CDC platform which sends the updates on Kafka cluster.

From the cluster, HULK instances read the updates, batches and send them to Redshift using the COPY command. 

data warehouse high level overview

Speaking of HULK, let’s have a look at how it works. 

And most importantly, why it worked for us.

Internal Architecture of HULK, Our Data Warehouse

Our aim was to build something that adds value, saves us the cost and time of monitoring/scaling. 

Naturally, we had to ensure that we were moving in the right direction- every step of the way.

So, we decided to stick to the following principles.  

First Principles

  • Automation: The time unit tests save and allow fast iteration is always underrated. This helped us move fast with confidence.
  • Metrics: Observability is critical, especially for systems such as HULK. We started with a dashboard first and then made sure we added similar metrics while building it. It gave us visibility of systems at runtime.
  • Fail-Fast: We didn’t want to make any assumptions. After all, it’s better to fail instead of being wrong.

We experimented with SQL libraries and updates to only realise Redshift does not work well with single statement updates. 

Hence, we focused on batches. Specifically on how to run batches efficiently on redshift

Focusing On Batches

One of the best ways of moving data into redshift is by using the COPY command. This helps to quickly move millions of records to redshift with a single command. 

However, as with everything in engineering– it comes with a trade-off. 

Every copy and subsequent actions has overhead. Hence to make things more efficient, we needed to define the right batch sizes, wait times etc to optimize operations. 

Now that we had resolved the issue of moving data at an optimized speed, our next big challenge was to handle source database migrations. 

Database Migrations

We had our fair share of challenges implementing fair migrations from MySQL.

For starters:

  • For any CDC event that is an Alter, we fetch source table configurations and compare with RedShift schema and make a decision.
  • We had to be cautious of redshift types and defaults, few examples: 
    • Default precision/scale of double is (11,0) which can cause data loss.
    • TEXT is simply translated to VARCHAR(256) instead of MAX.
  • The focus for the most part is to do in-place migrations since re-creating tables can be very expensive and time-consuming.
  • We took extra caution before applying any updates since we want batches to be replayable.

It’s important that we had the right visibility and observability of the system; especially its performance during runtime. This adheres to our tight bound SLAs.  

After all, gauging metrics was one of our primary motivations while building the system.

Alerting and Dashboards

Now, we have dashboards with real-time status for 

  • 99%ile lag in seconds from MySQL, each database has different accepted SLA based on size and alerts on top of it. 
  • Kafka lag on the CDC topics
  • Number of messages produced per minute
  • Redshift batch times, broken down into each stage of COPY command
  • Total number of batches being processed per minute. 
  • Every DDL applied and translated, while the system is very stable but we still do cross-check translated SQLs.  

With alerting in place, we had to ensure that our deployment strategy had an elastic scale so that it could adjust to runtime demands. 

Speaking of which…

Our Deployment Setup

Our deployment setup itself could take up the form of a whole other blog. 

But here are some pointers for a quick overview:

  • We set up a separate Kubernetes cluster for the data platform which hosts CDC, Kafka and HULK instances. 
  • We usually have one instance per database which helps us avoid cascading failures across multiple databases. This setup also helps us manage instance size in proportion to the updates we expect from the database. 
  • We used Prometheus for scraping the metrics and Grafana for visualization. 
  • We integrated with Pagerduty and Slack for alerts.

We have already discussed the HULK architecture and its functionalities.

It’s only fitting to discuss the results!

How HULK Helped Us Build a Near Real-Time Data Pipeline

We have seen tremendous benefits on the overall ecosystem since we have moved to HULK.  

  • Our costs are a fraction of what we had using data migration services and sink connectors (including VPC etc).
  • We have more visibility/control on data movements and fine alerting mechanisms on more contextual events/batches.
  • We have billions of rows moved into Redshift via H. This has enabled our finance, risk and product teams to have a more meaningful and connected data ecosystem built on top of it.
  • Average and Peak CPU utilization of Redshift has dropped considerably from consistently being over 80-85% to around 60-65%. This gives us more room to strategise our platform plans without worrying a lot about redshift being a bottleneck

The Road Ahead

Yes, this has been a big win for us. 

However, it’s just the first step towards our goal of building an awesome data engineering team. 

Of course, we have a long road ahead of us. 

Here are our focus points for the near future:

  • Auto VACUUM and exploring if we can run vacuum on high throughput tables in the night via HULK.
  • Alerts/visibility on whether we have the right sort and dist keys for our tables based on usage patterns.
  • Purging tables, for tables where historical data does not have a lot of context.
  • Optimizing file sizes for S3, compressions and parallel copy. 

We are excited for all our future experiments, to say the least.

Does this sound like something that would interest you? 

Well then, we have some exciting opportunities lined up for data engineers just like you!