Más contenido relacionado

Similar a Rds data lake @ Robinhood (20)


Rds data lake @ Robinhood

  1. 1 RDS Data Lake @ Robinhood Balaji Varadarajan Vikrant Goel Josh Kang
  2. Agenda ● Background & High Level Architecture ● Deep Dive - Change Data Capture (CDC) ○ Setup ○ Lessons Learned ● Deep Dive - Data Lake Ingestion ○ Setup ○ Customizations ● Future Work ● Q&A
  3. Ecosystem
  4. Prior Data Ingestion ● Daily snapshotting of tables in RDS ● Dedicated Replicas to isolate snapshot queries ● Bottlenecked by Replica I/O ● Read & Write amplifications
  5. Faster Ingestion Pipeline Unlock Data Lake for business critical applications
  6. Change Data Capture ● Table State as sequence of state changes ● Capture stream of changes from DB. ● Replay and Merge changes to data lake ● Efficient & Fast -> Capture and Apply only deltas
  7. High Level Architecture Master RDS Replica RDS Table Topic DeltaStreamer DeltaStreamer Bootstrap DATA LAKE (s3://xxx/… Update schema and partition Write incremental data and checkpoint offsets
  8. Deep Dive - CDC using Debezium
  9. Debezium - Zooming In Master Relational Database (RDS) WriteAheadLogs (WALs) 1. Enable logical-replication. All updates to the Postgres RDS database are logged into binary files called WriteAheadLogs (WALs) AVRO Schema Registry 3. Debezium updates and validates avro schemas for all tables using Kafka Schema Registry Table_1 Topic Table_2 Topic Table_n Topic 4. Debezium writes avro serialized updates into separate Kafka topics for each table 2. Debezium consumes WALs using Postgres Logical Replication plugins
  10. Why did we choose Debezium over alternatives? Debezium AWS Database Migration Service (DMS) Operational Overhead High Low Cost Free, with engineering time cost Relatively expensive, with negligible engineering time cost Speed High Not enough Customizations Yes No Community Support Debezium has a very active and helpful Gitter community. Limited to AWS support.
  11. Debezium: Lessons Learned
  12. 1. Postgres Master Dependency ONLY the Postgres Master publishes WriteAheadLogs (WALs). Disk Space: - If a consumer dies, Postgres will keep accumulating WALs to ensure Data Consistency - Can eat up all the disk space - Need proper monitoring and alerting CPU: - Each logical replication consumer uses a small amount of CPU - Postgres10+ uses pgoutput (built-in) : Lightweight Postgres9 uses wal2Json (3rd party) : Heavier - Need upgrades to Postgres10+ Debezium: Lessons Learned Postgres Master: - Publishes WALs - Record LogSequenceNumber (LSN) for each consumer Consumer-1 Consumer-n Consumer-2 LSN-2 LSN-1 LSN-n Consumer-ID LSN Consumer-1 A3CF/BC Consumer-n A3CF/41
  13. Debezium: Lessons Learned 2. Initial Table Snapshot (Bootstrapping) Need for bootstrapping: - Each table to replicate requires initial snapshot, on top of which ongoing logical updates are applied Problem with Debezium: - Slow. Debezium snapshot mode reads, encodes (AVRO), and writes all the rows to Kafka - Large tables put too much pressure on Kafka Infrastructure and Postgres master Solution using Deltastreamer: - Custom Deltastreamer bootstrapping framework using partitioned and distributed spark reads - Can use read-replicas instead of the master Master RDS Replica RDS Table Topic DeltaStreamer Bootstrap DeltaStreamer
  14. Debezium: Lessons Learned AVRO JSON JSON + Schema Throughput (Benchmarked using db.r5.24xlarge Postgres RDS instance) Up to 40K mps Up to 11K mps. JSON records are larger than AVRO. Up to 3K mps. Schema adds considerable size to JSON records. Data Types - Supports considerably high number of primitive and complex data types out of the box. - Great for type safety. Values must be one of these 6 data types: - String - Number - JSON object - Array - Boolean - Null Same as JSON Schema Registry Required by clients to deserialize the data. Optional Optional 3. AVRO vs JSON
  15. Debezium: Lessons Learned Table-1 Table-4 Table-2 Table-5 Table-3 Table-n Consumer-1 Consumer-n Consumer-2 Table-1 Table-2 Table-3 Table-4 Table-5 Table-n 4. Multiple logical replication streams for horizontal scaling Databases with multiple large tables can generate enough logs to overwhelm a single Debezium connector. Solution is to split the tables across multiple Debezium connectors. Total throughput = max_throughput_per_connector * num_connectors
  16. Debezium: Lessons Learned 5. Schema evolution and value of Freezing Schemas Failed assumption: Schema changes are infrequent and always backwards compatible. - Examples: 1. Adding non-nullable columns (Most Common 99/100) 2. Deleting columns 3. Changing data types How to handle the non backwards compatible changes? - Re-bootstrap the table - Can happen anytime during the day #always_on_call Alternatives? Freeze the schema - Debezium allows to specify the list of columns to be consumed per table. - Pros: - #not_always_on_call - Monitoring scripts to identify and batch the changes for management window - Cons: - Schema is temporarily out of sync Table-X Backwards Incompatible Schema Change Consumer-2 - Specific columns Consumer-1 - All columns
  17. Deep Dive - CDC Ingestion - Bootstrap Ingestion
  18. Data Lake Ingestion - CDC Path Schema Registry Hudi Table Hudi Metadata 1. Get Kafka checkpoint 2. Spark Kafka batch read and union from most recently committed kafka offsets 3. Deserialize using Avro schema from schema registry 4. Apply Copy-On-Write updates and update Kafka checkpoint Shard 1 Table 1 Shard 2 Table 1 Shard N Table 1 Shard 1 Table 2 Shard 2 Table 2 Shard N Table 2
  19. Data Lake Ingestion - Bootstrap Path Hudi Table AWS RDS Replica Shard 1 Table Topic Shard 2 Table Topic Shard n Table Topic Hudi Table 3. Wait for Replica to catch up latest checkpoint 2. Store offsets in Hudi metadata 1. Get latest topic offsets 4. Spark JDBC Read 5. Bulk Insert Hudi Table
  20. Common Challenges
  21. Common Challenges - How to ensure multiple Hudi jobs are not running on the same table?
  22. Common Challenges - How to partition Postgres queries for bootstrapping skewed tables?
  23. Common Challenges - How to monitor job statuses and end-to-end-latencies? - Commit durations - Number of inserts, updates, and deletes - Files added and deleted - Partitions updated
  24. Future Work - 1000s of pipelines to be managed and replications slots requires an orchestration framework for managing this - Support for Hudi’s continuous mode to achieve lower latencies - Leverage Hudi’s incremental processing to build efficient downstream pipelines - Deploy and Improve Hudi’s Flink integration to Flink based pipelines
  25. Robinhood is Growing Robinhood is hiring Engineers across its teams to keep up with our growth! We have expanded Engineering teams recently in Seattle, NYC, and continuing to Grow in Menlo Park, CA (HQ) Interested? Reach out to us directly or Apply online at
  26. Questions ?