This document provides an overview of data transfer and common ETL (extract, transform, load) processes. It discusses typical data sources and destinations for ETL pipelines, as well as common failure cases. It emphasizes writing testable code in small steps, maintaining idempotency, implementing thorough logging and monitoring, and tempering expectations since data transfer processes are inherently difficult to test and failures are common.
1. THE INS & OUTS OF DATA TRANSFER
LOS ANGELES AWS USERS GROUP
JASON DAVIS, CEO SIMON DATA
@JASONDAVIS
DRJASONDAVIS.COM
2. A TYPICAL DATA ECOSYSTEM
OLTP/RDS
DATA LAKE / REDSHIFT / S3
USERS FRONTEND
ANALYTICS
BACK OFFICE
"THE BIZ"
CORE TECH
3P TECH /
SAAS
CRM / ERPEMAIL / PUSH / SMS GRAPHS / BI
APPLICATION
3. A gentle introduction to data transfer & "ETL"
An overview of common failure cases
Best practices and some high level guidance
OVERVIEW
4. SOME TYPICAL DATA TRANSFERS
WEB ANALYTICS
"BUSINESS" REPORTING
ACQUISITION /
LTV ANALYSIS
EMAIL SEGMENTATION
5. Product recommendations
Extract: skus, purchase / browse history, profit margins
Transform: Deep learning / recommender systems
Load: user / sku recommendations into a production DB
Inventory planning
Extract: historical sales, inventory and shipping costs
Transform: Stockage goal estimation
Load: Sku-level forecasts into an ERP system
Executive dashboard
Extract: revenue, traffic, support volume, operational data
Transform: basic aggregates
Load: pie charts, vanity metrics driven by a reporting DB
SOME MORE TYPICAL DATA TRANSFERS
6. ETL: the process of pulling data from one or more sources for use in another
Extract data from one or more sources
Database, event streams, S3, Salesforce, email metrics
Transform data via aggregations, joins, filters, and/or predictive analysis
Parallel (Hadoop, Spark), In-core (Redshift), Scripts (Python, bash)
Load data into destination
Database / Redshift, S3, HDFS, SaaS, ERP, CRM, email platform, etc.
DATA TRANSFER IN 3 STEPS: EXTRACT-TRANSFORM-LOAD
E T L
7. Extraction failures
Source unavailable
Data corrupt / incomplete - upstream error
Transform failures
Resources unavailable / exceeded: OOM
Broken computation: Bad math / DBZ
Load failures
Validation errors
Connectivity errors
Availability / bandwidth limitations
Failures can cascade in unexpected ways
MOVING DATA IS HARD: COMMON FAILURE CASES
8. Maintaining state between two systems is hard
The basic problem of 1-1 syncing is hard in itself
Incrementals, cursor based extractors are all prone to failure
Failure cases are wide, varied, and data-driven
Generally require running in real-world context for an extended period
Many times failures are silent
Ensuring correctness is hard / impossible
Run-times are generally longer which strain unit testing best practices
FUNDAMENTAL CHALLENGES
=?
9. Break your pipeline into small steps
Large SQL statements are hard to test
SQL in general is hard to unit test - it's a declarative language after all
Data flow languages such as spark / cascading are easier to test
Build patterns to be able to easily test real-world inputs against outputs
Unit testing timeout errors and other exceptional cases are hard to test in isolation
WRITE UNIT TESTS BUT TEMPER EXPECTATIONS
DATA PIPES ARE HARD TO UNIT TEST
10. Idempotent. A unary operation (or function) is idempotent if, whenever it is applied twice to any value, it gives the
same result as if it were applied once; i.e., ƒ(ƒ(x)) ≡ ƒ(x). For example, the absolute value function, where abs(abs(x))
≡ abs(x), is idempotent.
In layman's terms: your code has the same result if you run it one, two, or three or more times.
Why is this important?
Oftentimes you won't know if something was successful or not.
Solution: Idempotency allows you to "just run it again"
IDEMPOTENCY
"THINGS DON'T ALWAYS TAKE ON THE FIRST TRY...."
11. Start with fine-grained logs
"Measure Anything, Measure Everything" - Etsy, Code as Craft
Alert on things that are mission critical or have well-known failure characteristic
VISIBILITY: LOGGING, GRAPHING, & ALERTING
OPTIMIZE FOR TIME TO DETECTION