- TrueCar migrated their data warehouse from an on-premises Hadoop cluster to Amazon Redshift. They load clickstream, transactions, inventory, and lead data into Redshift for analytics and reporting.
- They use ETL tools like Talend and Hive to process data and load it into HDFS and S3, then load it into Redshift using a custom utility. The data is organized into schemas separating raw, user, and reporting data.
- Best practices for Redshift include designing tables for compression, sort keys, and distribution, managing cluster size and workloads over time, and vacuuming and analyzing tables regularly. TrueCar's migration to Redshift improved performance and reduced costs.
1. Migrate Your Data Warehouse to Amazon
Redshift
Greg Khairallah, Business Development Manger, AWS
David Giffin, VP Technology, TrueCar
Sharat Nair, Director of Data, TrueCar
Blagoy Kaloferov, Data Engineer, TrueCar
September 21, 2016
2. Agenda
• Motivation for Change and Migration
• Migration patterns and Best Practices
• AWS Database Migration Service
• Use Case – TrueCar
• Questions and Answers
3. Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
4. Amazon Redshift delivers performance
“[Amazon] Redshift is twenty times faster than Hive.” (5x–20x reduction in query times) link
“Queries that used to take hours came back in seconds. Our analysts are orders of magnitude more
productive.” (20x–40x reduction in query times) link
“…[Amazon Redshift] performance has blown away everyone here (we generally see 50–100x
speedup over Hive).” link
“Team played with [Amazon] Redshift today and concluded it is awesome. Un-indexed complex
queries returning in < 10s.”
“Did I mention it's ridiculously fast? We'll be using it immediately to provide our analysts an
alternative to Hadoop.”
“We saw… 2x improvement in query times.”
Channel “We regularly process multibillion row datasets and we do that in a matter of hours.” link
5. Amazon Redshift is cost optimized
DS2 (HDD)
Price Per Hour for
DS2.XLarge Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.850 $ 3,725
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DC1 (SSD)
Price Per Hour for
DC1.Large Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.250 $ 13,690
1 Year Reservation $ 0.161 $ 8,795
3 Year Reservation $ 0.100 $ 5,500
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No up front costs
Pay as you go
Prices shown for US East
Other regions may vary
6. Considerations Before You Migrate
• Data is often being loaded into another warehouse
– existing ETL process with investment in code and process
• Temptation is to ‘lift & shift’ workload.
• Resist temptation. Instead consider:
– What do I really want to do?
– What do I need?
• Some data does not lend itself to a relational schema
• Common pattern is to use Amazon EMR:
– impose structure
– import into Amazon Redshift
7. Amazon Redshift architecture
• Leader Node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
• Compute Nodes
Local columnar storage
Parallel/distributed execution of all queries, loads,
backups, restores, resizes
• Start at just $0.25/hour, grow to 2 PB
(compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
Ingestion/Backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)
9. Amazon Redshift Migration Overview
AWS CloudCorporate Data center
Amazon
DynamoDB
Amazon S3
Data
Volume
Amazon Elastic
MapReduce
Amazon
RDS
Amazon
Redshift
Amazon
Glacier
logs / files
Source DBs
VPN
Connection
AWS Direct
Connect
S3 Multipart
Upload
Amazon
Snowball
EC2 or On-Prem
(using SSH)
Database Migration
Service
Kinesis
AWS Lambda
AWS Datapipeline
10. Uploading Files to Amazon S3
Amazon
Redshiftmydata
Client.txt
Corporate Data center
Region
Ensure that your
data resides in the
same Region as your
Redshift clusters
Split the data into
multiple files to
facilitate parallel
processing
Optionally, you can
encrypt your data
using Amazon S3
Server-Side or
Client-Side
Encryption
Client.txt.1
Client.txt.2
Client.txt.3
Client.txt.4
Files should be
individually
compressed using
GZIP or LZOP
11. • Use the COPY command
• Each slice can load one file at
a time
• A single input file means only
one slice is ingesting data
• Instead of 100MB/s, you’re
only getting 6.25MB/s
Loading – Use multiple input files to maximize
throughput
12. • Use the COPY command
• You need at least as many
input files as you have slices
• With 16 input files, all slices
are working so you maximize
throughput
• Get 100MB/s per node; scale
linearly as you add nodes
Loading – Use multiple input files to maximize
throughput
13. Loading Data with Manifest Files
• Use manifest to loads all required files
• Supply JSON-formatted text file that lists the files to be loaded
• Can load files from different buckets or with different prefix
{
"entries": [
{"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true},
{"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true}
]
}
14. Redshift COPY Command
• Loads data into a table from data files in S3 or from an Amazon DynamoDB table.
• The COPY command requires only three parameters:
– Table name
– Data Source
– Credentials
Copy table_name FROM data_source CREDENTIALS
‘aws_access_credentials’
• Optional Parameters include:
– Column mapping options – mapping source to target
– Data Format Parameters – FORMAT, CSV, DELIMITER, FIXEDWIDTH, AVRO, JSON,
BZIP2, GZIP, LZOP
– Data Conversion Parameters – Data type conversion between source and target
– Data Load Operations –troubleshoot load times or reduce load times with parameters like
COMROWS, COMPUPDATE, MAXERROR, NOLOAD, STATUPDATE
15. Loading JSON Data
• COPY uses a jsonpaths text file to parse JSON data
• JSONPath expressions specify the path to JSON name elements
• Each JSONPath expression corresponds to a column in the Amazon
Redshift target table
Suppose you want to load the VENUE table with the following content
{ "id": 15, "name": "Gillette Stadium", "location": [ "Foxborough", "MA" ],
"seats": 68756 } { "id": 15, "name": "McAfee Coliseum", "location": [
"Oakland", "MA" ], "seats": 63026 }
You would use the following jsonpaths file to parse the JSON data.
{ "jsonpaths": [ "$['id']", "$['name']", "$['location'][0]",
"$['location'][1]", "$['seats']" ] }
16. Loading Data in Avro Format
• Avro is a data serialization protocol. An Avro source file includes a schema that defines the structure
of the data. The Avro schema type must be record.
• COPY uses a avro_option to parse Avro data. Valid values for avro_option are as follows:
– 'auto’ (default) - COPY automatically maps the data elements in the Avro source data to the
columns in the target table by matching field names in the Avro schema to column names in
the target table.
– 's3://jsonpaths_file' - To explicitly map Avro data elements to columns, you can use an
JSONPaths file.
Avro Schema
{
"name": "person",
"type": "record",
"fields": [
{"name": "id", "type": "int"},
{"name": "guid", "type": "string"},
{"name": "name", "type": "string"},
{"name": "address", "type": "string"}
}
17. Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Redshift
and Elasticsearch
• Zero administration: Capture and deliver streaming data into Amazon S3, Amazon
Redshift, and other destinations without writing an application or managing
infrastructure.
• Direct-to-data store integration: Batch, compress, and encrypt streaming data for
delivery into data destinations in as little as 60 secs using simple configurations.
• Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Capture and submit streaming
data
Analyze streaming data using your
favorite BI tools
Firehose loads streaming data continuously
into Amazon S3, Redshift and Elasticsearch
18. Best Practices for Loading Data
• Use a COPY Command to load data
• Use a single COPY command per table
• Split your data into multiple files
• Compress your data files with GZIP or LZOP
• Use multi-row inserts whenever possible
• Bulk insert operations (INSERT INTO…SELECT and CREATE TABLE AS)
provide high performance data insertion
• Use Amazon Kinesis Firehose for Streaming Data direct load to S3 and/or
Redshift
19. Best Practices for Loading Data Continued
• Load your data in sort key order to avoid needing to vacuum
• Organize your data as a sequence of time-series tables, where each table is
identical but contains data for different time ranges
• Use staging tables to perform an upsert
• Run the VACUUM command whenever you add, delete, or modify a large
number of rows, unless you load your data in sort key order
• Increase the memory available to a COPY or VACUUM by increasing
wlm_query_slot_count
• Run the ANALYZE command whenever you’ve made a non-trivial number of
changes to your data to ensure your table statistics are current
20. Amazon Partner ETL
• Amazon Redshift is supported by a variety of ETL vendors
• Many simplify the process of data loading
• Visit http://aws.amazon.com/redshift/partners
• There are a variety of vendors offering a free trial of their products, allowing you
to evaluate and choose the one that suits your needs.
21. • Start your first migration in 10 minutes or less
• Keep your apps running during the migration
• Replicate within, to, or from Amazon EC2 or RDS
• Move data from commercial database engines to
open source engines
• Or…move data to the same database engine
• Consolidate databases and/or tables
AWS Database Migration Service (DMS)
Benefits:
22. Sources and Targets for AWS DMS
Sources:
On-premises and Amazon EC2 instance databases:
• Oracle Database 10g – 12c
• Microsoft SQL Server 2005 – 2014
• MySQL 5.5 – 5.7
• MariaDB (MySQL-compatible data source)
• PostgreSQL 9.4 – 9.5
• SAP ASE 15.7+
RDS instance databases:
• Oracle Database 11g – 12c
• Microsoft SQL Server 2008R2 - 2014. CDC operations
are not supported yet.
• MySQL versions 5.5 – 5.7
• MariaDB (MySQL-compatible data source)
• PostgreSQL 9.4 – 9.5. CDC operations are not
supported yet.
• Amazon Aurora (MySQL-compatible data source)
Targets:
On-premises and EC2 instance databases:
• Oracle Database 10g – 12c
• Microsoft SQL Server 2005 – 2014
• MySQL 5.5 – 5.7
• MariaDB (MySQL-compatible data source)
• PostgreSQL 9.3 – 9.5
• SAP ASE 15.7+
RDS instance databases:
• Oracle Database 11g – 12c
• Microsoft SQL Server 2008 R2 - 2014
• MySQL 5.5 – 5.7
• MariaDB (MySQL-compatible data source)
• PostgreSQL 9.3 – 9.5
• Amazon Aurora (MySQL-compatible data source)
Amazon Redshift
23. AWS Database Migration Service Pricing
• T2 for developing and periodic data migration tasks
• C4 for large databases and minimizing time
• T2 pricing starts at $0.018 per hour for T2.micro
• C4 pricing starts at $0.154 per hour for C4.large
• 50 GB GP2 storage included with T2 instances
• 100 GB GP2 storage included with C4 instances
•
• Data transfer inbound and within AZ is free
• Data transfer across AZs starts at $0.01 per GB
25. Resources on the AWS Big Data Blog
• Best Practices for Micro-Batch Loading on Amazon Redshift
• Using Attunity Cloudbeam at UMUC to Replicate Data to Amazon RDS and
Amazon Redshift
• A Zero-Administration Amazon Redshift Database Loader
• Best Practices for Designing Tables
• Best Practices for Designing Queries
• Best Practices for Loading Data
Best Practices References
26. This Is The Presentation Title Entered In Master Slide Footer Area
Amazon Redshift at TrueCar
Sep 21, 2016
27. This Is The Presentation Title Entered In Master Slide Footer Area
● About TrueCar
● David Giffin – VP Technology
● Sharat Nair – Director of Data
● Blagoy Kaloferov – Data Engineer
About us
27
28. This Is The Presentation Title Entered In Master Slide Footer Area
● Amazon Redshift use case overview
● Architecture and migration process
● Tips and lessons learned
Agenda
28
30. This Is The Presentation Title Entered In Master Slide Footer Area
● Datasets that flow into Amazon Redshift
● Clickstream Transaction
● Sales Inventory
● Dealer Leads
● How we do analytics and reporting
● Redshift is our data store for BI tools and ad-hoc
● Data that is loaded into Amazon Redshift is already processed
Amazon Redshift at TrueCar
30
31. This Is The Presentation Title Entered In Master Slide Footer Area
31
Architecture
31
ETL
(MR, Hive, Pig,
Oozie,Talend) Postgres
HDFS
Leads
Dealer
Transactions
Sales
Inventory
Clickstream
Staging, DWH ETL
Data Processing
32. This Is The Presentation Title Entered In Master Slide Footer Area
3232
ETL
(MR, Hive, Pig,
Oozie,Talend) Postgres
HDFS
Amazon
Redshift
Leads
Dealer
Transactions
Sales
Inventory
Clickstream
Staging, DWH ETL
S3
Loading utility
MSTR
Tableau
Data Processing Reporting
Ad Hoc
Architecture
33. This Is The Presentation Title Entered In Master Slide Footer Area
● Schemas
● Our datasets are in a read only schema for ad-hoc and scheduled reporting
● Ad-hoc and User tables in separate schemas
● Makes it easy to separate final data from user created one.
● Simple table’s naming conventions
● F_ - facts
● D_ - dimensions,
● AGG_ - aggregates
● V_ - views
Schema design
33
35. This Is The Presentation Title Entered In Master Slide Footer Area
● ETL is orchestrated through Talend and Oozie
● Processing tools: Talend, Hive , Pig and MapReduce pushing data into HDFS and S3
● We built our own Amazon Redshift loading utility
● Handles all loading use cases:
● Load
● TruncateLoad
● DeleteAppend
● Upsert
Redshift loading process
35
36. This Is The Presentation Title Entered In Master Slide Footer Area
● Train developers on table design and Redshift best practices
● Compress columns and encodings
● Analyze compression
● It makes a significant difference on space usage
● Sort and distribution keys
● Plan on Workload management strategy
● As usage of Redshift cluster grows you need to ensure that critical jobs get bandwidth
Table design considerations
36
37. This Is The Presentation Title Entered In Master Slide Footer Area
● Retain pre “COPY” data in S3
● It can easily be used by other tools (Spark, Pig, MapReduce)
● Offload historical datasets into separate tables on rolling basis
● Pre aggregate data when possible to reduce load on the system
Space considerations
37
38. This Is The Presentation Title Entered In Master Slide Footer Area
● Have a cluster resize strategy
● Use Reserved instances for cost savings
● Plan on having enough space for long-term growth
● Plan on your maintenance
● Vacuuming
● System tables are your friends
● Useful collection of utilities: https://github.com/awslabs/amazon-redshift-utils/
Long-term usage tips
38
Edits (11/12/2015)
Kinesis Firehose (via S3)
RDS Migration Tool
Add related reference to Create Table as Select (see below).
Lab - Need to right-size data set and cluster size (7 GB on Tiny cluster takes ~10 min; too long for a lab).