2. Getting Connected
Slack Channel: https://DenverAWSUsersGroup.slack.com
You will need an invitation to join, please email me: david@mobile-360.com.
We are now listed on AWS UG site:
https://aws.amazon.com/usergroups/americas/
We are sponsored by CloudAcademy! They have a free portal for our members at:
https://cloudacademy.com/aws-usergroup/?code=newawsugs
We are also sponsored and a member of the official Global AWS Communities!
See them at https://awsug.support
3. What we’re going to do tonight
1. Describe Amazon Redshift
2. Talk about how it’s different from regular SQL Databases
3. Talk about storage options for Redshift
a. Standard Disk-based storage
b. Spectrum and S3 (CSV & Parquet) storage
4. Describe ways to load data
a. S3, EMR, DynamoDB or Remote Hosts
5. Compare to Athena
4. What is Redshift?
Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze
all your data using standard SQL and your existing Business Intelligence (BI) tools. It allows you to run
complex analytic queries against petabytes of structured data, using sophisticated query optimization,
columnar storage on high-performance local disks, and massively parallel query execution. Most results come
back in seconds. With Amazon Redshift, you can start small for just $0.25 per hour with no commitments and
scale out to petabytes of data for $1,000 per terabyte per year, less than a tenth the cost of traditional
solutions.
Amazon Redshift also includes Redshift Spectrum, allowing you to directly run SQL queries against exabytes
of unstructured data in Amazon S3. No loading or transformation is required, and you can use open data
formats, including CSV, TSV, Parquet, Sequence, and RCFile. Redshift Spectrum automatically scales query
compute capacity based on the data being retrieved, so queries against Amazon S3 run fast, regardless of
dataset size.
Recently announced 4x compression improvement in Redshift.
5. How Redshift is Different
Redshift is a column-oriented database whereas regular SQL databases are row-oriented in nature. This
means that Redshift stores groups of columns together rather than groups of rows. This can be hugely
beneficial when processing many rows, but only a few columns, which is typical in BI and Analytical
processing. Many data warehouse databases will be denormalized to reduce joins and therefore tables
will be very wide (many columns) to provide the most value, even though individual queries will only use a
small number of columns.
7. Storage Options
1. Local Disk Storage
a. Traditional, SSD-based, ties storage to compute.
b. Ties compute to storage.
c. Must make FULL read-only copies to scale.
2. S3 - Used with Redshift Spectrum
a. Uses Amazon Athena Meta-data to understand files in S3.
b. Decouples storage from compute.
c. Still must make read-only copies, but of meta-data only, so smaller & faster to scale.
8. How do we load data?
Multiple ways:
1. Preferred way: Use COPY command to load data from files in one of many
formats from:
a. S3
b. EMR
c. Remote EC2 Hosts
d. DynamoDB Tables
2. Use DML:
9. How is it different from Athena?
Athena Redshift
Storage on S3 Storage on attached SSD disks
Automatically scales Must add more instances/change instance
size
Massive parallelism Only as parallel as you configure
Data can be stored in multiple formats per
table
Data can be loaded from files in multiple
formats
10. Demo!
1. Create Schemas for Redshift tables
2. Load data in multiple formats from S3
3. Create Redshift Spectrum Schemas
4. Load data (really, meta-data)
5. Execute queries
6. Tableau visualization