This document summarizes a presentation given by Steve Abraham and Brian Filppu on collecting and analyzing large amounts of real-time data with zero infrastructure using AWS services. It discusses using Amazon API Gateway to ingest data, Amazon Kinesis to collect and store data, AWS Lambda to process data in real-time, and Amazon Redshift and Aurora for analytics and querying. It also provides a case study of how Zillow uses this architecture to collect and analyze mobile app metrics.
2. Who am I?
• Steve Abraham
• Solutions Architect – AWS
• Previous life
• T-Mobile
• U.S. State Department
• Hasbro
• Software company
3. What we’ll cover
• Data ingestion pipeline
• Collect 1,000,000,000 data points per month
• Varied clients
• Near real-time access to data
• High performance / high availability
• Low cost / low maintenance
• Case study – Zillow
• Brian Filppu – Director of Business Intelligence
9. Amazon API Gateway
• Integration types
• Lambda
• Proxy AWS service
• Proxy existing service
• Mock
10. Amazon API Gateway
• Deploy to stages
• Cross-origin resource sharing (CORS) support
• Automatically generates SDK
• Android
• iOS
• JavaScript
11. Amazon API Gateway
• $3.50 per 1,000,000 calls
• Data transfer in - Free
• Data transfer out - $0.05 -> $0.09 per GB
• 1,000,000,000 calls
• $3,500.00 – Gateway
• $0.00 – Data transfer out
• Total price - $3,500.00
30. Amazon Simple Storage Service
• Secure
• Encryption in flight - HTTPS
• Encryption at rest (Amazon S3 key, client key, AWS KMS)
• Durable
• Designed for 11 9’s of durability
• Scalable
• Millions of requests per second
• Trillions of objects
31. AWS Key Management Service
• Manage encryption keys
• Encrypt / decrypt data directly
• Directly Integrates with
• Amazon S3
• Amazon RDS
• Amazon Redshift
• AWS Lambda integration
• Access via API
32. Amazon Simple Storage Service
• Key name distribution
• Random values
• Lifecycle policy
• Delete objects
• Move objects to Amazon Glacier
• Amazon Glacier
• Infrequently accessed data (cold storage)
• Low-cost starting at $0.007 per GB
• Secure / durable
33. Amazon Simple Queue Service
• Simple
• Easy to set up
• Secure
• Encryption in flight - HTTPS
• Durable
• Multiple servers / data centers
• Scalable
• Automatically scales
34. Amazon S3 Pricing
• $0.0275 - $0.0408 per GB
• Tiered pricing
• Varies by region
• $0.005 - $0.007 per 1,000 PUT requests
• Varies by region
• $0.004 - $0.0056 per 10,000 GET requests
• Varies by region
• Total cost -> $3.87
35. Amazon SQS Pricing
• $0.50 per 1,000,000 requests
• First 1,000,000 requests free
• Total cost -> $0.00
38. Amazon Redshift
• Fully-managed, petabyte scale data warehouse
• Fast
• Columnar storage / data compression
• Scalable
• Scale up or down
• Fault tolerant
• Data replicated across nodes / Backed up to Amazon S3
• Familiar
• Connect via ODBC / JDBC
41. Amazon Redshift
• Micro-batch loading
• Number of files = multiple of virtual cores
• Define compression type for each column in table definition
• Load data in sort key order
• Use SSD node type (dc1 instance types)
42. Amazon Redshift
• Infinite loop
• Create 1 Amazon Kinesis stream with 1 shard
• Attach Lambda function to Amazon Kinesis stream
• Execute workload
• Put record into stream
• Create multiple shards for multiple threads
47. Amazon Aurora
• Fully-managed relational database
• MySQL 5.6
• Wire compatible
• InnoDB storage engine
• Up to five times better performance than MySQL
• Over 500,000 SELECTs per second
• 100,000 updates per second
• Multi-AZ
• Data replicated 6 ways across 3 zones
48. Amazon Aurora or Amazon Redshift?
• Amazon Redshift
• Data warehouse workload
• Data > 64 TB
• 50 concurrent queries
• Amazon Aurora
• OLTP workload
• Data < 64 TB
• 500,000 SELECT / 100,000 UPDATES per second
49. Amazon Aurora Pricing - Compute
• db.r3.xlarge
• On Demand - $431.52 / month
• 1 Year No Upfront - $277.40 / month (34% savings)
• 1 Year Partial - $1,250.00, $131.40 / month (45% savings)
• Total compute cost -> $235.47
52. Zillow
• What is Zillow?
• Zillow is the leading real estate and home-related information
marketplace. Zillow is dedicated to empowering consumers with
data, inspiration and knowledge around the place they call
home.
• Who am I?
• Brian Filppu
• Director, Business Intelligence, Zillow
• I have been at Zillow close to 8 years
• Previous life – Spent about 6 years consulting throughout
North America
53. Zillow – Use Case
• Needed to collect a subset of mobile app metrics
• Solution needed to be delivered in under 3 weeks
• Requirement to aggregate and report metrics back to
business owners several times during the day
• We already have a number of data warehouse
processes in AWS so we reached out to Steve, our AWS
solutions architect for assistance
54. Zillow – What Did We Create?
• Custom URL endpoint in Amazon API Gateway
• 16,000,000+ POSTs per day – to start
• Data sent from API Gateway to Amazon Kinesis using AWS
Lambda
• Storing data encrypted with AWS KMS in Amazon S3 using
Lambda
• Analyze our data with Spark on Amazon EMR
• Run Spark jobs through out the data with AWS Data Pipeline
• Have the ability to consume/analyze data real time on Spark
on Amazon EMR with Amazon Kinesis if the use case arises
56. Zillow – Data Collection Costs
• Using 3 Amazon Kinesis shards costing around $1.30 a
day which includes hourly + put costs.
• On AWS Lambda, we allocated 128 MB of memory per
function call. Lambda runs for under $6 dollars a day.
• Lambda and Amazon Kinesis gave us a cost effective
solution for storing data with little development time.
57. Zillow – Data Analysis
• Use Spark to perform ETL, clean up, and analysis
through out the day. ETL includes Parquet conversion,
data partitioning, etc.
• Use Presto on Amazon EMR for long-term
querying/analysis of data.
• Data is stored on Amazon S3. For all Amazon EMR
jobs, we use Amazon S3 as our HDFS.
• Currently running jobs 4 + times a day using AWS Data
Pipeline which launches Spark jobs.
58. Zillow – What Else Does My Team Run in AWS?
• Use Amazon Redshift for fast access to data
• Big users of Spark and Presto on Amazon EMR, which
includes ETL and ad hoc querying, other use cases
involve long term historical data not kept in
Amazon Redshift
• Amazon SQS, AWS Data Pipeline, Amazon SNS,
Amazon S3, AWS KMS, Amazon API Gateway,
Amazon EC2
59. Zillow – We are Hiring
• My team is hiring ETL data engineers and software
developers
• All open positions at Zillow can be found at
http://www.zillow.com/jobs/
62. Related Sessions
• BDT302 - Real-World Smart Applications with Amazon
Machine Learning
• BDT309 - Data Science & Best Practices for Apache
Spark on Amazon EMR
• BDT310 - Big Data Architectural Patterns and Best
Practices on AWS
65. Code used for the demo in this session is
available for download here:
http://abrstevepermalink.s3.amazonaws.com/Demo.zip
66. Amazon API Gateway Pricing
• $3.50 per 1,000,000 calls
• Data Transfer In - Free
• Data Transfer Out
• $0.09/GB for the first 10 TB
• $0.085/GB for the next 40 TB
• $0.07/GB for the next 100 TB
• $0.05/GB for the next 350 TB
• 1,000,000,000 calls / 1KB payload
• $3,500.00 – Gateway
• $85.83 – Data Transfer Out