Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 71 Anuncio

AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

Descargar para leer sin conexión

Amazon EC2 Spot instances provide acceleration, scale, and deep cost savings to run time-critical, hyper-scale workloads for rapid data analysis. In this session, AOL and Metamarkets will present lessons learned and best practices from scaling their big data workloads using popular platforms like Presto, Spark and Druid.

AOL will present how they process, store, and analyze big data securely and cost effectively using Presto. AOL achieved 70% savings by separating compute and storage, dynamically resizing clusters based on volume and complexity, and using AWS Lambda to orchestrate processing pipelines. Metamarkets, an industry leader in interactive analytics, will present how they leverage Amazon EBS to persist 185 TiB of (compressed) state to run Druid historical nodes on EC2 Spot instances. They will also cover how they run Spark for batch jobs to process 1-4 PiB of data across 200 B to 1 T events/day, saving more than 60% in costs.

Amazon EC2 Spot instances provide acceleration, scale, and deep cost savings to run time-critical, hyper-scale workloads for rapid data analysis. In this session, AOL and Metamarkets will present lessons learned and best practices from scaling their big data workloads using popular platforms like Presto, Spark and Druid.

AOL will present how they process, store, and analyze big data securely and cost effectively using Presto. AOL achieved 70% savings by separating compute and storage, dynamically resizing clusters based on volume and complexity, and using AWS Lambda to orchestrate processing pipelines. Metamarkets, an industry leader in interactive analytics, will present how they leverage Amazon EBS to persist 185 TiB of (compressed) state to run Druid historical nodes on EC2 Spot instances. They will also cover how they run Spark for batch jobs to process 1-4 PiB of data across 200 B to 1 T events/day, saving more than 60% in costs.

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

A los espectadores también les gustó (20)

Anuncio

Similares a AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302) (20)

Más de Amazon Web Services (20)

Anuncio

Más reciente (20)

AWS re:Invent 2016: Disrupting Big Data with Cost-effective Compute (CMP302)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. November 30, 2016 Disrupting Big Data with Cost-effective Compute Charles Allen, Metamarkets Durga Nemani, Gaurav Agrawal, AOL Anu Sharma, Amazon EC2 CMP302
  2. 2. Amazon EC2 Spot instances • Regular EC2 instances opened to the Spot market when spare • Prices on average 70-80% lower than On-Demand • Best suited for workloads that can scale with compute • Accelerate jobs 5-10 times e.g. run faster CI/CD pipelines (case study: Yelp) • Reduce costs by 5-10 times, scale stateless web applications (case study: Mapbox, Ad-tech) • Generate better business insights from your event stream
  3. 3. In this session • Use Case: context and history • AOL: Separation of Compute and Storage using Amazon EMR and EC2 Spot instances • Architecture • Cost Optimization • Orchestration • Monitoring • Best Practices • Metamarkets: Spark and Druid on EC2 Spot instances • Architecture Overview: Real-time, Batch Jobs, Lambda • Spark on Spot instances • Druid on Spot instances • Monitoring
  4. 4. Business Intelligence Data Set • Event Data • Timestamp • Dimensions/Attributes • Measures • Total data set is huge, billions of events per day
  5. 5. Relational Databases Traditional Data Warehouse Star Schema • FACT table contains primary information and measures to aggregate • DIM tables contain additional attributes about entities • Queries involve joins between central FACT and DIM tables Performance degrades as data scales.
  6. 6. Key/Value Stores Fast writes, fast lookups • Pre-compute every possible query • As more columns are added, query space grows exponentially • Primary key is a hash of timestamp and dimensions • Value is measure to aggregate • Shuffle data from storage to computational buffer - slow • Difficult to create intelligent indexes Precomputation Range Scans
  7. 7. General Compute Engines SQL on Hadoop • Scale with compute power • Generate up to 5-10x faster business insights with cheaper compute • Or just reduce costs by 80-90%
  8. 8. Pioneers to Settlers Algorithmic Efficiency to Mundane Efficiency
  9. 9. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Separation of Compute and Storage Durga Nemani, System Architect, AOL Gaurav Agrawal, Software Engineer, AOL Big Data Processing with Amazon EMR and EC2 Spot instances
  10. 10. Architecture
  11. 11. Architecture AWS Lambda : Orchestration Elastic IP Amazon EMR Hive AWS IAM Amazon S3 : Data Lake Amazon Dynamo DB : Data Validation Amazon EMR Hive client Amazon RDS : Hive Metastore Data Processing Data Analytics Amazon EMR Presto Elastic IP Amazon EMR Presto client
  12. 12. Key features and advantages • Separation of compute and storage • Scale compute and storage independently • Separate data processing and analytics • Hive for processing, Presto for analytics • No data migration • S3 Data lake • Single source of truth • Columnar format for performance and compression • VPC design • Identified by Name Tags • AOL CIDR, VPN • Few lines of code change vs big data migration efforts
  13. 13. Cost Optimization
  14. 14. Amazon EC2 Spot Instances • Keep in mind • Availability • Spot pricing vary for • Instance Types • Availability Zone • Different provisioning time • AOL Requirement • Major restatement - 15-20K EC2 Instances • Data for 15+ countries • Frequency : HLY, DLY, WLY, MTD, MLY, 28 Days
  15. 15. EMR Deployment Setup • Set up VPC in all regions • Ensure Spot Limits • Setup Hard EC2 limit per AZ • Multiple instance types • Define Instance Type-Core Mapping • Data Volume • Code Complexity • Pay actual price not bid price!
  16. 16. Deployment Logic Diagram Data Volume + Code Complexity Pick Instance Type Sorted Spot Price AZ List Number of Cores = A Next AZ in List? Open/Active Instances = B A + B < AZ Limit Kick off EMR Yes Yes No No
  17. 17. Average Cost Saving Graphs **m3.xlarge Sept’2016 Cost On Demand Static AZ ~80% Savings
  18. 18. Average Cost Saving Graphs Static AZ Cheap AZ 10-15% Savings **m3.xlarge Sept’2016 Cost
  19. 19. Size ( GB ) Cores Hours Local AZ Cost Cheaper AZ Cost Transfer Cost Total Cost Cost Savings 50 100 2 3,431 2,847 365 3,212 6% 100 300 3 20,586 17,082 730 17,812 13% 200 500 5 51,465 42,705 1,460 44,165 14% 300 700 7 109,792 91,104 2,190 93,294 15% Why cheaper AZ matters? • Data transfer cost • Worst Case scenario – Cheaper AZ not in local region • More Data => More Nodes + More Hours Size ( GB ) Cores Hours Local AZ Cost Cheaper AZ Cost Transfer Cost Total Cost Cost Savings 10 25 1 429 356 73 429 0% **m3.xlarge Sept’2016 Cost
  20. 20. EMR Region Distribution us-east-1 20% ap-northeast-1 1% sa-east-1 9% ap-southeast-1 3% ap-southeast-2 9% us-west-2 26% us-west-1 4% eu-west-1 28% AOL DW Sept-Oct 2016 80% times Cheaper AZ is not in local region
  21. 21. Average Cost Saving Graphs 9/1/16 9/2/16 9/3/16 9/4/16 9/5/16 9/6/16 9/7/16 9/8/16 9/9/16 9/10/16 9/11/16 9/12/16 9/13/16 9/14/16 9/15/16 9/16/16 9/17/16 9/18/16 9/19/16 9/20/16 9/21/16 9/22/16 9/23/16 9/24/16 9/25/16 9/26/16 9/27/16 9/28/16 9/29/16 9/30/16 Static AZ Cheap AZ AZ+Data+Code 15-22% Savings **m3.xlarge Sept’2016 Cost
  22. 22. Orchestration: AWS Lambda
  23. 23. Process Pipeline Overview • Multiple Stages b/w Raw Data & Final Summary • Ensure dependencies • Integration with Data services • Extensible, Scalable & Reliable • Recovery Options • Notifications • Directed Acyclic Graph
  24. 24. Sample DW Workflow a b c e jg h i d Operations
  25. 25. AOL DW Process Pipeline Amazon S3 Amazon S3 Amazon EMR Amazon EMR AWS Lambda Python Boto AWS Lambda Python Boto
  26. 26. Benefits & Suggestions • Improved SLA due to Event based model • Serverless – Zero Administration • Millisecond response time • Pricing – 1 million requests/month Free • Generic utilities for Extensibility • Built in Auto Scaling • CloudWatch Logging • Replaced ~2000 Autosys jobs
  27. 27. EMR Monitoring
  28. 28. EMR Monitoring - Prunella • Tons of clusters/day • EMR Failure causes • Network Connectivity • Bootstrap Actions • Zero OPS Hours • SLA improvement • No datacenter dependency • Notifications – Email/Slack
  29. 29. Good to have • S3 Lifecycle based on Tags • Terminate Long STARTING EMR Cluster • Python 3 Lambda Support • Lambda Code Test/Deployment • Kappa • Global EMR Dashboard • Redshift External Tables
  30. 30. Recap • Transient Spot Architecture • S3 as Data Lake • Cost Optimization • Dynamic choice of Spot AZ and Number of Cores • Server less Process Pipeline • AWS Lambda for event driven design • Automated EMR Monitoring • Reduce Manual intervention for 1000s of clusters
  31. 31. Photo Credits • Gabor Kiss - http://bit.ly/2epkQJY • AustinPixels- http://bit.ly/2eAenqr • Mike - http://bit.ly/2eqGx82
  32. 32. Related Sessions • AWS re:Invent 2015 | (BDT208) A Technical Introduction to Amazon Elastic MapReduce • https://www.youtube.com/watch?v=WnFYoiRqEHw • AWS re:Invent 2015 | (BDT210) Building Scalable Big Data Solutions: Intel & AOL • https://www.youtube.com/watch?v=2yZginBYcEo
  33. 33. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Spark and Druid on EC2 Spot Instances Charles Allen, Metamarkets
  34. 34. About Me Director of Platform Druid PMC @drcrallen charles.allen@metamarkets.com Special thanks to Jisoo Kim
  35. 35. Programmatic data is 100x larger than Wall Street
  36. 36. Metamarkets + Industry leader in interactive analytics for programmatic marketing + > 100B events / day + Typical peak approx 2,000,000 events / sec + Massaged, joined, HA replicated → 3M/s Move fast. Think big. Be open. Have fun.
  37. 37. Metamarkets + Event ingestion lag down to few ms + Dynamic queries + Query latency less than 1 second + Specially tailored for real-time bidding
  38. 38. Current Spot Usage
  39. 39. Current Spot Usage + Spark + Druid + Jenkins
  40. 40. Brief Architecture Overview - Real-time Kafka Druid Real- time Indexing Kafka / Samza Very Fast ● Pretty accurate ● On-time data
  41. 41. Brief Architecture Overview - Batch Kafka Druid Historical S3 Spark A Few Hours Later ● Highly accurate ● Deduplicated ● Late data
  42. 42. Brief Architecture Overview (Lambda) Real Time Batch Kafka ΔFew Hrs Historical User
  43. 43. Key Technologies Used + Kafka + Samza + Spark + Druid
  44. 44. Spark on Spot
  45. 45. Why Spark? The Good: + No HDFS + Good enough partial failure recovery + Native Mesos, Yarn, and Stand-alone The Bad: + Rough to configure multi-tenant
  46. 46. Spark + Between 1 and 4 PiB / day (mem bytes spilled) + Between 200B and 1T events / day + Peak days can be up to 5x baseline Think Big.
  47. 47. Cost Savings SPARK Savings vs. on-demand Approx equal to 3-year term >60%
  48. 48. Tradeoff + More complex job failure handling + “Did my job die because of Me, Spark, the Data, or the Market?” + More random delays + More man-hours to manage, or automation to build
  49. 49. Druid on Spot
  50. 50. Druid on Spot Some of our Historical nodes run on Spot 185 TB (compressed) state on EBS on Spot ⅕ of a petabyte can vanish… and come back in 15 minutes
  51. 51. Druid Historical Data 1 hr < EVENT_TIME < X Months X Months < EVENT_TIME < Y Months HOT Y Months < EVENT_TIME < Z Years COLD ICY
  52. 52. Historical Tier QPS (Logscale)
  53. 53. Historical Tier QPS (Logscale) Spot can go here
  54. 54. Using EBS With Druid on Spot + Define a “pool” tag or EBS volumes + If EBS “pool” is “empty” (no unmounted volumes) Create a new volume (with proper tags) and mount it + Otherwise, claim drive from pool + Sanity check on volume, discard if unrecoverable
  55. 55. Using EBS With Druid on Spot + Monitor spot notifications[1] to stop gracefully + If stop is detected, prepare to die gracefully + Stop applications (hook) + Unmount volume cleanly + Do not actually terminate instance; wait for death [1] https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/
  56. 56. Terrifying to Boring (Originally ran without EBS reattachment) [ops] Search Alert: More than 0 results found for "DRUID - Spot Market Fluctuations" Now mundane.
  57. 57. Druid Tips + Coordinator (thing that moves state around) does better with NO tier than with a half-tier + Flapping nodes can cause backpressure, better to kill entire tier than repeatedly flap up and down. + Nodes usually have a burn-in time before they reach steady-state fast queries (few minutes)
  58. 58. Druid + Spot + EBS Accomplished by EBS re-attachment Metamarkets is proud to Open Source this tool Be Open.
  59. 59. Monitoring
  60. 60. Spot Price on the AWS Management Console If only there was some tool that allowed powerful, drill-down analytics on real-time markets…
  61. 61. x1.32xl price stability across zones
  62. 62. Final Thoughts
  63. 63. Spot Caveats + Switching from Spot to On-Demand does NOT always work + Pricing strategy tuned to value of lost work + Scaling in a Spot market must be done SLOWLY (tens of nodes at a time) + us-east-1 is crowded
  64. 64. Lessons Learned… “If I could do it all over again” + Multi-homed (at least by AZ) from the very start + us-west + More ZK quorums + Build on cluster resource framework
  65. 65. We Are Hiring! Have Fun!
  66. 66. Metamarkets and Spot + Metamarkets has great internal tooling for Spot market insight + Druid uses EBS reattachment + Spark works well with proper configuration
  67. 67. Thank you!
  68. 68. Remember to complete your evaluations!
  69. 69. Related Sessions • AWS re:Invent 2015 | (BDT208) A Technical Introduction to Amazon Elastic MapReduce • https://www.youtube.com/watch?v=WnFYoiRqEHw AWS re:Invent 2015 | (BDT210) Building Scalable Big Data Solutions: Intel & AOL • https://www.youtube.com/watch?v=2yZginBYcEo

×