Se ha denunciado esta presentación.
Se está descargando tu SlideShare.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
DoneDeal	
  -­‐	
  Data	
  Pla+orm	
  
April	
  2016	
  
Mar6n	
  Peters	
  	
  
(mar6n@donedeal.ie	
  /	
  @mar6nbpeters)...
If you don’t understand the details of your business you are
going to fail.
If we can keep our competitors focused on us w...
What	
  do	
  these	
  companies	
  have	
  in	
  Common?
Data	
  is	
  …
With the right set of information, you can make business decisions with higher
levels of confidence, as you...
Business	
  Intelligence	
  101
For	
  small	
  companies	
  the	
  gap	
  is	
  oNen	
  filled	
  with	
  
custom	
  ad	
 ...
What	
  and	
  why	
  BI?
As	
  	
  a	
  company	
  grows,	
  the	
  Availability,	
  Accuracy	
  and	
  
Accessibility	
 ...
Some	
  terminology:	
  ETL	
  process
Extrac6on
Extracts data from
homogeneous or
heterogeneous
data sources.
Transforma6...
April	
  2015	
  -­‐	
  April	
  2016
Timeline:	
  2014-­‐2017
2014 2015 2016 2017
Silo’d Data
Manual/Error
Prone Blending
Value of BI/Data
not understood
Platf...
Business	
  Goals	
  &	
  Objec6ves
1.	
  Build	
  a	
  future	
  proof	
  data	
  analy2cs	
  pla5orm	
  that	
  will	
  ...
Apollo	
  Team
Data Engineer
Data Analyst
Architect
DevOps
BI Consultants
Solution Architect
• Analy2cs	
  Pla5orm	
  that...
Apollo	
  Principles
1.	
  System	
  must	
  scale	
  but	
  costs	
  
grow	
  more	
  slowly	
  
2.	
  Occam’s	
  Razor	
...
Apollo	
  Architectural	
  Principles
www.slideshare.net/AmazonWebServices/big-data-architectural-patterns-and-best-practi...
Tools/Services	
  in	
  Produc6on
Data
Science
Business Users
ETL	
  Architecture:	
  Custom	
  Build	
  Pipeline
E T L
Summary Summary Summary
ETL:	
  Control	
  over	
  complex	
  dependencies
• Allows control of ETL
pipelines with complex
dependencies
• Easy plug...
ETL:	
  By	
  the	
  numbers
• Extrac6on	
  
-­‐ 4000	
  days	
  processed	
  
-­‐ 7	
  different	
  data	
  sources	
  
-­...
Kinesis	
  Streams
• 1	
  Stream	
  with	
  4	
  Shards	
  
• Data	
  reten6on	
  of	
  24hrs	
  
• KCL	
  on	
  EC2	
  wr...
S3
• Simple Storage
Service provides
secure, highly-
scalable, durable
cloud storage
• Native support for
Spark, Hive
S3
• A strongly defined naming convention
• YYYY/MM/DD prefix used
• Avro format used for OLTP data/ JSON otherwise
- probab...
Spark	
  on	
  EMR
• AWS’s	
  managed	
  Hadoop	
  framework	
  that	
  can	
  
interact	
  with	
  data	
  from	
  S3,	
 ...
• Spark is easy, performant Spark code is hard and
time consuming
• DataFrame API exclusively
• Developing Spark applicati...
Data	
  Pipeline	
  +	
  Simple	
  No6fica6on	
  Service
• Pipeline	
  is	
  a	
  service	
  to	
  reliably	
  process	
  a...
RedshiN
• Dense	
  Compute	
  or	
  Dense	
  Storage?	
  
-­‐ Single	
  ds2.xlarge	
  instance	
  
-­‐ Right	
  balance	
 ...
Tableau	
  on	
  EC2
• Tableau	
  Server	
  runs	
  on	
  EC2	
  (c3.2xlarge)	
  inside	
  AWS	
  Environment.	
  	
  
• T...
Up	
  Next?
• Increase	
  number	
  of	
  data	
  
streams/Remove	
  
dependence	
  on	
  OLTP	
  
• Tradi2onal	
  BI/Repo...
DoneDeal	
  Image	
  Service	
  Upgrade
•Image	
  Storage	
  &	
  Transforming	
  moved	
  to	
  AWS	
  
•Over	
  4.5M	
  ...
DoneDeal	
  Dynamic	
  Test	
  Environments
•QA	
  can	
  now	
  run	
  any	
  feature	
  branch	
  of	
  DoneDeal	
  dire...
Q&A	
  Session
Nigel Creighton
CTO at DNM
Martin Peters
BI Manager at DoneDeal
DoneDeal - AWS Data Analytics Platform
Próxima SlideShare
Cargando en…5
×
  • Sé el primero en comentar

DoneDeal - AWS Data Analytics Platform

  1. 1. DoneDeal  -­‐  Data  Pla+orm   April  2016   Mar6n  Peters     (mar6n@donedeal.ie  /  @mar6nbpeters)   DoneDeal  Analy6cs  Team  Manager  
  2. 2. If you don’t understand the details of your business you are going to fail. If we can keep our competitors focused on us while we stay focused on the customer, ultimately we’ll turn out all right. - Jeff Bezos, Amazon
  3. 3. What  do  these  companies  have  in  Common?
  4. 4. Data  is  … With the right set of information, you can make business decisions with higher levels of confidence, as you can audit and attribute the data you used for the decision-making process. - Krish Krishnan, 2014 one  of  our  biggest  assets.
  5. 5. Business  Intelligence  101 For  small  companies  the  gap  is  oNen  filled  with   custom  ad  hoc  solu6ons  with  limited  and  rather   sta6c  repor6ng  capability.
  6. 6. What  and  why  BI? As    a  company  grows,  the  Availability,  Accuracy  and   Accessibility  requirements  of  data  increases.
  7. 7. Some  terminology:  ETL  process Extrac6on Extracts data from homogeneous or heterogeneous data sources. Transforma6on: Process, Blend, merge and conform the data Loading: Store in the proper format or structure for the purposes of querying and analysis.
  8. 8. April  2015  -­‐  April  2016
  9. 9. Timeline:  2014-­‐2017 2014 2015 2016 2017 Silo’d Data Manual/Error Prone Blending Value of BI/Data not understood Platform Design Implementation Storage Layer Batch Layer Traditional BI Serving Layer Speed Layer Real Time Analytics
  10. 10. Business  Goals  &  Objec6ves 1.  Build  a  future  proof  data  analy2cs  pla5orm  that  will  scale  with  the  company   over  the  next  5  years.   2.  Take  ownership  of  our  data.  Collect  more  data.   3.  Replace  exis2ng  repor2ng  tool.   4.  Provide  a  holis2c  view  of  our  users  (buyers  and  sellers),  ads,  products   5.  Use  our  data  in  a  smarter  manner  and  provide  recommenda2ons  in  a  2mely   fashion.  
  11. 11. Apollo  Team Data Engineer Data Analyst Architect DevOps BI Consultants Solution Architect • Analy2cs  Pla5orm  that  includes  Event  Streaming,  Data  Consolida2on,  Cleansing  &  Warehousing,  Data   Visualisa2on,  Business  Intelligence  and  Data  Product  Delivery.   • Apollo  brings  agility  and  flexibility  in  our  data  model,  data  ownership  is  key  and  allows  us  to  blending   data  more  conveniently
  12. 12. Apollo  Principles 1.  System  must  scale  but  costs   grow  more  slowly   2.  Occam’s  Razor   3.  Analy2cs  and  core  pla5orms   are  independent   4.  Monitoring  of  pla5orm  is   key   5.  Low  maintenance Project  Principles: Data  Principles: 1.  Accurate,  Available,  Accessible   2.  Ownership  -­‐  Business  &  Technical     3.  Standardised  across  teams   4.  Integrity     5.  Iden2fiable  -­‐  primary  source  and   globally  unique  iden2fier
  13. 13. Apollo  Architectural  Principles www.slideshare.net/AmazonWebServices/big-data-architectural-patterns-and-best-practices-on-aws •  Decoupled  “data  bus”   •  Use  the  right  tool/service  for  the  job   ➡  Data  structure,  latency,  throughput,  access  paerns   •  Use  Lambda  architecture  ideas   ➡  Immutable  (append-­‐only),  batch,  [speed,  serving]  layers   •  Leverage  AWS  Managed  Services   ➡  Scalable/elas2c,  available,  reliable,  secure,  no/low  admin   •  Big  data  !=  Big  Cost
  14. 14. Tools/Services  in  Produc6on Data Science Business Users
  15. 15. ETL  Architecture:  Custom  Build  Pipeline E T L Summary Summary Summary
  16. 16. ETL:  Control  over  complex  dependencies • Allows control of ETL pipelines with complex dependencies • Easy plug-in of new datasource • Orchestration with Data Pipeline and Common Status or Summary Files • Idempotent Pipeline • Historical data extracted as simulated stream
  17. 17. ETL:  By  the  numbers • Extrac6on   -­‐ 4000  days  processed   -­‐ 7  different  data  sources   -­‐ 14  domains   -­‐ 13  event  types   • Orchestra6on   -­‐ 1200  processing  days   -­‐ 4GB/day   -­‐ 3  Environments     -­‐ 15  data  pipelines • Data  Lake   -­‐ 11M  events  streamed/day     -­‐ 3  million  files   -­‐ 3  TB  of  data  stored  over  7  buckets   • RedshiN   -­‐ 7B  records  in  produc6on   -­‐ 6  Schemas  (core  and  aggregate)   -­‐ 86  Tables  in  core  schema
  18. 18. Kinesis  Streams • 1  Stream  with  4  Shards   • Data  reten6on  of  24hrs   • KCL  on  EC2  writes  data  to  S3  ready  for  Spark   • Max  size  of  1MB  data  blog   • 1,000  records/sec  per  shard  write   • 5  transac6ons/sec  read  or  2MB/sec   • Server  side  API  Logging  from  7  applica6on   servers  using  Log4JAppender   • Event  Buffering  at  source  [in  progress] Put records Requests
  19. 19. S3 • Simple Storage Service provides secure, highly- scalable, durable cloud storage • Native support for Spark, Hive
  20. 20. S3 • A strongly defined naming convention • YYYY/MM/DD prefix used • Avro format used for OLTP data/ JSON otherwise - probably the right choice (schema evolution), although we haven’t take any advantages for those yet. • Allow easy retrieval of data from a particular time period • Easy to maintain and browse • Handling of summaries from E,T & L steps
  21. 21. Spark  on  EMR • AWS’s  managed  Hadoop  framework  that  can   interact  with  data  from  S3,  DynamoDB,  etc.   • Apache  Spark  -­‐  Fast,  general  purpose  engine   for  large-­‐scale  in-­‐memory  data  processing.   Runs  on  Hadoop/EMR  and  can  read  from  S3.   • PySpark  +  SparkSQL  was  the  focus  in  Apollo.   • Streaming  and  ML  will  be  the  focusing  the   months  ahead.
  22. 22. • Spark is easy, performant Spark code is hard and time consuming • DataFrame API exclusively • Developing Spark applications in local environment with limited size dataset significantly differs from running Spark on EMR (e.g. joins, unions etc.) • Don’t pre-optimize • Naive joins to be avoided • Spark UI is invaluable to test performances (both locally and on EMR) and to understand the underlying mechanism of Spark •Some  scaling  of  Spark  on  EMR,  seled  on   memory  op2mised  instances  r3.2xlarge  (8   vCPUs,  61GB  RAM). Spark  on  EMR
  23. 23. Data  Pipeline  +  Simple  No6fica6on  Service • Pipeline  is  a  service  to  reliably  process  and   move  between  AWS  applica6ons  (e.g.  S3,  EMR,   DynamoDB)   • Pipelines  run  on  schedule  and  alarms  are   issued  with  Simple  No6fica6on  Service  (SNS)   • EMR/Spark  used  for  compute  and  EC2  used  for   loading  data  in  RedshiN   • Debugging  can  be  a  challenge
  24. 24. RedshiN • Dense  Compute  or  Dense  Storage?   -­‐ Single  ds2.xlarge  instance   -­‐ Right  balance  between  storage/memory/ compute  and  cost/hr   • Strict  ETL,  no  transforma2on  is  carried  out  in  DW,   an  Append  Only  Strategy   -­‐ Leverage  power  and  scalability  of  EMR  and   Insert  speed  of  Redshif   -­‐ No  Updates  in  DW,  Drop  and  Recreate     • Tuning  is  a  2me  consuming  task  &  requires   rigorous  tes2ng.   • Define  Sort,  Distribu2on,  Interleaved  keys  as  early   as  possible.   • Reserved  Nodes  will  be  used  in  future Test Dev Prod Core cmtest cmdev cmprod Agg agtest agdev agprod Test Dev Prod Core cmtest cmdev cmprod Agg agtest agdev agprod read permissions Kimball  Star  Schema:  Conformed  dimensions   across  all  data  sources
  25. 25. Tableau  on  EC2 • Tableau  Server  runs  on  EC2  (c3.2xlarge)  inside  AWS  Environment.     • Tableau  Desktop  used  to  develop  dashboards  that  are  published  to  the  server.   • Connec2on  to  Redshif  Data  Warehouse  -­‐  JDBC/ODBC  Connector.   • Maps  support  is  poor  for  countries  outside  the  US http://www.slideshare.net/AmazonWebServices/analytics-on-the-cloud-with-tableau-on-aws
  26. 26. Up  Next? • Increase  number  of  data   streams/Remove   dependence  on  OLTP   • Tradi2onal  BI/Repor2ng  -­‐   More  dashboards   • [In  progress]  Data  Products   with  Spark  ML/Amazon  ML,   DynamoDB,  Lambda  &  API   Gateway • Trials  of  Kinesis  Firehose,   Kinesis  Analy2cs,  Quicksight   • Improved  Code  Deployment   with  Code  Pipeline  and   Code  Commit
  27. 27. DoneDeal  Image  Service  Upgrade •Image  Storage  &  Transforming  moved  to  AWS   •Over  4.5M  images  migrated  to  S3   •ECS  +  ELB  used  for  image  resizing   •Autoscaling  group  enables  adding  new  image  sizes   •We  now  run  docker  in  produc2on  thanks  to  ECS   •Inves2ga2ng  uses  for  AWS  Lambda  and  image  processing For more info: @davidconde
  28. 28. DoneDeal  Dynamic  Test  Environments •QA  can  now  run  any  feature  branch  of  DoneDeal  directly  from  our  CI   server   •Uses  Jenkins  /  Docker  (Machine  +  Compose)  /  EC2  &  Route  53   •Enables  rapid  tes2ng  without  server  conten2on   •Also  used  by  the  mobile  team  to  develop  against  &  test  new  APIs   For more info: @davidconde
  29. 29. Q&A  Session Nigel Creighton CTO at DNM Martin Peters BI Manager at DoneDeal

    Sé el primero en comentar

    Inicia sesión para ver los comentarios

  • haejuk99

    May. 10, 2016
  • UmapathyV

    Jan. 19, 2019
  • nourredineZaher

    Sep. 28, 2019
  • shanugandhi7

    Oct. 15, 2020

DoneDeal AWS Data Analytics Platform build using AWS products: EMR, Data Pipeline, S3, Kinesis, Redshift and Tableau. Custom built ETL was written using PySpark.

Vistas

Total de vistas

294

En Slideshare

0

De embebidos

0

Número de embebidos

5

Acciones

Descargas

4

Compartidos

0

Comentarios

0

Me gusta

4

×