Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Scaling etl with hadoop shapira 3

4.810 visualizaciones

Publicado el

Publicado en: Tecnología

Scaling etl with hadoop shapira 3

  1. 1. 1 Scaling ETL with Hadoop Gwen Shapira, Solutions Architect @gwenshap gshapira@cloudera.com
  2. 2. 2
  3. 3. ETL is… • Extracting data from outside sources • Transforming it to fit operational needs • Loading it into the end target • (Wikipedia: http://en.wikipedia.org/wiki/Extract,_transform,_load) 3
  4. 4. Hadoop Is… • HDFS – Replicated, distributed data storage • Map-Reduce – Batch oriented data processing at scale. Parallel programming platform. 4
  5. 5. The Ecosystem • High level languages and abstractions • File, relational and streaming data integration • Process Orchestration and Scheduling • Libraries for data wrangling • Low latency query language 5
  6. 6. 6 Why ETL with Hadoop?
  7. 7. Got unstructured data? • Traditional ETL: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 8
  8. 8. Replace ETL Clusters • Informatica is: • Easy to program • Complete ETL solution • Hadoop is: • Cheaper • MUCH more flexible • Faster? • More scalable? • You can have both 9
  9. 9. Data Warehouse Offloading • DWH resources are EXPENSIVE • Reduce storage costs • Release CPU capacity • Scale • Better tools 10
  10. 10. What I often see ETL Cluster ELT in DWH ETL in Hadoop 11 ETL Cluster ETL Cluster with some Hadoop
  11. 11. 12
  12. 12. We’ll Discuss: Technologies Speed & Scale Tips & Tricks Extract Transform Load Workflow 13
  13. 13. 14 Extract
  14. 14. Let me count the ways • From Databases: Sqoop • Log Data: Flume + CDK • Or just write data to HDFS 15
  15. 15. Sqoop – The Balancing Act 16
  16. 16. Scale Sqoop Slowly • Balance between: • Maximizing network utilization • Minimizing database impact • Start with smallish table (1-10G) • 1 mapper, 2 mappers, 4 mappers • Where’s the bottleneck? • vmtat, iostat, mpstat, netstat, iptraf 17
  17. 17. When Loading Files: Same principles apply: • Parallel Copy • Add Parallelism • Find Bottlenecks • Resolve them • Avoid Self-DDOS 18
  18. 18. Scaling Sqoop • Split column - match index or partitions • Compression • Drivers + Connectors + Direct mode • Incremental import 19
  19. 19. Ingest Tips • Use file system tricks to ensure consistency • Directory structure: /intent /category /application (optional) /dataset /partitions /files • Examples: /data/fraud/txs/2011-01-01/20110101-00.avro /group/research/model-17/training-tx/part-00000.txt /user/gshapira/scratch/surge/ 20
  20. 20. Ingest Tips • External tables in Hive • Keep raw data • Trigger workflows on file arrival 21
  21. 21. 22 Transform
  22. 22. Endless Possibilites • Map Reduce (in any language) • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java • Morphlines 23
  23. 23. Prototype 24
  24. 24. Partitioning • Hive • Directory Structure • Pre-filter • Adds metadata 25
  25. 25. Tune Data Structures • Joins are expensive • Disk space is not • De-normalize • Store same data in multiple formats 26
  26. 26. Map-Reduce • Assembly language of data processing • Simple things are hard, hard things are possible • Use for: • Optimization: Do in one MR job what Hive does in 3 • Optimization: Partition the data just right • GeoSpatial • Mahout – Map/Reduce machine learning 27
  27. 27. Parallelism –Unit of Work • Amdahl’s Law • Small Units • That stay small • One user? • One day? • Ten square meter? 28
  28. 28. Remember the Basics • X reduce output is 3X disk IO and 2X network IO • Less jobs = Less reduces = Less IO = Faster and Scalier • Know your network and disk throughput • Have rough idea of ops-per-second 29
  29. 29. Instrumentation • Optimize the right things • Right jobs • Right hardware • 90% of the time – its not the hardware 30
  30. 30. Tips Avoid the “old data warehouse guy” syndrome 31
  31. 31. Tips • Slowly changing dimensions: • Load changes • Merge • And swap • Store intermediate results: • Performance • Debugging • Store Source/Derived relation 32
  32. 32. Fault and Rebuild • Tier 0 – raw data • Tier 1 – cleaned data • Tier 2 – transformations, lookups and denormalization • Tier 3 - Aggregations 33
  33. 33. Never Metadata I didn’t like • Metadata is small data • That changes frequently • Solutions: • Oozie • Directory structure / Partitions • Outside of HDFS • HBase • Cloudera Navigator 34
  34. 34. Few words about Real Time ETL • What does it even mean? • Fast reporting? • No delay from OLTP to DWH? • Micro-batches make more sense: • Aggregation • Economy of scale • Late data happens • Near-line solutions 35
  35. 35. 36 Load
  36. 36. Technologies • Sqoop • Fuse-DFS • Oracle Connectors • NoSQLs 37
  37. 37. Scaling • Sqoop works better for some DBs • Fuse-FS and DB tools give more control • Load in parallel • Directly to FS • Partitions • Do all formatting on Hadoop • Do you REALLY need to load that? 38
  38. 38. How not to Load • Most Hadoop customers don’t load data in bulk • History can stay in Hadoop • Load only aggregated data • Or computation results – recommendations, reports. • Most queries can run in Hadoop • BI tools often run in Hadoop 39
  39. 39. 40 Workflow Management
  40. 40. Tools • Oozie • Azkaban • Pentaho Kettle • TalenD • Informatica 41 } Native Hadoop } Kinda Open Source
  41. 41. Scaling Challenges • Keeping track of: • Code Components • Metadata • Integrations and Adapters • Reports, results, artifacts • Scheduling and Orchestration • Cohesive System View • Life Cycle • Instrumentation, Measurement and Monitoring 42
  42. 42. My Toolbox • Hue + Oozie: • Scheduling + Orchestration • Cohesive system view • Process repository • Some metadata • Some instrumentation • Cloudera Manager for monitoring • … and way too many home grown scripts 43
  43. 43. Hue + Oozie 44
  44. 44. 45 — Josh Wills
  45. 45. A lot is still missing • Metadata solution • Instrumentation and monitoring solution • Lifecycle management • Lineage tracking Contributions are welcome 46
  46. 46. 47

×