Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Data Warehousing Patterns for Hadoop

Cargando en…3
×

Eche un vistazo a continuación

1 de 25 Anuncio
1 de 25 Anuncio

Data Warehousing Patterns for Hadoop

Slides from Michelle Ufford's Data Warehousing talk at Hadoop Summit 2015.

How can we take advantage of the veritable treasure trove of data stored in Hadoop to augment our traditional data warehouses? In this session, Michelle will share her experience with migrating GoDaddy’s data warehouse to Hadoop. She’ll explore how GoDaddy has adapted traditional data warehousing methodologies to work with Hadoop and will share example ETL patterns used by her team. Topics will also include how the integration of structured and unstructured data has exposed new insights, the resulting business impact, and tips for making your own Hadoop migration project more successful.

Recording available here: https://www.youtube.com/watch?v=0AxoB-wJcZc

Slides from Michelle Ufford's Data Warehousing talk at Hadoop Summit 2015.

How can we take advantage of the veritable treasure trove of data stored in Hadoop to augment our traditional data warehouses? In this session, Michelle will share her experience with migrating GoDaddy’s data warehouse to Hadoop. She’ll explore how GoDaddy has adapted traditional data warehousing methodologies to work with Hadoop and will share example ETL patterns used by her team. Topics will also include how the integration of structured and unstructured data has exposed new insights, the resulting business impact, and tips for making your own Hadoop migration project more successful.

Recording available here: https://www.youtube.com/watch?v=0AxoB-wJcZc

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (19)

Anuncio

Data Warehousing Patterns for Hadoop

  1. 1. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. DATA WAREHOUSING USING HADOOP MICHELLE UFFORD DATA PLATFORM @ GODADDY 10 JUNE 2015
  2. 2. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. DATA WAREHOUSING IN HADOOP IS EXTREMELY POWERFUL 2 Customer Sales Web Clickstream Social Media Business Value Data Set Moving our data warehouse to Hadoop enabled greater data integration & allowed us to support all phases of analytics Product Usage Structured Semi- Structured Unstructured Descriptive Diagnostic Predictive Prescriptive AnalyticsAscendencyModel Today’s Agenda • Project motivations • Team principles • Design patterns • Batch processing • New insights • Tips & suggestions
  3. 3. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. PROJECT MOTIVATIONS OR, “ARE YOU CRAZY?!” 3
  4. 4. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. Sensors Events Logs FEEDS DATA VISUALIZATION Tableau Excel StormSpark STREAM PROCESSING MySQLSQL Server CORPORATE DATA Public EXTERNAL DATA SOURCES ElasticSearchCassandra SERVING PLATFORM MySQL SQL Server Teradata Partnerships Subscriptions Google Analytics APPLICATIONS & SERVICES OpenStack Personalization Hosting WSB WebPro Etc.Search Fast-DB Sync Secure Ingress DataCollector Pigsty Ad Hocs Kafka DATA PLATFORM @ GODADDY
  5. 5. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. DATA WAREHOUSING @ GODADDY OR, “THE METHOD TO OUR MADNESS” 5
  6. 6. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. TEAM PRINCIPLES • data should be easy to • discover • understand • consume • maintain • favor simplicity in both design & process • automate! HOW WE GET WORK DONE MAKE IT EASY • weekly Agile sprints • focus incessantly on business needs & value • be flexible • deliver quickly • ‘Data First’ design • data quality is critical to adoption • automated data quality tests • visibility into failures & warnings • self-healing design DELIVER VALUE ENSURE QUALITY
  7. 7. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. DATA WAREHOUSE – DESIGN DECISIONS TO SUPPORT ALL TYPES OF ANALYTICS • Basically, a variant of Kimball • Wide, denormalized facts • Minimize “reference tables” and “flat dimensions” to reduce the need for expensive joins • Integrated, conformed dimensions • Maintain data at the lowest granularity (including transactional, when available) • Preserve source data in full fidelity • Minimize derived data (i.e. birthdate vs. age) • Type 4 SCD (“history table”) • Natural keys (instead of surrogate keys)
  8. 8. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. EXAMPLE DIMENSION TABLE – TRANSACTIONAL 8 DIM_CUSTOMER_TX ETL Timestamp Source Transaction {Source, Timestamp, Action} Delete Flag Cust ID (Natural Key) Original Gender First Order Date Original Country Code Country Code 2015-06-01 9:00 AM {ecomm, 2015-06-01 8:55 AM, update} False 111 2007-07-02 N/A 2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA 2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 111 2007-07-02 USA US 2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A 2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  9. 9. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. EXAMPLE DIMENSION TABLE – SNAPSHOT 9 DIM_CUSTOMER ETL Timestamp Customer ID (Natural Key) Active Customer Flag Original Gender First Order Date Original Country Code Country Code 2015-06-01 10:00 AM 111 False 2007-07-02 USA US 2015-06-01 10:00 AM 222 True Female 2010-06-06 CA CA 2015-06-01 10:00 AM 333 True Male 2015-06-01 blah N/A Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  10. 10. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. TRANSACTIONAL VS. SNAPSHOT • Mutable “snapshot” that rolls up transactions • Unique on [Natural Key] • May either use logical deletion or exclude deleted records • Sourced from the processed, transactional table • Populated using an automated snapshotting process • Replaces the prior snapshot each time it executes • Automates complexity • Provides historical visibility via “archives” • Default data source for most queries & reports • Optimized for querying • Immutable, append-only • Unique on [Transaction Timestamp + Natural Key] • Records have logical deletion indicators • Sourced from raw imported data • Populated by Pig script (data engineer) • New data is always appended • Minimizes complexity • Provides dynamic “point-in-time” query functionality • Typically used for PA, ML, & SOX • Optimized for ETL processes TRANSACTIONAL TABLE SNAPSHOT TABLE
  11. 11. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. EXAMPLE FACT TABLE – APPEND-ONLY 11 FACT_USER_EVENT ETL Timestamp Event Timestamp Event Type Customer ID Location Event JSON 2015-06-01 9:00 AM 2015-06-01 8:55 AM wsb.login 111 UK {“event”:”wsb.login”,“cust_id":111,”wsb_id”:579} 2015-06-01 9:15 AM 2015-06-01 9:14 AM call.inbound 222 IN {“event”:”call.inbound”,“cust_id":222,”rep_id”:25} 2015-06-01 9:15 AM 2015-06-01 9:14 AM account.create 333 US {“event”:”account.create”,“cust_id":333} 2015-06-01 9:30 AM 2015-06-01 9:22 AM wsb.config 111 UK {“event”:”wsb.convig”,“cust_id":111,”wsb_id”:579} 2015-06-01 9:45 AM 2015-06-01 9:37 AM o365.provision 222 IN {“event”:”o365.provision”,“cust_id":222,”rep_id”:25} Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  12. 12. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING OR, “MAKING IT ALL HAPPEN” 12
  13. 13. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. Enterprise Data Layer (data warehouse) Data Ingress Layer (raw data) Stage VMs Transactional Table (tx) HDFS Raw Event Data Snapshot Table (snap) External Data Data Consumption Layer (user / team) Hourly Snapshot (View) Kafka Fast-DB Sync Transactional Table (tx) Snapshot Table (snap) Hourly Snapshot (view) HDFS DRILL-DOWN LOGICAL DATA LAYERS Integrated Data Pre-Aggregated Data Transformed Data Append-Only Data Logical construct only! Users & processes can consume from any layer Pig Pig -or- Hive
  14. 14. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING DRILL-DOWN 14 INCREMENTAL PATTERN Enterprise Data Layer (data warehouse) Data Ingress Layer (raw data) HDFS Data Consumption Layer (user / team) Pigsty • next(tx_date) = $date • foreach destination server prep() • execute(script.pig) 1. filter transactional source (tx_date=$date) 2. store transactional to HDFS (tx_date=$date) 3. store aggregations to HDFS (tx_date=$date) 4. store to destination server(s) • execute(data_quality_tests) • if (tests=pass) merge(destination) • replace(dim_customer_snapshot) Obligatory Disclaimer: this is fictitious data used for demonstration purposes only customer (snapshot) dim_customer (snapshot) SERVING PLATFORM MySQL SQL Server Cassandra data-ingress / ecomm / customer_tx / tx_date=20150601 data-mgmt / dim_customer_tx / tx_date=20150601 data-rpt / new_customers / tx_date=20150601 / tx_date=20150602 / tx_date=20150602 / tx_date=20150602
  15. 15. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING EXAMPLE – PRE-DEPLOYMENT 15 DIM_CUSTOMER_TX – TRANSACTIONAL ETL Timestamp Source Transaction {Source, Timestamp, Action} Delete Flag Cust ID (Natural Key) Original Gender First Order Date Original Country Code Country Code 2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA 2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A 2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  16. 16. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING EXAMPLE – POST-DEPLOYMENT 16 DIM_CUSTOMER_TX – TRANSACTIONAL ETL Timestamp Source Transaction {Source, Timestamp, Action} Delete Flag Cust ID (Natural Key) Original Gender First Order Date Original Country Code Country Code Gender 2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA 2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A 2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US 2015-06-01 10:00 AM {ecomm, 2015-06-01 9:14 AM, deploy} False 222 Female 2010-06-06 CA CA Female 2015-06-01 10:00 AM {ecomm, 2015-06-01 9:22 AM, deploy} False 333 Male 2015-06-01 blah N/A Male 2015-06-01 10:00 AM {ecomm, 2015-06-01 9:37 AM, deploy} True 111 2007-07-02 USA US Female Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  17. 17. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. BATCH PROCESSING EXAMPLE – POST-DEPLOYMENT 17 DIM_CUSTOMER – SNAPSHOT ETL Timestamp Customer ID (Natural Key) Active Customer Flag Original Gender First Order Date Original Country Code Country Code Gender 2015-06-01 10:00 AM 111 False 2007-07-02 USA US Female 2015-06-01 10:00 AM 222 True Female 2010-06-06 CA CA Female 2015-06-01 10:00 AM 333 True Male 2015-06-01 blah N/A Male Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
  18. 18. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. REAL BUSINESS RESULTS OR, “HADOOP: NOT JUST HYPE!” 18
  19. 19. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. HADOOP ENABLES GREATER & QUICKER DW VALUE • Better use of data engineers • data ingress is largely automated • reduces (not eliminates) the traditional 75-80% of project time spent on ETL • Well-suited for Agile • full source data is preserved in full fidelity • minimizes permanence of design decisions • roll out changes weekly • Data integration • access to the other 79.7% of the company’s data • flexible data models using complex data types • Single source of data processing • Process once => export 0:N destinations; supports all data consumers • Frees up expensive database resources • Single enterprise solution for data quality, monitoring, etc.
  20. 20. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. PROCESSED DATA HAS GREATER REACH Descriptive What has / is happening? Diagnostic Why did it happen? Predictive What will likely happen? Prescriptive How can we make it happen? The same attributes used for reporting can be inputs into PA models & ML algorithms! primarily uses snapshots primarily uses transactional uses both snapshots & transactional
  21. 21. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. Enterprise data + Clickstream dataChurn Analysis Customer Dashboard New Attributes Sentiment Analysis UNLOCKING NEW, ACTIONABLE INSIGHTS 21 Customer Experience Business Value Complex Data Complex Analysis Enterprise data + Product data + Event data Enterprise data + External data = New enterprise data Enterprise data + Call transcripts
  22. 22. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. TIPS & LESSONS LEARNED OR, “HOW TO BE AWESOME” 22
  23. 23. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. SUGGESTIONS TO IMPROVE YOUR HADOOP DW PROJECT • Standardize on technology Pig for ETL, Hive for analysis • Focus on simplicity if your data isn’t easy to use, you’ve failed • Embrace flexibility don’t shy away from complex data types • Be predictable use HCatalog & consistent naming standard • Don’t be afraid of change use data abstraction to minimize impact to consumers • Do quick prototyping use external tables & data extractions via Hive ODBC • Democratize data amazing insights can come from anywhere; embrace new data consumers
  24. 24. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED. QUESTIONS? THANK YOU FOR ATTENDING! 

×