Más contenido relacionado La actualidad más candente (20) Similar a Data Warehousing Patterns for Hadoop (20) Data Warehousing Patterns for Hadoop1. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
DATA WAREHOUSING
USING HADOOP
MICHELLE UFFORD
DATA PLATFORM @ GODADDY
10 JUNE 2015
2. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
DATA WAREHOUSING
IN HADOOP IS
EXTREMELY POWERFUL
2
Customer Sales
Web
Clickstream
Social
Media
Business Value
Data Set
Moving our data warehouse to Hadoop enabled greater data
integration & allowed us to support all phases of analytics
Product
Usage
Structured
Semi-
Structured
Unstructured
Descriptive
Diagnostic
Predictive
Prescriptive
AnalyticsAscendencyModel Today’s Agenda
• Project motivations
• Team principles
• Design patterns
• Batch processing
• New insights
• Tips & suggestions
4. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
Sensors
Events
Logs
FEEDS
DATA VISUALIZATION
Tableau Excel
StormSpark
STREAM PROCESSING
MySQLSQL Server
CORPORATE DATA
Public
EXTERNAL DATA SOURCES
ElasticSearchCassandra
SERVING PLATFORM
MySQL SQL Server
Teradata Partnerships Subscriptions
Google Analytics
APPLICATIONS & SERVICES
OpenStack
Personalization Hosting WSB WebPro Etc.Search
Fast-DB Sync Secure Ingress
DataCollector
Pigsty
Ad Hocs
Kafka
DATA PLATFORM @ GODADDY
5. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
DATA WAREHOUSING @ GODADDY
OR, “THE METHOD TO OUR MADNESS”
5
6. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
TEAM PRINCIPLES
• data should be easy to
• discover
• understand
• consume
• maintain
• favor simplicity in both design
& process
• automate!
HOW WE GET WORK DONE
MAKE IT EASY
• weekly Agile sprints
• focus incessantly on
business needs & value
• be flexible
• deliver quickly
• ‘Data First’ design
• data quality is critical to adoption
• automated data quality tests
• visibility into failures & warnings
• self-healing design
DELIVER VALUE ENSURE QUALITY
7. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
DATA WAREHOUSE – DESIGN DECISIONS
TO SUPPORT ALL TYPES OF ANALYTICS
• Basically, a variant of Kimball
• Wide, denormalized facts
• Minimize “reference tables” and “flat dimensions” to reduce the need for expensive joins
• Integrated, conformed dimensions
• Maintain data at the lowest granularity (including transactional, when available)
• Preserve source data in full fidelity
• Minimize derived data (i.e. birthdate vs. age)
• Type 4 SCD (“history table”)
• Natural keys (instead of surrogate keys)
8. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
EXAMPLE DIMENSION TABLE – TRANSACTIONAL
8
DIM_CUSTOMER_TX
ETL Timestamp
Source Transaction
{Source, Timestamp, Action}
Delete
Flag
Cust ID
(Natural
Key)
Original
Gender
First
Order
Date
Original
Country
Code
Country
Code
2015-06-01 9:00 AM {ecomm, 2015-06-01 8:55 AM, update} False 111 2007-07-02 N/A
2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA
2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 111 2007-07-02 USA US
2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A
2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
9. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
EXAMPLE DIMENSION TABLE – SNAPSHOT
9
DIM_CUSTOMER
ETL Timestamp
Customer ID
(Natural Key)
Active
Customer
Flag
Original
Gender
First Order
Date
Original
Country
Code
Country
Code
2015-06-01 10:00 AM 111 False 2007-07-02 USA US
2015-06-01 10:00 AM 222 True Female 2010-06-06 CA CA
2015-06-01 10:00 AM 333 True Male 2015-06-01 blah N/A
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
10. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
TRANSACTIONAL VS. SNAPSHOT
• Mutable “snapshot” that rolls up transactions
• Unique on [Natural Key]
• May either use logical deletion or exclude deleted records
• Sourced from the processed, transactional table
• Populated using an automated snapshotting process
• Replaces the prior snapshot each time it executes
• Automates complexity
• Provides historical visibility via “archives”
• Default data source for most queries & reports
• Optimized for querying
• Immutable, append-only
• Unique on [Transaction Timestamp + Natural Key]
• Records have logical deletion indicators
• Sourced from raw imported data
• Populated by Pig script (data engineer)
• New data is always appended
• Minimizes complexity
• Provides dynamic “point-in-time” query functionality
• Typically used for PA, ML, & SOX
• Optimized for ETL processes
TRANSACTIONAL TABLE SNAPSHOT TABLE
11. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
EXAMPLE FACT TABLE – APPEND-ONLY
11
FACT_USER_EVENT
ETL Timestamp Event Timestamp Event Type Customer ID Location Event JSON
2015-06-01 9:00 AM 2015-06-01 8:55 AM wsb.login 111 UK {“event”:”wsb.login”,“cust_id":111,”wsb_id”:579}
2015-06-01 9:15 AM 2015-06-01 9:14 AM call.inbound 222 IN {“event”:”call.inbound”,“cust_id":222,”rep_id”:25}
2015-06-01 9:15 AM 2015-06-01 9:14 AM account.create 333 US {“event”:”account.create”,“cust_id":333}
2015-06-01 9:30 AM 2015-06-01 9:22 AM wsb.config 111 UK {“event”:”wsb.convig”,“cust_id":111,”wsb_id”:579}
2015-06-01 9:45 AM 2015-06-01 9:37 AM o365.provision 222 IN {“event”:”o365.provision”,“cust_id":222,”rep_id”:25}
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
13. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
Enterprise Data Layer (data warehouse)
Data Ingress Layer (raw data)
Stage VMs
Transactional Table (tx)
HDFS
Raw Event Data Snapshot Table (snap) External Data
Data Consumption Layer (user / team)
Hourly Snapshot (View)
Kafka Fast-DB Sync
Transactional Table (tx) Snapshot Table (snap) Hourly Snapshot (view)
HDFS DRILL-DOWN
LOGICAL DATA LAYERS
Integrated Data Pre-Aggregated Data Transformed Data
Append-Only Data
Logical construct only!
Users & processes can consume from any layer
Pig
Pig -or- Hive
14. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
BATCH PROCESSING DRILL-DOWN
14
INCREMENTAL PATTERN
Enterprise Data Layer (data warehouse)
Data Ingress Layer (raw data)
HDFS
Data Consumption Layer (user / team)
Pigsty
• next(tx_date) = $date
• foreach destination server
prep()
• execute(script.pig)
1. filter transactional source
(tx_date=$date)
2. store transactional to HDFS
(tx_date=$date)
3. store aggregations to HDFS
(tx_date=$date)
4. store to destination server(s)
• execute(data_quality_tests)
• if (tests=pass)
merge(destination)
• replace(dim_customer_snapshot)
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
customer
(snapshot)
dim_customer
(snapshot)
SERVING PLATFORM
MySQL SQL Server Cassandra
data-ingress / ecomm / customer_tx / tx_date=20150601
data-mgmt / dim_customer_tx / tx_date=20150601
data-rpt / new_customers / tx_date=20150601
/ tx_date=20150602
/ tx_date=20150602
/ tx_date=20150602
15. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
BATCH PROCESSING EXAMPLE – PRE-DEPLOYMENT
15
DIM_CUSTOMER_TX – TRANSACTIONAL
ETL Timestamp
Source Transaction
{Source, Timestamp, Action}
Delete
Flag
Cust ID
(Natural
Key)
Original
Gender
First
Order
Date
Original
Country
Code
Country
Code
2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA
2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A
2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
16. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
BATCH PROCESSING EXAMPLE – POST-DEPLOYMENT
16
DIM_CUSTOMER_TX – TRANSACTIONAL
ETL Timestamp
Source Transaction
{Source, Timestamp, Action}
Delete
Flag
Cust ID
(Natural
Key)
Original
Gender
First
Order
Date
Original
Country
Code
Country
Code
Gender
2015-06-01 9:15 AM {ecomm, 2015-06-01 9:14 AM, update} False 222 Female 2010-06-06 CA CA
2015-06-01 9:30 AM {ecomm, 2015-06-01 9:22 AM, insert} False 333 Male 2015-06-01 blah N/A
2015-06-01 9:45 AM {ecomm, 2015-06-01 9:37 AM, delete} True 111 2007-07-02 USA US
2015-06-01 10:00 AM {ecomm, 2015-06-01 9:14 AM, deploy} False 222 Female 2010-06-06 CA CA Female
2015-06-01 10:00 AM {ecomm, 2015-06-01 9:22 AM, deploy} False 333 Male 2015-06-01 blah N/A Male
2015-06-01 10:00 AM {ecomm, 2015-06-01 9:37 AM, deploy} True 111 2007-07-02 USA US Female
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
17. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
BATCH PROCESSING EXAMPLE – POST-DEPLOYMENT
17
DIM_CUSTOMER – SNAPSHOT
ETL Timestamp
Customer ID
(Natural Key)
Active
Customer
Flag
Original
Gender
First Order
Date
Original
Country
Code
Country
Code
Gender
2015-06-01 10:00 AM 111 False 2007-07-02 USA US Female
2015-06-01 10:00 AM 222 True Female 2010-06-06 CA CA Female
2015-06-01 10:00 AM 333 True Male 2015-06-01 blah N/A Male
Obligatory Disclaimer: this is fictitious data used for demonstration purposes only
19. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
HADOOP ENABLES GREATER & QUICKER DW VALUE
• Better use of data engineers
• data ingress is largely automated
• reduces (not eliminates) the traditional 75-80% of project time spent on ETL
• Well-suited for Agile
• full source data is preserved in full fidelity
• minimizes permanence of design decisions
• roll out changes weekly
• Data integration
• access to the other 79.7% of the company’s data
• flexible data models using complex data types
• Single source of data processing
• Process once => export 0:N destinations; supports all data consumers
• Frees up expensive database resources
• Single enterprise solution for data quality, monitoring, etc.
20. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
PROCESSED DATA HAS GREATER REACH
Descriptive
What has / is
happening?
Diagnostic
Why did it
happen?
Predictive
What will likely
happen?
Prescriptive
How can we
make it happen?
The same attributes used for
reporting can be inputs into
PA models & ML algorithms!
primarily uses
snapshots
primarily uses
transactional
uses both
snapshots &
transactional
21. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
Enterprise data
+
Clickstream dataChurn
Analysis
Customer
Dashboard
New
Attributes
Sentiment
Analysis
UNLOCKING NEW, ACTIONABLE INSIGHTS
21
Customer
Experience
Business
Value
Complex Data
Complex Analysis
Enterprise data
+
Product data
+
Event data
Enterprise data
+
External data
=
New enterprise data
Enterprise data
+
Call transcripts
23. CONFIDENTIAL. COPYRIGHT © 2015 GODADDY INC. ALL RIGHTS RESERVED.
SUGGESTIONS TO IMPROVE YOUR HADOOP DW PROJECT
• Standardize on technology Pig for ETL, Hive for analysis
• Focus on simplicity if your data isn’t easy to use, you’ve failed
• Embrace flexibility don’t shy away from complex data types
• Be predictable use HCatalog & consistent naming standard
• Don’t be afraid of change use data abstraction to minimize impact to consumers
• Do quick prototyping use external tables & data extractions via Hive ODBC
• Democratize data amazing insights can come from anywhere; embrace new data consumers