Using hadoop to expand data warehousing

Using Hadoop
to Expand
Data Warehousing
Mike Peterson
VP of Platforms and Data Architecture, Neustar

Ron Bodkin
CEO and Founder, Think Big Analytics
ron.bodkin@thinkbiganalytics.com
.
x

1 Copyright © Think Big Analytics and Neustar Inc.
June 13, 2012

Agenda

Overview
Technology
Process
Conclusion


Big Data Highlights at Neustar
2010 2011 2012

Hadoop
Hadoop Cluster Cluster
Rollout at Quova Rollout with
UltraDNS


The Business Case
100 TB of INCREMENTAL Data Storage
3 Year Cost
Millions US $

$9.6 $9.6

$6.3

$0.2
Hadoop Big Data Netezza Teradata Oracle

Big Data Warehouse

Challenges Goals
•  Cost to store unstructured data •  Integrate unstructured data with EDW
•  Poor response time to changing BI needs •  Predictive analytics based on data science
•  Data Warehouse access for departments •  Access to cluster for all users


Data Agility
Classic Warehouse Big Data Warehouse
»  ETL »  Store raw data
»  Pre-parse all data »  Parse only when proven
»  Normalize up front »  Approximate parse on
»  Feed data marts demand
»  New ideas = IT projects »  Analysis on demand
»  Aggregate/Summarize »  Provide ideas before
projects to optimize


Change to Technology Focus

»  New data platforms unlock innovation
»  Not just package implementation
»  More open source technology
»  Rethink assumptions
»  Increase technology skills
»  Focus data teams


Working Together

»  Expertise in delivery
»  Trusted partner
»  Collaborative development

»  Open source leader
»  Invested in client success
»  Price/performance


Technology


Architecture Overview

Samples & Aggregates
RSync/SCP + Capture
Scripts Server

Cluster
Backup Master Server

Cronacle Agent Slave
Slave
Postgres ETL Slave
Slave Server
Hive + UDFs
Secondary Name Node Postgres
Cronacle to HDFS (incl. Hive
Scheduler Slave Metastore)
Master Server Postgres + HDFS HDFS
Task Tracker
Hive Name Node
Job Tracker
Ad-Hoc queries and BI
Management, Monitoring

LDAP


Initial Hadoop Cluster Current Configuration
»  40 servers
»  Hadoop and PostgreSQL

Data Nodes
»  2 x 12 cores

»  64 GB memory

»  24 x 3TB SATA drives

»  Mixed Nodes –
Raid 6 storage
»  HDFS only nodes –
JBOD
»  10Gbit Ethernet


System Scale
»  Query volume – light but ramping
»  10,000 Map Reduce processes/day
»  Ingesting over 40B rows a day
»  1.5TB with 7x compression
»  Storage utilization at 45%
»  Core utilization spikes when processing Machine
Learning Algorithms
»  100% capture of multiple large product data sets


Software Choices by Layer
Data Processing &
Platform Application Integration Analytics
Resource Management

Data Science & Algorithms Application Support

Business Intelligence &
Current configuration:
Reporting Hive
»  Redwood Software

Hortonworks Data Platform, Ganglia

LDAP, Hortonworks Data Platform
Data Ingestion Data Transformation & Aggregation Data Publication
Custom Script Hive FTP scheduler

Management & Monitoring
Workﬂow Management
»  Hortonworks Distribution

Cluster Security
Cronacle

Metadata Management Low Latency Data Access
GridSQL, not Hadoop ecosystem
»  Move to Oracle JDK 6
Data Governance Resource Management »  Move to Red Hat
Fair Scheduler

Platform Software
Hortonworks Data Platform, Oracle JDK 6, RHEL 6

Networking Infrastructure
10 GigE HP Servers

Cluster Provisioning
On Premise Shared Hadoop and Grid SQL


Massive Binary Format Data
Query
SELECT * FROM datafile
WHERE dt='2012-06-15'; »  Parse on fly: don’t
1
duplicate or lose original
Parse into records
»  Reused open source
Binary
InputFormat
Binary
SerDe
parser with custom
extensions
2
Parse into ﬁelds lazily
»  Optimized with
Bean »  profiling
Object
Inspector »  lazy parsing
3
Fields determined
minimizes object
large partitioned binary ﬁle - 100's of TBs by Java beans methods creation
compressed binary record 1 »  CPU bound due to
parsing compact structure
compressed binary record 2
...

6/19/12 14

Disk Failures and Recovery
»  5 drives in 9 months; 3 were DOA
»  Hadoop handled failure perfectly
»  Raid 6 PostgreSQL & GridSQL working fine


Storage Policy
»  Storage still isn’t free!
»  Newer data = 3 replicas
»  Older data = 2 replicas
»  Data retention = 1 year
»  Free space = 20% reserve


Process


The Big Data Journey

Phase 1: Enterprise Cluster Dev & Deployment

Data Science
Phase 2: Comprehensive Ingestion by Service

Monetization

Cost Savings
Offerings

Ingestion
Phase 3: Develop Big Data Capabilities
»  New Applications
»  Data Science
»  Advanced Analytics
»  Cost Savings Data Services Strategy


Organizational Approach
Executive Support

Roles and Organization

Data Governance

Outreach

Training

Product Definition

Data Sharing

Data Science / Analytics

Data Acquisition External Data Acquisition Internal

Platform Build

Organizational Investment

Data
Developers
Database
Administrator

Big Data Big Data
Administrator Engineer
New workloads & tools Distributed development

Data
Architect

Data Architect Data Scientist
Big Data Modeling Math, programming &
Varying data structures analysis


Training to Match Organization

Fundamentals Track

Tools Track

Applications Track

Guiding Principles
•  Making Hadoop Big Data “relevant” at the company and job
•  Cross-train math and engineering skill bases
•  Lab Team exposure to new and emerging technology training


Building Data Science Capability


Use of Capability

Use
Case
Technology
Selection Criteria
»  Structure
Basic
Only Structured Data Type Complex Structure »  Compute Scale Reporting
»  Data Volume
»  Latency
At least 10
# Calculations
Under 10 »  Analysis Type Data Ingestion
PetaFLOP PetaFLOP

100 TB
Data Size
Under Batch
EDW
or more
EDW
10 TB
Data Processing
EDW EDW
EDW Data
EDW 10-100 TB
Data

Latency? Analysis Analysis Fast Analytics
Type? Type?

Tightly Integrated Other (Simple,
A minute Under with existing data
a minute Parallel, Complex Production, Tightly Integrated Data
or more Other
Structural) with existing data
Enrichment

existing existing
EDW EDW EDW
platform platform

Data Science

Trends
•  Compute model scores faster
•  Analyze full data sets
•  Incorporate new data
• 
»  23

Build new services from data

Conclusion


Getting Value from Big Data Finalize these
take aways

»  Expand warehousing capability with Hadoop
»  Enable data science to create new value
»  Organizational change is a journey


Thank You

Mike Peterson
VP of Platforms and Data Architecture, Neustar

Ron Bodkin
CEO and Founder, Think Big Analytics
ron.bodkin@thinkbiganalytics.com
x


Using hadoop to expand data warehousing

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Using hadoop to expand data warehousing

Similar a Using hadoop to expand data warehousing (20)

Más de DataWorks Summit

Más de DataWorks Summit (20)

Último

Último (20)

Using hadoop to expand data warehousing