Hortonworks Big Data & Hadoop

© Hortonworks Inc. 2013
Big Data, Data Science & Hadoop
Ofer Mendelevitch
San Francisco Bay Area
Microsoft Business
Intelligence User Group
May 2013

© Hortonworks Inc. 2013 Page 2
Who am I?
Director of Data Sciences @ Hortonworks
• Data science with Hadoop
• Professional services
Previously…
A Chess Dad

Gartner’s 3 V’s of big data:
Volume
VelocityVariety
Size of the data
Ingest speed
Response latency
Diverse sources
Format, structure
Data quality

What Makes Up Big Data?
Megabytes
Gigabytes
Terabytes
Petabytes
Purchase detail
Purchase record
Payment record
ERPERP
CRMCRM
WEBWEB
BIG DATABIG DATA
Offer details
Support Contacts
Customer Touches
Segmentation
Web logs
Offer history
A/B testing
Dynamic Pricing
Affiliate Networks
Search Marketing
Behavioral Targeting
Dynamic Funnels
User Generated Content
Mobile Web
SMS/MMSSentiment
External Demographics
HD Video, Audio, Images
Speech to Text
Product/Service Logs
Social Interactions & Feeds
Business Data Feeds
User Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
Increasing Data Variety and Complexity
Transactions + Interactions
+ Observations
= BIG DATA
Page 5

• Sensors/devices
• Online: social, forums, etc
• Event logs
• Etc etc…
But also:
• Data that was “thrown away “ previously
Where is all this data coming from?

I like a quote from Michael Franklin (UCB):
“Big Data is any data that is expensive to
manage and hard to extract value from”
It’s a relative term.
Today’s big data may be tomorrow’s small data.
Ok… so what is big data?

“A software system whose core
functionality depends on the
application of statistical analysis
and machine learning to data.”
What is a data product?

Example 1: Google Adwords

Example 2: People you may know

Example 3: spell correction

What is data science?
#1: Extracting deep meaning from data
(data mining; finding “gems” in data)

What is data science?
#2: Building data products
(Delivering gems on a regular basis)
Pre-process Build model SQL
Periodic batch processing
Online serving

Common data science tasks
DescriptiveDescriptive
Clustering
Detect natural groupings
Clustering
Detect natural groupings
Outlier detection
Detect anomalies
Outlier detection
Detect anomalies
Affinity Analysis
Co-occurrence patterns
Affinity Analysis
Co-occurrence patterns
PredictivePredictive
Classification
Predict a category
Classification
Predict a category
Regression
Predict a value
Regression
Predict a value
Recommendation
Predict a preference
Recommendation
Predict a preference

A brief history of Apache Hadoop
Page 18
2013
Focus on INNOVATION
2005: Yahoo! creates
team under E14 to
work on Hadoop
Focus on OPERATIONS
2008: Yahoo team extends focus to
operations to support multiple
projects & growing clusters
Yahoo! begins to
Operate at scale
Enterprise
Hadoop
Apache Project
Established
Hortonworks
Data Platform
2004 2008 2010 20122006
STABILITY
2011: Hortonworks created to focus on
“Enterprise Hadoop“. Starts with 24
key Hadoop engineers from Yahoo

ApplianceCloudOS / VM
HDP: Enterprise-Ready Hadoop
HORTONWORKS
DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,
DR, Snapshots, Security, …
Distributed
Storage & ProcessingHDFS
MAP REDUCE
DATA
SERVICES
Store,
Process and
Access Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI

Core Hadoop: HDFS & Map Reduce
Deliver high-scale storage & processing
• HDFS: distributed, self-healing data store
• Map-reduce: distributed computation framework that
handles the complexities of distributed programming
Page 20

Keys to Hadoop’s power
• Computation co-located with data
– Data and computation system co-designed and co-
developed to work together
• Process data in parallel across thousands of
“commodity” hardware nodes
– Self-healing; failure handled by software
• Designed for one write and multiple reads
– There are no random writes
– Optimized for minimum seek on hard drives

Inside HDP for Windows
Page 22
Hortonworks
Data Platform (HDP)
For Windows
• 100% Open Source
Enterprise Hadoop
• Component and version
compatible with Microsoft
HDInsight
• Availability
• Beta release available now
• GA early 2Q 2012
PLATFORM SERVICES
HADOOP CORE
DATA
SERVICES
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
Store,
Process and
Access Data
HORTONWORKS
DATA PLATFORM (HDP)
For Windows
Distributed
Storage & ProcessingHDFS
WEBHDFS
MAP REDUCE
HCATALOG
HIVEPIG
SQOOP
Oozie

Seamless Interoperability with Your Microsoft Tools
• Integrated with Microsoft tools
for native big data analysis
– Bi-directional connectors for SQL
Server and SQL Azure through SQOOP
– Excel ODBC integration through Hive
• Addressing demand for Hadoop
on Windows
– Ideal for Windows customers with
Hadoop operational experience
• Enables all common Hadoop
workloads
– Data refinement and ETL offload for
high-volume data landing
– Data exploration for discovery of new
business opportunities
Page 23
APPLICATIONSDATASYSTEMS
Microsoft Applications
HORTONWORKS
DATA PLATFORM
For Windows
DATASOURCES
MOBILE
DATA
OLTP,
POS
SYSTEMS
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)

Data Science, now with more data…

Benefit #1:
Explore full datasets
Benefits of Hadoop for data
science

Explore large datasets directly with Hadoop
Measure/Evaluate
Acquire
Clean DataVisualize, Grok
Model
Full dataset stored on Hadoop
Researcher laptop
R, Matlab, SAS, etc

Integrate Hadoop in your data analysis flow
•Full dataset resides in Hadoop
• Typical Hadoop tasks:
–Simple statistics: mean, median, correlation
–Text pre-processing: grep, regex, NLP
–Dimensionality reduction: PCA, SVD, clustering, etc
–Random sampling: with or without replacement, by unique
–K-fold cross-validation

Benefit #2:
Mine larger datasets
science

More data -> better outcomes
Banko & Brill, 2001
Halevy, Norvig & Pereira, 2009

Learning algorithms with large datasets…
Challenges:
•Data won’t fit in memory
•Learning takes a lot longer…
Using Hadoop:
•Distribute data across nodes in the Hadoop cluster
•Implement a distributed/parallel algorithm

Benefit #3:
Large-scale data preparation
science

80% of data science work is data preparation
Strip away
HTML/PDF/DOC/PPT
Entity resolution
Document vector
generation
Sampling, filtering
Joins
Raw Data
Processed
Data
Term normalization

Hadoop is ideal for batch data preparation and
cleanup of large datasets

Benefit #4:
Accelerate data-driven innovation
science

Barriers to speed with traditional data architectures
• RDBMS uses “schema on write”; change is expensive
• High barrier for data-driven innovation
I need
new data
collecting
Finally,
we start
collecting
Let me
see… is it
any good?
Start 6 months 9 months
Schema change project

“Schema on read” means faster time-to-innovation
• Hadoop uses “schema on read”
• Low barrier for data-driven innovation
I need
new data
Let’s just putLet’s just put
it in a folder
on HDFS
Let me
see… is it
any good?
Start 3 months 6 months
My model is
awesome!

Quick start: Hortonworks Sandbox
• What is it
– A free download of a virtualized single-node implementation of the enterprise-ready
Hortonworks Data Platform
– A personal Hadoop environment
– An integrated learning environment with frequently, easily updatable hands-on
step-by-step tutorials
• What it does
– Dramatically accelerates the process of learning Apache Hadoop
– Accelerate and validates the use of Hadoop within your unique data architecture
– Use your data to explore and investigate your use cases
• ZERO to big data in 15 minutes
Page 38
Download Hortonworks Sandbox
www.hortonworks.com/sandbox
Sign up for Training for in-depth learning
hortonworks.com/hadoop-training/

Hadoop Summit
Page 39Architecting the Future of Big Data
• June 26-27, 2013- San Jose Convention
Center
• Co-hosted by Hortonworks & Yahoo!
• Theme: Enabling the Next Generation
Enterprise Data Platform
• 90+ Sessions and 7 Tracks
• Community Focused Event
– Sessions selected by a Conference Committee
– Community Choice allowed public to vote for
sessions they want to see
• Pre-event training classes
– Apache Hadoop Essentials: A Technical
Understanding for Business Users
– Understanding Microsoft HDInsight and Apache
Hadoop
– Developing Solutions with Apache Hadoop –
HDFS and MapReduce
– Applying Data Science using Apache Hadoop
• 10% discount code: 13DiscHUG10
hadoopsummit.org

Thank you!
Any Questions?
Ofer Mendelevitch
Director, Data Sciences @ Hortonworks
ofer@hortonworks.com
@ofermend, @hortonworks
We’re hiring!

Hortonworks Big Data & Hadoop

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (11)

Similar a Hortonworks Big Data & Hadoop

Similar a Hortonworks Big Data & Hadoop (20)

Más de Mark Ginnebaugh

Más de Mark Ginnebaugh (20)

Último

Último (20)

Hortonworks Big Data & Hadoop