Más contenido relacionado La actualidad más candente (20) Similar a Hortonworks Big Data & Hadoop (20) Más de Mark Ginnebaugh (20) Hortonworks Big Data & Hadoop1. © Hortonworks Inc. 2013
Big Data, Data Science & Hadoop
Ofer Mendelevitch
San Francisco Bay Area
Microsoft Business
Intelligence User Group
May 2013
2. © Hortonworks Inc. 2013 Page 2
Who am I?
Director of Data Sciences @ Hortonworks
• Data science with Hadoop
• Professional services
Previously…
A Chess Dad
4. © Hortonworks Inc. 2013 Page 4
Gartner’s 3 V’s of big data:
Volume
VelocityVariety
Size of the data
Ingest speed
Response latency
Diverse sources
Format, structure
Data quality
5. © Hortonworks Inc. 2013
What Makes Up Big Data?
Megabytes
Gigabytes
Terabytes
Petabytes
Purchase detail
Purchase record
Payment record
ERPERP
CRMCRM
WEBWEB
BIG DATABIG DATA
Offer details
Support Contacts
Customer Touches
Segmentation
Web logs
Offer history
A/B testing
Dynamic Pricing
Affiliate Networks
Search Marketing
Behavioral Targeting
Dynamic Funnels
User Generated Content
Mobile Web
SMS/MMSSentiment
External Demographics
HD Video, Audio, Images
Speech to Text
Product/Service Logs
Social Interactions & Feeds
Business Data Feeds
User Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
Increasing Data Variety and Complexity
Transactions + Interactions
+ Observations
= BIG DATA
Page 5
6. © Hortonworks Inc. 2013 Page 6
• Sensors/devices
• Online: social, forums, etc
• Event logs
• Etc etc…
But also:
• Data that was “thrown away “ previously
Where is all this data coming from?
7. © Hortonworks Inc. 2013 Page 7
I like a quote from Michael Franklin (UCB):
“Big Data is any data that is expensive to
manage and hard to extract value from”
It’s a relative term.
Today’s big data may be tomorrow’s small data.
Ok… so what is big data?
9. © Hortonworks Inc. 2013 Page 9
“A software system whose core
functionality depends on the
application of statistical analysis
and machine learning to data.”
What is a data product?
14. © Hortonworks Inc. 2013 Page 14
What is data science?
#1: Extracting deep meaning from data
(data mining; finding “gems” in data)
15. © Hortonworks Inc. 2013 Page 15
What is data science?
#2: Building data products
(Delivering gems on a regular basis)
Pre-process Build model SQL
Periodic batch processing
Online serving
16. © Hortonworks Inc. 2013 Page 16
Common data science tasks
DescriptiveDescriptive
Clustering
Detect natural groupings
Clustering
Detect natural groupings
Outlier detection
Detect anomalies
Outlier detection
Detect anomalies
Affinity Analysis
Co-occurrence patterns
Affinity Analysis
Co-occurrence patterns
PredictivePredictive
Classification
Predict a category
Classification
Predict a category
Regression
Predict a value
Regression
Predict a value
Recommendation
Predict a preference
Recommendation
Predict a preference
18. © Hortonworks Inc. 2013
A brief history of Apache Hadoop
Page 18
2013
Focus on INNOVATION
2005: Yahoo! creates
team under E14 to
work on Hadoop
Focus on OPERATIONS
2008: Yahoo team extends focus to
operations to support multiple
projects & growing clusters
Yahoo! begins to
Operate at scale
Enterprise
Hadoop
Apache Project
Established
Hortonworks
Data Platform
2004 2008 2010 20122006
STABILITY
2011: Hortonworks created to focus on
“Enterprise Hadoop“. Starts with 24
key Hadoop engineers from Yahoo
19. © Hortonworks Inc. 2013
ApplianceCloudOS / VM
HDP: Enterprise-Ready Hadoop
HORTONWORKS
DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,
DR, Snapshots, Security, …
Distributed
Storage & ProcessingHDFS
MAP REDUCE
DATA
SERVICES
Store,
Process and
Access Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI
20. © Hortonworks Inc. 2013
Core Hadoop: HDFS & Map Reduce
Deliver high-scale storage & processing
• HDFS: distributed, self-healing data store
• Map-reduce: distributed computation framework that
handles the complexities of distributed programming
Page 20
21. © Hortonworks Inc. 2013 Page 21
Keys to Hadoop’s power
• Computation co-located with data
– Data and computation system co-designed and co-
developed to work together
• Process data in parallel across thousands of
“commodity” hardware nodes
– Self-healing; failure handled by software
• Designed for one write and multiple reads
– There are no random writes
– Optimized for minimum seek on hard drives
22. © Hortonworks Inc. 2013
Inside HDP for Windows
Page 22
Hortonworks
Data Platform (HDP)
For Windows
• 100% Open Source
Enterprise Hadoop
• Component and version
compatible with Microsoft
HDInsight
• Availability
• Beta release available now
• GA early 2Q 2012
PLATFORM SERVICES
HADOOP CORE
DATA
SERVICES
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
Store,
Process and
Access Data
HORTONWORKS
DATA PLATFORM (HDP)
For Windows
Distributed
Storage & ProcessingHDFS
WEBHDFS
MAP REDUCE
HCATALOG
HIVEPIG
SQOOP
Oozie
23. © Hortonworks Inc. 2013
Seamless Interoperability with Your Microsoft Tools
• Integrated with Microsoft tools
for native big data analysis
– Bi-directional connectors for SQL
Server and SQL Azure through SQOOP
– Excel ODBC integration through Hive
• Addressing demand for Hadoop
on Windows
– Ideal for Windows customers with
Hadoop operational experience
• Enables all common Hadoop
workloads
– Data refinement and ETL offload for
high-volume data landing
– Data exploration for discovery of new
business opportunities
Page 23
APPLICATIONSDATASYSTEMS
Microsoft Applications
HORTONWORKS
DATA PLATFORM
For Windows
DATASOURCES
MOBILE
DATA
OLTP,
POS
SYSTEMS
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)
26. © Hortonworks Inc. 2013 Page 26
Benefit #1:
Explore full datasets
Benefits of Hadoop for data
science
27. © Hortonworks Inc. 2013 Page 27
Explore large datasets directly with Hadoop
Measure/Evaluate
Acquire
Clean DataVisualize, Grok
Model
Full dataset stored on Hadoop
Researcher laptop
R, Matlab, SAS, etc
28. © Hortonworks Inc. 2013 Page 28
Integrate Hadoop in your data analysis flow
•Full dataset resides in Hadoop
• Typical Hadoop tasks:
–Simple statistics: mean, median, correlation
–Text pre-processing: grep, regex, NLP
–Dimensionality reduction: PCA, SVD, clustering, etc
–Random sampling: with or without replacement, by unique
–K-fold cross-validation
29. © Hortonworks Inc. 2013 Page 29
Benefit #2:
Mine larger datasets
Benefits of Hadoop for data
science
30. © Hortonworks Inc. 2013 Page 30
More data -> better outcomes
Banko & Brill, 2001
Halevy, Norvig & Pereira, 2009
31. © Hortonworks Inc. 2013 Page 31
Learning algorithms with large datasets…
Challenges:
•Data won’t fit in memory
•Learning takes a lot longer…
Using Hadoop:
•Distribute data across nodes in the Hadoop cluster
•Implement a distributed/parallel algorithm
32. © Hortonworks Inc. 2013 Page 32
Benefit #3:
Large-scale data preparation
Benefits of Hadoop for data
science
33. © Hortonworks Inc. 2013 Page 33
80% of data science work is data preparation
Strip away
HTML/PDF/DOC/PPT
Entity resolution
Document vector
generation
Sampling, filtering
Joins
Raw Data
Processed
Data
Term normalization
34. © Hortonworks Inc. 2013 Page 34
Hadoop is ideal for batch data preparation and
cleanup of large datasets
35. © Hortonworks Inc. 2013 Page 35
Benefit #4:
Accelerate data-driven innovation
Benefits of Hadoop for data
science
36. © Hortonworks Inc. 2013 Page 36
Barriers to speed with traditional data architectures
• RDBMS uses “schema on write”; change is expensive
• High barrier for data-driven innovation
I need
new data
collecting
Finally,
we start
collecting
Let me
see… is it
any good?
Start 6 months 9 months
Schema change project
37. © Hortonworks Inc. 2013 Page 37
“Schema on read” means faster time-to-innovation
• Hadoop uses “schema on read”
• Low barrier for data-driven innovation
I need
new data
Let’s just putLet’s just put
it in a folder
on HDFS
Let me
see… is it
any good?
Start 3 months 6 months
My model is
awesome!
38. © Hortonworks Inc. 2013
Quick start: Hortonworks Sandbox
• What is it
– A free download of a virtualized single-node implementation of the enterprise-ready
Hortonworks Data Platform
– A personal Hadoop environment
– An integrated learning environment with frequently, easily updatable hands-on
step-by-step tutorials
• What it does
– Dramatically accelerates the process of learning Apache Hadoop
– Accelerate and validates the use of Hadoop within your unique data architecture
– Use your data to explore and investigate your use cases
• ZERO to big data in 15 minutes
Page 38
Download Hortonworks Sandbox
www.hortonworks.com/sandbox
Sign up for Training for in-depth learning
hortonworks.com/hadoop-training/
39. Hadoop Summit
Page 39Architecting the Future of Big Data
• June 26-27, 2013- San Jose Convention
Center
• Co-hosted by Hortonworks & Yahoo!
• Theme: Enabling the Next Generation
Enterprise Data Platform
• 90+ Sessions and 7 Tracks
• Community Focused Event
– Sessions selected by a Conference Committee
– Community Choice allowed public to vote for
sessions they want to see
• Pre-event training classes
– Apache Hadoop Essentials: A Technical
Understanding for Business Users
– Understanding Microsoft HDInsight and Apache
Hadoop
– Developing Solutions with Apache Hadoop –
HDFS and MapReduce
– Applying Data Science using Apache Hadoop
• 10% discount code: 13DiscHUG10
hadoopsummit.org
40. © Hortonworks Inc. 2013 Page 40
Thank you!
Any Questions?
Ofer Mendelevitch
Director, Data Sciences @ Hortonworks
ofer@hortonworks.com
@ofermend, @hortonworks
We’re hiring!