SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
© Hortonworks Inc. 2013
Big Data, Data Science & Hadoop
Ofer Mendelevitch
San Francisco Bay Area
Microsoft Business
Intelligence User Group
May 2013
© Hortonworks Inc. 2013 Page 2
Who am I?
Director of Data Sciences @ Hortonworks
• Data science with Hadoop
• Professional services
Previously…
A Chess Dad
© Hortonworks Inc. 2013 Page 3
© Hortonworks Inc. 2013 Page 4
Gartner’s 3 V’s of big data:
Volume
VelocityVariety
Size of the data
Ingest speed
Response latency
Diverse sources
Format, structure
Data quality
© Hortonworks Inc. 2013
What Makes Up Big Data?
Megabytes
Gigabytes
Terabytes
Petabytes
Purchase detail
Purchase record
Payment record
ERPERP
CRMCRM
WEBWEB
BIG DATABIG DATA
Offer details
Support Contacts
Customer Touches
Segmentation
Web logs
Offer history
A/B testing
Dynamic Pricing
Affiliate Networks
Search Marketing
Behavioral Targeting
Dynamic Funnels
User Generated Content
Mobile Web
SMS/MMSSentiment
External Demographics
HD Video, Audio, Images
Speech to Text
Product/Service Logs
Social Interactions & Feeds
Business Data Feeds
User Click Stream
Sensors / RFID / Devices
Spatial & GPS Coordinates
Increasing Data Variety and Complexity
Transactions + Interactions
+ Observations
= BIG DATA
Page 5
© Hortonworks Inc. 2013 Page 6
• Sensors/devices
• Online: social, forums, etc
• Event logs
• Etc etc…
But also:
• Data that was “thrown away “ previously
Where is all this data coming from?
© Hortonworks Inc. 2013 Page 7
I like a quote from Michael Franklin (UCB):
“Big Data is any data that is expensive to
manage and hard to extract value from”
It’s a relative term.
Today’s big data may be tomorrow’s small data.
Ok… so what is big data?
© Hortonworks Inc. 2013 Page 8
© Hortonworks Inc. 2013 Page 9
“A software system whose core
functionality depends on the
application of statistical analysis
and machine learning to data.”
What is a data product?
© Hortonworks Inc. 2013 Page 10
Example 1: Google Adwords
© Hortonworks Inc. 2013 Page 11
Example 2: People you may know
© Hortonworks Inc. 2013 Page 12
Example 3: spell correction
© Hortonworks Inc. 2013 Page 13
© Hortonworks Inc. 2013 Page 14
What is data science?
#1: Extracting deep meaning from data
(data mining; finding “gems” in data)
© Hortonworks Inc. 2013 Page 15
What is data science?
#2: Building data products
(Delivering gems on a regular basis)
Pre-process Build model SQL
Periodic batch processing
Online serving
© Hortonworks Inc. 2013 Page 16
Common data science tasks
DescriptiveDescriptive
Clustering
Detect natural groupings
Clustering
Detect natural groupings
Outlier detection
Detect anomalies
Outlier detection
Detect anomalies
Affinity Analysis
Co-occurrence patterns
Affinity Analysis
Co-occurrence patterns
PredictivePredictive
Classification
Predict a category
Classification
Predict a category
Regression
Predict a value
Regression
Predict a value
Recommendation
Predict a preference
Recommendation
Predict a preference
© Hortonworks Inc. 2013 Page 17
© Hortonworks Inc. 2013
A brief history of Apache Hadoop
Page 18
2013
Focus on INNOVATION
2005: Yahoo! creates
team under E14 to
work on Hadoop
Focus on OPERATIONS
2008: Yahoo team extends focus to
operations to support multiple
projects & growing clusters
Yahoo! begins to
Operate at scale
Enterprise
Hadoop
Apache Project
Established
Hortonworks
Data Platform
2004 2008 2010 20122006
STABILITY
2011: Hortonworks created to focus on
“Enterprise Hadoop“. Starts with 24
key Hadoop engineers from Yahoo
© Hortonworks Inc. 2013
ApplianceCloudOS / VM
HDP: Enterprise-Ready Hadoop
HORTONWORKS
DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,
DR, Snapshots, Security, …
Distributed
Storage & ProcessingHDFS
MAP REDUCE
DATA
SERVICES
Store,
Process and
Access Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI
© Hortonworks Inc. 2013
Core Hadoop: HDFS & Map Reduce
Deliver high-scale storage & processing
• HDFS: distributed, self-healing data store
• Map-reduce: distributed computation framework that
handles the complexities of distributed programming
Page 20
© Hortonworks Inc. 2013 Page 21
Keys to Hadoop’s power
• Computation co-located with data
– Data and computation system co-designed and co-
developed to work together
• Process data in parallel across thousands of
“commodity” hardware nodes
– Self-healing; failure handled by software
• Designed for one write and multiple reads
– There are no random writes
– Optimized for minimum seek on hard drives
© Hortonworks Inc. 2013
Inside HDP for Windows
Page 22
Hortonworks
Data Platform (HDP)
For Windows
• 100% Open Source
Enterprise Hadoop
• Component and version
compatible with Microsoft
HDInsight
• Availability
• Beta release available now
• GA early 2Q 2012
PLATFORM SERVICES
HADOOP CORE
DATA
SERVICES
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
Store,
Process and
Access Data
HORTONWORKS
DATA PLATFORM (HDP)
For Windows
Distributed
Storage & ProcessingHDFS
WEBHDFS
MAP REDUCE
HCATALOG
HIVEPIG
SQOOP
Oozie
© Hortonworks Inc. 2013
Seamless Interoperability with Your Microsoft Tools
• Integrated with Microsoft tools
for native big data analysis
– Bi-directional connectors for SQL
Server and SQL Azure through SQOOP
– Excel ODBC integration through Hive
• Addressing demand for Hadoop
on Windows
– Ideal for Windows customers with
Hadoop operational experience
• Enables all common Hadoop
workloads
– Data refinement and ETL offload for
high-volume data landing
– Data exploration for discovery of new
business opportunities
Page 23
APPLICATIONSDATASYSTEMS
Microsoft Applications
HORTONWORKS
DATA PLATFORM
For Windows
DATASOURCES
MOBILE
DATA
OLTP,
POS
SYSTEMS
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)
© Hortonworks Inc. 2013 Page 24
© Hortonworks Inc. 2013 Page 25
Data Science, now with more data…
© Hortonworks Inc. 2013 Page 26
Benefit #1:
Explore full datasets
Benefits of Hadoop for data
science
© Hortonworks Inc. 2013 Page 27
Explore large datasets directly with Hadoop
Measure/Evaluate
Acquire
Clean DataVisualize, Grok
Model
Full dataset stored on Hadoop
Researcher laptop
R, Matlab, SAS, etc
© Hortonworks Inc. 2013 Page 28
Integrate Hadoop in your data analysis flow
•Full dataset resides in Hadoop
• Typical Hadoop tasks:
–Simple statistics: mean, median, correlation
–Text pre-processing: grep, regex, NLP
–Dimensionality reduction: PCA, SVD, clustering, etc
–Random sampling: with or without replacement, by unique
–K-fold cross-validation
© Hortonworks Inc. 2013 Page 29
Benefit #2:
Mine larger datasets
Benefits of Hadoop for data
science
© Hortonworks Inc. 2013 Page 30
More data -> better outcomes
Banko & Brill, 2001
Halevy, Norvig & Pereira, 2009
© Hortonworks Inc. 2013 Page 31
Learning algorithms with large datasets…
Challenges:
•Data won’t fit in memory
•Learning takes a lot longer…
Using Hadoop:
•Distribute data across nodes in the Hadoop cluster
•Implement a distributed/parallel algorithm
© Hortonworks Inc. 2013 Page 32
Benefit #3:
Large-scale data preparation
Benefits of Hadoop for data
science
© Hortonworks Inc. 2013 Page 33
80% of data science work is data preparation
Strip away
HTML/PDF/DOC/PPT
Entity resolution
Document vector
generation
Sampling, filtering
Joins
Raw Data
Processed
Data
Term normalization
© Hortonworks Inc. 2013 Page 34
Hadoop is ideal for batch data preparation and
cleanup of large datasets
© Hortonworks Inc. 2013 Page 35
Benefit #4:
Accelerate data-driven innovation
Benefits of Hadoop for data
science
© Hortonworks Inc. 2013 Page 36
Barriers to speed with traditional data architectures
• RDBMS uses “schema on write”; change is expensive
• High barrier for data-driven innovation
I need
new data
collecting
Finally,
we start
collecting
Let me
see… is it
any good?
Start 6 months 9 months
Schema change project
© Hortonworks Inc. 2013 Page 37
“Schema on read” means faster time-to-innovation
• Hadoop uses “schema on read”
• Low barrier for data-driven innovation
I need
new data
Let’s just putLet’s just put
it in a folder
on HDFS
Let me
see… is it
any good?
Start 3 months 6 months
My model is
awesome!
© Hortonworks Inc. 2013
Quick start: Hortonworks Sandbox
• What is it
– A free download of a virtualized single-node implementation of the enterprise-ready
Hortonworks Data Platform
– A personal Hadoop environment
– An integrated learning environment with frequently, easily updatable hands-on
step-by-step tutorials
• What it does
– Dramatically accelerates the process of learning Apache Hadoop
– Accelerate and validates the use of Hadoop within your unique data architecture
– Use your data to explore and investigate your use cases
• ZERO to big data in 15 minutes
Page 38
Download Hortonworks Sandbox
www.hortonworks.com/sandbox
Sign up for Training for in-depth learning
hortonworks.com/hadoop-training/
Hadoop Summit
Page 39Architecting the Future of Big Data
• June 26-27, 2013- San Jose Convention
Center
• Co-hosted by Hortonworks & Yahoo!
• Theme: Enabling the Next Generation
Enterprise Data Platform
• 90+ Sessions and 7 Tracks
• Community Focused Event
– Sessions selected by a Conference Committee
– Community Choice allowed public to vote for
sessions they want to see
• Pre-event training classes
– Apache Hadoop Essentials: A Technical
Understanding for Business Users
– Understanding Microsoft HDInsight and Apache
Hadoop
– Developing Solutions with Apache Hadoop –
HDFS and MapReduce
– Applying Data Science using Apache Hadoop
• 10% discount code: 13DiscHUG10
hadoopsummit.org
© Hortonworks Inc. 2013 Page 40
Thank you!
Any Questions?
Ofer Mendelevitch
Director, Data Sciences @ Hortonworks
ofer@hortonworks.com
@ofermend, @hortonworks
We’re hiring!

Más contenido relacionado

La actualidad más candente

Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
Cloudera, Inc.
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
Brock Noland
 

La actualidad más candente (20)

Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Case study on big data
Case study on big dataCase study on big data
Case study on big data
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 
Introducing Big Data
Introducing Big DataIntroducing Big Data
Introducing Big Data
 
Introducing Data Lakes
Introducing Data LakesIntroducing Data Lakes
Introducing Data Lakes
 
Atul Mithe
Atul MitheAtul Mithe
Atul Mithe
 

Destacado

Hortonworks Presentation at Big Data London
Hortonworks Presentation at Big Data LondonHortonworks Presentation at Big Data London
Hortonworks Presentation at Big Data London
Hortonworks
 

Destacado (11)

Hortonworks Presentation at Big Data London
Hortonworks Presentation at Big Data LondonHortonworks Presentation at Big Data London
Hortonworks Presentation at Big Data London
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
5 Reasons Why Healthcare Data is Unique and Difficult to Measure
5 Reasons Why Healthcare Data is Unique and Difficult to Measure5 Reasons Why Healthcare Data is Unique and Difficult to Measure
5 Reasons Why Healthcare Data is Unique and Difficult to Measure
 
What is big data?
What is big data?What is big data?
What is big data?
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar a Hortonworks Big Data & Hadoop

Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
Hortonworks
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Hortonworks
 

Similar a Hortonworks Big Data & Hadoop (20)

Data Science with Hadoop - A primer
Data Science with Hadoop - A primerData Science with Hadoop - A primer
Data Science with Hadoop - A primer
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A Primer
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack Europe
 
Modern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BIModern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BI
 
Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
 
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
Distilling Hadoop Patterns of Use and How You Can Use Them for Your Big Data ...
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
 
Ben Marden - Making sense of Big Data
Ben Marden - Making sense of Big Data Ben Marden - Making sense of Big Data
Ben Marden - Making sense of Big Data
 
Enterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionEnterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the Union
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
 

Más de Mark Ginnebaugh

Más de Mark Ginnebaugh (20)

Automating Microsoft Power BI Creations 2015
Automating Microsoft Power BI Creations 2015Automating Microsoft Power BI Creations 2015
Automating Microsoft Power BI Creations 2015
 
Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction
Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction
Microsoft SQL Server Analysis Services (SSAS) - A Practical Introduction
 
Platfora - An Analytics Sandbox In A World Of Big Data
Platfora - An Analytics Sandbox In A World Of Big DataPlatfora - An Analytics Sandbox In A World Of Big Data
Platfora - An Analytics Sandbox In A World Of Big Data
 
Microsoft SQL Server Relational Databases and Primary Keys
Microsoft SQL Server Relational Databases and Primary KeysMicrosoft SQL Server Relational Databases and Primary Keys
Microsoft SQL Server Relational Databases and Primary Keys
 
DesignMind Microsoft Business Intelligence SQL Server
DesignMind Microsoft Business Intelligence SQL ServerDesignMind Microsoft Business Intelligence SQL Server
DesignMind Microsoft Business Intelligence SQL Server
 
San Francisco Bay Area SQL Server July 2013 meetings
San Francisco Bay Area SQL Server July 2013 meetingsSan Francisco Bay Area SQL Server July 2013 meetings
San Francisco Bay Area SQL Server July 2013 meetings
 
Silicon Valley SQL Server User Group June 2013
Silicon Valley SQL Server User Group June 2013Silicon Valley SQL Server User Group June 2013
Silicon Valley SQL Server User Group June 2013
 
Microsoft SQL Server Continuous Integration
Microsoft SQL Server Continuous IntegrationMicrosoft SQL Server Continuous Integration
Microsoft SQL Server Continuous Integration
 
Microsoft SQL Server Physical Join Operators
Microsoft SQL Server Physical Join OperatorsMicrosoft SQL Server Physical Join Operators
Microsoft SQL Server Physical Join Operators
 
Microsoft PowerPivot & Power View in Excel 2013
Microsoft PowerPivot & Power View in Excel 2013Microsoft PowerPivot & Power View in Excel 2013
Microsoft PowerPivot & Power View in Excel 2013
 
Microsoft Data Warehouse Business Intelligence Lifecycle - The Kimball Approach
Microsoft Data Warehouse Business Intelligence Lifecycle - The Kimball ApproachMicrosoft Data Warehouse Business Intelligence Lifecycle - The Kimball Approach
Microsoft Data Warehouse Business Intelligence Lifecycle - The Kimball Approach
 
Fusion-io Memory Flash for Microsoft SQL Server 2012
Fusion-io Memory Flash for Microsoft SQL Server 2012Fusion-io Memory Flash for Microsoft SQL Server 2012
Fusion-io Memory Flash for Microsoft SQL Server 2012
 
Microsoft Data Mining 2012
Microsoft Data Mining 2012Microsoft Data Mining 2012
Microsoft Data Mining 2012
 
Microsoft SQL Server PASS News August 2012
Microsoft SQL Server PASS News August 2012Microsoft SQL Server PASS News August 2012
Microsoft SQL Server PASS News August 2012
 
Business Intelligence Dashboard Design Best Practices
Business Intelligence Dashboard Design Best PracticesBusiness Intelligence Dashboard Design Best Practices
Business Intelligence Dashboard Design Best Practices
 
Microsoft Mobile Business Intelligence
Microsoft Mobile Business Intelligence Microsoft Mobile Business Intelligence
Microsoft Mobile Business Intelligence
 
Microsoft SQL Server 2012 Cloud Ready
Microsoft SQL Server 2012 Cloud ReadyMicrosoft SQL Server 2012 Cloud Ready
Microsoft SQL Server 2012 Cloud Ready
 
Microsoft SQL Server 2012 Master Data Services
Microsoft SQL Server 2012 Master Data ServicesMicrosoft SQL Server 2012 Master Data Services
Microsoft SQL Server 2012 Master Data Services
 
Microsoft SQL Server PowerPivot
Microsoft SQL Server PowerPivotMicrosoft SQL Server PowerPivot
Microsoft SQL Server PowerPivot
 
Microsoft SQL Server Testing Frameworks
Microsoft SQL Server Testing FrameworksMicrosoft SQL Server Testing Frameworks
Microsoft SQL Server Testing Frameworks
 

Último

Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Anamikakaur10
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
amitlee9823
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
dollysharma2066
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
Abortion pills in Kuwait Cytotec pills in Kuwait
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
lizamodels9
 
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
lizamodels9
 
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
daisycvs
 

Último (20)

How to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityHow to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League City
 
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
Call Now ☎️🔝 9332606886🔝 Call Girls ❤ Service In Bhilwara Female Escorts Serv...
 
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
Call Girls Jp Nagar Just Call 👗 7737669865 👗 Top Class Call Girl Service Bang...
 
Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876
Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876
Cheap Rate Call Girls In Noida Sector 62 Metro 959961乂3876
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024
 
Falcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in indiaFalcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in india
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
 
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Majnu Ka Tilla, Delhi Contact Us 8377877756
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
 
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with Culture
 
Falcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investorsFalcon Invoice Discounting: The best investment platform in india for investors
Falcon Invoice Discounting: The best investment platform in india for investors
 
Falcon Invoice Discounting: Unlock Your Business Potential
Falcon Invoice Discounting: Unlock Your Business PotentialFalcon Invoice Discounting: Unlock Your Business Potential
Falcon Invoice Discounting: Unlock Your Business Potential
 
Business Model Canvas (BMC)- A new venture concept
Business Model Canvas (BMC)-  A new venture conceptBusiness Model Canvas (BMC)-  A new venture concept
Business Model Canvas (BMC)- A new venture concept
 
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
 
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Falcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to ProsperityFalcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to Prosperity
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
 

Hortonworks Big Data & Hadoop

  • 1. © Hortonworks Inc. 2013 Big Data, Data Science & Hadoop Ofer Mendelevitch San Francisco Bay Area Microsoft Business Intelligence User Group May 2013
  • 2. © Hortonworks Inc. 2013 Page 2 Who am I? Director of Data Sciences @ Hortonworks • Data science with Hadoop • Professional services Previously… A Chess Dad
  • 3. © Hortonworks Inc. 2013 Page 3
  • 4. © Hortonworks Inc. 2013 Page 4 Gartner’s 3 V’s of big data: Volume VelocityVariety Size of the data Ingest speed Response latency Diverse sources Format, structure Data quality
  • 5. © Hortonworks Inc. 2013 What Makes Up Big Data? Megabytes Gigabytes Terabytes Petabytes Purchase detail Purchase record Payment record ERPERP CRMCRM WEBWEB BIG DATABIG DATA Offer details Support Contacts Customer Touches Segmentation Web logs Offer history A/B testing Dynamic Pricing Affiliate Networks Search Marketing Behavioral Targeting Dynamic Funnels User Generated Content Mobile Web SMS/MMSSentiment External Demographics HD Video, Audio, Images Speech to Text Product/Service Logs Social Interactions & Feeds Business Data Feeds User Click Stream Sensors / RFID / Devices Spatial & GPS Coordinates Increasing Data Variety and Complexity Transactions + Interactions + Observations = BIG DATA Page 5
  • 6. © Hortonworks Inc. 2013 Page 6 • Sensors/devices • Online: social, forums, etc • Event logs • Etc etc… But also: • Data that was “thrown away “ previously Where is all this data coming from?
  • 7. © Hortonworks Inc. 2013 Page 7 I like a quote from Michael Franklin (UCB): “Big Data is any data that is expensive to manage and hard to extract value from” It’s a relative term. Today’s big data may be tomorrow’s small data. Ok… so what is big data?
  • 8. © Hortonworks Inc. 2013 Page 8
  • 9. © Hortonworks Inc. 2013 Page 9 “A software system whose core functionality depends on the application of statistical analysis and machine learning to data.” What is a data product?
  • 10. © Hortonworks Inc. 2013 Page 10 Example 1: Google Adwords
  • 11. © Hortonworks Inc. 2013 Page 11 Example 2: People you may know
  • 12. © Hortonworks Inc. 2013 Page 12 Example 3: spell correction
  • 13. © Hortonworks Inc. 2013 Page 13
  • 14. © Hortonworks Inc. 2013 Page 14 What is data science? #1: Extracting deep meaning from data (data mining; finding “gems” in data)
  • 15. © Hortonworks Inc. 2013 Page 15 What is data science? #2: Building data products (Delivering gems on a regular basis) Pre-process Build model SQL Periodic batch processing Online serving
  • 16. © Hortonworks Inc. 2013 Page 16 Common data science tasks DescriptiveDescriptive Clustering Detect natural groupings Clustering Detect natural groupings Outlier detection Detect anomalies Outlier detection Detect anomalies Affinity Analysis Co-occurrence patterns Affinity Analysis Co-occurrence patterns PredictivePredictive Classification Predict a category Classification Predict a category Regression Predict a value Regression Predict a value Recommendation Predict a preference Recommendation Predict a preference
  • 17. © Hortonworks Inc. 2013 Page 17
  • 18. © Hortonworks Inc. 2013 A brief history of Apache Hadoop Page 18 2013 Focus on INNOVATION 2005: Yahoo! creates team under E14 to work on Hadoop Focus on OPERATIONS 2008: Yahoo team extends focus to operations to support multiple projects & growing clusters Yahoo! begins to Operate at scale Enterprise Hadoop Apache Project Established Hortonworks Data Platform 2004 2008 2010 20122006 STABILITY 2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with 24 key Hadoop engineers from Yahoo
  • 19. © Hortonworks Inc. 2013 ApplianceCloudOS / VM HDP: Enterprise-Ready Hadoop HORTONWORKS DATA PLATFORM (HDP) PLATFORM SERVICES HADOOP CORE Enterprise Readiness: HA, DR, Snapshots, Security, … Distributed Storage & ProcessingHDFS MAP REDUCE DATA SERVICES Store, Process and Access Data HCATALOG HIVEPIG HBASE SQOOP FLUME OPERATIONAL SERVICES Manage & Operate at Scale OOZIE AMBARI
  • 20. © Hortonworks Inc. 2013 Core Hadoop: HDFS & Map Reduce Deliver high-scale storage & processing • HDFS: distributed, self-healing data store • Map-reduce: distributed computation framework that handles the complexities of distributed programming Page 20
  • 21. © Hortonworks Inc. 2013 Page 21 Keys to Hadoop’s power • Computation co-located with data – Data and computation system co-designed and co- developed to work together • Process data in parallel across thousands of “commodity” hardware nodes – Self-healing; failure handled by software • Designed for one write and multiple reads – There are no random writes – Optimized for minimum seek on hard drives
  • 22. © Hortonworks Inc. 2013 Inside HDP for Windows Page 22 Hortonworks Data Platform (HDP) For Windows • 100% Open Source Enterprise Hadoop • Component and version compatible with Microsoft HDInsight • Availability • Beta release available now • GA early 2Q 2012 PLATFORM SERVICES HADOOP CORE DATA SERVICES OPERATIONAL SERVICES Manage & Operate at Scale Store, Process and Access Data HORTONWORKS DATA PLATFORM (HDP) For Windows Distributed Storage & ProcessingHDFS WEBHDFS MAP REDUCE HCATALOG HIVEPIG SQOOP Oozie
  • 23. © Hortonworks Inc. 2013 Seamless Interoperability with Your Microsoft Tools • Integrated with Microsoft tools for native big data analysis – Bi-directional connectors for SQL Server and SQL Azure through SQOOP – Excel ODBC integration through Hive • Addressing demand for Hadoop on Windows – Ideal for Windows customers with Hadoop operational experience • Enables all common Hadoop workloads – Data refinement and ETL offload for high-volume data landing – Data exploration for discovery of new business opportunities Page 23 APPLICATIONSDATASYSTEMS Microsoft Applications HORTONWORKS DATA PLATFORM For Windows DATASOURCES MOBILE DATA OLTP, POS SYSTEMS Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media)
  • 24. © Hortonworks Inc. 2013 Page 24
  • 25. © Hortonworks Inc. 2013 Page 25 Data Science, now with more data…
  • 26. © Hortonworks Inc. 2013 Page 26 Benefit #1: Explore full datasets Benefits of Hadoop for data science
  • 27. © Hortonworks Inc. 2013 Page 27 Explore large datasets directly with Hadoop Measure/Evaluate Acquire Clean DataVisualize, Grok Model Full dataset stored on Hadoop Researcher laptop R, Matlab, SAS, etc
  • 28. © Hortonworks Inc. 2013 Page 28 Integrate Hadoop in your data analysis flow •Full dataset resides in Hadoop • Typical Hadoop tasks: –Simple statistics: mean, median, correlation –Text pre-processing: grep, regex, NLP –Dimensionality reduction: PCA, SVD, clustering, etc –Random sampling: with or without replacement, by unique –K-fold cross-validation
  • 29. © Hortonworks Inc. 2013 Page 29 Benefit #2: Mine larger datasets Benefits of Hadoop for data science
  • 30. © Hortonworks Inc. 2013 Page 30 More data -> better outcomes Banko & Brill, 2001 Halevy, Norvig & Pereira, 2009
  • 31. © Hortonworks Inc. 2013 Page 31 Learning algorithms with large datasets… Challenges: •Data won’t fit in memory •Learning takes a lot longer… Using Hadoop: •Distribute data across nodes in the Hadoop cluster •Implement a distributed/parallel algorithm
  • 32. © Hortonworks Inc. 2013 Page 32 Benefit #3: Large-scale data preparation Benefits of Hadoop for data science
  • 33. © Hortonworks Inc. 2013 Page 33 80% of data science work is data preparation Strip away HTML/PDF/DOC/PPT Entity resolution Document vector generation Sampling, filtering Joins Raw Data Processed Data Term normalization
  • 34. © Hortonworks Inc. 2013 Page 34 Hadoop is ideal for batch data preparation and cleanup of large datasets
  • 35. © Hortonworks Inc. 2013 Page 35 Benefit #4: Accelerate data-driven innovation Benefits of Hadoop for data science
  • 36. © Hortonworks Inc. 2013 Page 36 Barriers to speed with traditional data architectures • RDBMS uses “schema on write”; change is expensive • High barrier for data-driven innovation I need new data collecting Finally, we start collecting Let me see… is it any good? Start 6 months 9 months Schema change project
  • 37. © Hortonworks Inc. 2013 Page 37 “Schema on read” means faster time-to-innovation • Hadoop uses “schema on read” • Low barrier for data-driven innovation I need new data Let’s just putLet’s just put it in a folder on HDFS Let me see… is it any good? Start 3 months 6 months My model is awesome!
  • 38. © Hortonworks Inc. 2013 Quick start: Hortonworks Sandbox • What is it – A free download of a virtualized single-node implementation of the enterprise-ready Hortonworks Data Platform – A personal Hadoop environment – An integrated learning environment with frequently, easily updatable hands-on step-by-step tutorials • What it does – Dramatically accelerates the process of learning Apache Hadoop – Accelerate and validates the use of Hadoop within your unique data architecture – Use your data to explore and investigate your use cases • ZERO to big data in 15 minutes Page 38 Download Hortonworks Sandbox www.hortonworks.com/sandbox Sign up for Training for in-depth learning hortonworks.com/hadoop-training/
  • 39. Hadoop Summit Page 39Architecting the Future of Big Data • June 26-27, 2013- San Jose Convention Center • Co-hosted by Hortonworks & Yahoo! • Theme: Enabling the Next Generation Enterprise Data Platform • 90+ Sessions and 7 Tracks • Community Focused Event – Sessions selected by a Conference Committee – Community Choice allowed public to vote for sessions they want to see • Pre-event training classes – Apache Hadoop Essentials: A Technical Understanding for Business Users – Understanding Microsoft HDInsight and Apache Hadoop – Developing Solutions with Apache Hadoop – HDFS and MapReduce – Applying Data Science using Apache Hadoop • 10% discount code: 13DiscHUG10 hadoopsummit.org
  • 40. © Hortonworks Inc. 2013 Page 40 Thank you! Any Questions? Ofer Mendelevitch Director, Data Sciences @ Hortonworks ofer@hortonworks.com @ofermend, @hortonworks We’re hiring!