SlideShare una empresa de Scribd logo
1 de 70
Descargar para leer sin conexión
®
© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Best Practices for Using Hadoop as an
Enterprise Data Hub
Mike Ferguson – Intelligent Business Strategies
Steve Wooledge – MapR
June 18, 2014
2
About Mike Ferguson
Mike Ferguson is Managing Director of Intelligent
Business Strategies Limited. As an analyst and
consultant he specialises in business
intelligence, data management and enterprise
business integration. With over 32 years of IT
experience, Mike has consulted for dozens of
companies, spoken at events all over the world
and written numerous articles. Formerly he was
a principal and co-founder of Codd and Date
Europe Limited – the inventors of the Relational
Model, a Chief Architect at Teradata on the
Teradata DBMS and European Managing
Director of DataBase Associates.
www.intelligentbusiness.biz
mferguson@intelligentbusiness.biz
Twitter: @mikeferguson1
Tel/Fax (+44)1625 520700
The Hadoop Data Refinery and Enterprise Data
Hub
Mike Ferguson
Managing Director
Intelligent Business Strategies
June 2014
4
Topics
!  Data warehousing and the evolution of ETL processing
!  New data and new analytical workloads
!  Big data use cases driving business agendas
!  The unprecedented demand for customer insight
!  Challenges with new big data sources
!  Beyond the data warehouse – new platforms for new analytical
workloads
!  The role of Hadoop in the modern analytical ecosystem
!  Introducing the Hadoop enterprise data hub and data refinery
!  Simplifying access to new big data insight using SQL on Hadoop
!  Integrating Hadoop into your analytical ecosystem
5
For Many Years The Traditional Data Warehouse and BI
Environment Has Been Used For Analysis & Reporting
Operational
systems
web
P
o
r
t
a
l
Employees
Partners
Customers
BI
Tools
Platform
Data
Integration/DQ
Reports &
analytics
Data warehouse
& data marts
DW
6
The Evolution of Data Integration in Data
Warehousing – From Hand Coded to ETL to ELT
Hand coded ETL programs
DW
Hand
coded
programs
ETL Servers
DW
ETL
Servers
ELT processing
Generated
SQL ELT
processing
DWEvolution of Data Warehousing
MPP RDBMS systems
7
Sales
Product line n
Product line 4
Product line 3
Product line 2
Product/
service line 1
Marketing
Service
Credit
Verification
HR
Finance
Planning
Procurement
SupplyChain
Suppliers
Front Office BackOffice
Operations
Customers
New Data Sources Have Emerged Inside And Outside
The Enterprise That Business Now Wants To Analyse
E.g. RFID tag
sensor
networks
weather data
Data volume
Data variety
Number of sources
Data volume
Data velocity
8
Popular Big Data Analytic Applications – Web Data
!  Clickstream analytics
•  Site navigation behaviour (session) analysis
–  Paths to buy, paths to abandonment, what else
they looked at
–  Improve customer experience and conversion
–  Associate clicks with customers & prospects
!  Social network influencer analysis
•  Graph analytics for influencer behavioural impact
analysis
•  ‘Target the influencer’ marketing campaign
effectiveness
9
Popular Big Data Analytic Applications – Sensor Data
For Improving Process Efficiency and Optimisation
!  Sustainability analytics e.g. energy optimisation
!  Supply/distribution chain optimisation
!  Asset management and field service optimisation
!  Manufacturing production line optimisation
!  Location based advertising (mobile phones)
!  Grid health monitoring
•  Electricity, water, mobile phone cell network…
!  Smart metering (collect data every 15 minutes)
!  Fraud
!  Healthcare – ITC vital signs, fit bits,….
!  Traffic optimisation
" WHAT ARE YOU PREPARED TO INSTRUMENT?
E.g. RFID tag
10
Popular Big Data Analytic Applications
– Unstructured Data
!  Case management
!  Fault management and field
service optimisation
!  “Voice of the customer”
!  Sentiment analytics
!  Competitor analysis
!  Media coverage analysis
!  Improve pharma drug trials
" Unstructured content is hard to
analyse
How much is TEXT worth to
your business?
11
Big Data Analytics - Industry Use Case Examples
Industry Use Case Examples
Financial
Services
Improved risk decisions, KYC customer insight, auto programmatic
trading, 360 view of financial crime, pre-trade decision support,
real-time trade & corp action tagging for compliance and RT P&L,
grow security services outsourcing, Reference Data Exchange
Utilities Smart meter data analysis, pricing elasticity analysis, customer
loyalty, sustainability, asset management
Telecommunic
ations
Customer Churn, Network optimization analysis from device,
sensor and GPS inputs, monetization of GPS and data
Manufacturing Sensor data for next generation ‘smart’ products, production line
optimisation, improved customer service and improved field
service, distribution chain optimization, asset management
Insurance “How you drive” insurance (sensors to reduce risk), broker
document analysis (risk assessment)
Government Smart cities (e.g. transportation optimisation), anti-terrorism, law
enforcement
Logistics Distribution optimisation, route optimisation,
12
More Data Is Required To Get A Deeper
Understanding of Customers
!  We now need
•  Transaction data
•  Data from touch points you own
•  Data from the touch points you don’t own
•  Interaction data
–  Need to look at Inbound interactions Vs outbound interactions
–  Social interactions
•  Master data
•  Professional data e.g. profiles on LinkedIn
•  Internal and external event data
•  Competition data…..
!  Then use analytics to understand and predictive desire and
propensity e.g. propensity to churn
13
Top Priorities - Improving Customer Experience Via
Time Series Analysis of All Customer Interactions
OMNI channel – analyse all customer
interactions across all channels
identity
data
behavioural
data
social
data
Customer “DNA”
14
identity
data
behaviou
ral data
social
data
Customer “DNA”
Customer Experience Management - Understanding Customer
On-Line Behaviour is Mission Critical to Retention and Growth
!  Important new data sources for analysis for customer ‘DNA’
•  Clickstream data from web logs
•  Sentiment and social network influencer data
New competitors
More choice
Voice of the customer
On the web the
customer is king
On the
move
Easy to find
15
Today Both Structured And Multi-Structured Data Are
Needed For Deeper Insight
Multi-
structured
data
Click stream web log data
Customer interaction data
Social interaction data
Sensor data
Rich media data (video, audio)
External content
Documents
Internal web content
Seismic data (oil & gas)
Structured
data
OLTP system data
Data warehouse data
Personal data stores e.g.
Excel, Access
Often un-modelled and may
not be well understood
Often a schema is defined
and data is well understood
Data characteristics are changing
- Companies must deal with volume,
variety and velocity
16
Big Data Analytics Challenges Include The Analysis of
Unstructured, Semi-structured and Structured Data
{ "firstName": ”Wayne",
"lastName": ”Rooney",
"age": 25,
"address": {
"streetAddress": "21 Sir Matt Busby Way",
"city": ”Manchester”,
“country”: “England”,
"postalCode": “M1 6DY”
},
"phoneNumbers": [
{ "type": "home”,
"number": ”0161-123-1234”
},
{
"type": ”mobile",
"number": ”07779-123234”
}
]
} JSON data
Text data
Image Data
Makes analysis more complex with new analytics and visualisations needed
17
Increased Data and Analytical Complexity Has Created
A Need For A New Role – The Data Scientist
Image source: Wikipedia
Data Science is the process of investigative / exploratory analysis of
multi-structured data to discover and produce new business insights
Image source:
www.computing.co.uk
18
People In Different Roles In The Analytical Landscape
Need To Work Together To Deliver Business Value
Exploratory analysis
Predictive / statistical
model producer
Business Analyst
Business Manager /
Operations worker /
Customer
Data Scientist
Model consumer
Data visualisation
Information Producer
• Build reports
• Build and publish
dashboards
Information consumer
Decision maker
Action taker
Strategic
Business
Objective
Priority KPI Current
KPI
Value
What is
+1%
worth?
KPI
Target
Executive
Accountable
Business
Initiatives
(projects)
Budget
Allocation
Action
Plan
1 $$$ Project
Project
Project
£ x Million
2
3
4
Business Strategy – strategic objectives and targets including sustainability targets
sandbox
19
Data Science Produces New Insights For Business Analysts
Who Produce Actionable BI For Front Office Decision Makers
Business Analyst
Marketing Manager /
Marketing, Sales and
Service workers
Data Scientist
Data Quality
Forecasting
Segmentation
Models
Customer Lifetime
Value
Social
Network
Strategy
Creation
Performance
& Effectiveness
Reporting
Direct Mail
Understand
Customer
Behavior
& Navigation
Marketing
Performance &
Reporting
Campaign
Planning
Financial
Planning
Creative
Materials
Marketing
Attribution
Operations
Management
Channel
Efficiency
Sentiment
& Influence
Dynamic
Content
Re-marketing
Web
Call Center
Live Event
Broadcast Media
Mobile/ SMS
Social
Email
Industry Specific
Big Data Analytics
Traditional DW/BI
Workflow
& Approvals
New insights Actionable BI
20
Big Data Analytics Has Taken Us Beyond The
Traditional DW – New Big Data Analytical Workloads
1.  Analysis of data in motion
2.  Complex analysis of structured data
3.  Exploratory analysis of un-modeled multi-structured data
4.  Graph analysis e.g. social networks
5.  Accelerating ETL and analytical processing of un-
modeled data to enrich data in a data warehouse or
analytical appliance
6.  The storage and re-processing of archived data
21
The Changing Landscape – We Now Have Different
Platforms Optimised For Different Analytical Workloads
Big Data workloads result in multiple platforms now being needed for
analytical processing
Streaming
data
Hadoop
data store
Data Warehouse
RDBMS
NoSQL
DBMS
EDW
DW & marts
NoSQL DB
e.g. graph DB
Advanced Analytic
(multi-structured data)
mart
DW
Appliance
Advanced Analytics
(structured data)
Analytical
RDBMS
Graph
analysis
Investigative
analysis,
Data refinery
Traditional
query,
reporting &
analysis
Real-time
stream
processing &
decision m’gmt
Data mining,
model
development
22
Hadoop Is A Key Platform In Big Data Analytics
– Data Can Be Accessed Via Multiple APIs
Java MapReduce
APIs to HDFS,
HBase, Cascading
file file file file file
file file file file file
file file
file file
webHDFS
(An HTTP
interface to
HDFS has
REST APIs)
HDFS
file
file
file
file
YARN
PIG latin
scripts
SQL
Vendor SQL on
Hadoop engine
MapReduce
Application
index
indexIndex
partition
SQL
BI Tools &
Applications
Storm
Application
YARN
Tez or SparkMapReduce HBase
HDFS API
23
Defacto Standard APIs Allow Hadoop Components To Be
Replaced e.g. Faster, More Secure File System Than HDFS
Java MapReduce
APIs to HDFS,
HBase, Cascading
webHDFS
(An HTTP
interface to
HDFS has
REST APIs) file file file file file
file file file file file
file file
file file
file
file
file
file
Vendor Specific File System (e.g. )
YARN
HDFS API
PIG latin
scripts
index
indexIndex
partition
Storm
Application
YARN
MapReduce HBase
MapReduce
Application
SQL
Vendor SQL on
Hadoop engine
SQL
BI Tools &
Applications
Tez or Spark
24
Apache Hadoop Components
Component Description
Hadoop HDFS A distributed file system that partitions files across multiple machines for high-throughput
access to application data – HDFS API allows vendors to replace HDFS with an alternative
Hadoop YARN" A framework for job scheduling and cluster resource management"
Hadoop
MapReduce
A programming framework for distributed batch processing of large data sets distributed
across multiple servers
Avro A serialization system that creates & reads files in a format containing both JSON data
definitions & the data itself for dynamic interpretation of the data by applications
Hive A data warehouse system for Hadoop that facilitates data summarization, ad-hoc queries,
and the analysis of large datasets stored in Hadoop-compatible file systems. Hive provides
a mechanism to project structure onto this data and query it using a SQL-like language
called HiveQL. HiveQL programs are converted into MapReduce programs
HBase HBase is an open-source, distributed, versioned, column-oriented store modeled after
Google' Bigtable.
Pig A high-level data-flow language for expressing Map/Reduce programs for processing and
analysing large HDFS distributed data sets
Mahout A scalable machine learning and data mining library
Oozie A service for running and scheduling workflows of Hadoop jobs (including Map-Reduce,
Pig, Hive, and Sqoop jobs)
Spark A general purpose engine for large scale data processing in-memory. It supports analytical
applications that wish to make use of stream processing, SQL access to columnar data and
analytics on distributed in-memory data
Zookeeper A high-performance coordination service for distributed applications
25
The Role of Hadoop - Data Is Arriving Faster Than We
Can Consume It – How Good Is Your Filter?
F
D I
A L
T T
A E
R
Enterprise
Enterprise
systems
26
New Requirement
– The Managed Hadoop Enterprise Data Hub
Parse & Prepare Data in Hadoop (MapReduce)
Transform & Cleanse Data in Hadoop (MapReduce)
Discover data in Hadoop
ELT
work
-flow
sandbox
other data
sandbox sandbox
Data Reservoir
(raw data)
Load data into Hadoop
Data
Refinery
New high
value Insights
(pub/sub)
EDW
Graph
DBMS
DW
appliance
contains clean,
high value data
XML,%
JSON%
Web
logs
27
What’s In An Enterprise Data Hub?
!  A managed data reservoir (raw data)
•  Organised capture of multi-structured data
•  Includes real-time data capture
•  May include operational reporting
!  A governed data refinery
•  Data integration and cleansing at scale
•  Analytical sandboxes to discover high value data
!  Published, protected and secure high value insights
!  Long-term storage of archived data from data warehouses
28
file file file
file file file
file file
file file
file
file
Real-time Data Capture – E.g. MapR Allows Web Log
Data To Be Directly Streamed/Stored in Hadoop
MapR Direct Access NFSs allows
Web log files to be stored directly on
their Hadoop File System so that
click stream is captured in real-time
MapR Distribution
for Hadoop
Web Server
Direct Access NFS
web log
fileweb log
file
# mount localhost:/mapr /mapr
HDFS
Web Server
Web Server
29
High Volume Data Capture
- Column Family Databases
!  Suitable for fast capture of large amounts of sparse, volatile data
•  Very fast capture and can hold vast amounts of data
•  Billions of rows containing thousands or millions of columns
!  Provide column-centric storage and wide de-normalised big
tables can also help simplify operational reporting if used with
SQL-on-Hadoop e.g. SQL access to HBase
!  Allow you to
•  Group together related columns into column families
•  Design column families to optimize the most common queries
•  Retrieve columnar data for multiple entities by iterating through a
column family
•  Shard rows in a column family and distribute across many servers
•  Create indexes and secondary indexes
•  Support schema variance - columns in a column family can vary for
every row
30
NoSQL Column Family Databases - HBase
Row 1 # Column A = value
Column B = value
Column C = value
Row 2 # Column X = value
Column Y = value
Column Z = value
Hbase Storage Architecture
Hmaster and several HRegionServers
Regions (partitions) created automatically as tables grow
Hbase allows applications to directly read and write data
31
Column Families Can Be Stored In Different Files And
Queries Will Only Retrieve The Column Family Needed
Source: Data Access for Highly-Scalable Solutions : Using SQL, NoSQL, and Polyglot Persistence, McMurtry, Oakley, Sharp, Subramanian, Zhang
Portfolio.* means all
columns in the Portfolio
column family
Data about a customer and their
stock purchases are partitioned
vertically by column family
Column
family data
can also be
compressed
32
Fast Data Capture – MapR-DB Is A High Speed
Version of HBase Built Into The MapR Data Platform
HBase API
Source: MapR
33
Enterprise Data Hub – We Need A Data Refinery To
Process And Clean Complex Data
Image source: http://www.hollyfrontier.com/navajo/
34
Evolution of Big Data Integration Is Following The
Same Cycle as it Did in Data Warehousing
Hand coded ETL programs
Hadoop
Hand
coded
programs
ETL Servers
Hadoop
ETL
Servers
ELT processing
Generated
MapReduce ELT
processing
HadoopEvolution of Big Data Integration
35
Data Cleansing and Integration Tool
Scaling ETL In A Data Refinery By Generating Pig, Hive or 3GL
MapReduce Code for In-Hadoop ELT Processing
Extract Parse Clean Transform AnalyseLoad Insights
Option 1
ETL tool generates HQL
or convert generated
SQL to HQL
Option 2
ETL tool generates
Pig Latin
(compiler converts
every transform to
a map reduce job)
Note - Generating native MapReduce code instead of HiveQL or Pig Latin would
likely perform faster because there is no need to translate into MapReduce
Also HiveQL is a subset of SQL so check how ETL tools generating HiveQL do
complex transformations – HiveQL on its own may not be enough e.g. Hive UDFs?
Option 3
ETL tool generates
3GL MapReduce
code
36
Need to Parse & Extract From Multi-Structured Data While
Integrating Data In A Big Data Environment
E-mail (semi-structured)
Text (unstructured)
ExtractParse TransformLoad …
37
Sandboxes In The Data Refinery - Data Science Teams Need
To Conduct Exploratory Analysis on Multi-Structured Data
Click stream web log data
Customer interaction data
Social interaction data (e.g.
Twitter, Facebook)
Sensor data
Rich media data (video, audio)
External web content
Documents
Internal web content
Seismic data (oil & gas)
Investigative /
Exploratory
Analysis
C
R
U
D
Asset
Customer
Product
MDM System
EDW
mart
new
business
insights
sandbox
Multi-structured
data
Historical Data
archived DW datamaster data
Data Scientists
38
In-Hadoop Analytics In A Data Refinery
– Example Technologies
!  Hadoop MapReduce, Tez or Spark analytic
applications with custom analytics
•  Pig, Java, Python, Scala, Cascading…..
!  Hadoop MapReduce, Tez or Spark analytic
applications using pre-built Hadoop analytics e.g.
Mahout, Spark MLlib
•  Several analytical algorithms for use in analysis
!  Revolution Analytics RevoScaleR
!  SAS Analytics and In-Memory Statistics for Hadoop
!  … many more
Analytical
tools
Data
management
tools
39
In-Hadoop Analytics:
- Mahout Supports A Number Of Analytic Techniques
!  Collaborative Filtering
!  User and Item based recommenders
!  K-Means and Fuzzy K-Means clustering
!  Mean Shift clustering
!  Dirichlet process clustering
!  Latent Dirichlet Allocation
!  Singular value decomposition
!  Parallel Frequent Pattern mining
!  Complementary Naive Bayes classifier
!  Random forest decision tree based classifier
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Now runs
on Spark as
well as
MapReduce
40
Expediting The Data Refinery Process On Hadoop With
Automated Analysis – From ETL to Analytical Workflows
Parse & Prepare Data in Hadoop (MapReduce)
Transform & Cleanse Data in Hadoop (MapReduce)
Discover data in Hadoop
ELT
work
-flow
other data
Raw data
Load data into Hadoop
Data
Refinery
EDW
Graph
DBMS
DW
appliance
Automated Invocation of Custom Built & Pre-built
Analytics on Hadoop
contains clean,
high value data
New high
value Insights
(pub/sub)
41
High Value Insights Produced In A Hadoop Data Hub Can Be
Brought Into A DW to Enrich What We Already Know
Cloud Data
HDFS
Extract
DW
D
IMap/ Reduce data
transformation
and analytics
applications
Transform
e.g. PIG, IBM JAQL
Cloud Data e.g. Deriving insight from huge
volumes of social web content on
sites like twitter, facebook. Digg,
mySpace, tripAdvisor, Linkedin….for
sentiment analytics
Hundreds of
terabytes up
to petabytes
new
insights
Operational
systems
42
Making New Insights Available To Business Analysts
Via SQL Access To Big Data - Options
SQL
SQL access to
big data in
Hadoop
SQL
DW
data virtualisation server
SQL access to
big data via data
virtualisation
SQL
Analytical
RDBMS
SQL access to big
data in an
analytical RDBMS
streaming
data
SQL
SQL access to
streaming data in
motion
43
Self-Service BI
BI Tool(s)
e.g, Visual Discovery tools
Business Analyst
or ‘budding’ Data
Scientist
personal &
office data
Predictive
models
community
Publish / Share
Consume /
Enhance /
Re-publish
Transaction
systems
DW
SQL Access to Hadoop Is Needed To Allow Hadoop Data To
Be Accessed By Users With Self-Service BI Tools
collaborate
HDFS / Hbase/ Hive
e.g. Hive interface
44
SQL access
to Big Data?
Key Questions That May Influence If SQL Access to Big
Data Is A Good Choice or What SQL Option to Take
What kind of analysis?
Text analysis, Graph analysis,
Machine Learning, reporting
What kind of data type(s)
do you need to analyse?
- structured, unstructured, semi-
structured,
What kind of data volumes
do you want to analyse?
Is the data at rest or is it real-
time streaming data in motion?
What analytical functions
can you invoke on big
data from SQL?
Join with other data in
another data store?
How many concurrent users?
Performance and scalability
of complex queries and
analytical functions
(need parallelism)
Is the requirement for
interactive, exploratory,
or real-time analysis?
Data
Analytical Workload
45
SQL On Hadoop Initiatives
Key Questions
What analytic functions
are provided?
How can analytic
functions be extended
Can you join to data
outside of Hadoop?
Are these SQL on
Hadoop options
suitable for reporting
and analysis, interactive
discovery, exploratory
analysis or all of these?
Vendor SQL on Hadoop Initiative
AMPlab (UC Berkeley) Shark (Forked Hive at V0.9) or SparkSQL
Apache Hadoop Hive
Actian Vortex (Actian Vector on Hadoop data nodes)
CitusDB CitusDB (uses external tables)
Cloudera Impala / Parquet
Concurrent Lingual (SQL on Cascading)
Hadapt Schemaless SQL
Hortonworks Stinger / ORC (Hive 13)
HP Vertica on Hadoop
IBM BigSQL (SQL on HDFS & HBase)
InfiniDB InfiniDB on Apache Hadoop
Jethro Data JethroData
MapR Apache Drill
Microsoft Hive 13
Pivotal HawQ (uses external tables via PFX)
Teradata SQL-H
Splice Machine Splice Machine (SQL Engine on HBase)
Salesforce.com Phoenix (SQL engine on HBase)
Attivio Active Intelligence Engine (SQL access to
search indexes on Hadoop data)
46
SQL on Hadoop
– Apache Drill Can Access HDFS And HBase Data
BI Tool(s)
e.g, Visual Discovery tools
Business Analyst
or’ Data Scientist
Drill
Analytic Application
SQL SQL
Data Scientist
HDFSHBase
MapR Distribution for Hadoop
Apache Drill does not use MapReduce
MongoDB/
Cassandra
sensors
XML,%
JSON%
Data
entering
HBase
47
Apache Drill Distributed Query Processing
– A Storage Independent Drillbit MPP Architecture
Each drillbit is capable of receiving queries from applications and BI tools
- there is no master in this architecture
Multiple drillbits are involved in parallel query processing on distributed data
Supports Apache HDFS, Apache HBase, MapR-FS, MapR-DB, Amazon S3
48
SQL on Hadoop Example – Apache Drill Supports
Query of Self-Describing Data Without a Schema
JSON
Source: MapR
49
file
file
file
file
file
file
file
file
file
file
file
SQL on Hadoop
– What Should The Schema Look Like?
Star schema? Snowflake schema?
De-normalised schema?
Other?
50
Hadoop Storage Is Independent of Any SQL Engine Accessing
HDFS - Multiple SQL Engines Can Coexist On The Same Data
file file file file file
file file file file file
file file
file file
HDFS
file
file
file
file
YARN
Batch
(MapReduce)
Interactive
(Tez)
On-line
(HBase)
Streaming
(Storm,..)
Graph
(Giraph)
In-memory
(Spark)
HPC MPI
(OpenMPI)
Other
(Search,.)
file
file
file
file
SQLSQLSQL SQL
Storage is independent
of any SQL engine!  Key points about Hadoop
•  It is possible to have MULTIPLE SQL engines on the same data
•  Different SQL engines run on different Hadoop frameworks (M/R, Tez,
Spark) or on no framework at all i.e. directly access HDFS or HBase data
51
Relational DBMS / Hadoop Integration – Several Vendors Have
Integrated RDBMS with Hadoop to Run Analytics
Relational DBMS
External
Polymorphic
table function(s)
HDFS / Hbase/ Hive
SQL, XQuery
RDBMS optimizer handles
transparent access to external
analytical platforms on behalf
of the user
RDBMS and Hadoop could
be deployed on the same
hardware cluster
(preferred) or on different
hardware clusters
Allows join across data in a
single RDBMS and Hadoop
52
Relational DBMS / Hadoop Integration Example
- HP Vertica and MapR
Source: MapR
53
Self-Service BI
Self-service Data
Discovery & Visualisation
or Dashboard Server
Business
analyst
Data Virtualization and Optimization
personal
& office
data Predictive
models
Transaction
systems
Data Management Tools (ETL, DQ, etc.)
DW
Self-Service Access To Big Data Via Data Virtualization
BUT what about optimization?
Can the data virtualisation server push
down analytics to underlying platforms
to make them do the work?
54
New Insights Can Be Added Into A Data Warehouse To Enrich
What You Already Know
DW
D
I
new
insights
Operational
systems
e.g. Deriving insight from social web sites like for sentiment analytics
sandbox
Data Scientists
social
Web
logs
web cloud
ELT
55
Alternatively New Insights In Hadoop Can Integrated With A
DW Using Data Virtualization To Provide Enriched Information
DW
D
I
e.g. Deriving insight from social web sites like for sentiment analytics
new
insights
OLTP systems
sandbox
Data Scientists
social
Web
logs
web cloud DataVitualisation
SQL on
Hadoop
56
Using Hadoop As A Data Archive Means Data Can Be Kept
On-line, Analysed And Still Integrated With Data In The DW
DW
D
I
OLTP systems
DataVitualisation
SQL on
Hadoop
Archived data
Archiveunused
ordata>nyears
57
SQL on
Hadoop
Big Data Governance – Data Sources, Sandboxes,
People, Data Access Security, Results Lineage….
Graph DBMS
MPP Analytical
RDBMS
Social
graph data Unstructured / semi-
structured content
DW
RDBMSFiles
clickstream%
Web logs
governance
governance
governance
governance
governance
governance
governancegovernancegovernance
58
Issues: Siloed Analytics - Different Tools to Manage and
Integrate Data For Each Type of Analytical Data Store
Analytical
tools
Data
management
tools
EDW
mart
Structured data
CRM ERP SCM
Silo
DW & marts
Streaming data
(markets, sensors
Analytical
models
Silo
Analytical
tools/apps
Data
management
tools
Multi-structured
data
Silo
DW
Appliance
Advanced Analytics
(structured data)
Data
management
tools
Structured data
CRM ERP SCM
Analytical
tools
Silo
Analytical
tools/apps
Data
management
tools
NoSQL DB
e.g. graph DB
Silo
Multi-structured &
structured data
59
EDW
MDM SystemDW & marts
NoSQL DB
e.g. graph DB
Advanced Analytic
(multi-structured data)
mart
DW
Appliance
Advanced Analytics
(structured data)
Need to Manage The Supply of Consistent Data Across
The Entire Analytical Ecosystem
Common Enterprise Information Management Tool Suite
Stream
processing
C
R
U
D
Prod
Asset
Cust
actions
feedssensors
XML,%
JSON%
RDBMS Files office docssocial Cloud
clickstream%
Web logs
web services
New
New
New
New
New New New New NewNew
New
New
C
R
U
D
Prod
Asset
Cust
New data types need to be supported by EIM tool suites
60
BI tools platform &
data visualisation
tools
Search
based
BI tools
Custom
MapReduce
applications
Map
Reduce
BI tools
Graph
Analytics
tools
A New Architecture for Analytics - The Intelligent
Business Strategies Extended Analytical Ecosystem
Enterprise Information Management Tool Suite
feedssensors
XML,%
JSON%
RDBMS Files office docssocial Cloud
clickstream%
Web logs
web services
Event
processing
C
R
U
D
Prod
Asset
Cust
EDW
MDM SystemDW & marts
NoSQL DB
e.g. graph DB
Advanced Analytics
(multi-structured data)
mart
DW
Appliance
Advanced Analytics
(structured data)
actions
Filtered
data
Data Virtualisation and optimization
61
Conclusions
!  Business demand for new more complex, high volume data is driving
the need for new analytical workloads beyond the data warehouse
!  Hadoop is a low cost analytical platform capable of supporting new
analytical workloads on multi-stuctured data
!  A key role for Hadoop is as an data hub and data refinery
!  The data refinery process requires data integration and cleansing to
scale to handle the volume, variety and velocity of complex multi-
structured data
!  Data scientists analyse big data as part of the data refining process to
produce new insights that can be added to what you already know
!  Hadoop is part of an extended analytical ecosystem with data
management tools supplying consistent data across all data stores
!  Data scientists, business analysts and information consumers need to
work together to deliver new insight for competitive advantage
®
© 2014 MapR Technologies 62© 2014 MapR Technologies
®
Best Practices for Production Success
®
© 2014 MapR Technologies 63
HQ
WORLDWIDE HADOOP TECHNOLOGY LEADER
UNIQUELY ADDRESSES BOTH
ANALYTIC AND OPERATIONAL USE CASES
500+ PAYING CUSTOMERS
MapR:
®
© 2014 MapR Technologies 64
MapR: Best Product for Customer Success
Top Ranked Exponential Growth 500+ Customers
3X bookings Q1 ‘13 – Q1 ‘14
80% of accounts expand 3X
90% software licenses
<1% lifetime churn
>$1B
in incremental revenue
generated by 1 customer
®
© 2014 MapR Technologies 65
FOUNDATION
Architecture Matters for Success
®
© 2014 MapR Technologies 66
FOUNDATION
High Availability &
Data Protection
High performance
Multi-tenancy
Operational &
analytical workloads
Open standards
for integration
NEW APPLICATIONS SLAs TRUSTED INFORMATION LOWER TCO
Architecture Matters for Success
®
© 2014 MapR Technologies 67
The Power of the Open Source Community
Management
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark
Streaming
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning
&
coordination
Savannah*
Mahout
MLLib
ML, Graph
GraphX
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Tez*
Accumulo*
Hive
Impala
Shark
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integration
& Access
HttpFS
Hue
*%Cer6fica6on/support%planned%for%2014%
®
© 2014 MapR Technologies 68
MapR Distribution for Hadoop
Management
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark
Streaming
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning
&
coordination
Savannah*
Mahout
MLLib
ML, Graph
GraphX
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Tez*
Accumulo*
Hive
Impala
Shark
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integration
& Access
HttpFS
Hue
*%Cer6fica6on/support%planned%for%2014%
•  High availability
•  Data protection
•  Disaster recovery
•  Standard file
access
•  Standard database
access
•  Pluggable services
•  Broad developer
support
•  Enterprise security
authorization
•  Wire-level
authentication
•  Data governance
•  Ability to support
predictive analytics,
real-time database
operations, and
support high arrival
rate data
•  Ability to logically
divide a cluster to
support different
use cases, job
types, user groups,
and administrators
•  2X to 7X higher
performance
•  Consistent, low
latency
Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability
®
© 2014 MapR Technologies 69
Hadoop + Data Warehouse Architecture
Improve data services to customers without increasing enterprise architecture costs
•  Provide cloud, security, managed services, data center, & comms
•  Report on customer usage, profiles, billing, and sales metrics
•  Improve service: Measure service quality and repair metrics
•  Reduce customer churn – identify and address IP network hotspots
•  Cost of ETL & DW storage for growing IP and clickstream data; >3
months
•  Reliability & cost of Hadoop alternatives limited ETL & storage offload
•  MapR for data staging, ETL, and storage at 1/10th the cost
•  MapR provided smallest datacenter footprint with best DR solution
•  Enterprise-grade: NFS file management, consistent snapshots & mirroring
•  Data warehouse for mission-critical reporting and analysis
OBJECTIVES
CHALLENGES
SOLUTION
Hadoop + Data Warehouse = New, Deeper Insights for the Business
•  Increased scale to handle network IP and clickstream data
•  Freed up processing on DW to maintain reporting SLA’s to business
•  Unlocked new insights into network usage and customer preferences
Business
Impact
FORTUNE 500
TELCO
®
© 2014 MapR Technologies 70
Q&AEngage with us!
@mikeferguson1 – Intelligent Business Strategies
@swooledge – MapR Technologies
•  Learn more about Hadoop in your architecture: www.mapr.com/EDH
•  Upcoming Webinar series - www.mapr.com/resources/webinars
–  6/26 Talend – ETL in/for Hadoop
–  7/09 Syncsort – comScore & mainframe optimization
–  7/17 Rick van der Lans – SQL-on-Hadoop
–  7/23 Skytree – machine learning & analytics
–  7/30 Appfluent – DW usage monitoring & optimization
–  8/14 Tableau – data exploration & analysis on Hadoop
•  Contact / follow us

Más contenido relacionado

La actualidad más candente

Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraCloudera, Inc.
 
Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...
Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...
Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...ArabNet ME
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Cloudera, Inc.
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaCloudera, Inc.
 
The Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThe Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThomas Kelly, PMP
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Cloudera, Inc.
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Cloudera Federal Forum 2014: The Building Blocks of the Enterprise Data Hub
Cloudera Federal Forum 2014: The Building Blocks of the Enterprise Data HubCloudera Federal Forum 2014: The Building Blocks of the Enterprise Data Hub
Cloudera Federal Forum 2014: The Building Blocks of the Enterprise Data HubCloudera, Inc.
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteMark van Rijmenam
 
Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...Cloudera, Inc.
 
Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?DataStax
 
Unlocking data science in the enterprise - with Oracle and Cloudera
Unlocking data science in the enterprise - with Oracle and ClouderaUnlocking data science in the enterprise - with Oracle and Cloudera
Unlocking data science in the enterprise - with Oracle and ClouderaCloudera, Inc.
 
Digital Government: Data + Government Isn't Enough | Wrangle Conference 2017
Digital Government: Data + Government Isn't Enough | Wrangle Conference 2017Digital Government: Data + Government Isn't Enough | Wrangle Conference 2017
Digital Government: Data + Government Isn't Enough | Wrangle Conference 2017Cloudera, Inc.
 
Open Source in the Energy Industry - Creating a New Operational Model for Dat...
Open Source in the Energy Industry - Creating a New Operational Model for Dat...Open Source in the Energy Industry - Creating a New Operational Model for Dat...
Open Source in the Energy Industry - Creating a New Operational Model for Dat...DataWorks Summit
 
Traditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonTraditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonCapgemini
 
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...Cloudera, Inc.
 
Becoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeBecoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeCloudera, Inc.
 
IBM Industry Models and Data Lake
IBM Industry Models and Data Lake IBM Industry Models and Data Lake
IBM Industry Models and Data Lake Pat O'Sullivan
 

La actualidad más candente (20)

Better Together: The New Data Management Orchestra
Better Together: The New Data Management OrchestraBetter Together: The New Data Management Orchestra
Better Together: The New Data Management Orchestra
 
Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...
Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...
Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
 
The Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThe Emerging Data Lake IT Strategy
The Emerging Data Lake IT Strategy
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Cloudera Federal Forum 2014: The Building Blocks of the Enterprise Data Hub
Cloudera Federal Forum 2014: The Building Blocks of the Enterprise Data HubCloudera Federal Forum 2014: The Building Blocks of the Enterprise Data Hub
Cloudera Federal Forum 2014: The Building Blocks of the Enterprise Data Hub
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Hadoop Big Data Lakes Keynote
Hadoop Big Data Lakes KeynoteHadoop Big Data Lakes Keynote
Hadoop Big Data Lakes Keynote
 
Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...
 
Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
 
Unlocking data science in the enterprise - with Oracle and Cloudera
Unlocking data science in the enterprise - with Oracle and ClouderaUnlocking data science in the enterprise - with Oracle and Cloudera
Unlocking data science in the enterprise - with Oracle and Cloudera
 
Digital Government: Data + Government Isn't Enough | Wrangle Conference 2017
Digital Government: Data + Government Isn't Enough | Wrangle Conference 2017Digital Government: Data + Government Isn't Enough | Wrangle Conference 2017
Digital Government: Data + Government Isn't Enough | Wrangle Conference 2017
 
Open Source in the Energy Industry - Creating a New Operational Model for Dat...
Open Source in the Energy Industry - Creating a New Operational Model for Dat...Open Source in the Energy Industry - Creating a New Operational Model for Dat...
Open Source in the Energy Industry - Creating a New Operational Model for Dat...
 
Traditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A ComparisonTraditional BI vs. Business Data Lake – A Comparison
Traditional BI vs. Business Data Lake – A Comparison
 
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
Limitless Data, Rapid Discovery, Powerful Insight: How to Connect Cloudera to...
 
Becoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural ChangeBecoming Data-Driven Through Cultural Change
Becoming Data-Driven Through Cultural Change
 
IBM Industry Models and Data Lake
IBM Industry Models and Data Lake IBM Industry Models and Data Lake
IBM Industry Models and Data Lake
 

Destacado

MapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Technologies
 
Enabling Data as a Service with the JBoss Enterprise Data Services Platform
Enabling Data as a Service with the JBoss Enterprise Data Services PlatformEnabling Data as a Service with the JBoss Enterprise Data Services Platform
Enabling Data as a Service with the JBoss Enterprise Data Services Platformprajods
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Global Business Events
 
TUW-ASE Summer 2015: Data as a Service - Models and Data Concerns
TUW-ASE Summer 2015: Data as a Service - Models and Data ConcernsTUW-ASE Summer 2015: Data as a Service - Models and Data Concerns
TUW-ASE Summer 2015: Data as a Service - Models and Data ConcernsHong-Linh Truong
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionMapR Technologies
 
Securing Hadoop - MapR Technologies
Securing Hadoop - MapR TechnologiesSecuring Hadoop - MapR Technologies
Securing Hadoop - MapR TechnologiesMapR Technologies
 
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...MapR Technologies
 
Leveraging SAP HANA with Apache Hadoop and SAP Analytics
Leveraging SAP HANA with Apache Hadoop and SAP AnalyticsLeveraging SAP HANA with Apache Hadoop and SAP Analytics
Leveraging SAP HANA with Apache Hadoop and SAP AnalyticsMethod360
 
Insight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital TransformationInsight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital TransformationMapR Technologies
 
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondStanding Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondCloudera, Inc.
 
Apache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerApache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerHBaseCon
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0MapR Technologies
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 
Zeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureZeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureMapR Technologies
 
Next Generation Enterprise Architecture
Next Generation Enterprise ArchitectureNext Generation Enterprise Architecture
Next Generation Enterprise ArchitectureMapR Technologies
 

Destacado (17)

MapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data Platform
 
Enabling Data as a Service with the JBoss Enterprise Data Services Platform
Enabling Data as a Service with the JBoss Enterprise Data Services PlatformEnabling Data as a Service with the JBoss Enterprise Data Services Platform
Enabling Data as a Service with the JBoss Enterprise Data Services Platform
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
 
TUW-ASE Summer 2015: Data as a Service - Models and Data Concerns
TUW-ASE Summer 2015: Data as a Service - Models and Data ConcernsTUW-ASE Summer 2015: Data as a Service - Models and Data Concerns
TUW-ASE Summer 2015: Data as a Service - Models and Data Concerns
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
Securing Hadoop - MapR Technologies
Securing Hadoop - MapR TechnologiesSecuring Hadoop - MapR Technologies
Securing Hadoop - MapR Technologies
 
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
 
Leveraging SAP HANA with Apache Hadoop and SAP Analytics
Leveraging SAP HANA with Apache Hadoop and SAP AnalyticsLeveraging SAP HANA with Apache Hadoop and SAP Analytics
Leveraging SAP HANA with Apache Hadoop and SAP Analytics
 
MapR 5.2 Product Update
MapR 5.2 Product UpdateMapR 5.2 Product Update
MapR 5.2 Product Update
 
Insight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital TransformationInsight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital Transformation
 
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and BeyondStanding Up an Effective Enterprise Data Hub -- Technology and Beyond
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
 
Apache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerApache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at Cerner
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Zeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureZeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data Architecture
 
Next Generation Enterprise Architecture
Next Generation Enterprise ArchitectureNext Generation Enterprise Architecture
Next Generation Enterprise Architecture
 

Similar a Best Practices for Using Hadoop as an Enterprise Data Hub

Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013nkabra
 
How Big is Big Data business - Outsource People 2015
How Big is Big Data business - Outsource People 2015How Big is Big Data business - Outsource People 2015
How Big is Big Data business - Outsource People 2015Ihor Malchenyuk
 
How to get started in extracting business value from big data 1 of 2 oct 2013
How to get started in extracting business value from big data 1 of 2 oct 2013How to get started in extracting business value from big data 1 of 2 oct 2013
How to get started in extracting business value from big data 1 of 2 oct 2013Jaime Nistal
 
Panel: Powering Business Decision Making
Panel: Powering Business Decision MakingPanel: Powering Business Decision Making
Panel: Powering Business Decision MakingMRS
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Matt Stubbs
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"MDS ap
 
Big Data Meetup by Chad Richeson
Big Data Meetup by Chad RichesonBig Data Meetup by Chad Richeson
Big Data Meetup by Chad RichesonSocietyConsulting
 
Big Data Customer Experience Analytics -- The Next Big Opportunity for You
Big Data Customer Experience Analytics -- The Next Big Opportunity for You Big Data Customer Experience Analytics -- The Next Big Opportunity for You
Big Data Customer Experience Analytics -- The Next Big Opportunity for You Dr.Dinesh Chandrasekar PhD(hc)
 
Building the Cognitive Era : Big Data Strategies
Building the Cognitive Era : Big Data StrategiesBuilding the Cognitive Era : Big Data Strategies
Building the Cognitive Era : Big Data StrategiesKevin Sigliano
 
D2 d turning information into a competive asset - 23 jan 2014
D2 d   turning information into a competive asset - 23 jan 2014D2 d   turning information into a competive asset - 23 jan 2014
D2 d turning information into a competive asset - 23 jan 2014Henk van Roekel
 
Entry Points – How to Get Rolling with Big Data Analytics
Entry Points – How to Get Rolling with Big Data AnalyticsEntry Points – How to Get Rolling with Big Data Analytics
Entry Points – How to Get Rolling with Big Data AnalyticsInside Analysis
 
Applying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScaleApplying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScalePrecisely
 
Future of Power: Big Data - Søren Ravn
Future of Power: Big Data - Søren RavnFuture of Power: Big Data - Søren Ravn
Future of Power: Big Data - Søren RavnIBM Danmark
 
How to identify the Return on Investment of Big Data
How to identify the Return on Investment of Big DataHow to identify the Return on Investment of Big Data
How to identify the Return on Investment of Big DataJose Pablo Fernandez
 
How to identify the Return on Investment of Big Data / CIO (Infographic)
How to identify the Return on Investment of Big Data / CIO (Infographic)How to identify the Return on Investment of Big Data / CIO (Infographic)
How to identify the Return on Investment of Big Data / CIO (Infographic)suparupaa
 
Data-Driven Marketing Roadshow Splunk - March 26, 2014
Data-Driven Marketing Roadshow Splunk - March 26, 2014Data-Driven Marketing Roadshow Splunk - March 26, 2014
Data-Driven Marketing Roadshow Splunk - March 26, 2014DDM Alliance
 
Bigdata Landscape and Competitive Intelligence
Bigdata Landscape and Competitive IntelligenceBigdata Landscape and Competitive Intelligence
Bigdata Landscape and Competitive IntelligenceJithin S L
 

Similar a Best Practices for Using Hadoop as an Enterprise Data Hub (20)

uae views on big data
  uae views on  big data  uae views on  big data
uae views on big data
 
Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013
 
How Big is Big Data business - Outsource People 2015
How Big is Big Data business - Outsource People 2015How Big is Big Data business - Outsource People 2015
How Big is Big Data business - Outsource People 2015
 
How to get started in extracting business value from big data 1 of 2 oct 2013
How to get started in extracting business value from big data 1 of 2 oct 2013How to get started in extracting business value from big data 1 of 2 oct 2013
How to get started in extracting business value from big data 1 of 2 oct 2013
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 
Panel: Powering Business Decision Making
Panel: Powering Business Decision MakingPanel: Powering Business Decision Making
Panel: Powering Business Decision Making
 
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
Big Data LDN 2018: DATA MANAGEMENT AUTOMATION AND THE INFORMATION SUPPLY CHAI...
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
 
Big Data Meetup by Chad Richeson
Big Data Meetup by Chad RichesonBig Data Meetup by Chad Richeson
Big Data Meetup by Chad Richeson
 
Big Data Customer Experience Analytics -- The Next Big Opportunity for You
Big Data Customer Experience Analytics -- The Next Big Opportunity for You Big Data Customer Experience Analytics -- The Next Big Opportunity for You
Big Data Customer Experience Analytics -- The Next Big Opportunity for You
 
Building the Cognitive Era : Big Data Strategies
Building the Cognitive Era : Big Data StrategiesBuilding the Cognitive Era : Big Data Strategies
Building the Cognitive Era : Big Data Strategies
 
D2 d turning information into a competive asset - 23 jan 2014
D2 d   turning information into a competive asset - 23 jan 2014D2 d   turning information into a competive asset - 23 jan 2014
D2 d turning information into a competive asset - 23 jan 2014
 
Entry Points – How to Get Rolling with Big Data Analytics
Entry Points – How to Get Rolling with Big Data AnalyticsEntry Points – How to Get Rolling with Big Data Analytics
Entry Points – How to Get Rolling with Big Data Analytics
 
Applying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data ScaleApplying Data Quality Best Practices at Big Data Scale
Applying Data Quality Best Practices at Big Data Scale
 
Future of Power: Big Data - Søren Ravn
Future of Power: Big Data - Søren RavnFuture of Power: Big Data - Søren Ravn
Future of Power: Big Data - Søren Ravn
 
How to identify the Return on Investment of Big Data
How to identify the Return on Investment of Big DataHow to identify the Return on Investment of Big Data
How to identify the Return on Investment of Big Data
 
How to identify the Return on Investment of Big Data / CIO (Infographic)
How to identify the Return on Investment of Big Data / CIO (Infographic)How to identify the Return on Investment of Big Data / CIO (Infographic)
How to identify the Return on Investment of Big Data / CIO (Infographic)
 
Data-Driven Marketing Roadshow Splunk - March 26, 2014
Data-Driven Marketing Roadshow Splunk - March 26, 2014Data-Driven Marketing Roadshow Splunk - March 26, 2014
Data-Driven Marketing Roadshow Splunk - March 26, 2014
 
National Conference - Big Data - 31 Jan 2015
National Conference - Big Data - 31 Jan 2015National Conference - Big Data - 31 Jan 2015
National Conference - Big Data - 31 Jan 2015
 
Bigdata Landscape and Competitive Intelligence
Bigdata Landscape and Competitive IntelligenceBigdata Landscape and Competitive Intelligence
Bigdata Landscape and Competitive Intelligence
 

Más de MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications MapR Technologies
 

Más de MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications How Spark is Enabling the New Wave of Converged Cloud Applications
How Spark is Enabling the New Wave of Converged Cloud Applications
 

Último

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Último (20)

Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Best Practices for Using Hadoop as an Enterprise Data Hub

  • 1. ® © 2014 MapR Technologies 1 ® © 2014 MapR Technologies Best Practices for Using Hadoop as an Enterprise Data Hub Mike Ferguson – Intelligent Business Strategies Steve Wooledge – MapR June 18, 2014
  • 2. 2 About Mike Ferguson Mike Ferguson is Managing Director of Intelligent Business Strategies Limited. As an analyst and consultant he specialises in business intelligence, data management and enterprise business integration. With over 32 years of IT experience, Mike has consulted for dozens of companies, spoken at events all over the world and written numerous articles. Formerly he was a principal and co-founder of Codd and Date Europe Limited – the inventors of the Relational Model, a Chief Architect at Teradata on the Teradata DBMS and European Managing Director of DataBase Associates. www.intelligentbusiness.biz mferguson@intelligentbusiness.biz Twitter: @mikeferguson1 Tel/Fax (+44)1625 520700
  • 3. The Hadoop Data Refinery and Enterprise Data Hub Mike Ferguson Managing Director Intelligent Business Strategies June 2014
  • 4. 4 Topics !  Data warehousing and the evolution of ETL processing !  New data and new analytical workloads !  Big data use cases driving business agendas !  The unprecedented demand for customer insight !  Challenges with new big data sources !  Beyond the data warehouse – new platforms for new analytical workloads !  The role of Hadoop in the modern analytical ecosystem !  Introducing the Hadoop enterprise data hub and data refinery !  Simplifying access to new big data insight using SQL on Hadoop !  Integrating Hadoop into your analytical ecosystem
  • 5. 5 For Many Years The Traditional Data Warehouse and BI Environment Has Been Used For Analysis & Reporting Operational systems web P o r t a l Employees Partners Customers BI Tools Platform Data Integration/DQ Reports & analytics Data warehouse & data marts DW
  • 6. 6 The Evolution of Data Integration in Data Warehousing – From Hand Coded to ETL to ELT Hand coded ETL programs DW Hand coded programs ETL Servers DW ETL Servers ELT processing Generated SQL ELT processing DWEvolution of Data Warehousing MPP RDBMS systems
  • 7. 7 Sales Product line n Product line 4 Product line 3 Product line 2 Product/ service line 1 Marketing Service Credit Verification HR Finance Planning Procurement SupplyChain Suppliers Front Office BackOffice Operations Customers New Data Sources Have Emerged Inside And Outside The Enterprise That Business Now Wants To Analyse E.g. RFID tag sensor networks weather data Data volume Data variety Number of sources Data volume Data velocity
  • 8. 8 Popular Big Data Analytic Applications – Web Data !  Clickstream analytics •  Site navigation behaviour (session) analysis –  Paths to buy, paths to abandonment, what else they looked at –  Improve customer experience and conversion –  Associate clicks with customers & prospects !  Social network influencer analysis •  Graph analytics for influencer behavioural impact analysis •  ‘Target the influencer’ marketing campaign effectiveness
  • 9. 9 Popular Big Data Analytic Applications – Sensor Data For Improving Process Efficiency and Optimisation !  Sustainability analytics e.g. energy optimisation !  Supply/distribution chain optimisation !  Asset management and field service optimisation !  Manufacturing production line optimisation !  Location based advertising (mobile phones) !  Grid health monitoring •  Electricity, water, mobile phone cell network… !  Smart metering (collect data every 15 minutes) !  Fraud !  Healthcare – ITC vital signs, fit bits,…. !  Traffic optimisation " WHAT ARE YOU PREPARED TO INSTRUMENT? E.g. RFID tag
  • 10. 10 Popular Big Data Analytic Applications – Unstructured Data !  Case management !  Fault management and field service optimisation !  “Voice of the customer” !  Sentiment analytics !  Competitor analysis !  Media coverage analysis !  Improve pharma drug trials " Unstructured content is hard to analyse How much is TEXT worth to your business?
  • 11. 11 Big Data Analytics - Industry Use Case Examples Industry Use Case Examples Financial Services Improved risk decisions, KYC customer insight, auto programmatic trading, 360 view of financial crime, pre-trade decision support, real-time trade & corp action tagging for compliance and RT P&L, grow security services outsourcing, Reference Data Exchange Utilities Smart meter data analysis, pricing elasticity analysis, customer loyalty, sustainability, asset management Telecommunic ations Customer Churn, Network optimization analysis from device, sensor and GPS inputs, monetization of GPS and data Manufacturing Sensor data for next generation ‘smart’ products, production line optimisation, improved customer service and improved field service, distribution chain optimization, asset management Insurance “How you drive” insurance (sensors to reduce risk), broker document analysis (risk assessment) Government Smart cities (e.g. transportation optimisation), anti-terrorism, law enforcement Logistics Distribution optimisation, route optimisation,
  • 12. 12 More Data Is Required To Get A Deeper Understanding of Customers !  We now need •  Transaction data •  Data from touch points you own •  Data from the touch points you don’t own •  Interaction data –  Need to look at Inbound interactions Vs outbound interactions –  Social interactions •  Master data •  Professional data e.g. profiles on LinkedIn •  Internal and external event data •  Competition data….. !  Then use analytics to understand and predictive desire and propensity e.g. propensity to churn
  • 13. 13 Top Priorities - Improving Customer Experience Via Time Series Analysis of All Customer Interactions OMNI channel – analyse all customer interactions across all channels identity data behavioural data social data Customer “DNA”
  • 14. 14 identity data behaviou ral data social data Customer “DNA” Customer Experience Management - Understanding Customer On-Line Behaviour is Mission Critical to Retention and Growth !  Important new data sources for analysis for customer ‘DNA’ •  Clickstream data from web logs •  Sentiment and social network influencer data New competitors More choice Voice of the customer On the web the customer is king On the move Easy to find
  • 15. 15 Today Both Structured And Multi-Structured Data Are Needed For Deeper Insight Multi- structured data Click stream web log data Customer interaction data Social interaction data Sensor data Rich media data (video, audio) External content Documents Internal web content Seismic data (oil & gas) Structured data OLTP system data Data warehouse data Personal data stores e.g. Excel, Access Often un-modelled and may not be well understood Often a schema is defined and data is well understood Data characteristics are changing - Companies must deal with volume, variety and velocity
  • 16. 16 Big Data Analytics Challenges Include The Analysis of Unstructured, Semi-structured and Structured Data { "firstName": ”Wayne", "lastName": ”Rooney", "age": 25, "address": { "streetAddress": "21 Sir Matt Busby Way", "city": ”Manchester”, “country”: “England”, "postalCode": “M1 6DY” }, "phoneNumbers": [ { "type": "home”, "number": ”0161-123-1234” }, { "type": ”mobile", "number": ”07779-123234” } ] } JSON data Text data Image Data Makes analysis more complex with new analytics and visualisations needed
  • 17. 17 Increased Data and Analytical Complexity Has Created A Need For A New Role – The Data Scientist Image source: Wikipedia Data Science is the process of investigative / exploratory analysis of multi-structured data to discover and produce new business insights Image source: www.computing.co.uk
  • 18. 18 People In Different Roles In The Analytical Landscape Need To Work Together To Deliver Business Value Exploratory analysis Predictive / statistical model producer Business Analyst Business Manager / Operations worker / Customer Data Scientist Model consumer Data visualisation Information Producer • Build reports • Build and publish dashboards Information consumer Decision maker Action taker Strategic Business Objective Priority KPI Current KPI Value What is +1% worth? KPI Target Executive Accountable Business Initiatives (projects) Budget Allocation Action Plan 1 $$$ Project Project Project £ x Million 2 3 4 Business Strategy – strategic objectives and targets including sustainability targets sandbox
  • 19. 19 Data Science Produces New Insights For Business Analysts Who Produce Actionable BI For Front Office Decision Makers Business Analyst Marketing Manager / Marketing, Sales and Service workers Data Scientist Data Quality Forecasting Segmentation Models Customer Lifetime Value Social Network Strategy Creation Performance & Effectiveness Reporting Direct Mail Understand Customer Behavior & Navigation Marketing Performance & Reporting Campaign Planning Financial Planning Creative Materials Marketing Attribution Operations Management Channel Efficiency Sentiment & Influence Dynamic Content Re-marketing Web Call Center Live Event Broadcast Media Mobile/ SMS Social Email Industry Specific Big Data Analytics Traditional DW/BI Workflow & Approvals New insights Actionable BI
  • 20. 20 Big Data Analytics Has Taken Us Beyond The Traditional DW – New Big Data Analytical Workloads 1.  Analysis of data in motion 2.  Complex analysis of structured data 3.  Exploratory analysis of un-modeled multi-structured data 4.  Graph analysis e.g. social networks 5.  Accelerating ETL and analytical processing of un- modeled data to enrich data in a data warehouse or analytical appliance 6.  The storage and re-processing of archived data
  • 21. 21 The Changing Landscape – We Now Have Different Platforms Optimised For Different Analytical Workloads Big Data workloads result in multiple platforms now being needed for analytical processing Streaming data Hadoop data store Data Warehouse RDBMS NoSQL DBMS EDW DW & marts NoSQL DB e.g. graph DB Advanced Analytic (multi-structured data) mart DW Appliance Advanced Analytics (structured data) Analytical RDBMS Graph analysis Investigative analysis, Data refinery Traditional query, reporting & analysis Real-time stream processing & decision m’gmt Data mining, model development
  • 22. 22 Hadoop Is A Key Platform In Big Data Analytics – Data Can Be Accessed Via Multiple APIs Java MapReduce APIs to HDFS, HBase, Cascading file file file file file file file file file file file file file file webHDFS (An HTTP interface to HDFS has REST APIs) HDFS file file file file YARN PIG latin scripts SQL Vendor SQL on Hadoop engine MapReduce Application index indexIndex partition SQL BI Tools & Applications Storm Application YARN Tez or SparkMapReduce HBase HDFS API
  • 23. 23 Defacto Standard APIs Allow Hadoop Components To Be Replaced e.g. Faster, More Secure File System Than HDFS Java MapReduce APIs to HDFS, HBase, Cascading webHDFS (An HTTP interface to HDFS has REST APIs) file file file file file file file file file file file file file file file file file file Vendor Specific File System (e.g. ) YARN HDFS API PIG latin scripts index indexIndex partition Storm Application YARN MapReduce HBase MapReduce Application SQL Vendor SQL on Hadoop engine SQL BI Tools & Applications Tez or Spark
  • 24. 24 Apache Hadoop Components Component Description Hadoop HDFS A distributed file system that partitions files across multiple machines for high-throughput access to application data – HDFS API allows vendors to replace HDFS with an alternative Hadoop YARN" A framework for job scheduling and cluster resource management" Hadoop MapReduce A programming framework for distributed batch processing of large data sets distributed across multiple servers Avro A serialization system that creates & reads files in a format containing both JSON data definitions & the data itself for dynamic interpretation of the data by applications Hive A data warehouse system for Hadoop that facilitates data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems. Hive provides a mechanism to project structure onto this data and query it using a SQL-like language called HiveQL. HiveQL programs are converted into MapReduce programs HBase HBase is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable. Pig A high-level data-flow language for expressing Map/Reduce programs for processing and analysing large HDFS distributed data sets Mahout A scalable machine learning and data mining library Oozie A service for running and scheduling workflows of Hadoop jobs (including Map-Reduce, Pig, Hive, and Sqoop jobs) Spark A general purpose engine for large scale data processing in-memory. It supports analytical applications that wish to make use of stream processing, SQL access to columnar data and analytics on distributed in-memory data Zookeeper A high-performance coordination service for distributed applications
  • 25. 25 The Role of Hadoop - Data Is Arriving Faster Than We Can Consume It – How Good Is Your Filter? F D I A L T T A E R Enterprise Enterprise systems
  • 26. 26 New Requirement – The Managed Hadoop Enterprise Data Hub Parse & Prepare Data in Hadoop (MapReduce) Transform & Cleanse Data in Hadoop (MapReduce) Discover data in Hadoop ELT work -flow sandbox other data sandbox sandbox Data Reservoir (raw data) Load data into Hadoop Data Refinery New high value Insights (pub/sub) EDW Graph DBMS DW appliance contains clean, high value data XML,% JSON% Web logs
  • 27. 27 What’s In An Enterprise Data Hub? !  A managed data reservoir (raw data) •  Organised capture of multi-structured data •  Includes real-time data capture •  May include operational reporting !  A governed data refinery •  Data integration and cleansing at scale •  Analytical sandboxes to discover high value data !  Published, protected and secure high value insights !  Long-term storage of archived data from data warehouses
  • 28. 28 file file file file file file file file file file file file Real-time Data Capture – E.g. MapR Allows Web Log Data To Be Directly Streamed/Stored in Hadoop MapR Direct Access NFSs allows Web log files to be stored directly on their Hadoop File System so that click stream is captured in real-time MapR Distribution for Hadoop Web Server Direct Access NFS web log fileweb log file # mount localhost:/mapr /mapr HDFS Web Server Web Server
  • 29. 29 High Volume Data Capture - Column Family Databases !  Suitable for fast capture of large amounts of sparse, volatile data •  Very fast capture and can hold vast amounts of data •  Billions of rows containing thousands or millions of columns !  Provide column-centric storage and wide de-normalised big tables can also help simplify operational reporting if used with SQL-on-Hadoop e.g. SQL access to HBase !  Allow you to •  Group together related columns into column families •  Design column families to optimize the most common queries •  Retrieve columnar data for multiple entities by iterating through a column family •  Shard rows in a column family and distribute across many servers •  Create indexes and secondary indexes •  Support schema variance - columns in a column family can vary for every row
  • 30. 30 NoSQL Column Family Databases - HBase Row 1 # Column A = value Column B = value Column C = value Row 2 # Column X = value Column Y = value Column Z = value Hbase Storage Architecture Hmaster and several HRegionServers Regions (partitions) created automatically as tables grow Hbase allows applications to directly read and write data
  • 31. 31 Column Families Can Be Stored In Different Files And Queries Will Only Retrieve The Column Family Needed Source: Data Access for Highly-Scalable Solutions : Using SQL, NoSQL, and Polyglot Persistence, McMurtry, Oakley, Sharp, Subramanian, Zhang Portfolio.* means all columns in the Portfolio column family Data about a customer and their stock purchases are partitioned vertically by column family Column family data can also be compressed
  • 32. 32 Fast Data Capture – MapR-DB Is A High Speed Version of HBase Built Into The MapR Data Platform HBase API Source: MapR
  • 33. 33 Enterprise Data Hub – We Need A Data Refinery To Process And Clean Complex Data Image source: http://www.hollyfrontier.com/navajo/
  • 34. 34 Evolution of Big Data Integration Is Following The Same Cycle as it Did in Data Warehousing Hand coded ETL programs Hadoop Hand coded programs ETL Servers Hadoop ETL Servers ELT processing Generated MapReduce ELT processing HadoopEvolution of Big Data Integration
  • 35. 35 Data Cleansing and Integration Tool Scaling ETL In A Data Refinery By Generating Pig, Hive or 3GL MapReduce Code for In-Hadoop ELT Processing Extract Parse Clean Transform AnalyseLoad Insights Option 1 ETL tool generates HQL or convert generated SQL to HQL Option 2 ETL tool generates Pig Latin (compiler converts every transform to a map reduce job) Note - Generating native MapReduce code instead of HiveQL or Pig Latin would likely perform faster because there is no need to translate into MapReduce Also HiveQL is a subset of SQL so check how ETL tools generating HiveQL do complex transformations – HiveQL on its own may not be enough e.g. Hive UDFs? Option 3 ETL tool generates 3GL MapReduce code
  • 36. 36 Need to Parse & Extract From Multi-Structured Data While Integrating Data In A Big Data Environment E-mail (semi-structured) Text (unstructured) ExtractParse TransformLoad …
  • 37. 37 Sandboxes In The Data Refinery - Data Science Teams Need To Conduct Exploratory Analysis on Multi-Structured Data Click stream web log data Customer interaction data Social interaction data (e.g. Twitter, Facebook) Sensor data Rich media data (video, audio) External web content Documents Internal web content Seismic data (oil & gas) Investigative / Exploratory Analysis C R U D Asset Customer Product MDM System EDW mart new business insights sandbox Multi-structured data Historical Data archived DW datamaster data Data Scientists
  • 38. 38 In-Hadoop Analytics In A Data Refinery – Example Technologies !  Hadoop MapReduce, Tez or Spark analytic applications with custom analytics •  Pig, Java, Python, Scala, Cascading….. !  Hadoop MapReduce, Tez or Spark analytic applications using pre-built Hadoop analytics e.g. Mahout, Spark MLlib •  Several analytical algorithms for use in analysis !  Revolution Analytics RevoScaleR !  SAS Analytics and In-Memory Statistics for Hadoop !  … many more Analytical tools Data management tools
  • 39. 39 In-Hadoop Analytics: - Mahout Supports A Number Of Analytic Techniques !  Collaborative Filtering !  User and Item based recommenders !  K-Means and Fuzzy K-Means clustering !  Mean Shift clustering !  Dirichlet process clustering !  Latent Dirichlet Allocation !  Singular value decomposition !  Parallel Frequent Pattern mining !  Complementary Naive Bayes classifier !  Random forest decision tree based classifier https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms Now runs on Spark as well as MapReduce
  • 40. 40 Expediting The Data Refinery Process On Hadoop With Automated Analysis – From ETL to Analytical Workflows Parse & Prepare Data in Hadoop (MapReduce) Transform & Cleanse Data in Hadoop (MapReduce) Discover data in Hadoop ELT work -flow other data Raw data Load data into Hadoop Data Refinery EDW Graph DBMS DW appliance Automated Invocation of Custom Built & Pre-built Analytics on Hadoop contains clean, high value data New high value Insights (pub/sub)
  • 41. 41 High Value Insights Produced In A Hadoop Data Hub Can Be Brought Into A DW to Enrich What We Already Know Cloud Data HDFS Extract DW D IMap/ Reduce data transformation and analytics applications Transform e.g. PIG, IBM JAQL Cloud Data e.g. Deriving insight from huge volumes of social web content on sites like twitter, facebook. Digg, mySpace, tripAdvisor, Linkedin….for sentiment analytics Hundreds of terabytes up to petabytes new insights Operational systems
  • 42. 42 Making New Insights Available To Business Analysts Via SQL Access To Big Data - Options SQL SQL access to big data in Hadoop SQL DW data virtualisation server SQL access to big data via data virtualisation SQL Analytical RDBMS SQL access to big data in an analytical RDBMS streaming data SQL SQL access to streaming data in motion
  • 43. 43 Self-Service BI BI Tool(s) e.g, Visual Discovery tools Business Analyst or ‘budding’ Data Scientist personal & office data Predictive models community Publish / Share Consume / Enhance / Re-publish Transaction systems DW SQL Access to Hadoop Is Needed To Allow Hadoop Data To Be Accessed By Users With Self-Service BI Tools collaborate HDFS / Hbase/ Hive e.g. Hive interface
  • 44. 44 SQL access to Big Data? Key Questions That May Influence If SQL Access to Big Data Is A Good Choice or What SQL Option to Take What kind of analysis? Text analysis, Graph analysis, Machine Learning, reporting What kind of data type(s) do you need to analyse? - structured, unstructured, semi- structured, What kind of data volumes do you want to analyse? Is the data at rest or is it real- time streaming data in motion? What analytical functions can you invoke on big data from SQL? Join with other data in another data store? How many concurrent users? Performance and scalability of complex queries and analytical functions (need parallelism) Is the requirement for interactive, exploratory, or real-time analysis? Data Analytical Workload
  • 45. 45 SQL On Hadoop Initiatives Key Questions What analytic functions are provided? How can analytic functions be extended Can you join to data outside of Hadoop? Are these SQL on Hadoop options suitable for reporting and analysis, interactive discovery, exploratory analysis or all of these? Vendor SQL on Hadoop Initiative AMPlab (UC Berkeley) Shark (Forked Hive at V0.9) or SparkSQL Apache Hadoop Hive Actian Vortex (Actian Vector on Hadoop data nodes) CitusDB CitusDB (uses external tables) Cloudera Impala / Parquet Concurrent Lingual (SQL on Cascading) Hadapt Schemaless SQL Hortonworks Stinger / ORC (Hive 13) HP Vertica on Hadoop IBM BigSQL (SQL on HDFS & HBase) InfiniDB InfiniDB on Apache Hadoop Jethro Data JethroData MapR Apache Drill Microsoft Hive 13 Pivotal HawQ (uses external tables via PFX) Teradata SQL-H Splice Machine Splice Machine (SQL Engine on HBase) Salesforce.com Phoenix (SQL engine on HBase) Attivio Active Intelligence Engine (SQL access to search indexes on Hadoop data)
  • 46. 46 SQL on Hadoop – Apache Drill Can Access HDFS And HBase Data BI Tool(s) e.g, Visual Discovery tools Business Analyst or’ Data Scientist Drill Analytic Application SQL SQL Data Scientist HDFSHBase MapR Distribution for Hadoop Apache Drill does not use MapReduce MongoDB/ Cassandra sensors XML,% JSON% Data entering HBase
  • 47. 47 Apache Drill Distributed Query Processing – A Storage Independent Drillbit MPP Architecture Each drillbit is capable of receiving queries from applications and BI tools - there is no master in this architecture Multiple drillbits are involved in parallel query processing on distributed data Supports Apache HDFS, Apache HBase, MapR-FS, MapR-DB, Amazon S3
  • 48. 48 SQL on Hadoop Example – Apache Drill Supports Query of Self-Describing Data Without a Schema JSON Source: MapR
  • 49. 49 file file file file file file file file file file file SQL on Hadoop – What Should The Schema Look Like? Star schema? Snowflake schema? De-normalised schema? Other?
  • 50. 50 Hadoop Storage Is Independent of Any SQL Engine Accessing HDFS - Multiple SQL Engines Can Coexist On The Same Data file file file file file file file file file file file file file file HDFS file file file file YARN Batch (MapReduce) Interactive (Tez) On-line (HBase) Streaming (Storm,..) Graph (Giraph) In-memory (Spark) HPC MPI (OpenMPI) Other (Search,.) file file file file SQLSQLSQL SQL Storage is independent of any SQL engine!  Key points about Hadoop •  It is possible to have MULTIPLE SQL engines on the same data •  Different SQL engines run on different Hadoop frameworks (M/R, Tez, Spark) or on no framework at all i.e. directly access HDFS or HBase data
  • 51. 51 Relational DBMS / Hadoop Integration – Several Vendors Have Integrated RDBMS with Hadoop to Run Analytics Relational DBMS External Polymorphic table function(s) HDFS / Hbase/ Hive SQL, XQuery RDBMS optimizer handles transparent access to external analytical platforms on behalf of the user RDBMS and Hadoop could be deployed on the same hardware cluster (preferred) or on different hardware clusters Allows join across data in a single RDBMS and Hadoop
  • 52. 52 Relational DBMS / Hadoop Integration Example - HP Vertica and MapR Source: MapR
  • 53. 53 Self-Service BI Self-service Data Discovery & Visualisation or Dashboard Server Business analyst Data Virtualization and Optimization personal & office data Predictive models Transaction systems Data Management Tools (ETL, DQ, etc.) DW Self-Service Access To Big Data Via Data Virtualization BUT what about optimization? Can the data virtualisation server push down analytics to underlying platforms to make them do the work?
  • 54. 54 New Insights Can Be Added Into A Data Warehouse To Enrich What You Already Know DW D I new insights Operational systems e.g. Deriving insight from social web sites like for sentiment analytics sandbox Data Scientists social Web logs web cloud ELT
  • 55. 55 Alternatively New Insights In Hadoop Can Integrated With A DW Using Data Virtualization To Provide Enriched Information DW D I e.g. Deriving insight from social web sites like for sentiment analytics new insights OLTP systems sandbox Data Scientists social Web logs web cloud DataVitualisation SQL on Hadoop
  • 56. 56 Using Hadoop As A Data Archive Means Data Can Be Kept On-line, Analysed And Still Integrated With Data In The DW DW D I OLTP systems DataVitualisation SQL on Hadoop Archived data Archiveunused ordata>nyears
  • 57. 57 SQL on Hadoop Big Data Governance – Data Sources, Sandboxes, People, Data Access Security, Results Lineage…. Graph DBMS MPP Analytical RDBMS Social graph data Unstructured / semi- structured content DW RDBMSFiles clickstream% Web logs governance governance governance governance governance governance governancegovernancegovernance
  • 58. 58 Issues: Siloed Analytics - Different Tools to Manage and Integrate Data For Each Type of Analytical Data Store Analytical tools Data management tools EDW mart Structured data CRM ERP SCM Silo DW & marts Streaming data (markets, sensors Analytical models Silo Analytical tools/apps Data management tools Multi-structured data Silo DW Appliance Advanced Analytics (structured data) Data management tools Structured data CRM ERP SCM Analytical tools Silo Analytical tools/apps Data management tools NoSQL DB e.g. graph DB Silo Multi-structured & structured data
  • 59. 59 EDW MDM SystemDW & marts NoSQL DB e.g. graph DB Advanced Analytic (multi-structured data) mart DW Appliance Advanced Analytics (structured data) Need to Manage The Supply of Consistent Data Across The Entire Analytical Ecosystem Common Enterprise Information Management Tool Suite Stream processing C R U D Prod Asset Cust actions feedssensors XML,% JSON% RDBMS Files office docssocial Cloud clickstream% Web logs web services New New New New New New New New NewNew New New C R U D Prod Asset Cust New data types need to be supported by EIM tool suites
  • 60. 60 BI tools platform & data visualisation tools Search based BI tools Custom MapReduce applications Map Reduce BI tools Graph Analytics tools A New Architecture for Analytics - The Intelligent Business Strategies Extended Analytical Ecosystem Enterprise Information Management Tool Suite feedssensors XML,% JSON% RDBMS Files office docssocial Cloud clickstream% Web logs web services Event processing C R U D Prod Asset Cust EDW MDM SystemDW & marts NoSQL DB e.g. graph DB Advanced Analytics (multi-structured data) mart DW Appliance Advanced Analytics (structured data) actions Filtered data Data Virtualisation and optimization
  • 61. 61 Conclusions !  Business demand for new more complex, high volume data is driving the need for new analytical workloads beyond the data warehouse !  Hadoop is a low cost analytical platform capable of supporting new analytical workloads on multi-stuctured data !  A key role for Hadoop is as an data hub and data refinery !  The data refinery process requires data integration and cleansing to scale to handle the volume, variety and velocity of complex multi- structured data !  Data scientists analyse big data as part of the data refining process to produce new insights that can be added to what you already know !  Hadoop is part of an extended analytical ecosystem with data management tools supplying consistent data across all data stores !  Data scientists, business analysts and information consumers need to work together to deliver new insight for competitive advantage
  • 62. ® © 2014 MapR Technologies 62© 2014 MapR Technologies ® Best Practices for Production Success
  • 63. ® © 2014 MapR Technologies 63 HQ WORLDWIDE HADOOP TECHNOLOGY LEADER UNIQUELY ADDRESSES BOTH ANALYTIC AND OPERATIONAL USE CASES 500+ PAYING CUSTOMERS MapR:
  • 64. ® © 2014 MapR Technologies 64 MapR: Best Product for Customer Success Top Ranked Exponential Growth 500+ Customers 3X bookings Q1 ‘13 – Q1 ‘14 80% of accounts expand 3X 90% software licenses <1% lifetime churn >$1B in incremental revenue generated by 1 customer
  • 65. ® © 2014 MapR Technologies 65 FOUNDATION Architecture Matters for Success
  • 66. ® © 2014 MapR Technologies 66 FOUNDATION High Availability & Data Protection High performance Multi-tenancy Operational & analytical workloads Open standards for integration NEW APPLICATIONS SLAs TRUSTED INFORMATION LOWER TCO Architecture Matters for Success
  • 67. ® © 2014 MapR Technologies 67 The Power of the Open Source Community Management MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue *%Cer6fica6on/support%planned%for%2014%
  • 68. ® © 2014 MapR Technologies 68 MapR Distribution for Hadoop Management MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue *%Cer6fica6on/support%planned%for%2014% •  High availability •  Data protection •  Disaster recovery •  Standard file access •  Standard database access •  Pluggable services •  Broad developer support •  Enterprise security authorization •  Wire-level authentication •  Data governance •  Ability to support predictive analytics, real-time database operations, and support high arrival rate data •  Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators •  2X to 7X higher performance •  Consistent, low latency Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability
  • 69. ® © 2014 MapR Technologies 69 Hadoop + Data Warehouse Architecture Improve data services to customers without increasing enterprise architecture costs •  Provide cloud, security, managed services, data center, & comms •  Report on customer usage, profiles, billing, and sales metrics •  Improve service: Measure service quality and repair metrics •  Reduce customer churn – identify and address IP network hotspots •  Cost of ETL & DW storage for growing IP and clickstream data; >3 months •  Reliability & cost of Hadoop alternatives limited ETL & storage offload •  MapR for data staging, ETL, and storage at 1/10th the cost •  MapR provided smallest datacenter footprint with best DR solution •  Enterprise-grade: NFS file management, consistent snapshots & mirroring •  Data warehouse for mission-critical reporting and analysis OBJECTIVES CHALLENGES SOLUTION Hadoop + Data Warehouse = New, Deeper Insights for the Business •  Increased scale to handle network IP and clickstream data •  Freed up processing on DW to maintain reporting SLA’s to business •  Unlocked new insights into network usage and customer preferences Business Impact FORTUNE 500 TELCO
  • 70. ® © 2014 MapR Technologies 70 Q&AEngage with us! @mikeferguson1 – Intelligent Business Strategies @swooledge – MapR Technologies •  Learn more about Hadoop in your architecture: www.mapr.com/EDH •  Upcoming Webinar series - www.mapr.com/resources/webinars –  6/26 Talend – ETL in/for Hadoop –  7/09 Syncsort – comScore & mainframe optimization –  7/17 Rick van der Lans – SQL-on-Hadoop –  7/23 Skytree – machine learning & analytics –  7/30 Appfluent – DW usage monitoring & optimization –  8/14 Tableau – data exploration & analysis on Hadoop •  Contact / follow us