Hadoop and NoSQL joining forces by Dale Kim of MapR

© 2014 MapR Techno©lo g2i0e1s4 MapR Technologies 1
Hadoop and NoSQL Joining Forces

© 2014 MapR Technologies 2
Topics
Big Data, Hadoop, and NoSQL
The In-Hadoop Advantage
NoSQL-on-Hadoop in Action
Other In-Hadoop Examples
Integrating with SQL

Big Data is Overwhelming Traditional Systems
• Mission-critical reliability
• Transaction guarantees
• Deep security
• Real-time performance
• Backup and recovery
• Interactive SQL
• Rich analytics
• Workload management
• Data governance
• Backup and recovery
ENTERPRISE
USERS
Enterprise
Data
Architecture
OPERATIONAL
SYSTEMS
ANALYTICAL
SYSTEMS
PRODUCTION
REQUIREMENTS
PRODUCTION
REQUIREMENTS
OUTSIDE SOURCES

High
Scaling on Traditional Technologies
Data volume, velocity
Scale up to bigger, faster machines
Data variety
Extensive data modeling and ETL
Low
Low High

Data volume, velocity
Low High
NoSQL NoSQL NoSQL
Data variety
Low High
Scaling on Newer Technologies
Scale out with commodity hardware
Use the right tool for unstructured,
multi-structured, semi-structured,
non-relational data

Hadoop and NoSQL Relieve the Pressure from Enterprise Systems
Keys for Production Success
1 Reliability and DR
3 High performance
OPERATIONAL
SYSTEMS
ANALYTICAL
SYSTEMS
ENTERPRISE
USERS
• Data staging
• Archive
• Data transformation
• Data exploration
• Streaming,
interactions
2 Interoperability
4
Supports operations
and analytics
+ NoSQL

You Already Know;
• NoSQL is a class of databases that specialize in:
– Scale-out on commodity servers – no application-level sharding
– Flexible data models – no fixed schema required
• Hadoop is a distributed platform designed for:
– Storing/processing huge volumes of data cost-effectively
– Spreading work across many servers (“divide and conquer”)
Before we continue, let’s take a quick look back;

Google’s operational data store (BigTable) has enabled multiple revolutions
within the company:
What Would (Did) Google Do?
2003
GFS
2004
Web index is batch
(GFS/MapReduce)
2010
Web index is real-time
(BigTable)
The transition from
batch to real-time
2004
MapReduce
2006
BigTable
The explosion in
operational applications
(1)
(2)

Operations Vs. Analytics
Operations (Databases)
• Real-time
• Reads/writes/updates
• Current/recent data
• Updated regularly
• Fast inserts/updates
• Large volumes of data
Analytics (Hadoop)
• Batch
• Reports/Computations
• Historical data
• Generally non-volatile
• Fast retrievals
• Even larger volumes of data
But is the data different?

Mobile
application server
Web
application server
Handling Multiple Workloads
Analytics Operational
Hadoop
Data exploration
(SQL)
Operational NoSQL
DBMS
Batch import/export
Customer 360
dashboard
Churn analysis
(predictive analytics)

Mobile
application server
Product/service
optimization and
personalization
Data exploration
(SQL)
Customer 360
dashboard
Churn analysis
(predictive analytics)
• Single cluster
•High performance, low latency
• Large-scale analytics
• Enterprise-grade HA/DR
•Unified file and table administration
Real-time ad
targeting
Real-Time and Operational
Actionable
Analytics
Web
application server
In-Hadoop Databases

Separate Clusters Versus Single Cluster
Separate Hadoop and Database
• Delays analyzing live data
• Network traffic
– Heavy bandwidth usage
– Heavy cleanup upon error
• Complexity
– Higher maintenance, risk of error
– More HA/DR administration
– Risk to SLAs
• Unnecessarily duplicated
resources
Consolidated Deployment
• Real-time analysis/computation
• Data locality
– Reduced bandwidth utilization
– Efficient divide-and-conquer analysis
• Architectural simplicity
– Lower risk of error
– Lower administrative overhead
• No unnecessary data/hardware
duplication (except for HA/DR)

Databases on Direct Attached Storage (DAS)
Advantages
• Fast local file access
• Lower cost vs. SAN/NAS

Databases on Networked Storage (SAN/NAS)
Advantages
• Snapshot/backup
• Easy capacity expansion
• Disaster recovery
• Improved disk utilization
• Seamless maintenance
• Reliable

Databases on Hadoop (“In-Hadoop”)
Advantages
• Benefits of DAS
• Reduced complexity vs.
SAN
• Lower operational cost
• Faster local file access
• Easy capacity expansion
• Dynamic storage utilization
Hadoop

Lambda Architecture (lambda-architecture.net)
BATCH VIEWS
BATCH LAYER
SERVING LAYER
SPEED LAYER
MERGE
ALL DATA
(HDFS)
HADOOP
BATCH
RECOMPUTE
PROCESS
STREAM
REAL-TIME VIEWS
INCREMENT
VIEWS
STORM
Partial
aggregate
REAL-TIME
INCREMENT
Partial
aggregate
Partial
aggregate
MERGED
VIEW
(HBASE)
REAL-TIME DATA
NEW DATA
STREAM
PRECOMPUTE
VIEWS
(MAPREDUCE)

Enterprise Data Hub Architecture
Load more data
sources
Enrich data in Hadoop Analyze
Offload / Enrich /
Reload
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
BLOGS,
TWEETS,
LINK DATA
LOG FILES,
CLICKSTREAMS
MapR Control System (MCS)
Hadoop User Experience (HUE)
Batch Processing
MR, YARN, Hive, Pig, etc.
Interactive Querying
Drill, Impala, Presto, etc.
HBase other data stores
MapR Data Platform
MapR-DB Tables
MAPR DISTRIBUTION INCLUDING HADOOP
BI REPORTS AND
APPLICATIONS
High
speed
streaming
DATA MARTS DATA WAREHOUSE
PARSE, PROFILE, ETL
LOAD
REPLICATE, CDC
STREAMING
CLEANSE, MATCH
LOAD

Customer data, network
security event data
Anomaly detection on
large volumes of security
event data, analytics on
customer data to enable
incremental sales

Industry data analysis,
SaaS-based reporting
Advertising
Automation
Cloud
Buyers
Cloud
Sales performance
management data
combined with fast
responsiveness SaaS-delivered
reports

Customer profile data,
customer behavior data
Analytics on customer
behavior for better
recommendations
Telecommunications Company

MapR Overview
BIG
DATA
BEST
PRODUCT
BUSINESS
IMPACT
Hadoop
Top Ranked
Production
Success

The Power of the Open Source Community
Provisioning
&
coordination
Savannah*
Workflow
& Data
Governance
Data
Integration
& Access
Hue
HttpFS
Flume Knox* Falcon*
MMaannaaggeemmeenntt
APACHE HADOOP AND OSS ECOSYSTEM
Streaming
Storm*
NoSQL &
Search
Solr
MapR Data Platform
Security
SQL
Drill*
Shark
Impala
YARN
Batch
Spark
Cascading
Pig
Spark
Streaming
HBase
Juju
ML, Graph
GraphX
MLLib
Mahout
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Tez*
Accumulo*
Hive
Sqoop Sentry* Oozie ZooKeeper
MapR-DB MapR-FS
* Certification/support planned for 2014

MapR-DB: Powerful NoSQL Integrated with Hadoop
Benefit Features
High Performance Over 1 million ops/sec with 10 nodes, in-memory processing
Continuous Low Latency No I/O storms, no compaction delays
© 2014 MapR Technologies
24x7 Applications
Instant recovery, online schema modification, snapshots,
mirroring
Consistency Strong data consistency, row-level ACID transactions
Simplified Database
Administration
No processes to manage, automated splits, self-tuning
High Scalability 1 trillion tables, trillions of rows, millions of columns
Low TCO Files and tables on one platform, more work with fewer nodes
Performance
Reliability
Easy
Administration

MapR-DB (in MapR Enterprise Database Edition)
MapR-DB
NoSQL Table-Style Store
Apache HBase API
In-Hadoop Database
HBase
JVM
HDFS
JVM
ext3/ext4
Disks
Other Distros
Tables/Files
Disks
MapR
Fast, scalable, reliable.
HBase API, in-memory option, Hadoop integration.

Other In-Hadoop Database Technologies
• Databases in Hadoop
– Apache HBase
– Apache Accumulo
– Splice Machine
– MarkLogic
• Data Warehouses on Hadoop
– HP Vertica
– Pivotal HAWQ

What Other Trends?
• SQL query engines
– Apache Drill
– Impala
– Presto
– Etc.
• In-memory processing
– GridGain
– Apache Spark
– HAMRTech

SQL Query Engines for Hadoop and NoSQL Together
Impala

• Pioneering Data Agility for Hadoop
• Apache open source project
• Scale-out execution engine for low-latency queries
• Unified SQL-based API for analytics operational applications
APACHE DRILL
Vibrant Community
40+ contributors
150+ years of experience building
databases and distributed systems

Q A
Engage with us!
@mapr maprtech
dalekim@mapr.com
MapR
maprtech
mapr-technologies

Hadoop and NoSQL joining forces by Dale Kim of MapR

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Hadoop and NoSQL joining forces by Dale Kim of MapR

Similar a Hadoop and NoSQL joining forces by Dale Kim of MapR (20)

Más de Data Con LA

Más de Data Con LA (20)

Último

Último (20)

Hadoop and NoSQL joining forces by Dale Kim of MapR