Más contenido relacionado La actualidad más candente (20) Similar a Realtime analytics with_hadoop (20) Más de Edgar Alejandro Villegas (20) Realtime analytics with_hadoop6. © 2015 MapR Technologies 6
Examples of Real-Time
Images licensed under https://creativecommons.org/licenses/by/2.0/
Time image courtesy of Daniel Oldfield: https://www.flickr.com/photos/democlez/4424898002/
Air bag image courtesy of Mike Babcock: https://www.flickr.com/photos/mikebabcock/3098836311/
Tied to clock time Guaranteed response time
For real-time analytics, let’s use: “no built-in delays”
So what is real-time analytics with Hadoop?
7. © 2015 MapR Technologies 7
Requirements for Real-Time Analytics with Hadoop
REAL-TIME
DATA
REAL-TIME
APPLICATIONS
REAL-TIME
QUERIES
8. © 2015 MapR Technologies 8
Real-Time Data
Definition: Provide immediate access to live Hadoop data
for analysis
Requirements:
• Analysis uses live real-time data, not batch-copied data
• Business can identify insights immediately (often through
an automated process)
• Critical for use cases such as ad targeting, personalization, network
security analysis.
• System avoids complexity of separate stream processing
or messaging system for recent data
9. © 2015 MapR Technologies 9
Real-Time Data in Hadoop
For real-time:
• Log files should be written directly
into the cluster or synced across
remote data centers
• Operational applications should
run in the same cluster, or in a
separate cluster with real-time
table replication
• Immediate action should be taken
• E.g., difference between fraud
detection and fraud prevention
• Difference between on-demand ad
bid versus missing opportunity
Existing challenges:
• Log files must be batch uploaded
periodically (e.g., every 30
minutes)
• Due to HDFS limitations (not R/W,
file-close semantics, no direct NFS)
• Operational applications run on a
separate cluster/stack
• Data must be batch uploaded
• With batch uploads, the window to
respond is missed
• Fraud, cyber attacks, matches,
anomalies, etc.
10. © 2015 MapR Technologies 10
Real-Time Applications
Definition: Run operational applications in the cluster
Requirements:
• Address use cases beyond batch and interactive
analysis
• E.g., end-to-end real-time marketing and security applications directly
on Hadoop
• Eliminate separate Hadoop and NoSQL
clusters/technology stacks for apps
11. © 2015 MapR Technologies 11
Real-Time Applications in Hadoop
For real-time:
• Minimize impact of disrupting
“housekeeping tasks to enable
consistent, real-time operations
• E.g., Compactions, Java garbage
collection, “region splits”
• Process live, operational data in
Hadoop to avoid delays in batch
copies
Existing challenges:
• Other in-Hadoop databases suffer
disruptions, inhibiting real-time
• E.g., Compactions can significantly
slow down the system
• Garbage collection leads to
unpredictable system delays
• Region splits are required to spread
load, but impacts responsiveness
and performance
• Other in-Hadoop databases require
separate clusters
12. © 2015 MapR Technologies 12
Real-Time Querying
Definition: Query any data as soon as it lands in the
cluster (self-service)
Requirements:
• Analysts can explore data immediately, no waiting
days/weeks for data prep by IT
• IT is not burdened with repeated schema management
and ETL requests
13. © 2015 MapR Technologies 13
Real-Time Querying in Hadoop
For real-time:
• Minimize time to get started on
data exploration
• Leverage query engines that can
query data in place
– Eliminate IT dependencies for
schema preparation
Existing challenges:
• New data that lands in the cluster
necessarily requires IT-built
schemas
• Data exploration and analysis is
contingent on IT backlog
14. © 2015 MapR Technologies 14© 2015 MapR Technologies
So How Are These Implemented?
15. © 2014 MapR Technologies 15
Fraud model
Recommendations
table
MapR Distribution for Hadoop
Fraud
investigator
Interactive
marketer
Online
transactions
Fraud
detection
Personalized
offers
Clickstream
analysis
Fraud
investigation tool
Real-time Operational Applications
Analytics
Case Study: Global Financial Services Firm
Analytics + Operational Applications on one platform
16. © 2015 MapR Technologies 16
REAL-TIME
DATA
REAL-TIME
APPLICATIONS
REAL-TIME
QUERIES
17. © 2015 MapR Technologies 17
Faster/Secure NFS Access
Redundant gateways
for high availability
CLIENT NODE(S)
NFS
Gateway
NFS
Gateway
MapR data access options:
1. HDFS API – apps written for Hadoop
2. Standard read/write NFS (POSIX) – existing
file system-based apps, no code changes
3. MapR POSIX Client – advanced read/write
NFS requirements, includes:
1. Compression
2. Parallelism
3. Authentication
4. Encryption
NFS client
(included in OS)
Native applications
HDFS API
(hadoop-core-*.jar)
MapR POSIX
Client
MapR cluster
Hadoop
applications
(e.g. “hadoop fs –put”)
File-based apps/utils
(e.g. cp, emacs)
NFS client
(included in OS)
NFS
Gateway
2
3
1
18. © 2015 MapR Technologies 18
YCSB
Benchmark
MapR-DB 4.X Other NoSQL
MapR-DB
Increase
Load
(10, 100)*
27,097 14,753 1.8x
Read
(75, 150)
4,402 1,902 2.3x
50% read /
50% update
(75, 100)
8,684 2,012 4.3x
95% read /
5% update
(75, 100)
3,776 1,127 3.4x
Scan
(32, 32)
478 Client hangs N/A
MapR-DB and “Other NoSQL” Throughput on YCSB
Throughput performance in operations/second/node (higher is better)
*Numbers in parentheses represent threads per client used in test runs for MapR-DB, other NoSQL, respectively
19. © 2015 MapR Technologies 19
REAL-TIME
DATA
REAL-TIME
APPLICATIONS
REAL-TIME
QUERIES
20. © 2015 MapR Technologies 20
YCSB Mixed (50% Read / 50% Put) - Compare Read Latency
MapR-DB
HBase on other
Hadoop distribution
Lower is better
21. © 2015 MapR Technologies 21
MapR-DB Table Replication
Multi-master (aka, active/active)
replication
Active Read/Write
End Users
• Faster data access – minimize network
latency on global data with local clusters
• Reduced risk of data loss – real-time,
bi-directional replication for synchronized
data across active clusters
• Application failover – upon any cluster
failure, applications continue via
redirection to another cluster
22. © 2015 MapR Technologies 22
MapR-DB Real-Time Analytics
Active clusters close to the end users,
with real-time analytics at central cluster
Active Read/Write
MapR-DB cluster
(London)
MapR-DB cluster
(New York)
MapR-DB cluster
(Singapore)
MapR-DB/Hadoop
cluster
Hadoop analytics
Operational and analytical workloads
combined in a single deployment
Operationally efficient,
consolidated MapR cluster
Database
operations
Hadoop
analytics
End Users
23. © 2015 MapR Technologies 23
REAL-TIME
DATA
REAL-TIME
APPLICATIONS
REAL-TIME
QUERIES
24. © 2015 MapR Technologies 24
One SQL Interface for All Data Formats
Unstructured data will
account for more than 80%
of the data collected by
organizations
ANSI SQL queries on rapidly evolving schemas
UNSTRUCTURED
DATA
STRUCTURED DATA
2000 20101990 2020
TotalDataStored
Existing
SQL
Engines
Apache
Drill
Self-Service
Data
Exploration
IT-Driven BI
Self-Service BI
SQL Options for
Analytics
25. © 2015 MapR Technologies 25
Traditional
Approach
Agility by Reducing Distance to Data
Short analytic life cycles with no upfront schema creation and management
Hadoop Data
Schema
Design
Transforma
tion
Data
Movement
Users
Hadoop Data Users
New Business Questions
Total Time to Value: Weeks to Months
Total Time to Value: Minutes
New
Approach
Data Preparation
New Business Questions
Drill enables the
“As It Happens” business
with instant SQL analytics
on complex data
FROM:
TO:
26. © 2015 MapR Technologies 26© 2015 MapR Technologies
Summary
27. © 2015 MapR Technologies 27
Batch Bottlenecks
1. Data streaming – real-time,
but…
2. Further analysis is limited by
batch loads into HDFS
3. Most databases must run in
separate cluster, leading to
batch copies
4. Append-only HDFS leads to
heavy I/O for database
defragmentation
(“compactions”)
5. Data exploration requires IT
intervention
1
2
3
4
5
28. © 2015 MapR Technologies 28
Removing the Batch Limitations
1. Data streaming – real-time
as before, and now….
2. Further analysis is allowed
with real-time loading
3. MapR-DB runs in Hadoop
4. With full read/write file
system, defragmentation
delays are eliminated
5. Data exploration performed
in a self-service manner
Real-Time
Data
Real-Time
Applications
Real-Time
Querying
1
2
34
5
29. © 2015 MapR Technologies 29
And Don’t Forget…
• Real-time analytics doesn’t help you if the other key pieces
aren’t in place
• Include security
– Interoperability with any authentication mechanism
– Fine-grained access controls
– Auditing capabilities beyond simple log files
• Also include enterprise-grade reliability
– An automated high availability configuration
– Incremental mirroring/replication for disaster recovery
– Consistent snapshots
• Talk to us about what else you should consider
30. © 2015 MapR Technologies 30
Q&A
@mapr maprtech
dalekim@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
31. Confidential © 2014 Actian Corporation31
Real-Time Analytics
Paige Roberts
April, 2015
Hadoop & Analytics Evangelist
Actian Hadoop & Analytics Center of Excellence
32. Confidential © 2014 Actian Corporation32
Agenda
About Actian
Advantages of Data-Driven Business
What Do I Mean By Real-Time?
Real-Time Challenge: ATM Fraud
How Actian Does It
33. Confidential © 2014 Actian Corporation33
$140M Revenues + Profitable
10,000+ Customers
Global Presence: 8 world-wide offices, 7x 24 multinational support model
33
“Fast becoming a big data
powerhouse to challenge
the market.” Forrester
“Actian is now very powerfully
positioned in the big data and analytics
markets.” Bloor
A Few Words About Actian
34. Confidential © 2014 Actian Corporation34
Note: Percentage, 10 year CAGR McKinsey Report on Big Data.
8
9
5
5
-1
6
9
14
11
9
24
12
Revenue
Big Data Other Companies
Grocers
Online Retailers
Big Box Retailers
Casinos
Credit Cards
Insurance
EBITDA
• Predictive
• Real-time
• All Data
• New Insights
• Accuracy
5
-1
1
2
-15
3
14
9
12
10
22
11
…. At the Expense of Those That Don’t
Companies Using Big Data Strategically Outperform
35. Confidential © 2014 Actian Corporation35
What Does Real-Time Mean to Us?
Human comfortable interactivity
Streaming data processing
Sub-second response
36. Confidential © 2014 Actian Corporation36
Real-Time Analytics – Many Applications
Solar Power Company
New customer targeting
Smart meter data
Sportswear Company
Brand loyalty
Wearable data
Bank
ATM Fraud
Router data
37. Confidential © 2014 Actian Corporation37
Large US Bank Needs Help
• Multi-billion dollar American
bank / financial holding
company
• Provides deposit, credit,
trust, and investment
services to a broad range of
clients
• Operates nearly 1,500 retail
branches and more than
2,000 ATMs
38. Confidential © 2014 Actian Corporation38
Numberoftimesfasterthan
Impala
Fraud Kept This Bank’s Execs Up at Night
39. Confidential © 2014 Actian Corporation39
What is the Worst Gotcha About ATM Fraud?
In spite of that, 67% of U.S. adults would
switch to another institution after
experiencing ATM fraud or a data breach.
http://www.harrisinteractive.com/NewsRoom/HarrisPolls/tabid/447/ctl/ReadCustom%20Default/mid/1508/ArticleId/1515/Default.aspx
In the majority of cases, banks are required
to reimburse customer losses.
https://www.tycois.com/insights-and-opinions/articles/atm-skimming-costs-banks
42. Confidential © 2014 Actian Corporation42
Actian Management Console
DATAPLATFORM
Actian Vortex
Elastic Data
Preparation
DataFlow
SQL Analytics
Vector in Hadoop
Library of Analytic Blueprints
Graph Analytics
SPARQLverse
Machine Learning & Predictive Analytics
DataFlow
ANALYTIC
APPS
Financial
Services
Health Care
Other
Verticals
SQL
Java,C/++,
Python
SOURCE
DATA
Databases / Marts
Warehouses
Cloud / SaaS
Applications
Structured &
Unstructured
Data
Enterprise
Applications
APPLICATIONDEV
Application Development and Tools
INFRASTRUCTURE
Deployment Options
@Customer
Actian Vortex: The Elephant’s Best Friend
powered by KNIME
43. Confidential © 2014 Actian Corporation43
Actian Vortex:
High Performance Analytics at Scale in Hadoop
Powered by KNIME
44. Confidential © 2014 Actian Corporation44
Stopping Fraud in Real Time
https://www.youtube.com/watch?v=u1QoHCpOUOU
45. Confidential © 2014 Actian Corporation45
Actian Vector in Hadoop: Built for Speed
Vector Processing
Single
Instruction
Multiple
Data
2nd Gen Column Store
Limit I/O
Efficient real time updates
Smarter Compression
Maximize throughput
Vectorized decompression
Exploiting Chip Cache
Process data on chip – not in RAM
1
2
3
4
Multi-core Parallelism
Maximize system resource
utilization…
Storage Indexes
Quickly identify candidate data
blocks
Minimize IO
5
6
Time/CyclestoProcess
Data Processed
DISK
RAM
CHIP
10GB2-3GB40-400MB
2-20150-250Millions
48. Confidential © 2014 Actian Corporation48
What to Look For in SQL in Hadoop
• Collaborative architecture
• Open access to Actian data
storage formats
• Support for other formats
• Hadoop distribution and
ecosystem application
support
No vendor lock-in
• Fastest data prep and
ingestion
• Fastest analytic engines
• Unbridled processing
power on data nodes in a
Hadoop cluster
• Full SQL support
• Extreme scalability
• Full security
• High Availability &
Disaster Recovery
Results you need when
you need them
Proven technology
advantages
Open Fast Enterprise-Grade
50. Confidential © 2014 Actian Corporation50
www.actian.com
facebook.com/actiancorp
@actiancorp
Thank You
Download Actian Vortex Express
Free Forever
http://bigdata.actian.com/sql-in-hadoop
52. Q & A
Dale Kim
Director of Industry Solutions
MapR
Paige Roberts
Hadoop & Analytics Evangelist
Actian
53. Please use the same URL you used to view today’s live event
for the archive event, plus we will be sending you a follow-up
email with that URL once the archive is posted!
54. Thank you for participating in
today’s roundtable web event
Just by attending this event the winner of the
$100 AmEx Gift Card is…….