What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
7. “Regular Data” Goes Here
“Big Data” Goes Here
Gigabytes,
maybe a few
Terabytes
Terabytes to
Petabytes
8. What Does Big Data Look Like?
Structured
Semi-Structured
Unstructured
Employee Id Last Name First Name City
156561 John Doe Milano
1 Jane Smith London
“wooly_mammoth.jpg”:
size=2000
type=JPEG
size=960
“piano_cat.mpg”:
size=202300
type=MPEG
resolution=480
tags=kitty,lol,ican’tbelieveacatcanplaypiano
Seven a.m., waking up in the morning
Gotta be fresh, gotta go downstairs
Gotta have my bowl, gotta have cereal
Seein' everything, the time is goin'
Tickin' on and on, everybody's rushin'
Gotta get down to the bus stop
Gotta catch my bus, I see my friends (My friends)
9. Regular Data or BIG Data?
MapR Employee HR Records
SMALL - a few MB
The 2014 Web Index
BIG - 55PB - 8,500 Servers
Football Championship
SMALL - 10MB per year
Per-Minute Temperature History For Nest
Thermostats
BIG - 230TB - 35 Servers
14. ETL
Data Source
Export
Data Source
MapR Hadoop Cluster ($)
Data Warehouse ($$$)
Staging Tables Work Table
Production
Jobs
Hadoop / MapR : Stage 1
Export
NFS/HDFS
18. Aadhaar Project: Largest Biometric DB in the World
• Unique 12 – digit number for each person in India
• Proof of identity, authenticated anytime, anywhere
• Runs on NoSQL database MapR-DB
1.2 B
PEOPLE
19. Data Center 1 Data Center 2
• High Availability : “Always On”
• Latency : get identity in less than 200ms
• Volume : 1.2 Billions (10/15Tb with Biometrics information)
• Flexible Schema
NoSQL Database
Distributed File System
Architecture
24. Streaming Log : Goals
• Push Data into the Data Hub
• Track down a security breach
• Identify anomalous behaviors or other patterns clickstream data
from user interactions on a website
• Supply data to a real-time dashboard
26. What is a Time Series?
• Stuff with timestamps
• sensor measurements
• system stats
• log files
• ….
27. Data Storage
Key 13 43 73 103 …
…
series-uid.time-window 4.5 5.2 6.1 4.9
…
• Typical time window is one hour
• Column names are offsets in time window
• Find series-uid in separate table
30. So far we have:
• Collect Data easily:
• Kafka, Flume, Sqoop, …
• A way to store “any data” :
• Distributed File System: HDFS/MapR-FS
• NoSQL : HBase/MapR-DB
• Process and Access Data:
• Spark, Drill, Hive, Pig
Let’s use these data to build new applications !
34. Hadoop Use Cases
• Evolved from “batch” to “real time”
• Store & Process “everything you want” in file or database
• Built new type of applications:
• Continuous Analytics
• Data Hub
• Recommendations Engines
• Time Series with Predictions (maintenance, QA)
• ….
Learn Hadoop for Free : http://learn.mapr.com
Download free eBook : https://www.mapr.com/real-world-hadoop
35. MILAN 20/21.11.2015 - Tugdual Grall - @tgrall
Leave your feedback on Joind.in!
https://m.joind.in/event/codemotion-milan-2015