Everyone is awash in the new buzzword, Big Data, and it seems as if you can’t escape it wherever you go. But there are real companies with real use cases creating real value for their businesses by using big data. This talk will discuss some of the more compelling current or recent projects, their architecture & systems used, and successful outcomes.
9. Ship the Function to the Data
Distributed Computing
Traditional Architecture
function
function
data
data
function
data
data
function
function
data
data
function
data
RDBMS
function
data
data
data
data
data
data
data
data
function
function
function
data
data
data
data
data
data
data
data
data
function
function
function
data
data
data
SAN/NAS
9
10. Variation: Multiple MapReduces
Example: Fraud Detection in User Transactions
MapReduce
Transaction
data
LDA training
LDA scoring
G2 score
95 %-ile LDA anomaly
HBase /
MapR M7 Edition
Candidate events
for analyst review
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
10
11. MapR Distribution for Apache Hadoop
Complete Hadoop
distribution
Comprehensive
management suite
Industry-standard
interfaces
Enterprise-grade
dependability
Higher performance
11
SCRIPT:You can see from the Word Count example that a MapReduce is a low level construct. Typical applications require more complex processing, which is accomplished by performing multiple stages of MapReduce. Here is an example of a Hadoop system to detect account fraud after a security breach, using machine learning models. (*) Each step is its own MapReduce program. We’ll return to this example in more detail later.---------------[DON’T do any explanation of the algorithm here. Just twinkle the MR stages.(*) User transaction data is loaded into a distributed datastore for massive tables, such as HBase running on Hadoop, or native tables available with MapR’s M7 distribution.(*) There’s a training phase, to train the system what normal transactions look like.(*) Later, individual user transactions are scored against the “normal behavior” pattern.(*) Then, transactions with highly anomalous behavior are singled out as candidate events to be manually reviewed by analysts for potential fraud.In your data flow, any place you have a group-by, or join, or filter, or count occurrences event, it typically equates to one or more map-reduce jobs.
MapR provides a complete distribution for Apache Hadoop. MapR has integrated, tested and hardened a broad array of packages as part of this distribution Hive, Pig, Oozie, Sqoop, plus additional packages such as Cascading. We have spent over a two year well funded effort to provide deep architectural improvements to create the next generation distribution for Hadoop. MapR has made significant updates combined with a dozen open source packages. Any of the innovations MapR has delivered include 100% compatibility with the Apache Hadoop APIs. This is in stark contrast with the alternative distributions from Cloudera, HortonWorks, Apache which are all equivalent.