EAP - Accelerating behavorial analytics at PayPal using Hadoop

27th June, 2013
Accelerating Behavioral Analytics at PayPal
@Hadoop Summit 2013
DATA | PLATFORM - EVENT
ANALYTICS PLATFORM

Confidential and Proprietary2
Data Platform
@PayPal
Components
Teams
• Data Anywhere,
Anytime, Anyplace!
• Engine
• Workspaces
• Modules
• Access
• Rahul Bhartia:
Product Owner
• Alexei Vassiliev:
Technical Lead
INTRODUCTION

A COMMON LANDSCAPE OF
DATA

Current business needs…
• Detecting and preventing fraudRisk
• Reaching customers with relevant
offersMarketing
• Improving user experience for
better conversionProduct
• Providing insights into their
businessesMerchant

BEHAVIORAL ANALYTICS: DATA TO USE
From Terabytes
of raw data
Clickstream,
Transactions
& logs
Millions of
rows
To
Metadata for
business
view
Behaviors
across
channels
Processors
– Flows &
Patterns
• API
Login
• FRAUD
Review
• AUTH
Confirm
• TXN
Shipment

SOURCES
DEVELOPER
EVENTS
DATA
PARSER
S
DATA PLATFORM: DESIGN PRINCIPLES TO
A BLUE PRINT
One common and
extensible
framework

A BLUE PRINT

A BLUE PRINT
EAP Event Example: {
"id": "Impression384923561362839690709",
"name": "Impression",
"type": "Clickstream",
"subtype": "Page",
"filestream": "FPTI",
"sourceDataHash": "38492356",
"timestamp": "1362839690709",
"creationTimestamp": "1363202683771",
"updateTimestamp": "1363202715608",
"attributes": {
"attr1": "val1",
"attr2": "val2"
},
"entities": {
"Customer": "12346326326",
"Session": "3521651326"
}
}

SOURCES
DEVELOPER
EVENTS LIBRARY
TAGS & RELATIONS
DATA
PARSER
S
A BLUE PRINT
One common and
extensible
framework
Augment, not
transform with
metadata

A BLUE PRINT

SOURCES
DEVELOPER
EVENTS MODULE JOBS
PathViz
(D3)
LIBRARY
TAGS & RELATIONS
DATA
PARSER
S
PROCESSO
RS
SQL
A BLUE PRINT
One common and
extensible
framework
Augment, not
transform with
metadata
Templates for
analytical workflow

BUILDING A WORKFLOW : FINDING
PATTERNS

FINDING A USER WORKFLOW

SOURCES
DEVELOPER
EVENTS MODULE JOBS
PathViz
(D3)
LIBRARY
TAGS & RELATIONS
INPU
T
PROCESSING
DATA
PARSER
S
PROCESSO
RS
SQL
A BLUE PRINT
One common and
extensible
framework
Augment, not
transform with
metadata
Templates for
analytical workflow

EVENT ANALYTIC PLATFORM
(EAP) - ARCHITECTURE

EVENT ANALYTIC PLATFORM (EAP)
EAP
Data Ingest
• Metadata driven
transformations
• Plug-in parsers
Events
• Common
representation
as sequence
files
Relations
• Pre-Computed
• Map-side joins
using HBase Catalog
• Metadata
indexed HDFS
repository
Modules
• Common
interface for
data access
• Simplified logic

INPUT SUBSYSTEM – RAW DATA TO
EVENTS
SOURCES EVENTS
DELIMITED
SEQUENCE
OTHERS
Mapping & EntitiesDefinition
MapReduce
Data Catalog
HDFS
Hbase
Reference Data Entity Relations

EAP
Ingest
• Metadata driven
transformations
• Plug-in parsers
Events
• Common
representation
stored in
sequence files
Relations
• Enriching the
data
• Link event
across channel
Modules
• Logical
expressions
transparent to
sources
Library
• Business
metadata as
tags

ENRICHING THE DATA
Timestamp Event Session ID Customer ID
1362839690709 pageview 123456567 ?
1362839790719 pageview 123456567 ?
1362839890729 pageview 123456567 7654321
Timestamp Event Session ID Customer ID
1362839690709 pageview 123456567 7654321
1362839790710 pageview 123456567 7654321
1362839890711 pageview 123456567 7654321
Reference Data
Entity Relations
Lookup
HBASE
Entity Resolver

EAP
Data Ingest
• Metadata
driven
transformations
• Plug-in parsers
Events
• Common
representation
stored in
sequence files
Relations
• Enriching the
data
• Link event
across channel
Catalog
• Indexed access
to all the Event
data
Modules
• Common
interface for
data access

Data Catalog
API
DATA CATALOG : ACCESS TO THE EVENTS
HBASE
Sequence
Files
MapReduce
/Pig HDFS
PROCESS
METADATA

• PIG
REGISTER ‘EventEngine.jar';
EVENTDATA = LOAD 'eap://event' USING
com.paypal.eap.EventLoader(Time, Source,Events, Attributes, Entities);
….
• MR
Set<Path> paths = catalog.get(final Calendar startDate, final Calendar
endDate, final String type)
….
FlowMapper extends Mapper<Key, Event, OutputKey, OutputValue> {
ACCESSING DATA

EVENT ANALYTIC PLATFORM (EAP):
SUMMARY
EAP
Data Ingest
• Metadata
driven
transformations
• Plug-in parsers
Events
• Common
representation
stored in
sequence files
Relations
• Enriching the
data
• Link across
channel
Catalog
• Indexed access
to all the Event
data
Modules
for Event
processing

MODULE JOBS
LIBRARY
Path Discovery
Pattern Matching
Event Metrics
Invoke
Load
PROCESSING SUBSYSTEM – EVENTS TO
INFORMATION
Data Catalog
HDFS
MR PIG

A QUICK LOOK: NUMBERS

EVENT ANALYTICS PLATFORM (EAP) :
METRICS
Cluster
• Exploratory cluster:
600+ nodes
• Production cluster:
600+ nodes
Data (Current)
• Daily (2): 300+GB
• Hourly (1) : 50+ GB
• Growing Everyday with
more sources
Processing
(Sample)
• Time:15 min
• Events: 100+ M
• Entity(HBase): 20M
Jobs (User)
• Flows: 5 Min/40+ M
• Extract: 10 Min/200+ M

THANK YOU

EAP - Accelerating behavorial analytics at PayPal using Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to EAP - Accelerating behavorial analytics at PayPal using Hadoop

Similar to EAP - Accelerating behavorial analytics at PayPal using Hadoop (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

EAP - Accelerating behavorial analytics at PayPal using Hadoop

Editor's Notes