PayPal today generates massive amounts of data?from clickstream logs to transactions and routine business events. Analyzing customer behavior across this data can be a daunting task. Data Technology team at PayPal has built a configurable engine, Event Analytics Pipeline (EAP), using Hadoop to ingest and process massive amounts of customer interaction data, match business-defined behavioral patterns, and generate entities and interactions matching those patterns. The pipeline is an ecosystem of components built using HDFS, HBase, a data catalog, and seamless connectivity to enterprise data stores. EAP?s data definition, data processing, and behavioral analysis can be adapted to many business needs. Leveraging Hadoop to address the problems of size and scale, EAP promotes agility by abstracting the complexities of big-data technologies using a set of tools and metadata that allow end users to control the behavioral-centric processing of data. EAP abstracts the massive data stored on HDFS as business objects, e.g., customer and page impression events, allowing analysts to easily extract patterns of events across billions of rows of data. The rules system built using HBase allows analysts to define relationships between entities and extrapolate them across disparate data sources to truly explore the universe of customer interaction and behaviors through a single lens.
4. Confidential and Proprietary4
Current business needs…
• Detecting and preventing fraudRisk
• Reaching customers with relevant
offersMarketing
• Improving user experience for
better conversionProduct
• Providing insights into their
businessesMerchant
5. Confidential and Proprietary5
BEHAVIORAL ANALYTICS: DATA TO USE
From Terabytes
of raw data
Clickstream,
Transactions
& logs
Millions of
rows
To
Metadata for
business
view
Behaviors
across
channels
Processors
– Flows &
Patterns
• API
Login
• FRAUD
Review
• AUTH
Confirm
• TXN
Shipment
11. Confidential and Proprietary11
SOURCES
DEVELOPER
EVENTS MODULE JOBS
PathViz
(D3)
LIBRARY
TAGS & RELATIONS
DATA
PARSER
S
PROCESSO
RS
SQL
DATA PLATFORM: DESIGN PRINCIPLES TO
A BLUE PRINT
One common and
extensible
framework
Augment, not
transform with
metadata
Templates for
analytical workflow
14. Confidential and Proprietary14
SOURCES
DEVELOPER
EVENTS MODULE JOBS
PathViz
(D3)
LIBRARY
TAGS & RELATIONS
INPU
T
PROCESSING
DATA
PARSER
S
PROCESSO
RS
SQL
DATA PLATFORM: DESIGN PRINCIPLES TO
A BLUE PRINT
One common and
extensible
framework
Augment, not
transform with
metadata
Templates for
analytical workflow
16. Confidential and Proprietary16
EVENT ANALYTIC PLATFORM (EAP)
EAP
Data Ingest
• Metadata driven
transformations
• Plug-in parsers
Events
• Common
representation
as sequence
files
Relations
• Pre-Computed
• Map-side joins
using HBase Catalog
• Metadata
indexed HDFS
repository
Modules
• Common
interface for
data access
• Simplified logic
17. Confidential and Proprietary17
INPUT SUBSYSTEM – RAW DATA TO
EVENTS
SOURCES EVENTS
DELIMITED
SEQUENCE
OTHERS
Mapping & EntitiesDefinition
MapReduce
Data Catalog
HDFS
Hbase
Reference Data Entity Relations
18. Confidential and Proprietary18
EVENT ANALYTIC PLATFORM (EAP)
EAP
Ingest
• Metadata driven
transformations
• Plug-in parsers
Events
• Common
representation
stored in
sequence files
Relations
• Enriching the
data
• Link event
across channel
Modules
• Logical
expressions
transparent to
sources
Library
• Business
metadata as
tags
19. Confidential and Proprietary19
ENRICHING THE DATA
Timestamp Event Session ID Customer ID
1362839690709 pageview 123456567 ?
1362839790719 pageview 123456567 ?
1362839890729 pageview 123456567 7654321
Timestamp Event Session ID Customer ID
1362839690709 pageview 123456567 7654321
1362839790710 pageview 123456567 7654321
1362839890711 pageview 123456567 7654321
Reference Data
Entity Relations
Lookup
HBASE
Entity Resolver
20. Confidential and Proprietary20
EVENT ANALYTIC PLATFORM (EAP)
EAP
Data Ingest
• Metadata
driven
transformations
• Plug-in parsers
Events
• Common
representation
stored in
sequence files
Relations
• Enriching the
data
• Link event
across channel
Catalog
• Indexed access
to all the Event
data
Modules
• Common
interface for
data access
• Simplified logic
21. Confidential and Proprietary21
Data Catalog
API
DATA CATALOG : ACCESS TO THE EVENTS
HBASE
Sequence
Files
MapReduce
/Pig HDFS
PROCESS
METADATA
22. Confidential and Proprietary22
• PIG
REGISTER ‘EventEngine.jar';
EVENTDATA = LOAD 'eap://event' USING
com.paypal.eap.EventLoader(Time, Source,Events, Attributes, Entities);
….
• MR
Set<Path> paths = catalog.get(final Calendar startDate, final Calendar
endDate, final String type)
….
FlowMapper extends Mapper<Key, Event, OutputKey, OutputValue> {
ACCESSING DATA
23. Confidential and Proprietary23
EVENT ANALYTIC PLATFORM (EAP):
SUMMARY
EAP
Data Ingest
• Metadata
driven
transformations
• Plug-in parsers
Events
• Common
representation
stored in
sequence files
Relations
• Enriching the
data
• Link across
channel
Catalog
• Indexed access
to all the Event
data
Modules
• Simplified logic
for Event
processing
24. Confidential and Proprietary24
MODULE JOBS
LIBRARY
Path Discovery
Pattern Matching
Event Metrics
Invoke
Load
PROCESSING SUBSYSTEM – EVENTS TO
INFORMATION
Data Catalog
HDFS
MR PIG
Each company uses data in its own ways. Here are just some of the ways in which PayPal leverages its big data.
Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
The input subsystem is primarily focused on getting raw data into an analytical “object library”. Filestreams read input data (CSV, tabular, golden gate, hadoop sequence). The object factory is more of a concept. In practice, it’s a scheduled map reduce job that calls a file stream reader, accepts an input mapping from raw to defined objects, and spits out the analytical data for downstream logic. Our relationship system (called the “resolver” internally) is responsible for promoting keys out across the data to effectively create joins at scale.
Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
The input subsystem is primarily focused on getting raw data into an analytical “object library”. Filestreams read input data (CSV, tabular, golden gate, hadoop sequence). The object factory is more of a concept. In practice, it’s a scheduled map reduce job that calls a file stream reader, accepts an input mapping from raw to defined objects, and spits out the analytical data for downstream logic. Our relationship system (called the “resolver” internally) is responsible for promoting keys out across the data to effectively create joins at scale.
The input subsystem is primarily focused on getting raw data into an analytical “object library”. Filestreams read input data (CSV, tabular, golden gate, hadoop sequence). The object factory is more of a concept. In practice, it’s a scheduled map reduce job that calls a file stream reader, accepts an input mapping from raw to defined objects, and spits out the analytical data for downstream logic. Our relationship system (called the “resolver” internally) is responsible for promoting keys out across the data to effectively create joins at scale.
The processing subsystem is where we expect most usage. It’s where problem solvers across the company setup their own use cases using our modules that operate on the object library.