SlideShare a Scribd company logo
1 of 27
27th June, 2013
Accelerating Behavioral Analytics at PayPal
@Hadoop Summit 2013
DATA | PLATFORM - EVENT
ANALYTICS PLATFORM
Confidential and Proprietary2
Data Platform
@PayPal
Components
Teams
• Data Anywhere,
Anytime, Anyplace!
• Engine
• Workspaces
• Modules
• Access
• Rahul Bhartia:
Product Owner
• Alexei Vassiliev:
Technical Lead
INTRODUCTION
Confidential and Proprietary3
A COMMON LANDSCAPE OF
DATA
Confidential and Proprietary4
Current business needs…
• Detecting and preventing fraudRisk
• Reaching customers with relevant
offersMarketing
• Improving user experience for
better conversionProduct
• Providing insights into their
businessesMerchant
Confidential and Proprietary5
BEHAVIORAL ANALYTICS: DATA TO USE
From Terabytes
of raw data
Clickstream,
Transactions
& logs
Millions of
rows
To
Metadata for
business
view
Behaviors
across
channels
Processors
– Flows &
Patterns
• API
Login
• FRAUD
Review
• AUTH
Confirm
• TXN
Shipment
Confidential and Proprietary6
SOURCES
DEVELOPER
EVENTS
DATA
PARSER
S
DATA PLATFORM: DESIGN PRINCIPLES TO
A BLUE PRINT
One common and
extensible
framework
Confidential and Proprietary7
DATA PLATFORM: DESIGN PRINCIPLES TO
A BLUE PRINT
Confidential and Proprietary8
DATA PLATFORM: DESIGN PRINCIPLES TO
A BLUE PRINT
EAP Event Example: {
"id": "Impression384923561362839690709",
"name": "Impression",
"type": "Clickstream",
"subtype": "Page",
"filestream": "FPTI",
"sourceDataHash": "38492356",
"timestamp": "1362839690709",
"creationTimestamp": "1363202683771",
"updateTimestamp": "1363202715608",
"attributes": {
"attr1": "val1",
"attr2": "val2"
},
"entities": {
"Customer": "12346326326",
"Session": "3521651326"
}
}
Confidential and Proprietary9
SOURCES
DEVELOPER
EVENTS LIBRARY
TAGS & RELATIONS
DATA
PARSER
S
DATA PLATFORM: DESIGN PRINCIPLES TO
A BLUE PRINT
One common and
extensible
framework
Augment, not
transform with
metadata
Confidential and Proprietary10
DATA PLATFORM: DESIGN PRINCIPLES TO
A BLUE PRINT
Confidential and Proprietary11
SOURCES
DEVELOPER
EVENTS MODULE JOBS
PathViz
(D3)
LIBRARY
TAGS & RELATIONS
DATA
PARSER
S
PROCESSO
RS
SQL
DATA PLATFORM: DESIGN PRINCIPLES TO
A BLUE PRINT
One common and
extensible
framework
Augment, not
transform with
metadata
Templates for
analytical workflow
Confidential and Proprietary12
BUILDING A WORKFLOW : FINDING
PATTERNS
Confidential and Proprietary13
FINDING A USER WORKFLOW
Confidential and Proprietary14
SOURCES
DEVELOPER
EVENTS MODULE JOBS
PathViz
(D3)
LIBRARY
TAGS & RELATIONS
INPU
T
PROCESSING
DATA
PARSER
S
PROCESSO
RS
SQL
DATA PLATFORM: DESIGN PRINCIPLES TO
A BLUE PRINT
One common and
extensible
framework
Augment, not
transform with
metadata
Templates for
analytical workflow
Confidential and Proprietary15
EVENT ANALYTIC PLATFORM
(EAP) - ARCHITECTURE
Confidential and Proprietary16
EVENT ANALYTIC PLATFORM (EAP)
EAP
Data Ingest
• Metadata driven
transformations
• Plug-in parsers
Events
• Common
representation
as sequence
files
Relations
• Pre-Computed
• Map-side joins
using HBase Catalog
• Metadata
indexed HDFS
repository
Modules
• Common
interface for
data access
• Simplified logic
Confidential and Proprietary17
INPUT SUBSYSTEM – RAW DATA TO
EVENTS
SOURCES EVENTS
DELIMITED
SEQUENCE
OTHERS
Mapping & EntitiesDefinition
MapReduce
Data Catalog
HDFS
Hbase
Reference Data Entity Relations
Confidential and Proprietary18
EVENT ANALYTIC PLATFORM (EAP)
EAP
Ingest
• Metadata driven
transformations
• Plug-in parsers
Events
• Common
representation
stored in
sequence files
Relations
• Enriching the
data
• Link event
across channel
Modules
• Logical
expressions
transparent to
sources
Library
• Business
metadata as
tags
Confidential and Proprietary19
ENRICHING THE DATA
Timestamp Event Session ID Customer ID
1362839690709 pageview 123456567 ?
1362839790719 pageview 123456567 ?
1362839890729 pageview 123456567 7654321
Timestamp Event Session ID Customer ID
1362839690709 pageview 123456567 7654321
1362839790710 pageview 123456567 7654321
1362839890711 pageview 123456567 7654321
Reference Data
Entity Relations
Lookup
HBASE
Entity Resolver
Confidential and Proprietary20
EVENT ANALYTIC PLATFORM (EAP)
EAP
Data Ingest
• Metadata
driven
transformations
• Plug-in parsers
Events
• Common
representation
stored in
sequence files
Relations
• Enriching the
data
• Link event
across channel
Catalog
• Indexed access
to all the Event
data
Modules
• Common
interface for
data access
• Simplified logic
Confidential and Proprietary21
Data Catalog
API
DATA CATALOG : ACCESS TO THE EVENTS
HBASE
Sequence
Files
MapReduce
/Pig HDFS
PROCESS
METADATA
Confidential and Proprietary22
• PIG
REGISTER ‘EventEngine.jar';
EVENTDATA = LOAD 'eap://event' USING
com.paypal.eap.EventLoader(Time, Source,Events, Attributes, Entities);
….
• MR
Set<Path> paths = catalog.get(final Calendar startDate, final Calendar
endDate, final String type)
….
FlowMapper extends Mapper<Key, Event, OutputKey, OutputValue> {
ACCESSING DATA
Confidential and Proprietary23
EVENT ANALYTIC PLATFORM (EAP):
SUMMARY
EAP
Data Ingest
• Metadata
driven
transformations
• Plug-in parsers
Events
• Common
representation
stored in
sequence files
Relations
• Enriching the
data
• Link across
channel
Catalog
• Indexed access
to all the Event
data
Modules
• Simplified logic
for Event
processing
Confidential and Proprietary24
MODULE JOBS
LIBRARY
Path Discovery
Pattern Matching
Event Metrics
Invoke
Load
PROCESSING SUBSYSTEM – EVENTS TO
INFORMATION
Data Catalog
HDFS
MR PIG
Confidential and Proprietary25
A QUICK LOOK: NUMBERS
Confidential and Proprietary26
EVENT ANALYTICS PLATFORM (EAP) :
METRICS
Cluster
• Exploratory cluster:
600+ nodes
• Production cluster:
600+ nodes
Data (Current)
• Daily (2): 300+GB
• Hourly (1) : 50+ GB
• Growing Everyday with
more sources
Processing
(Sample)
• Time:15 min
• Events: 100+ M
• Entity(HBase): 20M
Jobs (User)
• Flows: 5 Min/40+ M
• Extract: 10 Min/200+ M
Confidential and Proprietary27
THANK YOU

More Related Content

What's hot

Oracle RAC - New Generation
Oracle RAC - New GenerationOracle RAC - New Generation
Oracle RAC - New GenerationAnil Nair
 
Understanding Apache Kafka® Latency at Scale
Understanding Apache Kafka® Latency at ScaleUnderstanding Apache Kafka® Latency at Scale
Understanding Apache Kafka® Latency at Scaleconfluent
 
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...DataStax
 
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...Daniel Hochman
 
Maria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityMaria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityOSSCube
 
Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Abdelkrim Hadjidj
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
MySQL operator for_kubernetes
MySQL operator for_kubernetesMySQL operator for_kubernetes
MySQL operator for_kubernetesrockplace
 
Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseKnoldus Inc.
 
Performance Schema and Sys Schema in MySQL 5.7
Performance Schema and Sys Schema in MySQL 5.7Performance Schema and Sys Schema in MySQL 5.7
Performance Schema and Sys Schema in MySQL 5.7Mark Leith
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...DataWorks Summit
 
Gain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta SharingGain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta SharingDatabricks
 
DNS Security Presentation ISSA
DNS Security Presentation ISSADNS Security Presentation ISSA
DNS Security Presentation ISSASrikrupa Srivatsan
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookDatabricks
 
Hotels.com’s Journey to Becoming an Algorithmic Business… Exponential Growth ...
Hotels.com’s Journey to Becoming an Algorithmic Business… Exponential Growth ...Hotels.com’s Journey to Becoming an Algorithmic Business… Exponential Growth ...
Hotels.com’s Journey to Becoming an Algorithmic Business… Exponential Growth ...Databricks
 
Lean Software Development
Lean Software DevelopmentLean Software Development
Lean Software DevelopmentSaqib Raza
 
MySQL InnoDB Cluster - Group Replication
MySQL InnoDB Cluster - Group ReplicationMySQL InnoDB Cluster - Group Replication
MySQL InnoDB Cluster - Group ReplicationFrederic Descamps
 

What's hot (20)

Oracle RAC - New Generation
Oracle RAC - New GenerationOracle RAC - New Generation
Oracle RAC - New Generation
 
Understanding Apache Kafka® Latency at Scale
Understanding Apache Kafka® Latency at ScaleUnderstanding Apache Kafka® Latency at Scale
Understanding Apache Kafka® Latency at Scale
 
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
 
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
 
Maria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityMaria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High Availability
 
Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
MySQL operator for_kubernetes
MySQL operator for_kubernetesMySQL operator for_kubernetes
MySQL operator for_kubernetes
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best Practise
 
Performance Schema and Sys Schema in MySQL 5.7
Performance Schema and Sys Schema in MySQL 5.7Performance Schema and Sys Schema in MySQL 5.7
Performance Schema and Sys Schema in MySQL 5.7
 
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and SupportabilityHDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
 
Gain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta SharingGain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta Sharing
 
DNS Security Presentation ISSA
DNS Security Presentation ISSADNS Security Presentation ISSA
DNS Security Presentation ISSA
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Hotels.com’s Journey to Becoming an Algorithmic Business… Exponential Growth ...
Hotels.com’s Journey to Becoming an Algorithmic Business… Exponential Growth ...Hotels.com’s Journey to Becoming an Algorithmic Business… Exponential Growth ...
Hotels.com’s Journey to Becoming an Algorithmic Business… Exponential Growth ...
 
Lean Software Development
Lean Software DevelopmentLean Software Development
Lean Software Development
 
MySQL InnoDB Cluster - Group Replication
MySQL InnoDB Cluster - Group ReplicationMySQL InnoDB Cluster - Group Replication
MySQL InnoDB Cluster - Group Replication
 

Viewers also liked

PayPal Real Time Analytics
PayPal  Real Time AnalyticsPayPal  Real Time Analytics
PayPal Real Time AnalyticsAnil Madan
 
Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1GurinderG
 
Big Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, PaypalBig Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, PaypalInnovation Enterprise
 
Big- Data and Risk Management - Ido Lustig, PayPal
Big- Data and Risk Management - Ido Lustig, PayPalBig- Data and Risk Management - Ido Lustig, PayPal
Big- Data and Risk Management - Ido Lustig, PayPalCodemotion Tel Aviv
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamSri Ambati
 
PayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on HadoopPayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on HadoopDataWorks Summit
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterMat Keep
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphDataWorks Summit
 
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...
eCommerce and ePayments markets in Russia : trends , analytics , perspect...Data Insight
 
Paymetrics Deck - Seed Round
Paymetrics Deck - Seed RoundPaymetrics Deck - Seed Round
Paymetrics Deck - Seed RoundShannon Sofield
 
PayPal: A case study
PayPal: A case studyPayPal: A case study
PayPal: A case studyKimberly Teo
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data modeljagdish_93
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Viewers also liked (15)

PayPal Real Time Analytics
PayPal  Real Time AnalyticsPayPal  Real Time Analytics
PayPal Real Time Analytics
 
Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1
 
Big Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, PaypalBig Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, Paypal
 
Big- Data and Risk Management - Ido Lustig, PayPal
Big- Data and Risk Management - Ido Lustig, PayPalBig- Data and Risk Management - Ido Lustig, PayPal
Big- Data and Risk Management - Ido Lustig, PayPal
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
PayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on HadoopPayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on Hadoop
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
 
Paymetrics Deck - Seed Round
Paymetrics Deck - Seed RoundPaymetrics Deck - Seed Round
Paymetrics Deck - Seed Round
 
PayPal: A case study
PayPal: A case studyPayPal: A case study
PayPal: A case study
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
 
RFM Segmentation
RFM SegmentationRFM Segmentation
RFM Segmentation
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar to EAP - Accelerating behavorial analytics at PayPal using Hadoop

Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingGuido Schmutz
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?Attunity
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?confluent
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafkaconfluent
 
Achieving Business Value by Fusing Hadoop and Corporate Data
Achieving Business Value by Fusing Hadoop and Corporate DataAchieving Business Value by Fusing Hadoop and Corporate Data
Achieving Business Value by Fusing Hadoop and Corporate DataInside Analysis
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoopCraig Jordan
 
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...semanticsconference
 
Accelerate Self-Service Analytics with Virtualization and Visualisation (Thai)
Accelerate Self-Service Analytics with Virtualization and Visualisation (Thai)Accelerate Self-Service Analytics with Virtualization and Visualisation (Thai)
Accelerate Self-Service Analytics with Virtualization and Visualisation (Thai)Denodo
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationInside Analysis
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 
Time's Up! Getting Value from Big Data Now
Time's Up! Getting Value from Big Data NowTime's Up! Getting Value from Big Data Now
Time's Up! Getting Value from Big Data NowEric Kavanagh
 
Key Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareKey Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareMapR Technologies
 
Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)Denodo
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSAWS User Group Kochi
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchSheetal Pratik
 
SAP Data Hub e SUSE Container as a Service Platform
SAP Data Hub e SUSE Container as a Service PlatformSAP Data Hub e SUSE Container as a Service Platform
SAP Data Hub e SUSE Container as a Service PlatformSUSE Italy
 
The Case for Open Source in the Public Sector
The Case for Open Source in the Public SectorThe Case for Open Source in the Public Sector
The Case for Open Source in the Public SectorMindtrek
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo
 
MDS ap_OEM Product Portfolio Intorduction to the DT & Analytics
MDS ap_OEM Product Portfolio Intorduction to the DT & AnalyticsMDS ap_OEM Product Portfolio Intorduction to the DT & Analytics
MDS ap_OEM Product Portfolio Intorduction to the DT & AnalyticsMDS ap
 

Similar to EAP - Accelerating behavorial analytics at PayPal using Hadoop (20)

Oracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream ProcessingOracle Stream Analytics - Simplifying Stream Processing
Oracle Stream Analytics - Simplifying Stream Processing
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafka
 
Achieving Business Value by Fusing Hadoop and Corporate Data
Achieving Business Value by Fusing Hadoop and Corporate DataAchieving Business Value by Fusing Hadoop and Corporate Data
Achieving Business Value by Fusing Hadoop and Corporate Data
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
 
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
Robert Isele | eccenca CorporateMemory - Semantically integrated Enterprise D...
 
Accelerate Self-Service Analytics with Virtualization and Visualisation (Thai)
Accelerate Self-Service Analytics with Virtualization and Visualisation (Thai)Accelerate Self-Service Analytics with Virtualization and Visualisation (Thai)
Accelerate Self-Service Analytics with Virtualization and Visualisation (Thai)
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Time's Up! Getting Value from Big Data Now
Time's Up! Getting Value from Big Data NowTime's Up! Getting Value from Big Data Now
Time's Up! Getting Value from Big Data Now
 
Key Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareKey Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShare
 
Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)
 
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWSACDKOCHI19 - Next Generation Data Analytics Platform on AWS
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
SAP Data Hub e SUSE Container as a Service Platform
SAP Data Hub e SUSE Container as a Service PlatformSAP Data Hub e SUSE Container as a Service Platform
SAP Data Hub e SUSE Container as a Service Platform
 
The Case for Open Source in the Public Sector
The Case for Open Source in the Public SectorThe Case for Open Source in the Public Sector
The Case for Open Source in the Public Sector
 
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
Denodo’s Data Catalog: Bridging the Gap between Data and Business (APAC)
 
MDS ap_OEM Product Portfolio Intorduction to the DT & Analytics
MDS ap_OEM Product Portfolio Intorduction to the DT & AnalyticsMDS ap_OEM Product Portfolio Intorduction to the DT & Analytics
MDS ap_OEM Product Portfolio Intorduction to the DT & Analytics
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

EAP - Accelerating behavorial analytics at PayPal using Hadoop

  • 1. 27th June, 2013 Accelerating Behavioral Analytics at PayPal @Hadoop Summit 2013 DATA | PLATFORM - EVENT ANALYTICS PLATFORM
  • 2. Confidential and Proprietary2 Data Platform @PayPal Components Teams • Data Anywhere, Anytime, Anyplace! • Engine • Workspaces • Modules • Access • Rahul Bhartia: Product Owner • Alexei Vassiliev: Technical Lead INTRODUCTION
  • 3. Confidential and Proprietary3 A COMMON LANDSCAPE OF DATA
  • 4. Confidential and Proprietary4 Current business needs… • Detecting and preventing fraudRisk • Reaching customers with relevant offersMarketing • Improving user experience for better conversionProduct • Providing insights into their businessesMerchant
  • 5. Confidential and Proprietary5 BEHAVIORAL ANALYTICS: DATA TO USE From Terabytes of raw data Clickstream, Transactions & logs Millions of rows To Metadata for business view Behaviors across channels Processors – Flows & Patterns • API Login • FRAUD Review • AUTH Confirm • TXN Shipment
  • 6. Confidential and Proprietary6 SOURCES DEVELOPER EVENTS DATA PARSER S DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework
  • 7. Confidential and Proprietary7 DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT
  • 8. Confidential and Proprietary8 DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT EAP Event Example: { "id": "Impression384923561362839690709", "name": "Impression", "type": "Clickstream", "subtype": "Page", "filestream": "FPTI", "sourceDataHash": "38492356", "timestamp": "1362839690709", "creationTimestamp": "1363202683771", "updateTimestamp": "1363202715608", "attributes": { "attr1": "val1", "attr2": "val2" }, "entities": { "Customer": "12346326326", "Session": "3521651326" } }
  • 9. Confidential and Proprietary9 SOURCES DEVELOPER EVENTS LIBRARY TAGS & RELATIONS DATA PARSER S DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework Augment, not transform with metadata
  • 10. Confidential and Proprietary10 DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT
  • 11. Confidential and Proprietary11 SOURCES DEVELOPER EVENTS MODULE JOBS PathViz (D3) LIBRARY TAGS & RELATIONS DATA PARSER S PROCESSO RS SQL DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework Augment, not transform with metadata Templates for analytical workflow
  • 12. Confidential and Proprietary12 BUILDING A WORKFLOW : FINDING PATTERNS
  • 14. Confidential and Proprietary14 SOURCES DEVELOPER EVENTS MODULE JOBS PathViz (D3) LIBRARY TAGS & RELATIONS INPU T PROCESSING DATA PARSER S PROCESSO RS SQL DATA PLATFORM: DESIGN PRINCIPLES TO A BLUE PRINT One common and extensible framework Augment, not transform with metadata Templates for analytical workflow
  • 15. Confidential and Proprietary15 EVENT ANALYTIC PLATFORM (EAP) - ARCHITECTURE
  • 16. Confidential and Proprietary16 EVENT ANALYTIC PLATFORM (EAP) EAP Data Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation as sequence files Relations • Pre-Computed • Map-side joins using HBase Catalog • Metadata indexed HDFS repository Modules • Common interface for data access • Simplified logic
  • 17. Confidential and Proprietary17 INPUT SUBSYSTEM – RAW DATA TO EVENTS SOURCES EVENTS DELIMITED SEQUENCE OTHERS Mapping & EntitiesDefinition MapReduce Data Catalog HDFS Hbase Reference Data Entity Relations
  • 18. Confidential and Proprietary18 EVENT ANALYTIC PLATFORM (EAP) EAP Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation stored in sequence files Relations • Enriching the data • Link event across channel Modules • Logical expressions transparent to sources Library • Business metadata as tags
  • 19. Confidential and Proprietary19 ENRICHING THE DATA Timestamp Event Session ID Customer ID 1362839690709 pageview 123456567 ? 1362839790719 pageview 123456567 ? 1362839890729 pageview 123456567 7654321 Timestamp Event Session ID Customer ID 1362839690709 pageview 123456567 7654321 1362839790710 pageview 123456567 7654321 1362839890711 pageview 123456567 7654321 Reference Data Entity Relations Lookup HBASE Entity Resolver
  • 20. Confidential and Proprietary20 EVENT ANALYTIC PLATFORM (EAP) EAP Data Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation stored in sequence files Relations • Enriching the data • Link event across channel Catalog • Indexed access to all the Event data Modules • Common interface for data access • Simplified logic
  • 21. Confidential and Proprietary21 Data Catalog API DATA CATALOG : ACCESS TO THE EVENTS HBASE Sequence Files MapReduce /Pig HDFS PROCESS METADATA
  • 22. Confidential and Proprietary22 • PIG REGISTER ‘EventEngine.jar'; EVENTDATA = LOAD 'eap://event' USING com.paypal.eap.EventLoader(Time, Source,Events, Attributes, Entities); …. • MR Set<Path> paths = catalog.get(final Calendar startDate, final Calendar endDate, final String type) …. FlowMapper extends Mapper<Key, Event, OutputKey, OutputValue> { ACCESSING DATA
  • 23. Confidential and Proprietary23 EVENT ANALYTIC PLATFORM (EAP): SUMMARY EAP Data Ingest • Metadata driven transformations • Plug-in parsers Events • Common representation stored in sequence files Relations • Enriching the data • Link across channel Catalog • Indexed access to all the Event data Modules • Simplified logic for Event processing
  • 24. Confidential and Proprietary24 MODULE JOBS LIBRARY Path Discovery Pattern Matching Event Metrics Invoke Load PROCESSING SUBSYSTEM – EVENTS TO INFORMATION Data Catalog HDFS MR PIG
  • 25. Confidential and Proprietary25 A QUICK LOOK: NUMBERS
  • 26. Confidential and Proprietary26 EVENT ANALYTICS PLATFORM (EAP) : METRICS Cluster • Exploratory cluster: 600+ nodes • Production cluster: 600+ nodes Data (Current) • Daily (2): 300+GB • Hourly (1) : 50+ GB • Growing Everyday with more sources Processing (Sample) • Time:15 min • Events: 100+ M • Entity(HBase): 20M Jobs (User) • Flows: 5 Min/40+ M • Extract: 10 Min/200+ M

Editor's Notes

  1. Each company uses data in its own ways. Here are just some of the ways in which PayPal leverages its big data.
  2. Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  3. The input subsystem is primarily focused on getting raw data into an analytical “object library”. Filestreams read input data (CSV, tabular, golden gate, hadoop sequence). The object factory is more of a concept. In practice, it’s a scheduled map reduce job that calls a file stream reader, accepts an input mapping from raw to defined objects, and spits out the analytical data for downstream logic. Our relationship system (called the “resolver” internally) is responsible for promoting keys out across the data to effectively create joins at scale.
  4. Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  5. Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  6. Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  7. Here’s the way the system works today, and the design is growing fast to keep up with all of the possible uses for it.
  8. The input subsystem is primarily focused on getting raw data into an analytical “object library”. Filestreams read input data (CSV, tabular, golden gate, hadoop sequence). The object factory is more of a concept. In practice, it’s a scheduled map reduce job that calls a file stream reader, accepts an input mapping from raw to defined objects, and spits out the analytical data for downstream logic. Our relationship system (called the “resolver” internally) is responsible for promoting keys out across the data to effectively create joins at scale.
  9. The input subsystem is primarily focused on getting raw data into an analytical “object library”. Filestreams read input data (CSV, tabular, golden gate, hadoop sequence). The object factory is more of a concept. In practice, it’s a scheduled map reduce job that calls a file stream reader, accepts an input mapping from raw to defined objects, and spits out the analytical data for downstream logic. Our relationship system (called the “resolver” internally) is responsible for promoting keys out across the data to effectively create joins at scale.
  10. The processing subsystem is where we expect most usage. It’s where problem solvers across the company setup their own use cases using our modules that operate on the object library.