This document summarizes PayPal's risk platform architecture. It discusses how PayPal processes over 1 billion payments per year using an asynchronous architecture for its risk data access layer (DAL) service. The async solution improved latency, throughput, CPU and memory usage compared to the synchronous approach. Future plans include further optimizing the async DAL service, RPC, and in-memory data access. The goal is to process payments at tremendous scale with low latency and system load through an event-driven and highly reusable architecture.
2. 2017 Software Architecture Summit
AGENDA
PayPal & PayPal Risk (Platform)
Risk DAL Service Challenge
Async Solution
Async Future Plan
3. 2017 Software Architecture Summit
AGENDA
PayPal & PayPal Risk (Platform)
Risk DAL Service Challenge
Async Solution
Async Future Plan
4. 2017 Software Architecture Summit
TPV/day
~1
BILLIONpayments/year
6.1
BILLIO
N
Computation/day
~20
Billion
Active Customer
Accounts
210M
petabytes of
data
105
Queries/ day
250
Billion
PayPal operates
one of the largest
Online
Payment
in the world
0.32%
Loss Rate
The power of
our platform
Our technology transformation enables us to:
• Process payments at tremendous scale (200+ countries & 25currencies
supported)
• Accelerate the innovation of new products
• Engage world-class developers & technologists
PayPal Overview
5. 2017 Software Architecture Summit
TPV
+35
4
BILLION
payments/year
6.1
BILLIO
N
payments/
second at peak
1.8B
active customer
accounts
210M
petabytes of
data
73
database
calls/ quarter
4.5T
PayPal operates
one of the largest
Online
Payment
in the world
0.32%
Loss Rate
The power of
our platform
Our technology transformation enables us to:
• Process payments at tremendous scale (200+ countries & 25currencies
supported)
• Accelerate the innovation of new products
• Engage world-class developers & technologists
PayPal Risk KPI
Payments
transactions
6. Requirement for Risk Platform
Accuracy vs Latency Low Latency + Hardware Investment
Vs Large Throughput
7. 2017 Software Architecture Summit
PayPal Risk Platform Architecture
Online
Offline
DAL
Service
Real-time
Compute Data
Offline
Generated Data
Model +
Variable
Computation
Service
Decision
Service
Variable Rollup
Service
Logging System/ ETL
Read
Path
Write
Path
Gateway
Service
Offline
Generated Data
Simulated
Real-time
Data
Offline Variable
Simulation
PlatformModel
Training
Platform Offline Variable
Aggregation
Service
8. 2017 Software Architecture Summit
PayPal Risk Platform Architecture
Online
Offline
DAL
Service
Offline
Generated Data
Real-time
Compute Data
Model +
Variable
Computation
Service
Decision
Service
Variable
Aggregation
Service
Logging System/ ETL
Read
Path
Write
Path
Gateway
Service
Offline
Generated Data
Simulated
Real-time
Data
Offline Variable
Simulation
PlatformModel
Training
Platform Offline Variable
Aggregation
Service
9. 2017 Software Architecture Summit
AGENDA
PayPal & PayPal Risk (Platform)
Risk DAL Service Challenge
Async Solution
Async Future Plan
10. DAL Service Ultimate Questions
JVM-Based High Performance & ATB DAL Service
<100ms P99.99 Latency ??
For single instance, 20k-30k Peak TPS ??
• 99.99% Availability-To-Business??
11. DAL Service Technical Challenges
Budget Cost
• Align with traffic, Hardware
investment Exponential Increase
Performance Issue
• P99 Latency Significantly
differentiate Avg latency
• Too Many Latency Spike under
Traffic
• Storage Cluster Unavailability Impact
Latency
Customer Requirement
• Adopt New Use Case
• Access behavior Differentiate per
Colo
• Flexibility & Fast-evolving Use Case
• Replication
• Traffic Strategy
Operational Cost
• Maintain too many Client with
multiple versions
• Too Frequent Release tie to Biz
Case
• Standby Storage Cluster switch-
over
Req
Tech
Value Cost
12. 2017 Software Architecture Summit
AGENDA
PayPal & PayPal Risk (Platform)
Risk DAL Service Challenge
Async Solution
Async Future Plan
13. 2017 Software Architecture Summit
Async Original Benefit
• More Efficient Thread Scheduling
• Non-blocking Call
• Event-Driven Callback
• Less Context Switch
• Fault Isolation
16. 2017 Software Architecture Summit
Async DAL Service KPI Comparison – Cont.
• High Throughput
• 3-10X Increase (Single Instance Comparison)
17. 2017 Software Architecture Summit
Async DAL Service KPI Comparison – Cont.
• Less CPU Usage
• 50% CPU Usage Reduction
• 66%+ Reduction for Context Switch & System Interrupts
18. 2017 Software Architecture Summit
Async DAL Service KPI Comparison – Cont.
• Less Thread Pool
• 90% Reduction for Thread pool number
0
20
40
60
80
100
120
140
160
180
200
Server RPC Thread Operation Thread Replication Thread Management Thread
9
0 0 2
200
14
40
2
Thread Number Comparison
Async Sync
19. Async DAL Service KPI Comparison – Cont.
• Memory Friendly
• 20% Reduction for Memory Allocation
• 100+MB Young Generation after Young GC
• 130+MB Pooled Off-heap
0.00%
0.01%
0.02%
0.03%
0.04%
0.05%
0.06%
0.07%
Sync Async
GC Time / Total Time
GC Time / Total Time
0
50
100
150
200
250
300
350
Sync Async
GC Count
GC Count
20. We Have ONE Async Dream
• Reform Application Charter from CPU-bound Charter to IO-
bound
• Traffic Throughput (non-)linear growth with CPU Usage
• By guarantee Low Latency, Taking 20-30K TPS with 500MB
JVM Heap (After young GC)
• Cloud Friendly Application
• Less Hardware Investment
• Low Operational Cost
• Easy Capacity Estimation
21. High Performance Design
E2E Async • Non-blocking Pipeline: Async
RPC + Async DataAccess
Less is More • Shared ThreadPool OVER
Separate ThreadPool
• Inline Execution over
Execution cross Multiple
Thread Pool
Autonomous Memory
Management
• Use Off-Heap as much as
possible
(inbound/outbound &
[de]serialization)
• Release Inbound Memory At
earlier stage (submitRequest)
22. High Performance Good Practice
• Performance Test as Critical Path
for Each Commit
• [Mandatory] Continuous
Performance Test for Each
Commit
Inbound/Outbound
Management
• Batch Consolidation
• Order Management
• Timeout Management
• Retry Only Happen in Client Side
Programming Habit • Fast Fail over Exception Thrown
Cascading
• Logging & Monitoring Matters
• Thread-safe Write Operation In
Control Plan while Exception-safe
Read Operation In Data Plane
KPI Sign-Off
23. Async High Level Architecture
Real Time Data Service
Data Set Clients
Data Set 1
Client
Data Set N
Client
Data Set Schema
Data Access API Metadata API Generic Configuration API
KV Store APIClient
Server
Biz logic
HTTP(s) RPC Client
HTTP(s) RPC Server
KV Store API
Generic logic
Schema-less
Read
KV Store
Metadata namespace Data set namespace
Configuration
namespace
Direct access
Service access
Store/Cache
DAL Service:
Control Connection Pool
Centralized Control & Highly Reusability (easily storage migration/non-backward compatible migration & throttling & ACL control) => Minimize Client Upgrade & Integration Effort
Seamless storage switch & upgrade
Control Connection Pool
Centralized Control & Highly Reusability (easily storage migration/non-backward compatible migration & throttling & ACL control)
Minimize Client Upgrade & Integration Effort
GC issue
Lock Contention (non-blocking)
Threading switch & context switch
IO Blocking
cache line refresh/cache miss
IPC => instruction per cycle
Use case:
TTL/timeout
ACL
Replication
Traffic strategy
Leverage OS support event-driven notification: windows IOCP & Linux Epoll & osx kqueue
Fully leverage CPU Cycle only for Inbound & outbound Handle
Short-lived Thread Task for better Thread Usage
Not-involve Client Thread for blocking waiting for downstream storage response & less impact for Client System Resource Usage
我们可以知道Epoll不负责IO操作,所以它只告诉你当前可读可写了,并且将协议读写缓冲填充,由用户去读写控制,此时我们可以做出额外的许多操作。IOCP则直接将IO通道里的读写操作都做完了才通知用户,当IO通道里发生了堵塞等状况我们是无法控制的。
Async for platform-wise & framework level, for business logic, not easy to adopt async pattern
Use off-heap: Schema-less for inbound & outbound
Release request memory: Retry won’t happen in DAL service
Aerospike:
High write performance & specific optimization for SSD => 1M TPS with P99 <1ms
DRAM/SSD Hybrid Solution
High ATB & Scalability | Local Replication & XDR
Aerospike VLDB 2016 Paper
Batch & Retry
Traffic Routing & HA
ACL & Multi-Tenancy