Scale By The Bay | 2020 | Gimel

Conquering All Stores with Gimel
Vladimir Bacvanski
Anisha Nainani
Deepak Chandramouli

About us
Vladimir Bacvanski
vbacvanski@paypal.com
Twitter: @OnSoftware
• Principal Architect, Strategic Architecture at PayPal
• In previous life: CTO of a development and
consulting firm
• PhD in Computer Science from RWTH Aachen,
Germany
• O’Reilly author: Courses on Big Data, Kafka
Deepak Chandramouli
dmohanakumarchan@paypal.com
LinkedIn: @deepakmc
• MT2 Software Engineer, Data Platform Services at
PayPal
• Data Enthusiast
• Tech lead
• Gimel (Big Data Framework for Apache Spark)
• Unified Data Catalog – PayPal’s Enterprise Data
Catalog
Anisha Nainani
annainani@paypal.com
LinkedIn: @anishanainani
• Senior Software Engineer
• Big Data
• Data Platform Services

AGENDA
 PayPal - Introduction
 Why Gimel
 Gimel + Catalog + Alluxio (Intelligent Query
engine)
 Gimel Deep Dive
 What’s next?
 Questions

PayPal – Key Metrics and Analytics
Ecosystem
4

PayPal | Q3-2020 | Key Metrics
5https://investor.pypl.com/home/default.aspx

PayPal | Data Growth
6
160+ PB Data200,000+
YARN jobs/day
One of the largest
Aerospike,
Teradata,
Hortonworks
and Oracle
installations
Compute
supported:
Spark, Hive,
MR, BigQuery
20+ On-Premise
clusters
GPU co-located with
Hadoop
Cloud Migration
Adjacencies

7
Developer Data scientist Analyst Operator
Gimel SDK Notebooks
UDC Data API
Infrastructure services leveraged for elasticity and redundancy
Multi-DC Public cloudPredictive resource allocation
Logging
Monitoring
Alerting
Security
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
PayPal | Data Landscape

9
Challenges | Data Access Code | Cumbersome & Fragile
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
Illustration Purpose
Not Meant to Read
Spark Read From Hbase

10
Challenges | Data Processing Can Be Multi-Mode & Polyglot
Batch

11
Challenges with Data App Lifecycle
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
Learn Code Optimize Build Deploy RunCompute Version Upgraded
Learn Code Optimize Build Deploy RunStorage API Changed
Learn Code Optimize Build Deploy RunStorage Connector Upgraded
Learn Code Optimize Build Deploy RunStorage Hosts Migrated
Learn Code Optimize Build Deploy RunStorage Changed
Learn Code Optimize Build Deploy Run*********************

12
Gimel Simplifies Data Application Lifecycle
Data Application Lifecycle - With Data API
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
Compute Version Upgraded
Storage API Changed
Storage Connector Upgraded
Storage Hosts Migrated
Storage Changed
*********************
Run
Run
Run
Run
Run
Run

13
Challenges | Instrumentation Required at multiple touchpoints
Catalog /
Classification
Platform Centric
Interceptors
id name address
1 XXXX XXXX
2 XXXX XXXX
Visibility
Security
Data User / App
Data Stores

14
Challenges | Audit & Monitoring | Multifaceted
DBQLogs
Audit Table
Cloud Audit Logs
***
Lack of Unified View of Data
Processed on Spark
PubSub
User

15
Platform management Complexities Store Specific Interceptors
PubSub
Store
Operators App
Developers
Apps
Instrumentation
By App Developer

16
Putting it all together…
id First_nam
e
Last_nam
e
addres
s
1 XXXX XXXX XXXX
2 XXXX XXXX XXXX
3 XXXX XXXX XXXX
Data User
Data App
Data Stores
Catalog /
Classification
Alert
Platform Centric
InterceptorsSecurity
Data / SQL API
App App App
App
…….

17
Query Routing – Design
17
Spark / Gimel
ApplicationDeveloper Data scientist Analyst Operator
Gimel SDK Notebooks
UDC Data API
Logging
Monitoring
cation
cycle
gement
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Notebooks
Developer/Analyst/Data
Scientist
User / App needs transaction data
• NRT (Streaming)
• 7 days (Analytics Cache)
• 2 Years (cold storage)
1. Submits query to
GSQL Kernel
2. Submits
query to GTS Where txn_dt = last_7_days
Fast Access Via Cache
APP
• Gimel looks at logical dataset
in UDC
• Interpret filter criteria and
route query to appropriate
storage

Code Base
Docs
http://gimel.io
Code
Gimel_Notebook
Github
https://github.com/paypal/gimel
Gitter
https://gitter.im/paypal/gimel_data_api_community

Unified Data API & SQL Abstraction
21
With Data APISpark Read From Hbase
Spark Read From Elastic Search
With SQL

Unified Data API & Unified Config
Unified Data API
Unified Connector Config

Setgimel.catalog.provider=UDC
CatalogProvider.getDataSetProperties(“dataSetName”)
Metadata
Services
Setgimel.catalog.provider=USER
Setgimel.catalog.provider=HIVE
sql> set dataSetProperties={
"key.deserializer":"org.apache.kafka.common.serialization.StringDeserializer",
"auto.offset.reset":"earliest",
"gimel.kafka.checkpoint.zookeeper.host":"zookeeper:2181",
"gimel.storage.type":"kafka",
"gimel.kafka.whitelist.topics":"kafka_topic",
"datasetName":"test_table1",
"value.deserializer":"org.apache.kafka.common.serialization.ByteArrayDeserialize
r",
"value.serializer":"org.apache.kafka.common.serialization.ByteArraySerializer",
"gimel.kafka.checkpoint.zookeeper.path":"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.avro.schema.source":"CSR",
"gimel.kafka.zookeeper.connection.timeout.ms":"10000",
"gimel.kafka.avro.schema.source.url":"http://schema_registry:8081",
"key.serializer":"org.apache.kafka.common.serialization.StringSerializer",
"gimel.kafka.avro.schema.source.wrapper.key":"schema_registry_key",
"gimel.kafka.bootstrap.servers":"localhost:9092"
}
sql> Select * from pcatalog.test_table1.
spark.sql("set gimel.catalog.provider=USER");
val dataSetOptions = DataSetProperties(
"KAFKA",
Array(Field("payload","string",true)) ,
Array(),
Map(
"datasetName" -> "test_table1",
"auto.offset.reset"-> "earliest",
"gimel.kafka.bootstrap.servers"-> "localhost:9092",
"gimel.kafka.avro.schema.source"-> "CSR",
"gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081",
"gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key",
"gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181",
"gimel.kafka.checkpoint.zookeeper.path"->
"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.whitelist.topics"-> "kafka_topic",
"gimel.kafka.zookeeper.connection.timeout.ms"-> "10000",
"gimel.storage.type"-> "kafka",
"key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer"-> "org.apache.kafka.common.serialization.ByteArraySerializer"
)
)
dataSet.read(”test_table1",Map("dataSetProperties"->dataSetOptions))
CREATE EXTERNAL TABLE `pcatalog.test_table1`
(payload string)
LOCATION 'hdfs://tmp/'
TBLPROPERTIES (
"datasetName" -> "dummy",
"auto.offset.reset"-> "earliest",
"gimel.kafka.bootstrap.servers"-> "localhost:9092",
"gimel.kafka.avro.schema.source"-> "CSR",
"gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081",
"gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key",
"gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181",
"gimel.kafka.checkpoint.zookeeper.path"->
"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.whitelist.topics"-> "kafka_topic",
"gimel.kafka.zookeeper.connection.timeout.ms"-> "10000",
"gimel.storage.type"-> "kafka",
"key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer"->
"org.apache.kafka.common.serialization.ByteArraySerializer"
);
Spark-sql> Select * from pcatalog.test_table1
Scala> dataSet.read(”test_table1",Map("dataSetProperties"-
>dataSetOptions))
Anatomy of Catalog Provider
Metadata
Setgimel.catalog.provider=YOUR_CATALOG
{
//Implement this!
}

gimel.dataset.factory {
KafkaDataSet
ElasticSearchDataSet
DruidDataSet
HiveDataSet
AerospikeDataSet
HbaseDataSet
CassandraDataSet
JDBCDataSet
}
Metadata
Services
dataSet.read(“dataSetName”,options)
dataSet.write(dataToWrite,”dataSetName”,options)
dataStream.read(“dataSetName”, options)
valstorageDataSet =getFromFactory(type=“Hive”)
{
Core Connector Implementation, example –Kafka
Combination ofOpen SourceConnector and
In-house implementations
Open source connector such asDataStax/SHC /ES-Spark
}
Anatomy of API
gimel.datastream.factory{
KafkaDataStream
}
valstorageDataStream= getFromStreamFactory(type=“kafka”)
kafkaDataSet.read(“dataSetName”,options)
hiveDataSet.write(dataToWrite,”dataSetName”,options)
storageDataStream.read(“dataSetName”,options)
dataSet.write(”pcatalog.HIVE_dataset”,readDf, options)
val dataSet :gimel.DataSet =DataSet(sparkSession)
valdf1 =dataSet.read(“pcatalog.KAFKA_dataset”, options);
df1.createGlobalTempView(“tmp_abc123”)
Val resolvedSelectSQL= selectSQL.replace(“pcatalog.KAFKA_dataset”,”tmp_abc123”)
Val readDf : DataFrame= sparkSession.sql(resolvedSelectSQL);
selectkafka_ds.*,gimel_load_id
,substr(commit_timestamp,1,4)as yyyy
,substr(commit_timestamp,6,2)as mm
,substr(commit_timestamp,9,2)as dd
,substr(commit_timestamp,12,2)as hh
frompcatalog.KAFKA_dataset kafka_ds
join default.geo_lkp lkp
on kafka_ds.zip =geo_lkp.zip
where geo_lkp.region = ‘MIDWEST’
%%gimel
insertintopcatalog.HIVE_dataset
partition(yyyy,mm,dd,hh,mi)
--Establish10 concurrent connections perTopic-Partition
setgimel.kafka.throttle.batch.parallelsPerPartition=10;
--Fetch at max -10 M messagesfromeach partition
setgimel.kafka.throttle.batch.maxRecordsPerPartition=10,000,000;

Gimel Thrift Server @ PayPal
25

 HiveServer2
service that allows a remote client to submit requests to Hive using a variety of
programming languages (C++, Java, Python) and retrieve results BLOG
 Built on Apache Thrift Concepts
 Spark Thrift Server
Similar to HiveServer2, executes in spark Engine as compared to Hive (MR
/TEZ)
What is GTS?
• Gimel Thrift Server
Spark Thrift Server
+ Gimel
+ PayPal’s - Unified Data Catalog
+ Security & other PP specific features

Depending upon the cluster
capacity and traffic user has to
wait for the session
27
Why GTS?
27
Gimel SDK Notebooks
UDC Data API
Logging
Monitoring
Alerting
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Developer Dat
Gimel SDK N
Infrastructure se
Multi-DC
Application
Lifecycle
Management
Gimel SDK Notebooks
Infrastructure servic es leveraged for e
Multi-DC Predictive resource a
Application
Lifecycle
Management
Compute
Frameworkand
APIs
R Studi
Developer Data scientist Analyst
Gimel SDK Notebooks
UDC
Infrastructure servic es leveraged for elasticity
Multi-DC Predictive resource allocatio
Application
Lifecycle
Management
Compute
Frameworkand
APIs
R Studio BI
Gimel SDK Notebooks
UDC Data
Infrastructure servic es leveraged for elasticity and redu
Multi-DC PubPredictive resource allocation
Application
Lifecycle
Management
Compute
Frameworkand
APIs
User
R Studio BI tools
Gimel SDK Notebooks
UDC D
Infrastructure servic es leveraged for elasticity and r
Multi-DC Predictive resource allocation
Application
Lifecycle
Management
Compute
Frameworkand
APIs
R Studio BI tools
Gimel SDK Notebooks
UDC Data API
Infrastructure servic es leveraged for elasticity and redundancy
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Gimel SDK Notebooks
UDC Data API
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Gimel SDK Notebooks
UDC Data API
M
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experienc
andAcces
R Studio BI tools
Needs to read data from Hive
through SQL
PayPal Notebooks
Scientist
2. Starts Spark
Session on
cluster
3. Spark session Started
1. Get a Spark
Session
4. Submits the query
Select * from pymtdba.wtransaction_p2
5. Reads from Store
CLI Host
APP

28
How does GTS Work?
28
Gimel Thrift
Server
Developer Dat
Gimel SDK N
Infrastructure se
Multi-DC
Application
Lifecycle
Management
Gimel SDK Notebooks
Infrastructure servic es leveraged for e
Multi-DC Predictive resource a
Application
Lifecycle
Management
Compute
Frameworkand
APIs
R Studi
Gimel SDK Notebooks
UDC
Infrastructure servic es leveraged for elasticity
Multi-DC Predictive resource allocatio
Application
Lifecycle
Management
Compute
Frameworkand
APIs
R Studio BI
Gimel SDK Notebooks
UDC Data
Infrastructure servic es leveraged for elasticity and redu
Multi-DC PubPredictive resource allocation
Application
Lifecycle
Management
Compute
Frameworkand
APIs
User
R Studio BI tools
Gimel SDK Notebooks
UDC D
Infrastructure servic es leveraged for elasticity and r
Multi-DC Predictive resource allocation
Application
Lifecycle
Management
Compute
Frameworkand
APIs
R Studio BI tools
Gimel SDK Notebooks
UDC Data API
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Gimel SDK Notebooks
UDC Data API
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Gimel SDK Notebooks
UDC Data API
M
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experienc
andAcces
R Studio BI tools
Gimel SDK Notebooks
UDC Data API
Logging
Monitoring
cation
cycle
gement
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Paypal Notebooks
Scientist
Needs to read data from Hive
Select * from
pymtdba.wtransaction_p2
1. Submits query to
GSQL Kernal
2. Submits
query to GTS 3. Read from Store
APP
Connect via Java
JDBC / Python

What’s Next?
• Expand Catalog Provider
• Google Data Catalog
• Cloud Support
• BigQuery
• PubSub
• GCS
• AWS Redshift
• Gimel SQL
• Expand to Cloud Stores
• Query / Access
Optimization
• Pre-empt runaway queries
• Graph Support
• Neo4j
• ML/NLP Support
• ML-Lib
• Spark-NLP

Questions?
Code base
http://gimel.io
Gitter
https://gitter.im/paypal/gimel_data_api_community

 HiveServer2
service that allows a remote client to submit requests
to Hive using a variety of programming languages
(C++, Java, Python) and retrieve results BLOG
 Built on Apache Thrift Concepts
 Spark Thrift Server
Similar to HiveServer2, executes in spark Engine as
compared to Hive (MR /TEZ)
 Gimel Thrift Server
Spark Thrift Server + Gimel + Security & other PP
specific features
Hive, Spark, Gimel Thrift Server

GTS Key Features
Out-of-box Auditing:
Logging, Monitoring,
Dashboards
Alerting
(beta/internal)
Security
Apache Ranger
Teradata Proxy User
Part of Ecosystem
Notebooks – GSQL
UDC –Datasets
SCAAS – DML/DDL
Low Latency
User Experience
SQL to Any Store
Stores supported by
Gimel
Highly available
architecture
Software & Hardware
Query via REST
(work in progress)
REST
Query Guard
Kills run away
queries

Scale By The Bay | 2020 | Gimel

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Scale By The Bay | 2020 | Gimel

Similar a Scale By The Bay | 2020 | Gimel (20)

Último

Último (20)

Scale By The Bay | 2020 | Gimel

Notas del editor