SlideShare una empresa de Scribd logo
1 de 35
Conquering All Stores with Gimel
Vladimir Bacvanski
Anisha Nainani
Deepak Chandramouli
About us
Vladimir Bacvanski
vbacvanski@paypal.com
Twitter: @OnSoftware
• Principal Architect, Strategic Architecture at PayPal
• In previous life: CTO of a development and
consulting firm
• PhD in Computer Science from RWTH Aachen,
Germany
• O’Reilly author: Courses on Big Data, Kafka
Deepak Chandramouli
dmohanakumarchan@paypal.com
LinkedIn: @deepakmc
• MT2 Software Engineer, Data Platform Services at
PayPal
• Data Enthusiast
• Tech lead
• Gimel (Big Data Framework for Apache Spark)
• Unified Data Catalog – PayPal’s Enterprise Data
Catalog
Anisha Nainani
annainani@paypal.com
LinkedIn: @anishanainani
• Senior Software Engineer
• Big Data
• Data Platform Services
AGENDA
 PayPal - Introduction
 Why Gimel
 Gimel + Catalog + Alluxio (Intelligent Query
engine)
 Gimel Deep Dive
 What’s next?
 Questions
PayPal – Key Metrics and Analytics
Ecosystem
4
PayPal | Q3-2020 | Key Metrics
5https://investor.pypl.com/home/default.aspx
PayPal | Data Growth
6
160+ PB Data200,000+
YARN jobs/day
One of the largest
Aerospike,
Teradata,
Hortonworks
and Oracle
installations
Compute
supported:
Spark, Hive,
MR, BigQuery
20+ On-Premise
clusters
GPU co-located with
Hadoop
Cloud Migration
Adjacencies
7
Developer Data scientist Analyst Operator
Gimel SDK Notebooks
UDC Data API
Infrastructure services leveraged for elasticity and redundancy
Multi-DC Public cloudPredictive resource allocation
Logging
Monitoring
Alerting
Security
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
PayPal | Data Landscape
Why Gimel?
9
Challenges | Data Access Code | Cumbersome & Fragile
Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
Illustration Purpose
Not Meant to Read
Spark Read From Hbase
10
Challenges | Data Processing Can Be Multi-Mode & Polyglot
Batch
11
Challenges with Data App Lifecycle
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
Learn Code Optimize Build Deploy RunCompute Version Upgraded
Learn Code Optimize Build Deploy RunStorage API Changed
Learn Code Optimize Build Deploy RunStorage Connector Upgraded
Learn Code Optimize Build Deploy RunStorage Hosts Migrated
Learn Code Optimize Build Deploy RunStorage Changed
Learn Code Optimize Build Deploy Run*********************
12
Gimel Simplifies Data Application Lifecycle
Data Application Lifecycle - With Data API
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
Compute Version Upgraded
Storage API Changed
Storage Connector Upgraded
Storage Hosts Migrated
Storage Changed
*********************
Run
Run
Run
Run
Run
Run
13
Challenges | Instrumentation Required at multiple touchpoints
Catalog /
Classification
Platform Centric
Interceptors
id name address
1 XXXX XXXX
2 XXXX XXXX
Visibility
Security
Data User / App
Data Stores
14
Challenges | Audit & Monitoring | Multifaceted
DBQLogs
Audit Table
Cloud Audit Logs
***
Lack of Unified View of Data
Processed on Spark
PubSub
User
15
Platform management Complexities Store Specific Interceptors
PubSub
Store
Operators App
Developers
Apps
Instrumentation
By App Developer
16
Putting it all together…
id First_nam
e
Last_nam
e
addres
s
1 XXXX XXXX XXXX
2 XXXX XXXX XXXX
3 XXXX XXXX XXXX
Data User
Data App
Data Stores
Catalog /
Classification
Alert
Platform Centric
InterceptorsSecurity
Data / SQL API
App App App
App
…….
17
Query Routing – Design
17
Spark / Gimel
ApplicationDeveloper Data scientist Analyst Operator
Gimel SDK Notebooks
UDC Data API
Logging
Monitoring
cation
cycle
gement
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Notebooks
Developer/Analyst/Data
Scientist
User / App needs transaction data
• NRT (Streaming)
• 7 days (Analytics Cache)
• 2 Years (cold storage)
1. Submits query to
GSQL Kernel
2. Submits
query to GTS Where txn_dt = last_7_days
Fast Access Via Cache
APP
• Gimel looks at logical dataset
in UDC
• Interpret filter criteria and
route query to appropriate
storage
Code
Code Base
Docs
http://gimel.io
Code
Gimel_Notebook
Github
https://github.com/paypal/gimel
Gitter
https://gitter.im/paypal/gimel_data_api_community
Gimel – Deep Dive
20
Unified Data API & SQL Abstraction
21
With Data APISpark Read From Hbase
Spark Read From Elastic Search
With SQL
Unified Data API & Unified Config
Unified Data API
Unified Connector Config
Setgimel.catalog.provider=UDC
CatalogProvider.getDataSetProperties(“dataSetName”)
Metadata
Services
Setgimel.catalog.provider=USER
CatalogProvider.getDataSetProperties(“dataSetName”)
Setgimel.catalog.provider=HIVE
CatalogProvider.getDataSetProperties(“dataSetName”)
sql> set dataSetProperties={
"key.deserializer":"org.apache.kafka.common.serialization.StringDeserializer",
"auto.offset.reset":"earliest",
"gimel.kafka.checkpoint.zookeeper.host":"zookeeper:2181",
"gimel.storage.type":"kafka",
"gimel.kafka.whitelist.topics":"kafka_topic",
"datasetName":"test_table1",
"value.deserializer":"org.apache.kafka.common.serialization.ByteArrayDeserialize
r",
"value.serializer":"org.apache.kafka.common.serialization.ByteArraySerializer",
"gimel.kafka.checkpoint.zookeeper.path":"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.avro.schema.source":"CSR",
"gimel.kafka.zookeeper.connection.timeout.ms":"10000",
"gimel.kafka.avro.schema.source.url":"http://schema_registry:8081",
"key.serializer":"org.apache.kafka.common.serialization.StringSerializer",
"gimel.kafka.avro.schema.source.wrapper.key":"schema_registry_key",
"gimel.kafka.bootstrap.servers":"localhost:9092"
}
sql> Select * from pcatalog.test_table1.
spark.sql("set gimel.catalog.provider=USER");
val dataSetOptions = DataSetProperties(
"KAFKA",
Array(Field("payload","string",true)) ,
Array(),
Map(
"datasetName" -> "test_table1",
"auto.offset.reset"-> "earliest",
"gimel.kafka.bootstrap.servers"-> "localhost:9092",
"gimel.kafka.avro.schema.source"-> "CSR",
"gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081",
"gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key",
"gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181",
"gimel.kafka.checkpoint.zookeeper.path"->
"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.whitelist.topics"-> "kafka_topic",
"gimel.kafka.zookeeper.connection.timeout.ms"-> "10000",
"gimel.storage.type"-> "kafka",
"key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer"-> "org.apache.kafka.common.serialization.ByteArraySerializer"
)
)
dataSet.read(”test_table1",Map("dataSetProperties"->dataSetOptions))
CREATE EXTERNAL TABLE `pcatalog.test_table1`
(payload string)
LOCATION 'hdfs://tmp/'
TBLPROPERTIES (
"datasetName" -> "dummy",
"auto.offset.reset"-> "earliest",
"gimel.kafka.bootstrap.servers"-> "localhost:9092",
"gimel.kafka.avro.schema.source"-> "CSR",
"gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081",
"gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key",
"gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181",
"gimel.kafka.checkpoint.zookeeper.path"->
"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.whitelist.topics"-> "kafka_topic",
"gimel.kafka.zookeeper.connection.timeout.ms"-> "10000",
"gimel.storage.type"-> "kafka",
"key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer"->
"org.apache.kafka.common.serialization.ByteArraySerializer"
);
Spark-sql> Select * from pcatalog.test_table1
Scala> dataSet.read(”test_table1",Map("dataSetProperties"-
>dataSetOptions))
Anatomy of Catalog Provider
Metadata
Setgimel.catalog.provider=YOUR_CATALOG
CatalogProvider.getDataSetProperties(“dataSetName”)
{
//Implement this!
}
gimel.dataset.factory {
KafkaDataSet
ElasticSearchDataSet
DruidDataSet
HiveDataSet
AerospikeDataSet
HbaseDataSet
CassandraDataSet
JDBCDataSet
}
Metadata
Services
dataSet.read(“dataSetName”,options)
dataSet.write(dataToWrite,”dataSetName”,options)
dataStream.read(“dataSetName”, options)
valstorageDataSet =getFromFactory(type=“Hive”)
{
Core Connector Implementation, example –Kafka
Combination ofOpen SourceConnector and
In-house implementations
Open source connector such asDataStax/SHC /ES-Spark
}
Anatomy of API
gimel.datastream.factory{
KafkaDataStream
}
CatalogProvider.getDataSetProperties(“dataSetName”)
valstorageDataStream= getFromStreamFactory(type=“kafka”)
kafkaDataSet.read(“dataSetName”,options)
hiveDataSet.write(dataToWrite,”dataSetName”,options)
storageDataStream.read(“dataSetName”,options)
dataSet.write(”pcatalog.HIVE_dataset”,readDf, options)
val dataSet :gimel.DataSet =DataSet(sparkSession)
valdf1 =dataSet.read(“pcatalog.KAFKA_dataset”, options);
df1.createGlobalTempView(“tmp_abc123”)
Val resolvedSelectSQL= selectSQL.replace(“pcatalog.KAFKA_dataset”,”tmp_abc123”)
Val readDf : DataFrame= sparkSession.sql(resolvedSelectSQL);
selectkafka_ds.*,gimel_load_id
,substr(commit_timestamp,1,4)as yyyy
,substr(commit_timestamp,6,2)as mm
,substr(commit_timestamp,9,2)as dd
,substr(commit_timestamp,12,2)as hh
frompcatalog.KAFKA_dataset kafka_ds
join default.geo_lkp lkp
on kafka_ds.zip =geo_lkp.zip
where geo_lkp.region = ‘MIDWEST’
%%gimel
insertintopcatalog.HIVE_dataset
partition(yyyy,mm,dd,hh,mi)
--Establish10 concurrent connections perTopic-Partition
setgimel.kafka.throttle.batch.parallelsPerPartition=10;
--Fetch at max -10 M messagesfromeach partition
setgimel.kafka.throttle.batch.maxRecordsPerPartition=10,000,000;
Gimel Thrift Server @ PayPal
25
 HiveServer2
service that allows a remote client to submit requests to Hive using a variety of
programming languages (C++, Java, Python) and retrieve results BLOG
 Built on Apache Thrift Concepts
 Spark Thrift Server
Similar to HiveServer2, executes in spark Engine as compared to Hive (MR
/TEZ)
What is GTS?
• Gimel Thrift Server
Spark Thrift Server
+ Gimel
+ PayPal’s - Unified Data Catalog
+ Security & other PP specific features
Depending upon the cluster
capacity and traffic user has to
wait for the session
27
Why GTS?
27
Developer Data scientist Analyst Operator
Gimel SDK Notebooks
UDC Data API
Logging
Monitoring
Alerting
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Developer Dat
Gimel SDK N
Infrastructure se
Multi-DC
Application
Lifecycle
Management
Gimel SDK Notebooks
Infrastructure servic es leveraged for e
Multi-DC Predictive resource a
Application
Lifecycle
Management
Compute
Frameworkand
APIs
R Studi
Developer Data scientist Analyst
Gimel SDK Notebooks
UDC
Infrastructure servic es leveraged for elasticity
Multi-DC Predictive resource allocatio
Application
Lifecycle
Management
Compute
Frameworkand
APIs
R Studio BI
Developer Data scientist Analyst
Gimel SDK Notebooks
UDC Data
Infrastructure servic es leveraged for elasticity and redu
Multi-DC PubPredictive resource allocation
Application
Lifecycle
Management
Compute
Frameworkand
APIs
User
R Studio BI tools
Developer Data scientist Analyst
Gimel SDK Notebooks
UDC D
Infrastructure servic es leveraged for elasticity and r
Multi-DC Predictive resource allocation
Application
Lifecycle
Management
Compute
Frameworkand
APIs
R Studio BI tools
Developer Data scientist Analyst Operator
Gimel SDK Notebooks
UDC Data API
Infrastructure servic es leveraged for elasticity and redundancy
Multi-DC Public cloudPredictive resource allocation
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Developer Data scientist Analyst Operator
Gimel SDK Notebooks
UDC Data API
Infrastructure servic es leveraged for elasticity and redundancy
Multi-DC Public cloudPredictive resource allocation
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Gimel SDK Notebooks
UDC Data API
Infrastructure servic es leveraged for elasticity and redundancy
Multi-DC Public cloudPredictive resource allocation
M
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experienc
andAcces
R Studio BI tools
Needs to read data from Hive
through SQL
PayPal Notebooks
Developer/Analyst/Data
Scientist
2. Starts Spark
Session on
cluster
3. Spark session Started
1. Get a Spark
Session
4. Submits the query
Select * from pymtdba.wtransaction_p2
5. Reads from Store
CLI Host
APP
28
How does GTS Work?
28
Gimel Thrift
Server
Developer Dat
Gimel SDK N
Infrastructure se
Multi-DC
Application
Lifecycle
Management
Gimel SDK Notebooks
Infrastructure servic es leveraged for e
Multi-DC Predictive resource a
Application
Lifecycle
Management
Compute
Frameworkand
APIs
R Studi
Developer Data scientist Analyst
Gimel SDK Notebooks
UDC
Infrastructure servic es leveraged for elasticity
Multi-DC Predictive resource allocatio
Application
Lifecycle
Management
Compute
Frameworkand
APIs
R Studio BI
Developer Data scientist Analyst
Gimel SDK Notebooks
UDC Data
Infrastructure servic es leveraged for elasticity and redu
Multi-DC PubPredictive resource allocation
Application
Lifecycle
Management
Compute
Frameworkand
APIs
User
R Studio BI tools
Developer Data scientist Analyst
Gimel SDK Notebooks
UDC D
Infrastructure servic es leveraged for elasticity and r
Multi-DC Predictive resource allocation
Application
Lifecycle
Management
Compute
Frameworkand
APIs
R Studio BI tools
Developer Data scientist Analyst Operator
Gimel SDK Notebooks
UDC Data API
Infrastructure servic es leveraged for elasticity and redundancy
Multi-DC Public cloudPredictive resource allocation
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Developer Data scientist Analyst Operator
Gimel SDK Notebooks
UDC Data API
Infrastructure servic es leveraged for elasticity and redundancy
Multi-DC Public cloudPredictive resource allocation
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Gimel SDK Notebooks
UDC Data API
Infrastructure servic es leveraged for elasticity and redundancy
Multi-DC Public cloudPredictive resource allocation
M
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experienc
andAcces
R Studio BI tools
Developer Data scientist Analyst Operator
Gimel SDK Notebooks
UDC Data API
Logging
Monitoring
cation
cycle
gement
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools
Paypal Notebooks
Developer/Analyst/Data
Scientist
Needs to read data from Hive
Select * from
pymtdba.wtransaction_p2
1. Submits query to
GSQL Kernal
2. Submits
query to GTS 3. Read from Store
APP
Connect via Java
JDBC / Python
Next Steps
What’s Next?
• Expand Catalog Provider
• Google Data Catalog
• Cloud Support
• BigQuery
• PubSub
• GCS
• AWS Redshift
• Gimel SQL
• Expand to Cloud Stores
• Query / Access
Optimization
• Pre-empt runaway queries
• Graph Support
• Neo4j
• ML/NLP Support
• ML-Lib
• Spark-NLP
Questions?
Code base
http://gimel.io
Gitter
https://gitter.im/paypal/gimel_data_api_community
Thank You!
Appendix
 HiveServer2
service that allows a remote client to submit requests
to Hive using a variety of programming languages
(C++, Java, Python) and retrieve results BLOG
 Built on Apache Thrift Concepts
 Spark Thrift Server
Similar to HiveServer2, executes in spark Engine as
compared to Hive (MR /TEZ)
 Gimel Thrift Server
Spark Thrift Server + Gimel + Security & other PP
specific features
Hive, Spark, Gimel Thrift Server
GTS Key Features
Out-of-box Auditing:
Logging, Monitoring,
Dashboards
Alerting
(beta/internal)
Security
Apache Ranger
Teradata Proxy User
Part of Ecosystem
Notebooks – GSQL
UDC –Datasets
SCAAS – DML/DDL
Low Latency
User Experience
SQL to Any Store
Stores supported by
Gimel
Highly available
architecture
Software & Hardware
Query via REST
(work in progress)
REST
Query Guard
Kills run away
queries

Más contenido relacionado

La actualidad más candente

Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the RealityPhar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Databricks
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
DataWorks Summit
 
Securing and governing a multi-tenant data lake within the financial industry
Securing and governing a multi-tenant data lake within the financial industrySecuring and governing a multi-tenant data lake within the financial industry
Securing and governing a multi-tenant data lake within the financial industry
DataWorks Summit
 
Highly configurable and extensible data processing framework at PubMatic
Highly configurable and extensible data processing framework at PubMaticHighly configurable and extensible data processing framework at PubMatic
Highly configurable and extensible data processing framework at PubMatic
DataWorks Summit
 

La actualidad más candente (20)

Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Learn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML LifecycleLearn to Use Databricks for the Full ML Lifecycle
Learn to Use Databricks for the Full ML Lifecycle
 
Phar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the RealityPhar Data Platform: From the Lakehouse Paradigm to the Reality
Phar Data Platform: From the Lakehouse Paradigm to the Reality
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Migrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for DatabricksMigrate and Modernize Hadoop-Based Security Policies for Databricks
Migrate and Modernize Hadoop-Based Security Policies for Databricks
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
 
Big Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To KnowBig Data Management: What's New, What's Different, and What You Need To Know
Big Data Management: What's New, What's Different, and What You Need To Know
 
Securing and governing a multi-tenant data lake within the financial industry
Securing and governing a multi-tenant data lake within the financial industrySecuring and governing a multi-tenant data lake within the financial industry
Securing and governing a multi-tenant data lake within the financial industry
 
Building the Artificially Intelligent Enterprise
Building the Artificially Intelligent EnterpriseBuilding the Artificially Intelligent Enterprise
Building the Artificially Intelligent Enterprise
 
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
 
Managing R&D Data on Parallel Compute Infrastructure
Managing R&D Data on Parallel Compute InfrastructureManaging R&D Data on Parallel Compute Infrastructure
Managing R&D Data on Parallel Compute Infrastructure
 
Intelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff PollockIntelligent Integration OOW2017 - Jeff Pollock
Intelligent Integration OOW2017 - Jeff Pollock
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Scaling Privacy in a Spark Ecosystem
Scaling Privacy in a Spark EcosystemScaling Privacy in a Spark Ecosystem
Scaling Privacy in a Spark Ecosystem
 
Data visualization with sql analytics
Data visualization with sql analyticsData visualization with sql analytics
Data visualization with sql analytics
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
 
Highly configurable and extensible data processing framework at PubMatic
Highly configurable and extensible data processing framework at PubMaticHighly configurable and extensible data processing framework at PubMatic
Highly configurable and extensible data processing framework at PubMatic
 

Similar a Scale By The Bay | 2020 | Gimel

01_Team_03_CS_591_Project
01_Team_03_CS_591_Project01_Team_03_CS_591_Project
01_Team_03_CS_591_Project
harsh mehta
 

Similar a Scale By The Bay | 2020 | Gimel (20)

Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | Gimel
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit OrlandoGimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
 
Supercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuerySupercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuery
 
Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018Gimel at Dataworks Summit San Jose 2018
Gimel at Dataworks Summit San Jose 2018
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 
Preparing for Data Residency and Custom Domains
Preparing for Data Residency and Custom DomainsPreparing for Data Residency and Custom Domains
Preparing for Data Residency and Custom Domains
 
Neo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform OverviewNeo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform Overview
 
Intershop Commerce Management with Microsoft SQL Server
Intershop Commerce Management with Microsoft SQL ServerIntershop Commerce Management with Microsoft SQL Server
Intershop Commerce Management with Microsoft SQL Server
 
Super-NetOps Source of Truth
Super-NetOps Source of TruthSuper-NetOps Source of Truth
Super-NetOps Source of Truth
 
Iac d.damyanov 4.pptx
Iac d.damyanov 4.pptxIac d.damyanov 4.pptx
Iac d.damyanov 4.pptx
 
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News! ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
 
01_Team_03_CS_591_Project
01_Team_03_CS_591_Project01_Team_03_CS_591_Project
01_Team_03_CS_591_Project
 
Building a cloud based managed BigData platform for the enterprise
Building a cloud based managed BigData platform for the enterpriseBuilding a cloud based managed BigData platform for the enterprise
Building a cloud based managed BigData platform for the enterprise
 
MySQL Day Paris 2018 - What’s New in MySQL 8.0 ?
MySQL Day Paris 2018 - What’s New in MySQL 8.0 ?MySQL Day Paris 2018 - What’s New in MySQL 8.0 ?
MySQL Day Paris 2018 - What’s New in MySQL 8.0 ?
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Scale By The Bay | 2020 | Gimel

  • 1. Conquering All Stores with Gimel Vladimir Bacvanski Anisha Nainani Deepak Chandramouli
  • 2. About us Vladimir Bacvanski vbacvanski@paypal.com Twitter: @OnSoftware • Principal Architect, Strategic Architecture at PayPal • In previous life: CTO of a development and consulting firm • PhD in Computer Science from RWTH Aachen, Germany • O’Reilly author: Courses on Big Data, Kafka Deepak Chandramouli dmohanakumarchan@paypal.com LinkedIn: @deepakmc • MT2 Software Engineer, Data Platform Services at PayPal • Data Enthusiast • Tech lead • Gimel (Big Data Framework for Apache Spark) • Unified Data Catalog – PayPal’s Enterprise Data Catalog Anisha Nainani annainani@paypal.com LinkedIn: @anishanainani • Senior Software Engineer • Big Data • Data Platform Services
  • 3. AGENDA  PayPal - Introduction  Why Gimel  Gimel + Catalog + Alluxio (Intelligent Query engine)  Gimel Deep Dive  What’s next?  Questions
  • 4. PayPal – Key Metrics and Analytics Ecosystem 4
  • 5. PayPal | Q3-2020 | Key Metrics 5https://investor.pypl.com/home/default.aspx
  • 6. PayPal | Data Growth 6 160+ PB Data200,000+ YARN jobs/day One of the largest Aerospike, Teradata, Hortonworks and Oracle installations Compute supported: Spark, Hive, MR, BigQuery 20+ On-Premise clusters GPU co-located with Hadoop Cloud Migration Adjacencies
  • 7. 7 Developer Data scientist Analyst Operator Gimel SDK Notebooks UDC Data API Infrastructure services leveraged for elasticity and redundancy Multi-DC Public cloudPredictive resource allocation Logging Monitoring Alerting Security Application Lifecycle Management Compute Frameworkand APIs GimelData Platform User Experience andAccess R Studio BI tools PayPal | Data Landscape
  • 9. 9 Challenges | Data Access Code | Cumbersome & Fragile Spark Read From Hbase Spark Read From Elastic Search Spark Read From AeroSpike Spark Read From Druid Illustration Purpose Not Meant to Read Spark Read From Hbase
  • 10. 10 Challenges | Data Processing Can Be Multi-Mode & Polyglot Batch
  • 11. 11 Challenges with Data App Lifecycle Learn Code Optimize Build Deploy RunOnboarding Big Data Apps Learn Code Optimize Build Deploy RunCompute Version Upgraded Learn Code Optimize Build Deploy RunStorage API Changed Learn Code Optimize Build Deploy RunStorage Connector Upgraded Learn Code Optimize Build Deploy RunStorage Hosts Migrated Learn Code Optimize Build Deploy RunStorage Changed Learn Code Optimize Build Deploy Run*********************
  • 12. 12 Gimel Simplifies Data Application Lifecycle Data Application Lifecycle - With Data API Learn Code Optimize Build Deploy RunOnboarding Big Data Apps Compute Version Upgraded Storage API Changed Storage Connector Upgraded Storage Hosts Migrated Storage Changed ********************* Run Run Run Run Run Run
  • 13. 13 Challenges | Instrumentation Required at multiple touchpoints Catalog / Classification Platform Centric Interceptors id name address 1 XXXX XXXX 2 XXXX XXXX Visibility Security Data User / App Data Stores
  • 14. 14 Challenges | Audit & Monitoring | Multifaceted DBQLogs Audit Table Cloud Audit Logs *** Lack of Unified View of Data Processed on Spark PubSub User
  • 15. 15 Platform management Complexities Store Specific Interceptors PubSub Store Operators App Developers Apps Instrumentation By App Developer
  • 16. 16 Putting it all together… id First_nam e Last_nam e addres s 1 XXXX XXXX XXXX 2 XXXX XXXX XXXX 3 XXXX XXXX XXXX Data User Data App Data Stores Catalog / Classification Alert Platform Centric InterceptorsSecurity Data / SQL API App App App App …….
  • 17. 17 Query Routing – Design 17 Spark / Gimel ApplicationDeveloper Data scientist Analyst Operator Gimel SDK Notebooks UDC Data API Logging Monitoring cation cycle gement Compute Frameworkand APIs GimelData Platform User Experience andAccess R Studio BI tools Notebooks Developer/Analyst/Data Scientist User / App needs transaction data • NRT (Streaming) • 7 days (Analytics Cache) • 2 Years (cold storage) 1. Submits query to GSQL Kernel 2. Submits query to GTS Where txn_dt = last_7_days Fast Access Via Cache APP • Gimel looks at logical dataset in UDC • Interpret filter criteria and route query to appropriate storage
  • 18. Code
  • 20. Gimel – Deep Dive 20
  • 21. Unified Data API & SQL Abstraction 21 With Data APISpark Read From Hbase Spark Read From Elastic Search With SQL
  • 22. Unified Data API & Unified Config Unified Data API Unified Connector Config
  • 23. Setgimel.catalog.provider=UDC CatalogProvider.getDataSetProperties(“dataSetName”) Metadata Services Setgimel.catalog.provider=USER CatalogProvider.getDataSetProperties(“dataSetName”) Setgimel.catalog.provider=HIVE CatalogProvider.getDataSetProperties(“dataSetName”) sql> set dataSetProperties={ "key.deserializer":"org.apache.kafka.common.serialization.StringDeserializer", "auto.offset.reset":"earliest", "gimel.kafka.checkpoint.zookeeper.host":"zookeeper:2181", "gimel.storage.type":"kafka", "gimel.kafka.whitelist.topics":"kafka_topic", "datasetName":"test_table1", "value.deserializer":"org.apache.kafka.common.serialization.ByteArrayDeserialize r", "value.serializer":"org.apache.kafka.common.serialization.ByteArraySerializer", "gimel.kafka.checkpoint.zookeeper.path":"/pcatalog/kafka_consumer/checkpoint", "gimel.kafka.avro.schema.source":"CSR", "gimel.kafka.zookeeper.connection.timeout.ms":"10000", "gimel.kafka.avro.schema.source.url":"http://schema_registry:8081", "key.serializer":"org.apache.kafka.common.serialization.StringSerializer", "gimel.kafka.avro.schema.source.wrapper.key":"schema_registry_key", "gimel.kafka.bootstrap.servers":"localhost:9092" } sql> Select * from pcatalog.test_table1. spark.sql("set gimel.catalog.provider=USER"); val dataSetOptions = DataSetProperties( "KAFKA", Array(Field("payload","string",true)) , Array(), Map( "datasetName" -> "test_table1", "auto.offset.reset"-> "earliest", "gimel.kafka.bootstrap.servers"-> "localhost:9092", "gimel.kafka.avro.schema.source"-> "CSR", "gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081", "gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key", "gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181", "gimel.kafka.checkpoint.zookeeper.path"-> "/pcatalog/kafka_consumer/checkpoint", "gimel.kafka.whitelist.topics"-> "kafka_topic", "gimel.kafka.zookeeper.connection.timeout.ms"-> "10000", "gimel.storage.type"-> "kafka", "key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer", "value.serializer"-> "org.apache.kafka.common.serialization.ByteArraySerializer" ) ) dataSet.read(”test_table1",Map("dataSetProperties"->dataSetOptions)) CREATE EXTERNAL TABLE `pcatalog.test_table1` (payload string) LOCATION 'hdfs://tmp/' TBLPROPERTIES ( "datasetName" -> "dummy", "auto.offset.reset"-> "earliest", "gimel.kafka.bootstrap.servers"-> "localhost:9092", "gimel.kafka.avro.schema.source"-> "CSR", "gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081", "gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key", "gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181", "gimel.kafka.checkpoint.zookeeper.path"-> "/pcatalog/kafka_consumer/checkpoint", "gimel.kafka.whitelist.topics"-> "kafka_topic", "gimel.kafka.zookeeper.connection.timeout.ms"-> "10000", "gimel.storage.type"-> "kafka", "key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer", "value.serializer"-> "org.apache.kafka.common.serialization.ByteArraySerializer" ); Spark-sql> Select * from pcatalog.test_table1 Scala> dataSet.read(”test_table1",Map("dataSetProperties"- >dataSetOptions)) Anatomy of Catalog Provider Metadata Setgimel.catalog.provider=YOUR_CATALOG CatalogProvider.getDataSetProperties(“dataSetName”) { //Implement this! }
  • 24. gimel.dataset.factory { KafkaDataSet ElasticSearchDataSet DruidDataSet HiveDataSet AerospikeDataSet HbaseDataSet CassandraDataSet JDBCDataSet } Metadata Services dataSet.read(“dataSetName”,options) dataSet.write(dataToWrite,”dataSetName”,options) dataStream.read(“dataSetName”, options) valstorageDataSet =getFromFactory(type=“Hive”) { Core Connector Implementation, example –Kafka Combination ofOpen SourceConnector and In-house implementations Open source connector such asDataStax/SHC /ES-Spark } Anatomy of API gimel.datastream.factory{ KafkaDataStream } CatalogProvider.getDataSetProperties(“dataSetName”) valstorageDataStream= getFromStreamFactory(type=“kafka”) kafkaDataSet.read(“dataSetName”,options) hiveDataSet.write(dataToWrite,”dataSetName”,options) storageDataStream.read(“dataSetName”,options) dataSet.write(”pcatalog.HIVE_dataset”,readDf, options) val dataSet :gimel.DataSet =DataSet(sparkSession) valdf1 =dataSet.read(“pcatalog.KAFKA_dataset”, options); df1.createGlobalTempView(“tmp_abc123”) Val resolvedSelectSQL= selectSQL.replace(“pcatalog.KAFKA_dataset”,”tmp_abc123”) Val readDf : DataFrame= sparkSession.sql(resolvedSelectSQL); selectkafka_ds.*,gimel_load_id ,substr(commit_timestamp,1,4)as yyyy ,substr(commit_timestamp,6,2)as mm ,substr(commit_timestamp,9,2)as dd ,substr(commit_timestamp,12,2)as hh frompcatalog.KAFKA_dataset kafka_ds join default.geo_lkp lkp on kafka_ds.zip =geo_lkp.zip where geo_lkp.region = ‘MIDWEST’ %%gimel insertintopcatalog.HIVE_dataset partition(yyyy,mm,dd,hh,mi) --Establish10 concurrent connections perTopic-Partition setgimel.kafka.throttle.batch.parallelsPerPartition=10; --Fetch at max -10 M messagesfromeach partition setgimel.kafka.throttle.batch.maxRecordsPerPartition=10,000,000;
  • 25. Gimel Thrift Server @ PayPal 25
  • 26.  HiveServer2 service that allows a remote client to submit requests to Hive using a variety of programming languages (C++, Java, Python) and retrieve results BLOG  Built on Apache Thrift Concepts  Spark Thrift Server Similar to HiveServer2, executes in spark Engine as compared to Hive (MR /TEZ) What is GTS? • Gimel Thrift Server Spark Thrift Server + Gimel + PayPal’s - Unified Data Catalog + Security & other PP specific features
  • 27. Depending upon the cluster capacity and traffic user has to wait for the session 27 Why GTS? 27 Developer Data scientist Analyst Operator Gimel SDK Notebooks UDC Data API Logging Monitoring Alerting Application Lifecycle Management Compute Frameworkand APIs GimelData Platform User Experience andAccess R Studio BI tools Developer Dat Gimel SDK N Infrastructure se Multi-DC Application Lifecycle Management Gimel SDK Notebooks Infrastructure servic es leveraged for e Multi-DC Predictive resource a Application Lifecycle Management Compute Frameworkand APIs R Studi Developer Data scientist Analyst Gimel SDK Notebooks UDC Infrastructure servic es leveraged for elasticity Multi-DC Predictive resource allocatio Application Lifecycle Management Compute Frameworkand APIs R Studio BI Developer Data scientist Analyst Gimel SDK Notebooks UDC Data Infrastructure servic es leveraged for elasticity and redu Multi-DC PubPredictive resource allocation Application Lifecycle Management Compute Frameworkand APIs User R Studio BI tools Developer Data scientist Analyst Gimel SDK Notebooks UDC D Infrastructure servic es leveraged for elasticity and r Multi-DC Predictive resource allocation Application Lifecycle Management Compute Frameworkand APIs R Studio BI tools Developer Data scientist Analyst Operator Gimel SDK Notebooks UDC Data API Infrastructure servic es leveraged for elasticity and redundancy Multi-DC Public cloudPredictive resource allocation Application Lifecycle Management Compute Frameworkand APIs GimelData Platform User Experience andAccess R Studio BI tools Developer Data scientist Analyst Operator Gimel SDK Notebooks UDC Data API Infrastructure servic es leveraged for elasticity and redundancy Multi-DC Public cloudPredictive resource allocation Application Lifecycle Management Compute Frameworkand APIs GimelData Platform User Experience andAccess R Studio BI tools Gimel SDK Notebooks UDC Data API Infrastructure servic es leveraged for elasticity and redundancy Multi-DC Public cloudPredictive resource allocation M Application Lifecycle Management Compute Frameworkand APIs GimelData Platform User Experienc andAcces R Studio BI tools Needs to read data from Hive through SQL PayPal Notebooks Developer/Analyst/Data Scientist 2. Starts Spark Session on cluster 3. Spark session Started 1. Get a Spark Session 4. Submits the query Select * from pymtdba.wtransaction_p2 5. Reads from Store CLI Host APP
  • 28. 28 How does GTS Work? 28 Gimel Thrift Server Developer Dat Gimel SDK N Infrastructure se Multi-DC Application Lifecycle Management Gimel SDK Notebooks Infrastructure servic es leveraged for e Multi-DC Predictive resource a Application Lifecycle Management Compute Frameworkand APIs R Studi Developer Data scientist Analyst Gimel SDK Notebooks UDC Infrastructure servic es leveraged for elasticity Multi-DC Predictive resource allocatio Application Lifecycle Management Compute Frameworkand APIs R Studio BI Developer Data scientist Analyst Gimel SDK Notebooks UDC Data Infrastructure servic es leveraged for elasticity and redu Multi-DC PubPredictive resource allocation Application Lifecycle Management Compute Frameworkand APIs User R Studio BI tools Developer Data scientist Analyst Gimel SDK Notebooks UDC D Infrastructure servic es leveraged for elasticity and r Multi-DC Predictive resource allocation Application Lifecycle Management Compute Frameworkand APIs R Studio BI tools Developer Data scientist Analyst Operator Gimel SDK Notebooks UDC Data API Infrastructure servic es leveraged for elasticity and redundancy Multi-DC Public cloudPredictive resource allocation Application Lifecycle Management Compute Frameworkand APIs GimelData Platform User Experience andAccess R Studio BI tools Developer Data scientist Analyst Operator Gimel SDK Notebooks UDC Data API Infrastructure servic es leveraged for elasticity and redundancy Multi-DC Public cloudPredictive resource allocation Application Lifecycle Management Compute Frameworkand APIs GimelData Platform User Experience andAccess R Studio BI tools Gimel SDK Notebooks UDC Data API Infrastructure servic es leveraged for elasticity and redundancy Multi-DC Public cloudPredictive resource allocation M Application Lifecycle Management Compute Frameworkand APIs GimelData Platform User Experienc andAcces R Studio BI tools Developer Data scientist Analyst Operator Gimel SDK Notebooks UDC Data API Logging Monitoring cation cycle gement Compute Frameworkand APIs GimelData Platform User Experience andAccess R Studio BI tools Paypal Notebooks Developer/Analyst/Data Scientist Needs to read data from Hive Select * from pymtdba.wtransaction_p2 1. Submits query to GSQL Kernal 2. Submits query to GTS 3. Read from Store APP Connect via Java JDBC / Python
  • 30. What’s Next? • Expand Catalog Provider • Google Data Catalog • Cloud Support • BigQuery • PubSub • GCS • AWS Redshift • Gimel SQL • Expand to Cloud Stores • Query / Access Optimization • Pre-empt runaway queries • Graph Support • Neo4j • ML/NLP Support • ML-Lib • Spark-NLP
  • 34.  HiveServer2 service that allows a remote client to submit requests to Hive using a variety of programming languages (C++, Java, Python) and retrieve results BLOG  Built on Apache Thrift Concepts  Spark Thrift Server Similar to HiveServer2, executes in spark Engine as compared to Hive (MR /TEZ)  Gimel Thrift Server Spark Thrift Server + Gimel + Security & other PP specific features Hive, Spark, Gimel Thrift Server
  • 35. GTS Key Features Out-of-box Auditing: Logging, Monitoring, Dashboards Alerting (beta/internal) Security Apache Ranger Teradata Proxy User Part of Ecosystem Notebooks – GSQL UDC –Datasets SCAAS – DML/DDL Low Latency User Experience SQL to Any Store Stores supported by Gimel Highly available architecture Software & Hardware Query via REST (work in progress) REST Query Guard Kills run away queries

Notas del editor

  1. Hive thrift server accepts SQL queries and responds with results but it is limited to only HIVE database Also from security standpoint, it does impersonation (Runs as the logged in user) From the query engine perspective, it is not spark engine and hence it runs MR jobs which when compared to Spark performance, very slow.
  2. Hive thrift server accepts SQL queries and responds with results but it is limited to only HIVE database Also from security standpoint, it does impersonation (Runs as the logged in user) From the query engine perspective, it is not spark engine and hence it runs MR jobs which when compared to Spark performance, very slow.