SlideShare a Scribd company logo
1 of 30
Download to read offline
Building Operational Data Lake
using Spark and SequoiaDB
+
Yang Peng VP of Solutions
pengyang@sequoiadb.com
Have more enterprise customers than MongoDB and Couchbase
combined in China. Over 200 enterprise Customers, 10 of them
are Global Fortune 500 companies
Backed by Silicon Valley based VCs: Qiming Venture’s A round
investment, and DCM’s B round of $ 10 million investment. The
largest investment among all Chinese distributed database
companies so far.
Top Distributed Database Vendor in China
SequoiaDB has been
certified as one of the 14
Databricks(Spark) global
distributors.
SequoiaDB is listed by
Firstmark in the “Global Big
Data Landscape”, and is
the only Chinese company
ever selected.
SequoiaDB’s products:
New Generation Distributed
Content Management Software
New Generation Distributed
Multi-Model Database
Structured Semi-Structured Unstructured
NoSQLSQL Object files
Operational Data Lake
With SequoiaDB+Spark
Client Data Scale
• Around 700M user accounts in total
• 2PB data volume
• 10K RPS in Peak
• Query in 10B rows of data
• Response within 100ms
Highlights of Operational Data Lake
• Real-time & high concurrency queries
• Full-scale history data storage
• Multi-Model data management
Banking Operational Data Lake Use Case
Requirements:
Providing legal and fraud investigation data query platform with centralized storage of all history data from the
core systems, unified data query interfaces and ad-hoc query services.
n Centralized Data Storage and Data Cleaning
• T + 1 data synchronization from core system such as credit card system and E-banking system with
unified data channel.
• Unified ETL including data cleaning.
n Unified Interfaces for Query
• Query API and interfaces for operators
• Ad-hoc queries
Initial Idea
ODSCore System
History
Production
Environment
Core sys
Credit Card
sys
Middleware
E-banking
Payment
Initialize
T+1
Sync
SDB
SDB API
SPARK
SQL&
PG SQL
History Data Querying
Query Query
Initialize
& Sync
Initialize
Investigation
system
Counter
Gateway
Data Acquisition Data Processing Data Displaying
Open Platform
Bigdata platform
Debt Collection Initialize
& Sync
FTP
Query
Access Platform
Query
Import
Historical Data Ratio
New Core
System
4%
New Credit
Card System
20%
Legacy Core
System
21%Legacy Credit
Card System
32%
Debt Collection
22%
Online
customer…
Data Ratio in Data management Platform Data Services in History Data Platform
Data Services
1 Legacy Core System
2 Core System
3
Legacy Credit Card
System
4 Credit Card System
5 Credit Card Collection
6
Credit Card Customer
Services
7 Administration
Initial Architecture
Source
Data
Core System
TB1
TB4
……
Credit Card
TB2
TB5
……
E-banking
TB3
TB6
……
Other
systems
……
ODS FTP FTPODS
SDB Data Storage Area
TB1
TB2
TB3
TB4
……
TB1
TB2
TB3
TB4
……
TB1
TB2
TB3
TB4
……
TB1
TB2
TB3
TB4
……
.......
TB4=TB1+TB2+TB3
Stores Modified Data
SDB
Clusters
TB1、TB2、TB3
the source data, will be
directly stored into SDB.
SPARK
PGSQL
SDBAPI
Source Data TB1、TB2、TB3 is imported
using SDB API
Spark and PGSQL will modify the data and
store into TB4
WEB Applications
1 Master 2 Slaves
For each group of data
MySQL
Python
SPARKSQL
SDB
DatabaseCluster
SDB
Data Processing
Area
Challenges
• Main repository data used as a replacement of tape
• Main repository contains all the business source data in different schema, there
may not be query friendly
• Minimize I/O and computing activities on those boxes
• Reading data directly from main repository is NOT allowed
• Lack of isolation of source data management
• Scaling issue
• Lack of unified management tools for data, system source and query
• Lack of standard interface for external systems
• Performance issue for querying data in main repository
• Manage the query data and computing resource in an isolated area
– Data are cleaned and reconstructed in the cache region based on query request
– Data can replay and reconstructed any time you want
Solutions
Main Repository
Cache Region
Select * from T1, T2, T3
Dynamic
Load
Data Scheduling and Processing Area
Online Query Area
Data Service
SDBClusters
Main Repository Area
Operational Data Lake
Ad-hoc Query
Area
Data-Lab Area
Sand Box
Data Lifecycle Management
Ad-hoc QueryOnline Query
Online Real-time query, requires fast
response
All the original data from all the business systems.
Ad-hoc query must be tested in the sandbox
testing area before query the business database.
Testing the query
instructions for query.
Operational Data Lake:Overall Business Architecture
Online
Trading
Services
ODL:Detail Technology Architecture
Replicas of
data in one
week for
testing the Ad-
hoc query
instructions
Data replicas for Ad-
hoc query business
Online Trading Ad-hoc Query Executing
Ad-hoc Query
Testing
Management
Archiving
Management
Mission
Management
Source
Management
Users
Management
Original
Systems
Core
systems
Credit Card
Systems
E-Banking
System ……
O
D
S
N
A
S
CDCReal-time
Using CDC to fetch database log
synchronizing data for online
data.
Main Repository data
is imported from latest
T+1 data by ODS+FTP
Server Layer
Data Scheduling and Processing Area
Data that defined by
the specific online
query business
Online Query
Data Services Sandbox
C
D
F
T
P
O
D
S
Data
Relicas
Monitoring
Management
Data
Replicas
SequoiaDBClusters
Electric
Banking Data
Cluster
Credit Card
Data Clusters 。。。
Core Data
Clusters
Main Repository
C
D
F
T
P
Real-time
data from
business
systems
ECM Warm Data
Images Docs
ECM Online Data
Images Docs
ECM Archiving Data
Images Docs
ECM Platform Operational Data Lake
Ad-hoc Query
Modified
Data
Real-time
Data
Data
Relicas
Data
Relicas
1
• Data imported into main repository from external systems
2
• Data reconstructed by Scheduled and Processing layer
3
• Modified data will be imported into the Query Layer
4
• Users execute queries
5
• Middleware layer queries the data through SQL or SDB APIs
6
• Return the query result to system
Data Process & Schedule
Layer
Query Services
Modified
Data
Data
Replica
Query Middleware
Main Repository
Data
Storage
Main
Repository
Main
Repository
Source
Data Core Systems
Other systems
……
ODL:Online Query
n Query Process
1
Data imported to the main repository
2
Data imported Ad-hoc Query Layer by Scheduled and Processing layer.
3
Sandbox layer fetch replicas of sample data from Ad-hoc Query Layer
4
Users enter the query instructions
5
Testing area forwarded the instruction to sandbox layer for testing
6
Sandbox executes the instructions and return the result
7
If the preview test pass, the query will be sent to executing area
8
Executing area requests the query result from the Ad-hoc Query Data
storage layer
9
10
Return query result
Structured Data Storage
Data Process & Schedule Layer
SandboxExternal Data Service
Replicas of data
in one month for
testing the free
query
instructionsOriginal Data Replicas
Ad hoc Query Ad-hoc Query
Main Repository
Data
Storage
Main
Repository
Main
Repository
Data
Source
Core system
Other
systems……
Result Temporary
Space
ODL:Ad-hoc Query
n Ad-hoc Query Process
Result of Ad-hoc query will store in a temporary space and return the
processing result.
Spark SequoiaDB Connector
SparkSQL + SequoiaDB Connector
• SparkSQL is very useful as a ETL tool
• Users can write standard SQL to join data from multiple tables and load into
target table
• Spark is able to connect to any external data source as long as connector is
provided
SequoiaDB 2.8.1+, Spark 2.0+, JDK7+, Scala 2.11.x
SequoiaDB Connector – Spark SQL
1. Spark SQL
• Create Table or View
create [temporary] <table|view> <name>[(schema)]
using com.sequoiadb.spark options (<options>);
• Insert Data in SQL
insert into table <name1> select * from <name2>;
Options Description
Host SequoiaDB address and port
CollectionSpace The collection space for the table
Collection The collection for the table
Username Authentication username
Password Authentication password
SamplingRatio Sampling rate for schema
generation
SamplingNum Max number of records to
sample
SamplingWithID Include “_id” column in schema
SamplingSingle Take sample in a single partition
BulkSize Batch job size of bulk insert
SequoiaDB Connector - RDD
2. RDD Usage
• Create
import org.apache.spark._
import com.sequoiadb.spark._
val conf = new SparkConf().setMaster(“spark://server1:7077”)
conf.set(“sequoiadb.host”, “server2:11810”)
val spark = new SparkContext(conf)
val rdd = spark.loadFromSequoiadb(“sample”, “Employee”)
println(“count = ” + rdd.count())
rdd.saveToSequoiadb(“server3:11810”, “sample”, “newEmployee”)
Spark Connector Architecture
• DefaultSource
• SdbRelation
• SdbRDD
• SdbFilter
• SdbPartitioner
• SdbRDDIterator
• SdbCursor
• SdbSchemaSampler
• SdbWriter
	
+createRelation()
<<interface>>
RelationProvider
+createRelation()
<<interface>>
SchemaRelationProvider
+createRelation()
<<interface>>
CreatableRelationProvider
+shortName()
<<interface>>
DataSourceRegister
DefaultSource
+unhandledFilters()
BaseRelation
SdbRelation
+buildScan()
<<interface>>
TableScan
+buildScan()
<<interface>>
PrunedScan
+buildScan()
<<interface>>
PrunedFilteredScan
+insert()
<<interface>>
InsertableRelation
+compute()
+getPartitions()
+getPreferredLocations()
RDD
SdbRDD
SdbBsonRDD SdbRowRDD
<<interface>>
Partition
SdbPartition
+computePartitions()
SdbPartitioner
SdbSinglePartitioner SdbShardingPartitioner
SdbDatablockPartitioner
+hasNext()
+next()
<<interface>>
Iterator
SdbRDDIterator
SdbRowRDDIterator SdbBsonRDDIterator
SdbFilter
SdbConfig
SdbSchemaSampler
SdbWriter
+hasNext()
+next()
+close()
<<interface>>
SdbCursor
SdbNormalCursor SdbFastCursor
SequoiaDB Connector Data Type Compatibility
SequoiaDB Type SparkSQL Type SQL Type
Int32 IntegerType Int
Int64 LongType Bigint
Double DoubleType Double
Decimal DecimalType Decimal
String StringType String
ObjectId StringType String
Boolean BooleanType Boolean
Date DateType Date
Timestamp TimestampType Timestamp
Binary BinaryType Binary
Null NullType Null
Object StructType Struct<field:type>
Array ArrayType Array<type>
MinKey StringType String
MaxKey StringType String
Spark SequoiaDB Scheduler
• Dynamically load data from main repository to cache region, based on table and query predicates
– Ex: Select * from T1, T2 where T1.c = T2.c and T1.date between ( 2017-01-01 and 2017-02-01 )
– This query will load first month of data in T1 and full T2 data into cache region
– Data for most table are both sharded by PK and partitioned by date
– Loading data from main repository to cache is simply SFTP the data files
– 10Gb network is good enough for fast copy
• Remove expired cache (LRU) when it’s no longer needed
• Return the data from cache region, instead of reading from the main repository
What the scheduler does?
Spark
SequoiaDB
Connector
Message
Processor
REST
Business
Application
SequoiaDB
Cluster
Partition
Manager
Data
Scheduler
Spark
RDD
Task
Manager
SequoiaDB
standalone
SequoiaDB
standalone
SequoiaDB
standalone
Metadata
• Message Processing
• Metadata Management
• Partition Management
• Task Management
• RESTful API
Scheduler Architecture
Spark Connector Message Processor Partition Manager Task Manager Metadata SequoiaDB Cluster Standalones
explain(query) explain(query)
explain(query)
explain(query)
listReplicaGroups
listReplicaGroups
query partitions
query partitions
compute unloaded partitions
create copy task
finish copy task
copy partitions
insert copy task
insert copy task
finish copy task
finish copy task
load partitions
load partitions
save partitions
save partitions
generate explain
explain(query)explain(query)
	
Data Scheduler
Explain Data Flow
• Override SdbPartitioner
– Locate the related data and load to the cache region
– Getting partition information from cache region instead of main repository
• Stateless scheduler service
– Perform data copy tasks before returning the explain to connector
– Remove expired cache when it’s no longer needed
– Support HA and load balance
• Data safety
– Ad-hoc queries are run in cache region instead of main repository
– Spark workers are installed in the cache region servers
Data Scheduler Highlights
Project Achievement
• There are about 40 different applications systems.
• OLTP features are used in order to make sure the data in SequoiaDB is consistent
with production database.
• This is an online OLTP application in financial industry using distributed database.
• SequoiaDB have more than 107 physical nodes deployed in this banking customer as
a unified Operational Data Lake.
SequoiaDB
www.sequoiadb.com
sales_support@sequoiadb.com
(086)400-8038-339

More Related Content

What's hot

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Databricks
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsDatabricks
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksDatabricks
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemDatabricks
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceMongoDB
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouMetrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouDatabricks
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 

What's hot (20)

Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Lessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark WorkloadsLessons from Running Large Scale Spark Workloads
Lessons from Running Large Scale Spark Workloads
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
How To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own DatasourceHow To Connect Spark To Your Own Datasource
How To Connect Spark To Your Own Datasource
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye ZhouMetrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 

Similar to Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeTorsten Steinbach
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneMongoDB
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AITorsten Steinbach
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 
Access Data from XPages with the Relational Controls
Access Data from XPages with the Relational ControlsAccess Data from XPages with the Relational Controls
Access Data from XPages with the Relational ControlsTeamstudio
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeDatabricks
 
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...Enabling Key Business Advantage from Big Data through Advanced Ingest Process...
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...StampedeCon
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Connor McDonald
 
Data virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss TeiidData virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss TeiidAnil Allewar
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneMongoDB
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 

Similar to Building Operational Data Lake using Spark and SequoiaDB with Yang Peng (20)

Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Informatica slides
Informatica slidesInformatica slides
Informatica slides
 
IBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lakeIBM Cloud Day January 2021 - A well architected data lake
IBM Cloud Day January 2021 - A well architected data lake
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazione
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Access Data from XPages with the Relational Controls
Access Data from XPages with the Relational ControlsAccess Data from XPages with the Relational Controls
Access Data from XPages with the Relational Controls
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...Enabling Key Business Advantage from Big Data through Advanced Ingest Process...
Enabling Key Business Advantage from Big Data through Advanced Ingest Process...
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
 
Data virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss TeiidData virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss Teiid
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

  • 1. Building Operational Data Lake using Spark and SequoiaDB + Yang Peng VP of Solutions pengyang@sequoiadb.com
  • 2. Have more enterprise customers than MongoDB and Couchbase combined in China. Over 200 enterprise Customers, 10 of them are Global Fortune 500 companies Backed by Silicon Valley based VCs: Qiming Venture’s A round investment, and DCM’s B round of $ 10 million investment. The largest investment among all Chinese distributed database companies so far. Top Distributed Database Vendor in China
  • 3. SequoiaDB has been certified as one of the 14 Databricks(Spark) global distributors.
  • 4. SequoiaDB is listed by Firstmark in the “Global Big Data Landscape”, and is the only Chinese company ever selected.
  • 5. SequoiaDB’s products: New Generation Distributed Content Management Software New Generation Distributed Multi-Model Database Structured Semi-Structured Unstructured NoSQLSQL Object files
  • 6. Operational Data Lake With SequoiaDB+Spark
  • 7. Client Data Scale • Around 700M user accounts in total • 2PB data volume • 10K RPS in Peak • Query in 10B rows of data • Response within 100ms
  • 8. Highlights of Operational Data Lake • Real-time & high concurrency queries • Full-scale history data storage • Multi-Model data management
  • 9. Banking Operational Data Lake Use Case Requirements: Providing legal and fraud investigation data query platform with centralized storage of all history data from the core systems, unified data query interfaces and ad-hoc query services. n Centralized Data Storage and Data Cleaning • T + 1 data synchronization from core system such as credit card system and E-banking system with unified data channel. • Unified ETL including data cleaning. n Unified Interfaces for Query • Query API and interfaces for operators • Ad-hoc queries
  • 10. Initial Idea ODSCore System History Production Environment Core sys Credit Card sys Middleware E-banking Payment Initialize T+1 Sync SDB SDB API SPARK SQL& PG SQL History Data Querying Query Query Initialize & Sync Initialize Investigation system Counter Gateway Data Acquisition Data Processing Data Displaying Open Platform Bigdata platform Debt Collection Initialize & Sync FTP Query Access Platform Query Import
  • 11. Historical Data Ratio New Core System 4% New Credit Card System 20% Legacy Core System 21%Legacy Credit Card System 32% Debt Collection 22% Online customer… Data Ratio in Data management Platform Data Services in History Data Platform Data Services 1 Legacy Core System 2 Core System 3 Legacy Credit Card System 4 Credit Card System 5 Credit Card Collection 6 Credit Card Customer Services 7 Administration
  • 12. Initial Architecture Source Data Core System TB1 TB4 …… Credit Card TB2 TB5 …… E-banking TB3 TB6 …… Other systems …… ODS FTP FTPODS SDB Data Storage Area TB1 TB2 TB3 TB4 …… TB1 TB2 TB3 TB4 …… TB1 TB2 TB3 TB4 …… TB1 TB2 TB3 TB4 …… ....... TB4=TB1+TB2+TB3 Stores Modified Data SDB Clusters TB1、TB2、TB3 the source data, will be directly stored into SDB. SPARK PGSQL SDBAPI Source Data TB1、TB2、TB3 is imported using SDB API Spark and PGSQL will modify the data and store into TB4 WEB Applications 1 Master 2 Slaves For each group of data MySQL Python SPARKSQL SDB DatabaseCluster SDB Data Processing Area
  • 13. Challenges • Main repository data used as a replacement of tape • Main repository contains all the business source data in different schema, there may not be query friendly • Minimize I/O and computing activities on those boxes • Reading data directly from main repository is NOT allowed • Lack of isolation of source data management • Scaling issue • Lack of unified management tools for data, system source and query • Lack of standard interface for external systems • Performance issue for querying data in main repository
  • 14. • Manage the query data and computing resource in an isolated area – Data are cleaned and reconstructed in the cache region based on query request – Data can replay and reconstructed any time you want Solutions Main Repository Cache Region Select * from T1, T2, T3 Dynamic Load
  • 15. Data Scheduling and Processing Area Online Query Area Data Service SDBClusters Main Repository Area Operational Data Lake Ad-hoc Query Area Data-Lab Area Sand Box Data Lifecycle Management Ad-hoc QueryOnline Query Online Real-time query, requires fast response All the original data from all the business systems. Ad-hoc query must be tested in the sandbox testing area before query the business database. Testing the query instructions for query. Operational Data Lake:Overall Business Architecture
  • 16. Online Trading Services ODL:Detail Technology Architecture Replicas of data in one week for testing the Ad- hoc query instructions Data replicas for Ad- hoc query business Online Trading Ad-hoc Query Executing Ad-hoc Query Testing Management Archiving Management Mission Management Source Management Users Management Original Systems Core systems Credit Card Systems E-Banking System …… O D S N A S CDCReal-time Using CDC to fetch database log synchronizing data for online data. Main Repository data is imported from latest T+1 data by ODS+FTP Server Layer Data Scheduling and Processing Area Data that defined by the specific online query business Online Query Data Services Sandbox C D F T P O D S Data Relicas Monitoring Management Data Replicas SequoiaDBClusters Electric Banking Data Cluster Credit Card Data Clusters 。。。 Core Data Clusters Main Repository C D F T P Real-time data from business systems ECM Warm Data Images Docs ECM Online Data Images Docs ECM Archiving Data Images Docs ECM Platform Operational Data Lake Ad-hoc Query Modified Data Real-time Data Data Relicas Data Relicas
  • 17. 1 • Data imported into main repository from external systems 2 • Data reconstructed by Scheduled and Processing layer 3 • Modified data will be imported into the Query Layer 4 • Users execute queries 5 • Middleware layer queries the data through SQL or SDB APIs 6 • Return the query result to system Data Process & Schedule Layer Query Services Modified Data Data Replica Query Middleware Main Repository Data Storage Main Repository Main Repository Source Data Core Systems Other systems …… ODL:Online Query n Query Process
  • 18. 1 Data imported to the main repository 2 Data imported Ad-hoc Query Layer by Scheduled and Processing layer. 3 Sandbox layer fetch replicas of sample data from Ad-hoc Query Layer 4 Users enter the query instructions 5 Testing area forwarded the instruction to sandbox layer for testing 6 Sandbox executes the instructions and return the result 7 If the preview test pass, the query will be sent to executing area 8 Executing area requests the query result from the Ad-hoc Query Data storage layer 9 10 Return query result Structured Data Storage Data Process & Schedule Layer SandboxExternal Data Service Replicas of data in one month for testing the free query instructionsOriginal Data Replicas Ad hoc Query Ad-hoc Query Main Repository Data Storage Main Repository Main Repository Data Source Core system Other systems…… Result Temporary Space ODL:Ad-hoc Query n Ad-hoc Query Process Result of Ad-hoc query will store in a temporary space and return the processing result.
  • 20. SparkSQL + SequoiaDB Connector • SparkSQL is very useful as a ETL tool • Users can write standard SQL to join data from multiple tables and load into target table • Spark is able to connect to any external data source as long as connector is provided SequoiaDB 2.8.1+, Spark 2.0+, JDK7+, Scala 2.11.x
  • 21. SequoiaDB Connector – Spark SQL 1. Spark SQL • Create Table or View create [temporary] <table|view> <name>[(schema)] using com.sequoiadb.spark options (<options>); • Insert Data in SQL insert into table <name1> select * from <name2>; Options Description Host SequoiaDB address and port CollectionSpace The collection space for the table Collection The collection for the table Username Authentication username Password Authentication password SamplingRatio Sampling rate for schema generation SamplingNum Max number of records to sample SamplingWithID Include “_id” column in schema SamplingSingle Take sample in a single partition BulkSize Batch job size of bulk insert
  • 22. SequoiaDB Connector - RDD 2. RDD Usage • Create import org.apache.spark._ import com.sequoiadb.spark._ val conf = new SparkConf().setMaster(“spark://server1:7077”) conf.set(“sequoiadb.host”, “server2:11810”) val spark = new SparkContext(conf) val rdd = spark.loadFromSequoiadb(“sample”, “Employee”) println(“count = ” + rdd.count()) rdd.saveToSequoiadb(“server3:11810”, “sample”, “newEmployee”)
  • 23. Spark Connector Architecture • DefaultSource • SdbRelation • SdbRDD • SdbFilter • SdbPartitioner • SdbRDDIterator • SdbCursor • SdbSchemaSampler • SdbWriter +createRelation() <<interface>> RelationProvider +createRelation() <<interface>> SchemaRelationProvider +createRelation() <<interface>> CreatableRelationProvider +shortName() <<interface>> DataSourceRegister DefaultSource +unhandledFilters() BaseRelation SdbRelation +buildScan() <<interface>> TableScan +buildScan() <<interface>> PrunedScan +buildScan() <<interface>> PrunedFilteredScan +insert() <<interface>> InsertableRelation +compute() +getPartitions() +getPreferredLocations() RDD SdbRDD SdbBsonRDD SdbRowRDD <<interface>> Partition SdbPartition +computePartitions() SdbPartitioner SdbSinglePartitioner SdbShardingPartitioner SdbDatablockPartitioner +hasNext() +next() <<interface>> Iterator SdbRDDIterator SdbRowRDDIterator SdbBsonRDDIterator SdbFilter SdbConfig SdbSchemaSampler SdbWriter +hasNext() +next() +close() <<interface>> SdbCursor SdbNormalCursor SdbFastCursor
  • 24. SequoiaDB Connector Data Type Compatibility SequoiaDB Type SparkSQL Type SQL Type Int32 IntegerType Int Int64 LongType Bigint Double DoubleType Double Decimal DecimalType Decimal String StringType String ObjectId StringType String Boolean BooleanType Boolean Date DateType Date Timestamp TimestampType Timestamp Binary BinaryType Binary Null NullType Null Object StructType Struct<field:type> Array ArrayType Array<type> MinKey StringType String MaxKey StringType String
  • 26. • Dynamically load data from main repository to cache region, based on table and query predicates – Ex: Select * from T1, T2 where T1.c = T2.c and T1.date between ( 2017-01-01 and 2017-02-01 ) – This query will load first month of data in T1 and full T2 data into cache region – Data for most table are both sharded by PK and partitioned by date – Loading data from main repository to cache is simply SFTP the data files – 10Gb network is good enough for fast copy • Remove expired cache (LRU) when it’s no longer needed • Return the data from cache region, instead of reading from the main repository What the scheduler does? Spark SequoiaDB Connector Message Processor REST Business Application SequoiaDB Cluster Partition Manager Data Scheduler Spark RDD Task Manager SequoiaDB standalone SequoiaDB standalone SequoiaDB standalone Metadata • Message Processing • Metadata Management • Partition Management • Task Management • RESTful API Scheduler Architecture
  • 27. Spark Connector Message Processor Partition Manager Task Manager Metadata SequoiaDB Cluster Standalones explain(query) explain(query) explain(query) explain(query) listReplicaGroups listReplicaGroups query partitions query partitions compute unloaded partitions create copy task finish copy task copy partitions insert copy task insert copy task finish copy task finish copy task load partitions load partitions save partitions save partitions generate explain explain(query)explain(query) Data Scheduler Explain Data Flow
  • 28. • Override SdbPartitioner – Locate the related data and load to the cache region – Getting partition information from cache region instead of main repository • Stateless scheduler service – Perform data copy tasks before returning the explain to connector – Remove expired cache when it’s no longer needed – Support HA and load balance • Data safety – Ad-hoc queries are run in cache region instead of main repository – Spark workers are installed in the cache region servers Data Scheduler Highlights
  • 29. Project Achievement • There are about 40 different applications systems. • OLTP features are used in order to make sure the data in SequoiaDB is consistent with production database. • This is an online OLTP application in financial industry using distributed database. • SequoiaDB have more than 107 physical nodes deployed in this banking customer as a unified Operational Data Lake.