Serhii Kholodniuk: What you need to know, before migrating data platform to GCP (Google cloud platform)
AI & BigData Online Day 2022
Website: https://aiconf.com.ua
Youtube: https://www.youtube.com/startuplviv
FB: https://www.facebook.com/aiconf
Interoperability and ecosystems: Assembling the industrial metaverse
Everything You Need to Know Before Migrating Your Data Platform to GCP
1. WHAT YOU NEED TO KNOW,
BEFORE MIGRATING DATA
PLATFORM TO GCP
by
SERHII KHOLODNIUK
2. Serhii Kholodniuk
Senior Big Data
Engineer
Sigma Software
Ukraine
Kyiv office
My interest and goals:
• interested in designing and developing data platforms for the needs of
business intelligence and machine learning.
• constantly looking for opportunities to simplify and optimize solutions, their
implementation and maintenance.
• client value oriented.
Mastering GCP:
• currently building data platform in GCP
• migrating data pipelines in to GCP infrastructure
• optimizing data warehouse structure
4. WHY GCP BECOMES POPULAR?
—
4
Cloud Infrastructure
Network
Cloud sustainability
Data cloud
Security out of box (encrypt data at rest and in transit)
Powerful BigQuery features with ergonomic design
Provide cloud infrastructure for all data needs
Customized solutions for different industries
Provide best practices industry solutions
Artificial intelligence solutions
Prebuilt ML model APIs
Custom Model Building with SQL in BigQuery ML
Custom Model Building with Cloud AutoML
5. CLOUD INFRASTRUCTURE
—
5
Network
29 regions
88 availability zones
146 edge locations
Cloud sustainability
100% renewable energy for all cloud regions
81% waste diverted from landfills
2x more efficient thana typical enterprise
data center
6. DATA CLOUD
—
6
Security out of box
(encrypt data at rest and in transit)
Provide cloud infrastructure for all data needs
Powerful BigQuery features with
ergonomic design
7. CUSTOMIZED SOLUTIONS FOR DIFFERENT INDUSTRIES
—
7
Provide best practices industry solutions
Industry solutions
Retail
Consumer packaged goods
Manufacturing
Automotive
Supply chain and logistics
Energy
Healthcare and life sciences
Media and entertainment
Gaming
Telecommunications
Financial services
Financial services
Capital markets
Government and public sector
Government
State and local government
Federal government
Education
9. MIGRATION PHASES
—
9
1. Pre-migration phase
• complete inventory of workloads and stuff to be
migrated
• calculate Total Cost of Ownership and future
business value
• build a use case backlog
• select use cases for iteration
2. Migration phase
• schema migration
• pipelines migration
• data migration
3. Post-migration phase
• cost and performance optimization
• schema denormalization for BigQuery
• removing nested and repeated schema fields
• clustering and partitioning
• slots reservation for BigQuery
10. ITERATIVE APPROACH IN AGILE WAY
—
10
Prioritize use case backlog Select use cases for iteration Execution Release
1. Setup and data governance
2. Migrate schema and data
3. Translate queries
4. Migrate services and apps
5. Migrate data pipelines
6. Optimise perfomance
7. Verify and validate
Next iteration
12. DATAFLOW vs DATAPROC
—
12
Cloud Dataproc Cloud Dataflow
Recommended for: New data processing pipelines, unified
batch and streaming Existing
Hadoop/Spark applications, machine
learning/data science ecosystem, large-
batch jobs, preemptible VMs
New data processing pipelines, unified
batch and streaming
Fully-managed: No Yes
Managed by: DevOps Serverless
Auto-scaling: Yes, based on cluster utilization (reactive) Yes, transform-by-transform (adaptive)
Expertise: Hadoop, Hive, Pig, Apache Big Data
ecosystem, Spark, Flink, Presto, Druid
Apache Beam
13. DATAFLOW vs SPARK SERVERLESS
—
13
Spark Serverless Cloud Dataflow
Recommended for: New data processing pipelines, unified
batch existing Spark applications (from
Spark 3.2), machine learning/data science
ecosystem, large-batch jobs
New data processing pipelines, unified
batch and streaming
Fully-managed: Yes Yes
Managed by: Serverless Serverless
Auto-scaling: Yes, transform-by-transform (adaptive) Yes, transform-by-transform (adaptive)
Expertise: Pyspark, Spark SQL, Spark R, Spark
Java/Scala
Apache Beam
14. SCHEMA AND DATA MIGRATION
—
14
Database Migration Service – helps migrating MySQL and PostgresSQL to CloudSQL
BigQuery Data Transfer Service
Google recommends loading large data volumes by using Cloud Storage Transfer Service, and preferable are
Avro, Parquet or ORC format rather than CSV or JSON
For migration stratagies for Oracle workloads: rehost (by Bare Metal Solution), replatform, rewrite
Hbase to Bigtable migration path: HDFS -> Cloud Storage -> Storage Transfer Service -> Bigtable
15. DATA STORES FOR DIFFERENT USE CASES
—
15
Data
Unstructured Structured
Cloud Storage
Transactional
workloads
Data analytics
workloads
Millisecond
latency
Latency in
seconds
Cloud Bigtable
BigQuery
Firestore
NoSQL
SQL
One database
enough
Horisontal
scalability
Cloud SQL
Cloud Spanner