Publicidad

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Alluxio, Inc.
12 de Dec de 2020
Publicidad

Más contenido relacionado

Presentaciones para ti(20)

Similar a Modernizing Global Shared Data Analytics Platform and our Alluxio Journey(20)

Publicidad

Más de Alluxio, Inc.(20)

Publicidad

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

  1. DATA ORCHESTRATION SUMMIT 2020 SuperDB Modernizing Global Shared Data Analytics Platform and our Alluxio Journey Sandipan Chakraborty | Director Engineering
  2. 2 Topics • Brief About Rakuten • SuperDB Journey • Our Data Landscape • Challenges • Approach • Journey with Alluxio
  3. 3 70 + Service s Japan’s Largest e-commerce company Internet Services Fintech Services Communication s & Contents
  4. 4 SuperDB: Centralized Data Platform for Rakuten Ecosystem 43Services*1 700+ TeraBytes Normalized Data sets 6,500+ Users*2 from 40+ Businesses 70+Services in Ecosystem *1 Excluding small services and common services *2 # of weekly active users PetaBytes of Data
  5. 5 Our Journey 201 3 201 8 8 25+ Teradata + Hadoop Big Data Stack 2013 - 2018 Presto Mesos DC/OS On-Premise & GCP Hadoop Cluster GCS Click Stream Data Recommendation, PersonalizationSupport ML, 1 5 200 7 201 2 Traditional EDW Teradata 2007 – 2012 On-Premise BI Reporting, Ad-Hoc Analysis 30+ services 2019 2019 - 2020 40+ services 202 0 Multi-Cloud (GCP + Azure) Presto + Alluxio (POC) Mesos DC/OS Kubernetes Starburst Presto Alluxio (Prod.) Cloud Storage Hybrid Compute Hadoop ClustersObject Storage Optimize Analytics Optimize AI / ML Teradata + Hadoop Big Data Stack
  6. 6 Our Data Landscape Web / RAT User Transaction IoT / Device Apps Real-Time & Batch (Containers) Common Schema Business Generated (Data Producers) Business DWH / DL SuperDB (Enterprise Repository) Data challenges Diverse data from diverse sources, growing rapidly Easier Data Management Based on Personas, Gives Transparency & Better Cost Control, Standardized and Automated Faster & Better Insight & AI Start Analysis by ability to connect and Run anywhere Insights & Data Science (Data Consumers) SINGLE VERSION OF TRUTH! Data Projections and Feature Sets Virtual Data Mart (On-Prem / GCP Cloud) Super DB on-premise (JP) SuperDB Cloud (US, Japan) Auto-Sync On-Demand Scale (Cloud + Containers) Common Schema Cloud Bursting (Containers) AWS Azure GCP Click Stream On-premise Faster Business Insight Faster time to Analysis Quick Experimentation Cloud Native & Hybrid Architecture Granular Access control Data encryption (End to End) Multi-Factor Authenticatio n Query Layer Normalized Transaction / aggregated Transaction / aggregated Transaction / aggregated Auto-Sync
  7. 7 •Adhoc Query Capacity •Discover, Fast and Easy Access, OLAP & Low Latency •BI Support and Reporting. Business Analysts •Adhoc Query Capacity. (OLAP, low latency) •Run workload in large scale computing. •Data Science Platform and tools for ML - AI workloads Data Scientists •Ability to Integrate with API’s •Support of Data Sync to different clusters Applications •Query, Data Ingestion and Transformation •Scalable processing, long running jobs •Real-time and Batch Support •Data QC Support Data Engineers • Secured Access Layer • Ability to create Audit Reports • Data Lineage and traceability Governanc e, Audit & Security •Maintaining the data system infra. •Workload Turning. •Data Pipeline maintaining. •Data QC System Admin & Operators •Creates, Joins, Ad-hoc Report, KPI’s •Experiments & Quick Analysis •Support various Marketing activities Sand-Box Users Support for Different Personas
  8. 8 Our Challenges • Compute elasticity for experiments. • Adding capacity was time-consuming process System Scalability • Unable to address / optimize for different Personas • Legacy Code, limited processing power resulting in Job delays Data Availability • Too many data copy pipelines needed to be built, delaying the access to data • Managing for data copy pipelines to different clusters became an operation overhead.Data freshness • Data Movements before any Analysis can be done. Not all is present in DWH for analysis. • Quick Analysis cannot be done across different businesses data silos. Analytics Agility
  9. 9 Our Approach •Compute Elasticity for Experiments •Adding capacity was time-consuming process System Scalability •Unable to address / optimize for different Personas •Legacy Code, limited processing power resulting in Job delays Data Availability •Data sync cannot be done between different cluster in DC’s. •Too many Data Copies Data freshness •Cannot join between Transaction & behavior data. •Needs lot of Data Movements •Quick Analysis cannot be done. Analytics Agility Hybrid & Cloud-native architecture • On Demand Compute with Public Cloud • Separate Storage and Process • Containerization and Cloud Native Data Sync & Orchestration (Alluxio) • Data Sync across DC’s and Cloud. • Data Processing Cache Layer Query Layer (Starburst Presto) • Start Analytics connecting to different stores on multi-cloud , on-prem before any data movement • Common security layer with Ranger
  10. 10 One Major Challenge Data Sources Teradata Legacy HDFS New HDFS PwC Legac y ODIN Python Legacy copy copy copy Pipeline X Pipeline Y Pipeline Z ❖ ODIN is homebrewed data ingestion system ❖ Legacy HDFS and New HDFS are in different data centers, so downstream migration is not straight forward due to computing resource constrains GCS copy Spark Pipeline New
  11. 11 Data Sync Source Data Alluxio Ingest Alluxio XHDFS Cluster HDFS Cluster GCS Alluxio Y Alluxio Z Rakuten DC1 Rakuten DC2 GCP ❖ Alluxio Ingest Cluster: data persist to multi destination via Under Store Replication. ❖ Consumption tool cache data from different DC to improve performance, and enable DR Released in Production
  12. 12 Data Caching for Consumption Alluxio master Alluxio worker Alluxio worker Alluxio worker Presto Coordinator Presto worker Presto worker Presto worker Mem Cach e Mem Cach e Mem Cach e Mem Cach e GKE (GCP) & AKS (Azure) 2020 Production Physical box Physical box Physical box HDFS: DC local HDFS: DC remote 1 HDFS DC remote 2 Alluxio master Alluxio worker Alluxio worker Alluxio worker Presto Coordinator Presto worker Presto worker Presto worker On-Prem Bare Metal (POC)
  13. 13 Consumption in Production Today Physical box Physical box Physical box HDFS DC1 HDFS DC 2 Alluxio master Alluxio worker Alluxio worker Alluxio worker Presto Coordinator Presto worker Presto worker Presto worker On-Prem Bare Metal (2019 - Early 2020) Bare Metal (K8 Cluster) --- Present Production
  14. 14 TensorFlow / Caffe Spark Compute (Transformati on) Spark Compute Aggregations Distributed Cache Kubernetes , KubeflowLinu x Rakuten OneClou d Bare Metal GPU CPU HD FS Object Store NA S Libfuse AlluxioFUSE Alluxio JVM Distributed Cache (Presently under POC)
  15. 15 Our Journey with Alluxio Started using Presto Open source (On-Prem) 201 7 201 8 Started using Presto Open source (GCP) POC with Presto + Alluxio (GCP) 201 9 202 0 Presto + Alluxio (GCP , Azure) POC : Distributed Cache with Alluxio for ML & Data Pipeline Jobs Data Sync with Alluxio (On-Prem) 202 1 Planned : Distributed Cache with Alluxio for ML & Data Pipeline Jobs
  16. 16 Overview: Wrap-up RDB NoSQL Files events Pipeline Service Hadoop Discovery Service Consumption Service Transformations Landing zone Common Schema mapping Common Marts Data Orchestration Layer Presto BI toolsAI / ML Data Exploring Downstream pipelines Spark Schema management Data ACL Classification Auditing Changelogs Changelogs Cloud
Publicidad