Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Data Platform Architecture Principles and Evaluation Criteria

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Próximo SlideShare
Skilwise Big data
Skilwise Big data
Cargando en…3
×

Eche un vistazo a continuación

1 de 21 Anuncio

Más Contenido Relacionado

Similares a Data Platform Architecture Principles and Evaluation Criteria (20)

Más de ScyllaDB (20)

Anuncio

Más reciente (20)

Data Platform Architecture Principles and Evaluation Criteria

  1. 1. Data Platform Architecture Principles and Evaluation Criteria Pooja Kelgaonkar, Senior Data Architect at Rackspace Technology
  2. 2. Pooja Kelgaonkar ■ Senior Data Architect - GCP, Snowflake ■ Specialist in Data Modernization Implementations ■ Expertise in “Data” domain ■ Learner, Tech Blogger, Tech evangelist ■ Reading, listening Indian classical music
  3. 3. ■ Architecture Principles ■ Data Modernization ■ Data Platform Offerings ■ Evaluation Criteria ■ Sample Use Case - Evaluation Comparison Agenda 3
  4. 4. Data Architecture Principles
  5. 5. Framework Pillars Operational Excellence 05 ● Serviceability - Easy Operations & Maint ● Maintenance - Data Pipeline Maintenance ● Reduced Ops Activities & Cost Efficiency 04 ● Performance Efficiency ● Cost Efficiency - Cost Optimized Availability 03 ● Reliability ● Resiliency of System ● Availability - System Time UP Scalability 02 ● Horizontal Scaling ● Vertical Scaling ● Auto Scaling Security 01 ● Access Management & Controls ● Data Protection - Encryption , Data Masking ● Compliance - ISO, HIPPA , PCI DSS ● Data Governance 5
  6. 6. ■ Cloud Migration / Adoption - 5Rs of transformation ■ Rehost , Refactor , Revise , Rebuild and Replace Data Modernization Journey Data Discovery Analysis of existing Data Architecture, System Design and evaluating the need, requirements of new data system Data Architecture & Assessment Designing new data platform, assessment of data modelling, Data Governance and Security Data Architecture & Engineering Data Platform implementation, Data Pipeline development and enhancement POCs. Designing end to end cycle. Go Live & DataOps Soft launch/ early cut off to integrate with other systems and signing off from business users. Implementing operations of new platform and modified pipelines Data Migration & Pipeline Development/Conversion Actual pipeline development, conversion on new platform. Implementing , testing and validating pipelines/data on new platform. 05 01 02 03 04 05 01 02 03 04 6
  7. 7. Design Framework Pillars & Considerations 7 Teams Architects Engineering Operations Who? When? How? Business Drive Technology Drive Management & Engagement Drive What? End User SLAs Assessment & SLA Setting System Assessment & Technology Evaluations Signed Up Services vs Open Source vs Hybrid Evaluations Business Assessment Technology Evangelist Sign Up for Services Business Teams
  8. 8. Data Platform Offerings
  9. 9. There are various offerings to implement Data Platform by Public Cloud providers for DB / DW / Data Lake / Data Mesh / SQL / NoSQL etc. Cloud Native ● AWS Glue ● EMR ● Kinesis, Opensearch ● RDS , Aurora ● Redshift ● DynamoDB , DocumentDB AWS ● Azure Data Factory ● HDInsight ● Azure Stream Analytics ● Azure SQL, Managed SQL ● SQL Server, PostgreSQL ● MariaDB, CosmosDB, Managed Cassandra Azure ● DataFlow, Data Fusion ● DataProc ● Pub/Sub, Stream ● Cloud SQL , Cloud Spanner ● BigQuery , BigLake ● Bigtable, Firestore, Memorystore GCP 9
  10. 10. There are a variety of offerings to implement data platform and design data pipelines using native and open source services. Data on Cloud - Common Offerings 10
  11. 11. Data Platform - Evaluation Criteria (Assessment Phase)
  12. 12. Evaluation - Pre-Requisites Evaluation Criteria Existing/Cross Application Platform to be Evaluated New Platforms to be Explored Platform Offerings Existing Support Tier/ Billing Plans Platform Offerings Probable Platform to be Evaluated, Cost Comparisons Done? Managed/Native Services / BYOL services Existing System Licenses, Integrators- BI, OPS tools Managed/Native Services / BYOL services Existing System Licenses, Integrators - BI, OPS tools Specific Evaluation or Open Evaluation to Select Best Fit for Given Use Case 12
  13. 13. 13 Evaluation - Inputs Capex vs Opex % of Data Scan vs Processed Compute vs Storage Utilization Data Challenges System Challenges Capex vs Opex % Storage vs Scan vs Processed Compute vs Storage Utilization % Data Challenges System Challenges
  14. 14. Evaluation - CheckPoints 1 Data Operations & Business Critical Requirements ● Data Pipeline Management - Monitoring & Operations ● Business Requirements - 24X7 Monitoring vs SLAs ● Critical Applications - Availability & SLAs 3 Business Checkpoint ● Data Availability - SLAs ● End User Agreements ● Business Requirements - Specific to Tooling ● Existing Cost utilization ● Performance Ratio - Current vs Expected ● Modernization Drive 5 Data Platform Checkpoint ● Type of Data - Structured, Semi-Structured, Unstructured ● Sources of Data - Files , DBs, ioTs, Devices, APIs ● Consumers of Data - Users vs System ● Frequency of Data - Batch, RealTime ● Data Storage - Active vs Passive ● Data Modelling - Schema, Tables , DB Objects 2 Data Analytics Checkpoint ● Data Analytics - BI Tooling ● Predictive Analytics - Algorithms, Tools, Libraries used ● AI/ML Use Cases - Customer Facing vs In-House ● Enterprise vs Cloud Native 4 Data Processing Checkpoint ● Target Systems Integrations ● Data Usage - Hot Data vs Cold Data ● Data Stored vs Data Processed vs Reads ● Data Pipelines - Batch vs Streaming ● Data Pipeline Complexity - S/M/C/VC ● Data Pipeline Scheduling - Tools , Cron jobs, Native Schedulers, Event based ● ETL vs ELT Requirements 14
  15. 15. 15 Evaluation - Metrics Checkpoint Category Metrics Data Checkpoint Data Integrators No of Sources No of Target No of Specific Systems Total Storage Volume Daily Delta Volume Data Modelling Frequency of Schema Evolution No of Objects % of NoSQL Objects % of PL SQL Objects Data Processing Data Pipelines No of S/M/C Jobs No of External Functions Integrated (Java/Python/SQL) No of ETL Jobs (Tool Based) No of Compute Intense Jobs No of Storage Intense Jobs Checkpoint Category Metrics Business Checkpoint Operations No of Times SLA Challenged No of End Users Affected Reliability No of times Data compromised No of DR activities No of end users impacted Performance Efficiency Total Batch Time No of Times Batch SLA Impacted No of End User Reports No of End Users/Consumers No of Poor Performing Reports/Queries Cost Utilization Overall Billing ( Capex ) Total Operations, Maint cost Data Operations Monitoring No of Support Team Members No of Monitoring Dashboards Data Analytics Analytics No of ML Jobs/Algorithms ML Integrators
  16. 16. Data Platform - Evaluation Use Case
  17. 17. Evaluation - Pre-Requisites 17 Evaluation Criteria Existing/Cross Application Platform to be Evaluated New Platforms to be Explored Platform Offerings Existing Support Tier/ Billing Plans Platform Offerings Probable Platform to be Evaluated, Cost Comparisons Done? Managed/Native Services / BYOL services Existing System Licenses, Integrators- BI, OPS tools Managed/Native Services / BYOL services Existing System Licenses, Integrators - BI, OPS tools Specific Evaluation or Open Evaluation to Select Best Fit for Given Use Case
  18. 18. Evaluation - Inputs ■ Domain - Retail , DW - Teradata, ETL - DataStage ■ Platform - Recently Signed up for Google Cloud Platform ■ Data Platform - Evaluate GCP Services to Setup Data Warehouse Platform ■ DW Size - 120TB (70 TB Active + 50 TB Passive ) ■ Daily Volume - 1TB ( 80% Batch + 20% Streaming ) ■ Data - Structured & Semi-structured (JSON, XML) ■ Data Pipelines - Mostly ELT - Datastage to Teradata (landing layer), Teradata SQL to Transform Data ■ Data Analytics - Tableau Reports - Customer Reports ■ Enterprise Scheduler - Control-M , Ticketing Tool - JIRA , Alerting via Slack, Email ■ Monitoring Dashboards , 24X7 Support Team 18
  19. 19. DW - Google BigQuery vs Azure Synapse BigQuery Synapse Observations ● Supports More Than 90% of Requirements ● SaaS Offering , Cloud Managed ● Very Well Integrated 1 Data Platform Checkpoint ● Native Drivers to Support Batch & Stream ● Highest Data Processing Speed ● Storage vs Compute - Scaling In and Out ● Automatic Scaling, Performance Efficient 2 Data Processing Checkpoint ● Can Be Integrated With Any BI Tools ● Support AI/ML Libraries and Jobs ● Performance Efficient - Data Processing , Scanning 3 Data Analytics Checkpoint ● Customized Logging & Monitoring ● Native vs Customized Dashboards ● Integration With Various Alerting, Messaging Tools 5 Data Operations ● High Availability ● Automatic Failover , No DR Required ● Performance & Cost Efficient ● Pay as You Go vs Commitment Comparison Based on Overall Usage 4 Business Checkpoint 19
  20. 20. Evaluation - Final Report Approach 1 Approach 2 DW BigQuery BigQuery ETL + ELT Pipelines Modify DS jobs to use BQ connector to load data to BQ landing layer Convert DS load jobs to BQ load jobs to pull data from source and load to BQ (this is depending on types of source systems and integration complexity) Data Storage Store active data in BQ native tables with roll up policies and store passive datasets on GCS layer depending on usage of tables. External tables can be built on GCS datasets. Store active data in BQ native tables with roll up policies and store passive datasets on GCS layer depending on usage of tables.External tables can be built on GCS datasets. Data Analytics Tableau connections can be replaced with BQ connections Tableau connections can be replaced with BQ connections Data Pipeline Scheduler & Maint Control-M can be used to trigger the pipelines, Orchestration can be done using Composer. Existing ticketing tools, alerting tools can be used as is Control-M can be used to trigger the pipelines, Orchestration can be done using Composer.Existing ticketing tools, alerting tools can be used as is BigQuery is opted here post evaluation which is completely based on the initial sign up to GCP as well as data storage % ratio between active and passive storage. Azure Synapse can offer the same capabilities however choices here are business & enterprise driven. 20
  21. 21. Thank You Stay in Touch Pooja Kelgaonkar poojakelgaonkar@gmail.com & pooja.kelgaonkar@rackspace.com www.linkedin.com/in/poojakelgaonkar poojakelgaonkar.medium.com

×