SlideShare una empresa de Scribd logo
1 de 41
Flipkart Data Platform @ Scale
Arya Ketan, Rishabh Dua
Engineers @ Flipkart Tech
In God we trust. All others must bring data!
Flipkart confidential - For Internal use only. Not to be shared externally.
Agenda
1. Data @ Flipkart
2. Data platform architecture
3. Challenges @ Scale
4. Operating
5. Storage & Compute Optimizations
6. Data Governance
Data @ Flipkart
Flipkart confidential - For Internal use only. Not to be shared externally.
Who are the users?
“Torture the data, and it will confess to anything.”
Flipkart confidential - For Internal use only. Not to be shared externally.
Big Data - no longer just a buzzword
80% DATA
< 2 years old
15+ PB
HDFS files
3 billion +
events
ingested daily
400 billion +
container
hours daily
30+ TB
Ingested daily
Data Platform Architecture
Flipkart confidential - For Internal use only. Not to be shared externally.
Architecture
Challenges @ Scale
Operating ● Predictability ● Reliability
Operating data platform
Flipkart confidential - For Internal use only. Not to be shared externally.
Challenges in batch processing
Classic Batch pattern
● Fixed window cycles
● Repeated every window
Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Flipkart confidential - For Internal use only. Not to be shared externally.
Challenges in batch processing
● Breaks down when used with sophisticated window strategies
Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
● Businesses crave more timely data
● Non even workload spreads
Session
Flipkart confidential - For Internal use only. Not to be shared externally.
Stream processing patterns
● Stream
○ Low latency but approximate results
○ Unordered data of varying event-time skew
● Event time :
which is the time at which
events actually occurred.
● Processing time:
which is the time at
which events
are observed in the system.
Flipkart confidential - For Internal use only. Not to be shared externally.
Lambda Architecture
Flipkart confidential - For Internal use only. Not to be shared externally.
Semantics for unbounded data
● Time-agnostic
● Approximation
Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Flipkart confidential - For Internal use only. Not to be shared externally.
Semantics for unbounded data
Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
Windowing by
Processing time
Windowing by
Event time
Flipkart confidential - For Internal use only. Not to be shared externally.
Batch to fStream
● Streaming applications
○ f-SQL( ANSI-SQL compliant)
Flipkart confidential - For Internal use only. Not to be shared externally.
Batch to fStream
● Streaming applications
○ Materialized time windows
HBASE
Time Partitioned
Aggregates
Flipkart confidential - For Internal use only. Not to be shared externally.
Improvements
● Lower latency of freshness
○ User Insight prediction
○ Trust and Safety Interventions
● Newer features for data-science
○ User sessionization
● Lower resource consumption
Optimizing data platform to
improve predictability
Flipkart confidential - For Internal use only. Not to be shared externally.
Overload @ constant capacity
● More users, more use cases, more jobs, more resources
○ 100x Increase in compute hours
● Hardware unavailability in DataCenter to scale at same
rate
○ 1.1x increase in machine instances
Flipkart confidential - For Internal use only. Not to be shared externally.
Job analysis
● Problems in jobs are not obvious
● Lot of possible configurations - Hive, Hadoop, HDFS, Spark, JVM
● Inter-related settings
● Information & metrics are scattered
Flipkart confidential - For Internal use only. Not to be shared externally.
Optimizing compute usage
● Automated performance monitoring and tuning
tool
● Indicates best practices and tuning tips
● Best performance for every job
DR Elephant to the rescue
http://github.com/linkedin/dr-elephant/
Flipkart confidential - For Internal use only. Not to be shared externally.
Dr Elephant
Dashboard
Flipkart confidential - For Internal use only. Not to be shared externally.
Dr Elephant - Heuristics & Severity
Flipkart confidential - For Internal use only. Not to be shared externally.
Optimizing compute - Tez vs Mapreduce
● Tez creates DAG of tasks.
Compared to MR
○ No intermediate data written
○ Larger memory footprint
No one size fits all
● Assigner chooses compute engine
○ Container hours
○ Resources used
○ Configuration tweaking
Job
Assigner
TEZ
To be
scheduled
Compute
engine
chosen
MR
Flipkart confidential - For Internal use only. Not to be shared externally.
Optimizing storage
JSON AVRO ORC
Many Storage Formats
Flipkart confidential - For Internal use only. Not to be shared externally.
Which storage format?
ORC vs Avro vs Parquet vs Json
● ORC / Parquet scores over Avro/Json
○ Encoding, dictionaries, indexes, projection pushdown, predicate
pushdown
● Choose Parquet if highly nested structures.
○ Note: We are working on feature in ORC + hive to support
predicate push down and projection pushdown.
Flipkart confidential - For Internal use only. Not to be shared externally.
Optimized storage format
● Columnar format
● Integrated compression, indexes and stats
● Predicate push down & Projection push down
● Run length Encoding
Flipkart confidential - For Internal use only. Not to be shared externally.
Improvements
● ORC
○ ~80pc savings in storage, ~60pc savings in compute
● Dr. Elephant
○ 2000+ jobs improved
○ ~70pc savings in compute
● Tez
○ 10-100x improvement in processing speed
Data Governance
With great power comes great responsibility.
- Uncle Ben
Flipkart confidential - For Internal use only. Not to be shared externally.
Unreliability due to data issues
● What is source of truth for “Order Item Information”?
-- No way to annotate the data asset as blessed
● Why is this “Id” not in Data Platform?
-- Referential integrity constraints & validations are not supported
● Why Account-Id has invalid characters “%@#21323213”?
-- column is “account id” not just String.
● Why my data-table has yesterday’s data?
-- RCA of the dependencies is hard
Flipkart confidential - For Internal use only. Not to be shared externally.
Missing guard-rails & attribution
● Unrestricted usage of data
assets in the platform
● No minimum guarantees
on compute for Job execution
Flipkart confidential - For Internal use only. Not to be shared externally.
Lineage
● Data Assets Lineage
○ Easier RCA
○ Enables Reuse
○ Strategies to improve
data quality
Flipkart confidential - For Internal use only. Not to be shared externally.
1. Catalog of Data Assets
Schema & dependency definition
2. Classify and govern these assets
Attributes, tagging & security policies.
3. Collaboration capabilities around
these data assets
Ownership, accountability, subscriptions
What is Data Governance ?
Flipkart confidential - For Internal use only. Not to be shared externally.
Schema Tightening
● Why?
Identify data issue before entering the system
MicroService2
MicroService1
DATA
PLATFORM
INGESTION
Data Platform
AccountId:
ABC21312321333
AccountId:
FOO%%1231233
ERROR
AccountId:
ABC21312321333
Flipkart confidential - For Internal use only. Not to be shared externally.
Schema Tightening
How?
● Business Types
Eg AccountId, Price, OrderId
● Validations via JSON Schema
● Migrating to Schema
Tightened Entities
Flipkart confidential - For Internal use only. Not to be shared externally.
Data Quality Asserts
● Multiple Constraints
support
Eg. NULL Check, Variance, Referential
Change, Custom Query
● Auto triggered when fact is
finished
● Any one can Subscribe to an
Assert Rule
● Jira & Email integration
Flipkart confidential - For Internal use only. Not to be shared externally.
Org Queues
Why Org Queues?
Introduce fairness in allocation of Data Platform’s compute resources.
Optimize usage of already overloaded cluster, ensuring rogue jobs are preempted.
Features of Org Queue
● Guaranteed Minimum Compute
● Burstability & Pre-emption
● Sub queues of different sizes to improve reliability of P0 jobs
● Org Admins to manage the Users & Jobs in the queue
Flipkart confidential - For Internal use only. Not to be shared externally.
Features & Optimizations
● FStream
● Dr Elephant - Job Analysis
● Tez - Compute Engine
● ORC - Storage Format
Data Governance
● Dependency Lineage
● Schema Tightening
● DQ Asserts
● Org Queues
Summary
Challenges @ Scale
Overload Cluster @ Constant Capacity
Batch processing patterns
Data Quality issues
Missing guard-rails
Q & A
“Without big data, you are blind and deaf and in the
middle of Outer Ring Road.”
Flipkart confidential - For Internal use only. Not to be shared externally.
THANKS

Más contenido relacionado

Similar a Flipkart Data Platform @ Scale - slash n 2018 reprise

Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Imply
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 

Similar a Flipkart Data Platform @ Scale - slash n 2018 reprise (20)

Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017
 
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
 
In-Memory Data Management Goes Mainstream - OpenSlava 2015
In-Memory Data Management Goes Mainstream - OpenSlava 2015In-Memory Data Management Goes Mainstream - OpenSlava 2015
In-Memory Data Management Goes Mainstream - OpenSlava 2015
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
HKNOG 6.0 Next Generation Networks - will automation put us out of jobs?
 
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsGraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
 
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
Accelerate Your Analytic Queries with Amazon Aurora Parallel Query (DAT362) -...
 
Worst Practices in Data Warehouse Design
Worst Practices in Data Warehouse DesignWorst Practices in Data Warehouse Design
Worst Practices in Data Warehouse Design
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Single Source of Truth for Network Automation
Single Source of Truth for Network AutomationSingle Source of Truth for Network Automation
Single Source of Truth for Network Automation
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on DruidArchmage, Pinterest’s Real-time Analytics Platform on Druid
Archmage, Pinterest’s Real-time Analytics Platform on Druid
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
 
Industrialiser spark
Industrialiser sparkIndustrialiser spark
Industrialiser spark
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
Building A Self Service Streaming Platform at Pinterest - Steven Bairos-Novak...
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age
OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age
OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 

Último

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 

Flipkart Data Platform @ Scale - slash n 2018 reprise

  • 1. Flipkart Data Platform @ Scale Arya Ketan, Rishabh Dua Engineers @ Flipkart Tech In God we trust. All others must bring data!
  • 2. Flipkart confidential - For Internal use only. Not to be shared externally. Agenda 1. Data @ Flipkart 2. Data platform architecture 3. Challenges @ Scale 4. Operating 5. Storage & Compute Optimizations 6. Data Governance
  • 4. Flipkart confidential - For Internal use only. Not to be shared externally. Who are the users? “Torture the data, and it will confess to anything.”
  • 5. Flipkart confidential - For Internal use only. Not to be shared externally. Big Data - no longer just a buzzword 80% DATA < 2 years old 15+ PB HDFS files 3 billion + events ingested daily 400 billion + container hours daily 30+ TB Ingested daily
  • 7. Flipkart confidential - For Internal use only. Not to be shared externally. Architecture
  • 8. Challenges @ Scale Operating ● Predictability ● Reliability
  • 10. Flipkart confidential - For Internal use only. Not to be shared externally. Challenges in batch processing Classic Batch pattern ● Fixed window cycles ● Repeated every window Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  • 11. Flipkart confidential - For Internal use only. Not to be shared externally. Challenges in batch processing ● Breaks down when used with sophisticated window strategies Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 ● Businesses crave more timely data ● Non even workload spreads Session
  • 12. Flipkart confidential - For Internal use only. Not to be shared externally. Stream processing patterns ● Stream ○ Low latency but approximate results ○ Unordered data of varying event-time skew ● Event time : which is the time at which events actually occurred. ● Processing time: which is the time at which events are observed in the system.
  • 13. Flipkart confidential - For Internal use only. Not to be shared externally. Lambda Architecture
  • 14. Flipkart confidential - For Internal use only. Not to be shared externally. Semantics for unbounded data ● Time-agnostic ● Approximation Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  • 15. Flipkart confidential - For Internal use only. Not to be shared externally. Semantics for unbounded data Images from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 Windowing by Processing time Windowing by Event time
  • 16. Flipkart confidential - For Internal use only. Not to be shared externally. Batch to fStream ● Streaming applications ○ f-SQL( ANSI-SQL compliant)
  • 17. Flipkart confidential - For Internal use only. Not to be shared externally. Batch to fStream ● Streaming applications ○ Materialized time windows HBASE Time Partitioned Aggregates
  • 18. Flipkart confidential - For Internal use only. Not to be shared externally. Improvements ● Lower latency of freshness ○ User Insight prediction ○ Trust and Safety Interventions ● Newer features for data-science ○ User sessionization ● Lower resource consumption
  • 19. Optimizing data platform to improve predictability
  • 20. Flipkart confidential - For Internal use only. Not to be shared externally. Overload @ constant capacity ● More users, more use cases, more jobs, more resources ○ 100x Increase in compute hours ● Hardware unavailability in DataCenter to scale at same rate ○ 1.1x increase in machine instances
  • 21. Flipkart confidential - For Internal use only. Not to be shared externally. Job analysis ● Problems in jobs are not obvious ● Lot of possible configurations - Hive, Hadoop, HDFS, Spark, JVM ● Inter-related settings ● Information & metrics are scattered
  • 22. Flipkart confidential - For Internal use only. Not to be shared externally. Optimizing compute usage ● Automated performance monitoring and tuning tool ● Indicates best practices and tuning tips ● Best performance for every job DR Elephant to the rescue http://github.com/linkedin/dr-elephant/
  • 23. Flipkart confidential - For Internal use only. Not to be shared externally. Dr Elephant Dashboard
  • 24. Flipkart confidential - For Internal use only. Not to be shared externally. Dr Elephant - Heuristics & Severity
  • 25. Flipkart confidential - For Internal use only. Not to be shared externally. Optimizing compute - Tez vs Mapreduce ● Tez creates DAG of tasks. Compared to MR ○ No intermediate data written ○ Larger memory footprint No one size fits all ● Assigner chooses compute engine ○ Container hours ○ Resources used ○ Configuration tweaking Job Assigner TEZ To be scheduled Compute engine chosen MR
  • 26. Flipkart confidential - For Internal use only. Not to be shared externally. Optimizing storage JSON AVRO ORC Many Storage Formats
  • 27. Flipkart confidential - For Internal use only. Not to be shared externally. Which storage format? ORC vs Avro vs Parquet vs Json ● ORC / Parquet scores over Avro/Json ○ Encoding, dictionaries, indexes, projection pushdown, predicate pushdown ● Choose Parquet if highly nested structures. ○ Note: We are working on feature in ORC + hive to support predicate push down and projection pushdown.
  • 28. Flipkart confidential - For Internal use only. Not to be shared externally. Optimized storage format ● Columnar format ● Integrated compression, indexes and stats ● Predicate push down & Projection push down ● Run length Encoding
  • 29. Flipkart confidential - For Internal use only. Not to be shared externally. Improvements ● ORC ○ ~80pc savings in storage, ~60pc savings in compute ● Dr. Elephant ○ 2000+ jobs improved ○ ~70pc savings in compute ● Tez ○ 10-100x improvement in processing speed
  • 30. Data Governance With great power comes great responsibility. - Uncle Ben
  • 31. Flipkart confidential - For Internal use only. Not to be shared externally. Unreliability due to data issues ● What is source of truth for “Order Item Information”? -- No way to annotate the data asset as blessed ● Why is this “Id” not in Data Platform? -- Referential integrity constraints & validations are not supported ● Why Account-Id has invalid characters “%@#21323213”? -- column is “account id” not just String. ● Why my data-table has yesterday’s data? -- RCA of the dependencies is hard
  • 32. Flipkart confidential - For Internal use only. Not to be shared externally. Missing guard-rails & attribution ● Unrestricted usage of data assets in the platform ● No minimum guarantees on compute for Job execution
  • 33. Flipkart confidential - For Internal use only. Not to be shared externally. Lineage ● Data Assets Lineage ○ Easier RCA ○ Enables Reuse ○ Strategies to improve data quality
  • 34. Flipkart confidential - For Internal use only. Not to be shared externally. 1. Catalog of Data Assets Schema & dependency definition 2. Classify and govern these assets Attributes, tagging & security policies. 3. Collaboration capabilities around these data assets Ownership, accountability, subscriptions What is Data Governance ?
  • 35. Flipkart confidential - For Internal use only. Not to be shared externally. Schema Tightening ● Why? Identify data issue before entering the system MicroService2 MicroService1 DATA PLATFORM INGESTION Data Platform AccountId: ABC21312321333 AccountId: FOO%%1231233 ERROR AccountId: ABC21312321333
  • 36. Flipkart confidential - For Internal use only. Not to be shared externally. Schema Tightening How? ● Business Types Eg AccountId, Price, OrderId ● Validations via JSON Schema ● Migrating to Schema Tightened Entities
  • 37. Flipkart confidential - For Internal use only. Not to be shared externally. Data Quality Asserts ● Multiple Constraints support Eg. NULL Check, Variance, Referential Change, Custom Query ● Auto triggered when fact is finished ● Any one can Subscribe to an Assert Rule ● Jira & Email integration
  • 38. Flipkart confidential - For Internal use only. Not to be shared externally. Org Queues Why Org Queues? Introduce fairness in allocation of Data Platform’s compute resources. Optimize usage of already overloaded cluster, ensuring rogue jobs are preempted. Features of Org Queue ● Guaranteed Minimum Compute ● Burstability & Pre-emption ● Sub queues of different sizes to improve reliability of P0 jobs ● Org Admins to manage the Users & Jobs in the queue
  • 39. Flipkart confidential - For Internal use only. Not to be shared externally. Features & Optimizations ● FStream ● Dr Elephant - Job Analysis ● Tez - Compute Engine ● ORC - Storage Format Data Governance ● Dependency Lineage ● Schema Tightening ● DQ Asserts ● Org Queues Summary Challenges @ Scale Overload Cluster @ Constant Capacity Batch processing patterns Data Quality issues Missing guard-rails
  • 40. Q & A “Without big data, you are blind and deaf and in the middle of Outer Ring Road.”
  • 41. Flipkart confidential - For Internal use only. Not to be shared externally. THANKS