SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Data Engineering
Challenges
DSE Days - 10 Sept 2015
Structure
1. Data Engineering
2. Data Pipeline
3. Data Engineering Challenges
4. Closing
1. Data Engineering
All those buzzwords...
- Data explosion, big data
- Data scientist
- IoT
- Data driven company
Who is Data Engineer?
“The role of data engineer is now used throughout industry
to describe the highly specialized software
engineers who create and maintain
these robust big data pipelines.” -
Insight Data Engineering
Basically we are software engineers.
2. Data Pipeline
Data Pipeline
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
Lambda Architecture
INGESTION
Take it
DATA MANAGEMENT
Manage them
BATCH
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
STREAM
PROCESSING
Process it NOW
Big Data Pipeline
3. Data Engineering
Challenges
Challenges - Ingestion
Throughput, availability, scalability
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
Challenges - Ingestion
Sample Problem:
Facebook page view ~ 1 trillion/month
385,802 log or insert per second
Sample Solution:
Kafka, 2 million write/s (on 3 cheap machines)
- Simple (Log) → Throughput, O(1)
- Partitioning → Scalability
- Replication → Availability
Challenges - Ingestion
Challenge 1 - Wiring to Main App
● May introduce some changes in application
Challenge 2 - Failure isolation
● Minimize failure in application when logging
Challenges - Processing
Integrity, Dependency, Performance
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
Challenges - Processing
Sample Problem:
How many page views are from Indonesia in Aug 2015?
~100PB data if 10kb/datum
Sample Solution:
● Spark/Hadoop for computing
● HDFS for storing and Avro as file format
● Oozie as workflow management
Challenges - Processing
Challenge 1 - Learning Curve
● New way of thinking in processing data: Map Reduce
● New technology and operational concerns
Challenge 2 - Putting it All Together
● Incompatible release versions
● Minimum documentation
Challenges - Storage
Efficiency, Performance
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
Challenges - Storage
Sample Problems:
1. We want to get number of daily page view from
Indonesia for last 7 days
2. We want to retrieve user’s latest transaction to personalize
search result better
Sample Solution:
1. You might need Columnar Store for OLAP queries
2. You might need Key-Value Store since it will be retrieved per user id
Challenges - Storage
Challenge 1 - Choosing the right storage
● There are so many kind of database nowadays. Pick it
wisely to support your use cases best.
Challenge 2 - Develop the right model
● Each database has different way to model data.
Relational model might not be appropriate. We need to
understand how the database work.
Challenges - Retrieval
Ease of Use, Reusability, Adaptiveness
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
Challenges - Retrieval
Sample Problem:
● We want to visualize number of daily page view from
Indonesia for last 7 days
● and other problems like ad hoc query and reporting
Sample Solution:
● Create backend service to query and application to
visualize query result
Challenges - Retrieval
Challenge 1 - Ease of Use, Reusability
● It is very important to be easy to use since retrieval is
user facing product. Data product have to be
reusable and discoverable across data users.
Challenge 2 - Adaptiveness
● As there are many kind of databases now, query
service need to be extensible and adaptive to enable
usage of data from various sources.
Challenges - Data Management
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
Challenges - Data Management
Challenge 1 - Centralized Metadata
● Manage data at various places, with various schema
(sometime schemaless).
Challenge 2 - Security, Access Control
● Most of them are newly developed, and usually security
is last thing we consider.
4. Closing
Takeaway Points
● Think critically
○ Be wise, don’t get carried away, do not use
something just because it is cool, make sure you are
using what you need.
● Keep curious
○ New technology is coming everyday, one of them
might save your day
What is it like, to be a Data Engineer?
● Exhilarating
○ Be in critical position, handle big volume of data, be the nerve of
company, and have to make sure pipeline is robust.
● Challenging
○ Have to be DBA, data architect, big data programmer, software
engineer, and data analyst at the same time!
● Fun
○ Need to always learn new technology, new way to solve things
● High Demand
○ Data engineers are one of the most in-demand job roles at today’s
leading companies.
Q&A
References
● http://insightdataengineering.com/blog/The-
Data-Engineering-Ecosystem-An-Interactive-
Map.html
● http://insightdataengineering.com/Insight_Da
ta_Engineering_White_Paper.pdf

Más contenido relacionado

La actualidad más candente

Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanycOpen Analytics
 
Lambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingLambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingTrieu Nguyen
 
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summitOpen Analytics
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureOliver Buckley-Salmon
 
Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksDatabricks
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidDatabricks
 
MongoDB in a Mainframe World
MongoDB in a Mainframe WorldMongoDB in a Mainframe World
MongoDB in a Mainframe WorldMongoDB
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Big Data Spain
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Zhenxiao Luo
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to RedshiftTreasure Data, Inc.
 
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...DataStax
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataTreasure Data, Inc.
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixData Con LA
 
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Natalino Busa
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataTreasure Data, Inc.
 
Data Modeling Basics for the Cloud with DataStax
Data Modeling Basics for the Cloud with DataStaxData Modeling Basics for the Cloud with DataStax
Data Modeling Basics for the Cloud with DataStaxDataStax
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AISaving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AIDatabricks
 

La actualidad más candente (20)

Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
Lambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingLambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB Testing
 
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
 
Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with Databricks
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and Druid
 
MongoDB in a Mainframe World
MongoDB in a Mainframe WorldMongoDB in a Mainframe World
MongoDB in a Mainframe World
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
 
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
Webinar | How Clear Capital Delivers Always-on Appraisals on 122 Million Prop...
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
Towards Real-Time banking API's: Introducing Coral, a web api for realtime st...
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
 
Data Modeling Basics for the Cloud with DataStax
Data Modeling Basics for the Cloud with DataStaxData Modeling Basics for the Cloud with DataStax
Data Modeling Basics for the Cloud with DataStax
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AISaving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AI
 

Similar a Data Engineering Challenges - DSE Day at Bandung Institute of Technology

Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Webinar: Overcoming the Storage Roadblock to Data Center Modernization
Webinar: Overcoming the Storage Roadblock to Data Center ModernizationWebinar: Overcoming the Storage Roadblock to Data Center Modernization
Webinar: Overcoming the Storage Roadblock to Data Center ModernizationStorage Switzerland
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)Denodo
 
Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)
Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)
Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)Denodo
 
Bimodal IT and EDW Modernization
Bimodal IT and EDW ModernizationBimodal IT and EDW Modernization
Bimodal IT and EDW ModernizationRobert Gleave
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...InfluxData
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentTasktop
 
Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...
Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...
Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...Ashnikbiz
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015DataKitchen
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale OverviewPete Jarvis
 
Solving the Database Problem
Solving the Database ProblemSolving the Database Problem
Solving the Database ProblemJay Gordon
 
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon
 
Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challengesDilpreet kaur Virk
 
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus WebinarBuild and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus WebinarImpetus Technologies
 
S dillon mtlc 5-02-2013
S dillon   mtlc 5-02-2013S dillon   mtlc 5-02-2013
S dillon mtlc 5-02-2013MassTLC
 
Thinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters AnalyticsThinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters AnalyticsInside Analysis
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLTriNimbus
 
Big Data & Information Management Channel Manager
Big Data & Information Management Channel ManagerBig Data & Information Management Channel Manager
Big Data & Information Management Channel ManagerArrow ECS UK
 

Similar a Data Engineering Challenges - DSE Day at Bandung Institute of Technology (20)

Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Webinar: Overcoming the Storage Roadblock to Data Center Modernization
Webinar: Overcoming the Storage Roadblock to Data Center ModernizationWebinar: Overcoming the Storage Roadblock to Data Center Modernization
Webinar: Overcoming the Storage Roadblock to Data Center Modernization
 
How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)How Data Virtualization Puts Machine Learning into Production (APAC)
How Data Virtualization Puts Machine Learning into Production (APAC)
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)
Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)
Logical Data Warehouse: The Foundation of Modern Data and Analytics (APAC)
 
Bimodal IT and EDW Modernization
Bimodal IT and EDW ModernizationBimodal IT and EDW Modernization
Bimodal IT and EDW Modernization
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
 
Doing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics EnvironmentDoing Analytics Right - Building the Analytics Environment
Doing Analytics Right - Building the Analytics Environment
 
Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...
Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...
Polyglot Persistence and Database Deployment by Sandeep Khuperkar CTO and Dir...
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
 
Fundamentals of Big Data
Fundamentals of Big DataFundamentals of Big Data
Fundamentals of Big Data
 
TidalScale Overview
TidalScale OverviewTidalScale Overview
TidalScale Overview
 
Solving the Database Problem
Solving the Database ProblemSolving the Database Problem
Solving the Database Problem
 
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02
 
Big data issues and challenges
Big data issues and challengesBig data issues and challenges
Big data issues and challenges
 
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus WebinarBuild and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
Build and Manage Hadoop & Oracle NoSQL DB Solutions- Impetus Webinar
 
S dillon mtlc 5-02-2013
S dillon   mtlc 5-02-2013S dillon   mtlc 5-02-2013
S dillon mtlc 5-02-2013
 
Thinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters AnalyticsThinking Outside the Cube: How In-Memory Bolsters Analytics
Thinking Outside the Cube: How In-Memory Bolsters Analytics
 
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACLPerformance Optimization of Cloud Based Applications by Peter Smith, ACL
Performance Optimization of Cloud Based Applications by Peter Smith, ACL
 
Big Data & Information Management Channel Manager
Big Data & Information Management Channel ManagerBig Data & Information Management Channel Manager
Big Data & Information Management Channel Manager
 

Último

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Data Engineering Challenges - DSE Day at Bandung Institute of Technology

  • 2. Structure 1. Data Engineering 2. Data Pipeline 3. Data Engineering Challenges 4. Closing
  • 4. All those buzzwords... - Data explosion, big data - Data scientist - IoT - Data driven company
  • 5. Who is Data Engineer? “The role of data engineer is now used throughout industry to describe the highly specialized software engineers who create and maintain these robust big data pipelines.” - Insight Data Engineering Basically we are software engineers.
  • 7. Data Pipeline INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  • 8. Lambda Architecture INGESTION Take it DATA MANAGEMENT Manage them BATCH PROCESSING Process it STORAGE Store it RETRIEVAL Use it STREAM PROCESSING Process it NOW
  • 11. Challenges - Ingestion Throughput, availability, scalability INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  • 12. Challenges - Ingestion Sample Problem: Facebook page view ~ 1 trillion/month 385,802 log or insert per second Sample Solution: Kafka, 2 million write/s (on 3 cheap machines) - Simple (Log) → Throughput, O(1) - Partitioning → Scalability - Replication → Availability
  • 13. Challenges - Ingestion Challenge 1 - Wiring to Main App ● May introduce some changes in application Challenge 2 - Failure isolation ● Minimize failure in application when logging
  • 14. Challenges - Processing Integrity, Dependency, Performance INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  • 15. Challenges - Processing Sample Problem: How many page views are from Indonesia in Aug 2015? ~100PB data if 10kb/datum Sample Solution: ● Spark/Hadoop for computing ● HDFS for storing and Avro as file format ● Oozie as workflow management
  • 16. Challenges - Processing Challenge 1 - Learning Curve ● New way of thinking in processing data: Map Reduce ● New technology and operational concerns Challenge 2 - Putting it All Together ● Incompatible release versions ● Minimum documentation
  • 17. Challenges - Storage Efficiency, Performance INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  • 18. Challenges - Storage Sample Problems: 1. We want to get number of daily page view from Indonesia for last 7 days 2. We want to retrieve user’s latest transaction to personalize search result better Sample Solution: 1. You might need Columnar Store for OLAP queries 2. You might need Key-Value Store since it will be retrieved per user id
  • 19. Challenges - Storage Challenge 1 - Choosing the right storage ● There are so many kind of database nowadays. Pick it wisely to support your use cases best. Challenge 2 - Develop the right model ● Each database has different way to model data. Relational model might not be appropriate. We need to understand how the database work.
  • 20. Challenges - Retrieval Ease of Use, Reusability, Adaptiveness INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  • 21. Challenges - Retrieval Sample Problem: ● We want to visualize number of daily page view from Indonesia for last 7 days ● and other problems like ad hoc query and reporting Sample Solution: ● Create backend service to query and application to visualize query result
  • 22. Challenges - Retrieval Challenge 1 - Ease of Use, Reusability ● It is very important to be easy to use since retrieval is user facing product. Data product have to be reusable and discoverable across data users. Challenge 2 - Adaptiveness ● As there are many kind of databases now, query service need to be extensible and adaptive to enable usage of data from various sources.
  • 23. Challenges - Data Management INGESTION Take it DATA MANAGEMENT Manage them PROCESSING Process it STORAGE Store it RETRIEVAL Use it
  • 24. Challenges - Data Management Challenge 1 - Centralized Metadata ● Manage data at various places, with various schema (sometime schemaless). Challenge 2 - Security, Access Control ● Most of them are newly developed, and usually security is last thing we consider.
  • 26. Takeaway Points ● Think critically ○ Be wise, don’t get carried away, do not use something just because it is cool, make sure you are using what you need. ● Keep curious ○ New technology is coming everyday, one of them might save your day
  • 27. What is it like, to be a Data Engineer? ● Exhilarating ○ Be in critical position, handle big volume of data, be the nerve of company, and have to make sure pipeline is robust. ● Challenging ○ Have to be DBA, data architect, big data programmer, software engineer, and data analyst at the same time! ● Fun ○ Need to always learn new technology, new way to solve things ● High Demand ○ Data engineers are one of the most in-demand job roles at today’s leading companies.
  • 28. Q&A