5. Who is Data Engineer?
“The role of data engineer is now used throughout industry
to describe the highly specialized software
engineers who create and maintain
these robust big data pipelines.” -
Insight Data Engineering
Basically we are software engineers.
11. Challenges - Ingestion
Throughput, availability, scalability
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
12. Challenges - Ingestion
Sample Problem:
Facebook page view ~ 1 trillion/month
385,802 log or insert per second
Sample Solution:
Kafka, 2 million write/s (on 3 cheap machines)
- Simple (Log) → Throughput, O(1)
- Partitioning → Scalability
- Replication → Availability
13. Challenges - Ingestion
Challenge 1 - Wiring to Main App
● May introduce some changes in application
Challenge 2 - Failure isolation
● Minimize failure in application when logging
14. Challenges - Processing
Integrity, Dependency, Performance
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
15. Challenges - Processing
Sample Problem:
How many page views are from Indonesia in Aug 2015?
~100PB data if 10kb/datum
Sample Solution:
● Spark/Hadoop for computing
● HDFS for storing and Avro as file format
● Oozie as workflow management
16. Challenges - Processing
Challenge 1 - Learning Curve
● New way of thinking in processing data: Map Reduce
● New technology and operational concerns
Challenge 2 - Putting it All Together
● Incompatible release versions
● Minimum documentation
17. Challenges - Storage
Efficiency, Performance
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
18. Challenges - Storage
Sample Problems:
1. We want to get number of daily page view from
Indonesia for last 7 days
2. We want to retrieve user’s latest transaction to personalize
search result better
Sample Solution:
1. You might need Columnar Store for OLAP queries
2. You might need Key-Value Store since it will be retrieved per user id
19. Challenges - Storage
Challenge 1 - Choosing the right storage
● There are so many kind of database nowadays. Pick it
wisely to support your use cases best.
Challenge 2 - Develop the right model
● Each database has different way to model data.
Relational model might not be appropriate. We need to
understand how the database work.
20. Challenges - Retrieval
Ease of Use, Reusability, Adaptiveness
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
21. Challenges - Retrieval
Sample Problem:
● We want to visualize number of daily page view from
Indonesia for last 7 days
● and other problems like ad hoc query and reporting
Sample Solution:
● Create backend service to query and application to
visualize query result
22. Challenges - Retrieval
Challenge 1 - Ease of Use, Reusability
● It is very important to be easy to use since retrieval is
user facing product. Data product have to be
reusable and discoverable across data users.
Challenge 2 - Adaptiveness
● As there are many kind of databases now, query
service need to be extensible and adaptive to enable
usage of data from various sources.
23. Challenges - Data Management
INGESTION
Take it
DATA MANAGEMENT
Manage them
PROCESSING
Process it
STORAGE
Store it
RETRIEVAL
Use it
24. Challenges - Data Management
Challenge 1 - Centralized Metadata
● Manage data at various places, with various schema
(sometime schemaless).
Challenge 2 - Security, Access Control
● Most of them are newly developed, and usually security
is last thing we consider.
26. Takeaway Points
● Think critically
○ Be wise, don’t get carried away, do not use
something just because it is cool, make sure you are
using what you need.
● Keep curious
○ New technology is coming everyday, one of them
might save your day
27. What is it like, to be a Data Engineer?
● Exhilarating
○ Be in critical position, handle big volume of data, be the nerve of
company, and have to make sure pipeline is robust.
● Challenging
○ Have to be DBA, data architect, big data programmer, software
engineer, and data analyst at the same time!
● Fun
○ Need to always learn new technology, new way to solve things
● High Demand
○ Data engineers are one of the most in-demand job roles at today’s
leading companies.