● “Data” engineers design and build pipelines that transform and transport
data into a format wherein, by the time it reaches the Data Scientists or
other end users, it is in a highly usable state. These pipelines must take
data from many disparate sources and collect them into a single
warehouse that represents the data uniformly as a single source of truth.
● Designing, building and scaling systems that organize data for analytics.
● Data Engineers prepare the Big Data infrastructure to be analyzed by Data
Scientists.
● Data engineering is the process of designing and building systems that let
people collect and analyze raw data from multiple sources and formats.
ROLES
Data Engineer:
● Data engineers work in a variety of settings to build systems that collect, manage, and convert raw
data into usable information for data scientists and business analysts to interpret.
Data Scientist:
● They use linear algebra and multivariable calculus to create new insight from existing data.
Business Analyst:
● Analysis and exploration of historical data → identify trends, patterns & understand the information →
drive business change
V’s of BIG DATA
Volume
◾ How much data you have
Velocity
◾ How fast data is getting to you
Variety
◾ How different your data is
Veracity
◾ How reliable your data is
TYPES
Unstructured/Raw data
● Unprocessed data in format used on source, Text, CSV, Image, Video, etc..
● High Latency
● No schema applied
● Stored in Google Cloud Storage, AWS S3
● Tools like Snowflake, MongoDB allow their specific ways to query unstructured data
Structured/Processed data
● Raw data with schema applied
● Stored in event tables/destinations in pipelines
● Analytics query language: ideally SQL-like
● Low latency data ingestion
● Read focus over large portion of data
Batch vs Stream
Batch Processing Stream Processing
Data scope Processing over all or most of the data set processing over data on rolling window or most
recent data record
Data size Large batches of data Individual records or micro batches of few
records
Latency in minutes
to hours
in the order of seconds or milliseconds
MAP REDUCE
● MapReduce is a processing technique and a
program model for distributed computing.
● The algorithm contains two important tasks,
namely Map and Reduce. Map takes a set of data
and converts it into another set of data, where
individual elements are broken down into tuples
(key/value pairs).
● Secondly, reduce task, which takes the output from
a map as an input and combines those data tuples
into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always
performed after the map job.
Connect:
● Ketan (LinkedIn)
○ Computer Science ‘24 Grad @ Michigan Tech
○ Ex - Data Engineer @ Abzooba : Abzooba is one of the top 50 Best Data Science firms in
India to work for. Focuses on developing the highest quality analytics products and
services using expertise in Big Data and Cloud, AI, and ML.
○ A constant Learner