1) The document discusses a self-study approach to learning data science through project-based learning using various online resources.
2) It recommends breaking down projects into 5 steps: defining problems/solutions, data extraction/preprocessing, exploration/engineering, model implementation, and evaluation.
3) Each step requires different skillsets from domains like statistics, programming, SQL, visualization, mathematics, and business knowledge.
Call Girls In Panjim North Goa 9971646499 Genuine Service
Self-Study Approach for Data Science
1. Self-Study Approach for Data Science – 2022
Project Based Approach to learn Data Science
As someone who holds a Master’s degree in Computer science, I am truly passionate about
this field and decided to experiment on building my own curriculum to learn data science in
spare time. I would like to share my experience and hope to bring some insights if you want to
share the same journey.
Project-based learning is a good starting point for people already have some technical
background, but also want to dive deeper into the building blocks of data science. A typical
data science/machine learning project comprises the lifecycle — from defining the objectives,
data preprocessing, exploratory data analysis, feature engineering, model implementation to
2. model evaluation. Each phase requires different skillsets, including statistics, programming,
SQL, data visualization, mathematics and business knowledge.
I highly recommend Kaggle as the platform to experiment with your data science projects and
Medium as the platform to gain data science knowledge from professionals. With plenty of
interesting datasets and a cloud based programming environment, you can easily get data
source, code and notebooks from Kaggle for free. While several popular data science
publications (e.g. Towards Data Science, Analytics Vidhya) from Medium allows you to learn
from others work and share your own projects all at the same place.
Why Project Based Approach?
1. It is practical and gives us a sense of achievement that we are doing something
real!
2. It highlights the rationale of learning each pieces of content. A goal-oriented
approach provides a bird eye view of how each little pieces tie together to form
the big picture.
3. It allow us to actively retrieve the information as we are learning. Active Recall is
proven to significantly enhance information retention, compared to conventional
learning mechanism which only requires passively consuming knowledge.
Let’s break down the data science lifecycle into the following 5 steps and we will see how
each step connects to various knowledge domains.
1.Business Problem & Data Science Solution
The first step of a data science project is to identify the business problem and define the
objectives of an experiment design or model deployment.
3. Skillset I — Business Knowledge
At this stage, it doesn’t need technical skills yet demands business understanding to identify
the problem and define the objectives. First step is to understand the domain specific
terminology that appears in the dataset, then to translate a business requirement into a
technical solution. It requires years of experience in the field to build up your knowledge.
Here I can only recommend some websites that increase your exposure to some business
domains, for example Harvard Business Review, Hubspot, Investopedia, TechCrunch.
Skillset II — Statistics (Experimental Design)
After defining the problem, then it is to frame it / fit it into a data science solution. This starts
with the knowledge in Experimental Design such as:
Hypothesis Testing
Sampling
Bias / Variance Trade-off
Different types of Classification Errors
Overfitting / Underfitting.
There are various type of hypothesis testing to explore — T test, ANOVA, Chi Square test,
etc,
Machine Learning is fundamentally considered as a hypothesis testing process, where we
need to search for a model in the hypothesis space that best fits our observed data, and allows
us to make prediction to unobserved data.
4. Useful Resource:
Khan Academy: Study Design
A Gentle Introduction to Statistical Hypothesis Testing
2. Data Extraction & Data Preprocessing
The second step is to collect data from various sources and transform the raw data into
digestible format. This process is knowns as Data Ingestion.
Skillset III — SQL
SQL is a powerful language for communicating with and extracting data from structured
database. Learning SQL also assists with framing a mental model that helps you to generate
insights through data querying techniques, such as grouping, filtering, sorting, and joining.
You will also find similar logic appearing in other tools and languages, such as Pandas and
SAS.
Useful Resources:
“Get Started with SQL Joins”
Datacamp: SQL fundamentals
Dataquest: SQL Basics
Skillset IV — Python (Pandas)
It is essential to get comfortable with a programming language while learning data science.
The simple syntax makes Python a relatively easy language to start with. Here is a great video
tutorial if you are new to Python: Python for Beginners — Learn Python in 1 Hour.
5. After a basic understanding, it’s worth spending some time to learn Pandas Library. Pandas is
almost unavoidable if you use python for data extraction. It transforms database into
dataframe — a table like format that we are most familiar with. Pandas also plays an important
role in data preprocessing, when it is required to examine and handle the following data
quality issues,.
Address missing data
Transform inconsistent data type
Remove duplicated value
Treat outliers etc.,
Useful Resources:
Python Pandas Tutorial: A Complete Introduction for Beginners
W3schools: Pandas Tutorial
3. Data Exploration & Feature Engineering
The third step is Data Exploration, also known as EDA (Exploratory Data Analysis) which
reveals hidden characteristics and patterns in a dataset. It is usually achieved by data
visualization, and followed by feature engineering to transform data based on the outcome of
data exploration.
6. Skillset V — Statistics (Descriptive Statistics)
Data exploration use descriptive statistics to summarize characteristics of the dataset:
Mean, Median, Mode (Measures of Central Tendency)
Range, Variance, Standard Deviation (Measures of Dispersion)
Correlation, Covariance
Skewness, Distribution
After a solid understanding of the dataset characteristics, we need to apply the most
appropriate feature engineering techniques accordingly. For instance, use log transformation
for right-skewed data and clipping methods to deal with outliers.
Here I list down some most common and popular feature engineering techniques:
Categorical Encoding
Scaling
Log Transformation
Imputation
Feature Selection
7. Useful Resource:
3 Common Techniques for Data Transformation
Fundamental Techniques of Feature Engineering for Machine Learning
Feature Selection and EDA in Machine Learning
Skillset VI — Data Visualization
Combining statistics and data visualization allows us to understand the data through
appropriate visual representation. Whether you prefer using visualization package such
as seaborn or matplotlib in Python and ggplot2 in R; or visualization tools like Tableau and
PowerBI, it’s essential to distinguish the use case of common chart types:
Bar Chart
Histogram
Box Plot
Heatmap
Scatter Plot
Line Chart
8. 4. Model Implementation
After all of the preparation so far, it’s finally the time to dive deeper into machine learning
algorithms.
Skillset VI — Machine Learning
scikit-learn is a powerful Python library that allows beginners to get started in machine
learning easily. It offers plenty of built-in package and we can easily implement a model using
several lines of code. Although it has already done the hard work for us, it is still crucial to
understanding how the algorithms operate behind the scene and be able to distinguish the best
use case for each. Generally, machine learning algorithms are categorized into Supervised
learning and Unsupervised learning. Below are some of the most popular algorithms:
Supervised Learning:
Linear Regression
Logistic Regression
Neural Network
Decision Tree
Support Vector Machine
Unsupervised Learning:
Clustering
PCA
Dimension Reduction
9. Useful Resources:
scikit-learn website
Coursera: Machine Learning with Python
Skillset VI — Mathematics
Many starters including me may have the question of why we need to learn Math in data
science.
As a beginner, math knowledge mainly assists in understanding the underlying theory behind
the algorithms. Moving forward, when we no longer rely on built in libraries for machine
learning models, it allows us to develop and optimize customized algorithms. Additionally,
hyperparameter tuning also requires advanced math knowledge for searching the best model
that minimize the cost function.
This is when more complicated math topics come into place:
Calculus
Linear Algebra
Optimization problem
Gradient Descent
Searching Algorithms
10. Useful Resources:
3Blue1Brown: Essence of Linear Algebra
3Blue1Brown: Essence of Calculus
3Blue1Brown: Gradient Descent
5. Model Evaluation
Skillset VII — Statistics (Inferential Statistics)
Inferential Statistics is particular useful when making model prediction and evaluating model
performance. As opposed to descriptive statistics, inferential statistics focuses on generalizing
the pattern observed in the sample data to a wider population. It provides evidence of which
features have high importance in making inference. Also, it determines the model performance
based on evaluation metrics.
For example, for classification problem where the output are discrete categories, some
common metrics are:
Confusion matrix
Type 1 error / Type 2 error
Accuracy
ROC / AUC
11. Whereas, for regression problem where the output are continuous numbers, some common
metrics are:
R Squared
Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Squared
Error (MSE)
Useful Resources
Khan’s Academy: Statistics and Probability
Metrics to Evaluate your Machine Learning Algorithm