SlideShare una empresa de Scribd logo
1 de 11
Self-Study Approach for Data Science – 2022
Project Based Approach to learn Data Science
As someone who holds a Master’s degree in Computer science, I am truly passionate about
this field and decided to experiment on building my own curriculum to learn data science in
spare time. I would like to share my experience and hope to bring some insights if you want to
share the same journey.
Project-based learning is a good starting point for people already have some technical
background, but also want to dive deeper into the building blocks of data science. A typical
data science/machine learning project comprises the lifecycle — from defining the objectives,
data preprocessing, exploratory data analysis, feature engineering, model implementation to
model evaluation. Each phase requires different skillsets, including statistics, programming,
SQL, data visualization, mathematics and business knowledge.
I highly recommend Kaggle as the platform to experiment with your data science projects and
Medium as the platform to gain data science knowledge from professionals. With plenty of
interesting datasets and a cloud based programming environment, you can easily get data
source, code and notebooks from Kaggle for free. While several popular data science
publications (e.g. Towards Data Science, Analytics Vidhya) from Medium allows you to learn
from others work and share your own projects all at the same place.
Why Project Based Approach?
1. It is practical and gives us a sense of achievement that we are doing something
real!
2. It highlights the rationale of learning each pieces of content. A goal-oriented
approach provides a bird eye view of how each little pieces tie together to form
the big picture.
3. It allow us to actively retrieve the information as we are learning. Active Recall is
proven to significantly enhance information retention, compared to conventional
learning mechanism which only requires passively consuming knowledge.
Let’s break down the data science lifecycle into the following 5 steps and we will see how
each step connects to various knowledge domains.
1.Business Problem & Data Science Solution
The first step of a data science project is to identify the business problem and define the
objectives of an experiment design or model deployment.
Skillset I — Business Knowledge
At this stage, it doesn’t need technical skills yet demands business understanding to identify
the problem and define the objectives. First step is to understand the domain specific
terminology that appears in the dataset, then to translate a business requirement into a
technical solution. It requires years of experience in the field to build up your knowledge.
Here I can only recommend some websites that increase your exposure to some business
domains, for example Harvard Business Review, Hubspot, Investopedia, TechCrunch.
Skillset II — Statistics (Experimental Design)
After defining the problem, then it is to frame it / fit it into a data science solution. This starts
with the knowledge in Experimental Design such as:
 Hypothesis Testing
 Sampling
 Bias / Variance Trade-off
 Different types of Classification Errors
 Overfitting / Underfitting.
There are various type of hypothesis testing to explore — T test, ANOVA, Chi Square test,
etc,
Machine Learning is fundamentally considered as a hypothesis testing process, where we
need to search for a model in the hypothesis space that best fits our observed data, and allows
us to make prediction to unobserved data.
Useful Resource:
 Khan Academy: Study Design
 A Gentle Introduction to Statistical Hypothesis Testing
2. Data Extraction & Data Preprocessing
The second step is to collect data from various sources and transform the raw data into
digestible format. This process is knowns as Data Ingestion.
Skillset III — SQL
SQL is a powerful language for communicating with and extracting data from structured
database. Learning SQL also assists with framing a mental model that helps you to generate
insights through data querying techniques, such as grouping, filtering, sorting, and joining.
You will also find similar logic appearing in other tools and languages, such as Pandas and
SAS.
Useful Resources:
 “Get Started with SQL Joins”
 Datacamp: SQL fundamentals
 Dataquest: SQL Basics
Skillset IV — Python (Pandas)
It is essential to get comfortable with a programming language while learning data science.
The simple syntax makes Python a relatively easy language to start with. Here is a great video
tutorial if you are new to Python: Python for Beginners — Learn Python in 1 Hour.
After a basic understanding, it’s worth spending some time to learn Pandas Library. Pandas is
almost unavoidable if you use python for data extraction. It transforms database into
dataframe — a table like format that we are most familiar with. Pandas also plays an important
role in data preprocessing, when it is required to examine and handle the following data
quality issues,.
 Address missing data
 Transform inconsistent data type
 Remove duplicated value
 Treat outliers etc.,
Useful Resources:
 Python Pandas Tutorial: A Complete Introduction for Beginners
 W3schools: Pandas Tutorial
3. Data Exploration & Feature Engineering
The third step is Data Exploration, also known as EDA (Exploratory Data Analysis) which
reveals hidden characteristics and patterns in a dataset. It is usually achieved by data
visualization, and followed by feature engineering to transform data based on the outcome of
data exploration.
Skillset V — Statistics (Descriptive Statistics)
Data exploration use descriptive statistics to summarize characteristics of the dataset:
 Mean, Median, Mode (Measures of Central Tendency)
 Range, Variance, Standard Deviation (Measures of Dispersion)
 Correlation, Covariance
 Skewness, Distribution
After a solid understanding of the dataset characteristics, we need to apply the most
appropriate feature engineering techniques accordingly. For instance, use log transformation
for right-skewed data and clipping methods to deal with outliers.
Here I list down some most common and popular feature engineering techniques:
 Categorical Encoding
 Scaling
 Log Transformation
 Imputation
 Feature Selection
Useful Resource:
 3 Common Techniques for Data Transformation
 Fundamental Techniques of Feature Engineering for Machine Learning
 Feature Selection and EDA in Machine Learning
Skillset VI — Data Visualization
Combining statistics and data visualization allows us to understand the data through
appropriate visual representation. Whether you prefer using visualization package such
as seaborn or matplotlib in Python and ggplot2 in R; or visualization tools like Tableau and
PowerBI, it’s essential to distinguish the use case of common chart types:
 Bar Chart
 Histogram
 Box Plot
 Heatmap
 Scatter Plot
 Line Chart
4. Model Implementation
After all of the preparation so far, it’s finally the time to dive deeper into machine learning
algorithms.
Skillset VI — Machine Learning
scikit-learn is a powerful Python library that allows beginners to get started in machine
learning easily. It offers plenty of built-in package and we can easily implement a model using
several lines of code. Although it has already done the hard work for us, it is still crucial to
understanding how the algorithms operate behind the scene and be able to distinguish the best
use case for each. Generally, machine learning algorithms are categorized into Supervised
learning and Unsupervised learning. Below are some of the most popular algorithms:
Supervised Learning:
 Linear Regression
 Logistic Regression
 Neural Network
 Decision Tree
 Support Vector Machine
Unsupervised Learning:
 Clustering
 PCA
 Dimension Reduction
Useful Resources:
 scikit-learn website
 Coursera: Machine Learning with Python
Skillset VI — Mathematics
Many starters including me may have the question of why we need to learn Math in data
science.
As a beginner, math knowledge mainly assists in understanding the underlying theory behind
the algorithms. Moving forward, when we no longer rely on built in libraries for machine
learning models, it allows us to develop and optimize customized algorithms. Additionally,
hyperparameter tuning also requires advanced math knowledge for searching the best model
that minimize the cost function.
This is when more complicated math topics come into place:
 Calculus
 Linear Algebra
 Optimization problem
 Gradient Descent
 Searching Algorithms
Useful Resources:
 3Blue1Brown: Essence of Linear Algebra
 3Blue1Brown: Essence of Calculus
 3Blue1Brown: Gradient Descent
5. Model Evaluation
Skillset VII — Statistics (Inferential Statistics)
Inferential Statistics is particular useful when making model prediction and evaluating model
performance. As opposed to descriptive statistics, inferential statistics focuses on generalizing
the pattern observed in the sample data to a wider population. It provides evidence of which
features have high importance in making inference. Also, it determines the model performance
based on evaluation metrics.
For example, for classification problem where the output are discrete categories, some
common metrics are:
 Confusion matrix
 Type 1 error / Type 2 error
 Accuracy
 ROC / AUC
Whereas, for regression problem where the output are continuous numbers, some common
metrics are:
 R Squared
 Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Squared
Error (MSE)
Useful Resources
 Khan’s Academy: Statistics and Probability
 Metrics to Evaluate your Machine Learning Algorithm

Más contenido relacionado

La actualidad más candente

Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at PipedriveAndré Karpištšenko
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceGabriel Moreira
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...Edureka!
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and AnalyticsSrinath Perera
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —swethaT16
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Edureka!
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceMark West
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Data Science Lifecycle
Data Science LifecycleData Science Lifecycle
Data Science LifecycleSwapnilDahake2
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Edureka!
 

La actualidad más candente (20)

Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at Pipedrive
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
Data science
Data scienceData science
Data science
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
 
2005)
2005)2005)
2005)
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 
Data science 101
Data science 101Data science 101
Data science 101
 
data science
data sciencedata science
data science
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9Introduction to Data Science by Datalent Team @Data Science Clinic #9
Introduction to Data Science by Datalent Team @Data Science Clinic #9
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Data Science Lifecycle
Data Science LifecycleData Science Lifecycle
Data Science Lifecycle
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 

Similar a Self-Study Approach for Data Science

A Comprehensive Learning Path to Become a Data Science 2021.pptx
A Comprehensive Learning Path to Become a Data Science 2021.pptxA Comprehensive Learning Path to Become a Data Science 2021.pptx
A Comprehensive Learning Path to Become a Data Science 2021.pptxRajSingh512965
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfData Science Council of America
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning CCG
 
An Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsAn Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsIRJET Journal
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learningSandeep Garg
 
Data science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxData science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxNagarajanG35
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxsumitkumar600840
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Rohit Dubey
 
The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewDr. Ananth Krishnamoorthy
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxAbderrahmanABID2
 
data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analyticssunnypatil1778
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxDr.Shweta
 
Introduction to data science.pdf
Introduction to data science.pdfIntroduction to data science.pdf
Introduction to data science.pdfalsaid fathy
 

Similar a Self-Study Approach for Data Science (20)

A Comprehensive Learning Path to Become a Data Science 2021.pptx
A Comprehensive Learning Path to Become a Data Science 2021.pptxA Comprehensive Learning Path to Become a Data Science 2021.pptx
A Comprehensive Learning Path to Become a Data Science 2021.pptx
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
DataScience_RoadMap_2023.pdf
DataScience_RoadMap_2023.pdfDataScience_RoadMap_2023.pdf
DataScience_RoadMap_2023.pdf
 
Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning Afternoons with Azure - Azure Machine Learning
Afternoons with Azure - Azure Machine Learning
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
 
An Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsAn Overview of Python for Data Analytics
An Overview of Python for Data Analytics
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
Data science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxData science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptx
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017 Proposed Talk Outline for Pycon2017
Proposed Talk Outline for Pycon2017
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
 
The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape Overview
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptx
 
data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analytics
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Introduction to data science.pdf
Introduction to data science.pdfIntroduction to data science.pdf
Introduction to data science.pdf
 

Último

7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...Paul Menig
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdfRenandantas16
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.Aaiza Hassan
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
 
Sales & Marketing Alignment: How to Synergize for Success
Sales & Marketing Alignment: How to Synergize for SuccessSales & Marketing Alignment: How to Synergize for Success
Sales & Marketing Alignment: How to Synergize for SuccessAggregage
 
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service JamshedpurVIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service JamshedpurSuhani Kapoor
 
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒anilsa9823
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communicationskarancommunications
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMRavindra Nath Shukla
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...anilsa9823
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear RegressionRavindra Nath Shukla
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesDipal Arora
 
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130  Available With RoomVIP Kolkata Call Girl Howrah 👉 8250192130  Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Roomdivyansh0kumar0
 
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyThe Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyEthan lee
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsP&CO
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLSeo
 
Creating Low-Code Loan Applications using the Trisotech Mortgage Feature Set
Creating Low-Code Loan Applications using the Trisotech Mortgage Feature SetCreating Low-Code Loan Applications using the Trisotech Mortgage Feature Set
Creating Low-Code Loan Applications using the Trisotech Mortgage Feature SetDenis Gagné
 
Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Serviceritikaroy0888
 

Último (20)

7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
 
M.C Lodges -- Guest House in Jhang.
M.C Lodges --  Guest House in Jhang.M.C Lodges --  Guest House in Jhang.
M.C Lodges -- Guest House in Jhang.
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
 
Sales & Marketing Alignment: How to Synergize for Success
Sales & Marketing Alignment: How to Synergize for SuccessSales & Marketing Alignment: How to Synergize for Success
Sales & Marketing Alignment: How to Synergize for Success
 
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service JamshedpurVIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
VIP Call Girl Jamshedpur Aashi 8250192130 Independent Escort Service Jamshedpur
 
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒VIP Call Girls In Saharaganj ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment (COD) 👒
VIP Call Girls In Saharaganj ( Lucknow ) 🔝 8923113531 🔝 Cash Payment (COD) 👒
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communications
 
Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517
Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517
Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSM
 
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
Lucknow 💋 Escorts in Lucknow - 450+ Call Girl Cash Payment 8923113531 Neha Th...
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear Regression
 
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best ServicesMysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
Mysore Call Girls 8617370543 WhatsApp Number 24x7 Best Services
 
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130  Available With RoomVIP Kolkata Call Girl Howrah 👉 8250192130  Available With Room
VIP Kolkata Call Girl Howrah 👉 8250192130 Available With Room
 
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyThe Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and pains
 
Forklift Operations: Safety through Cartoons
Forklift Operations: Safety through CartoonsForklift Operations: Safety through Cartoons
Forklift Operations: Safety through Cartoons
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
 
Creating Low-Code Loan Applications using the Trisotech Mortgage Feature Set
Creating Low-Code Loan Applications using the Trisotech Mortgage Feature SetCreating Low-Code Loan Applications using the Trisotech Mortgage Feature Set
Creating Low-Code Loan Applications using the Trisotech Mortgage Feature Set
 
Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Service
 

Self-Study Approach for Data Science

  • 1. Self-Study Approach for Data Science – 2022 Project Based Approach to learn Data Science As someone who holds a Master’s degree in Computer science, I am truly passionate about this field and decided to experiment on building my own curriculum to learn data science in spare time. I would like to share my experience and hope to bring some insights if you want to share the same journey. Project-based learning is a good starting point for people already have some technical background, but also want to dive deeper into the building blocks of data science. A typical data science/machine learning project comprises the lifecycle — from defining the objectives, data preprocessing, exploratory data analysis, feature engineering, model implementation to
  • 2. model evaluation. Each phase requires different skillsets, including statistics, programming, SQL, data visualization, mathematics and business knowledge. I highly recommend Kaggle as the platform to experiment with your data science projects and Medium as the platform to gain data science knowledge from professionals. With plenty of interesting datasets and a cloud based programming environment, you can easily get data source, code and notebooks from Kaggle for free. While several popular data science publications (e.g. Towards Data Science, Analytics Vidhya) from Medium allows you to learn from others work and share your own projects all at the same place. Why Project Based Approach? 1. It is practical and gives us a sense of achievement that we are doing something real! 2. It highlights the rationale of learning each pieces of content. A goal-oriented approach provides a bird eye view of how each little pieces tie together to form the big picture. 3. It allow us to actively retrieve the information as we are learning. Active Recall is proven to significantly enhance information retention, compared to conventional learning mechanism which only requires passively consuming knowledge. Let’s break down the data science lifecycle into the following 5 steps and we will see how each step connects to various knowledge domains. 1.Business Problem & Data Science Solution The first step of a data science project is to identify the business problem and define the objectives of an experiment design or model deployment.
  • 3. Skillset I — Business Knowledge At this stage, it doesn’t need technical skills yet demands business understanding to identify the problem and define the objectives. First step is to understand the domain specific terminology that appears in the dataset, then to translate a business requirement into a technical solution. It requires years of experience in the field to build up your knowledge. Here I can only recommend some websites that increase your exposure to some business domains, for example Harvard Business Review, Hubspot, Investopedia, TechCrunch. Skillset II — Statistics (Experimental Design) After defining the problem, then it is to frame it / fit it into a data science solution. This starts with the knowledge in Experimental Design such as:  Hypothesis Testing  Sampling  Bias / Variance Trade-off  Different types of Classification Errors  Overfitting / Underfitting. There are various type of hypothesis testing to explore — T test, ANOVA, Chi Square test, etc, Machine Learning is fundamentally considered as a hypothesis testing process, where we need to search for a model in the hypothesis space that best fits our observed data, and allows us to make prediction to unobserved data.
  • 4. Useful Resource:  Khan Academy: Study Design  A Gentle Introduction to Statistical Hypothesis Testing 2. Data Extraction & Data Preprocessing The second step is to collect data from various sources and transform the raw data into digestible format. This process is knowns as Data Ingestion. Skillset III — SQL SQL is a powerful language for communicating with and extracting data from structured database. Learning SQL also assists with framing a mental model that helps you to generate insights through data querying techniques, such as grouping, filtering, sorting, and joining. You will also find similar logic appearing in other tools and languages, such as Pandas and SAS. Useful Resources:  “Get Started with SQL Joins”  Datacamp: SQL fundamentals  Dataquest: SQL Basics Skillset IV — Python (Pandas) It is essential to get comfortable with a programming language while learning data science. The simple syntax makes Python a relatively easy language to start with. Here is a great video tutorial if you are new to Python: Python for Beginners — Learn Python in 1 Hour.
  • 5. After a basic understanding, it’s worth spending some time to learn Pandas Library. Pandas is almost unavoidable if you use python for data extraction. It transforms database into dataframe — a table like format that we are most familiar with. Pandas also plays an important role in data preprocessing, when it is required to examine and handle the following data quality issues,.  Address missing data  Transform inconsistent data type  Remove duplicated value  Treat outliers etc., Useful Resources:  Python Pandas Tutorial: A Complete Introduction for Beginners  W3schools: Pandas Tutorial 3. Data Exploration & Feature Engineering The third step is Data Exploration, also known as EDA (Exploratory Data Analysis) which reveals hidden characteristics and patterns in a dataset. It is usually achieved by data visualization, and followed by feature engineering to transform data based on the outcome of data exploration.
  • 6. Skillset V — Statistics (Descriptive Statistics) Data exploration use descriptive statistics to summarize characteristics of the dataset:  Mean, Median, Mode (Measures of Central Tendency)  Range, Variance, Standard Deviation (Measures of Dispersion)  Correlation, Covariance  Skewness, Distribution After a solid understanding of the dataset characteristics, we need to apply the most appropriate feature engineering techniques accordingly. For instance, use log transformation for right-skewed data and clipping methods to deal with outliers. Here I list down some most common and popular feature engineering techniques:  Categorical Encoding  Scaling  Log Transformation  Imputation  Feature Selection
  • 7. Useful Resource:  3 Common Techniques for Data Transformation  Fundamental Techniques of Feature Engineering for Machine Learning  Feature Selection and EDA in Machine Learning Skillset VI — Data Visualization Combining statistics and data visualization allows us to understand the data through appropriate visual representation. Whether you prefer using visualization package such as seaborn or matplotlib in Python and ggplot2 in R; or visualization tools like Tableau and PowerBI, it’s essential to distinguish the use case of common chart types:  Bar Chart  Histogram  Box Plot  Heatmap  Scatter Plot  Line Chart
  • 8. 4. Model Implementation After all of the preparation so far, it’s finally the time to dive deeper into machine learning algorithms. Skillset VI — Machine Learning scikit-learn is a powerful Python library that allows beginners to get started in machine learning easily. It offers plenty of built-in package and we can easily implement a model using several lines of code. Although it has already done the hard work for us, it is still crucial to understanding how the algorithms operate behind the scene and be able to distinguish the best use case for each. Generally, machine learning algorithms are categorized into Supervised learning and Unsupervised learning. Below are some of the most popular algorithms: Supervised Learning:  Linear Regression  Logistic Regression  Neural Network  Decision Tree  Support Vector Machine Unsupervised Learning:  Clustering  PCA  Dimension Reduction
  • 9. Useful Resources:  scikit-learn website  Coursera: Machine Learning with Python Skillset VI — Mathematics Many starters including me may have the question of why we need to learn Math in data science. As a beginner, math knowledge mainly assists in understanding the underlying theory behind the algorithms. Moving forward, when we no longer rely on built in libraries for machine learning models, it allows us to develop and optimize customized algorithms. Additionally, hyperparameter tuning also requires advanced math knowledge for searching the best model that minimize the cost function. This is when more complicated math topics come into place:  Calculus  Linear Algebra  Optimization problem  Gradient Descent  Searching Algorithms
  • 10. Useful Resources:  3Blue1Brown: Essence of Linear Algebra  3Blue1Brown: Essence of Calculus  3Blue1Brown: Gradient Descent 5. Model Evaluation Skillset VII — Statistics (Inferential Statistics) Inferential Statistics is particular useful when making model prediction and evaluating model performance. As opposed to descriptive statistics, inferential statistics focuses on generalizing the pattern observed in the sample data to a wider population. It provides evidence of which features have high importance in making inference. Also, it determines the model performance based on evaluation metrics. For example, for classification problem where the output are discrete categories, some common metrics are:  Confusion matrix  Type 1 error / Type 2 error  Accuracy  ROC / AUC
  • 11. Whereas, for regression problem where the output are continuous numbers, some common metrics are:  R Squared  Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Squared Error (MSE) Useful Resources  Khan’s Academy: Statistics and Probability  Metrics to Evaluate your Machine Learning Algorithm