17. Regression
Regression analysis is a
statistical method that
helps us to analyze and
understand the relationship
between two or more
variables of interest.
24. Learn from mistakes
Reinforcement learning is a machine learning training method based on
rewarding desired behaviors and/or punishing undesired ones
31. Preparing the proper input dataset, compatible with Machine learning
algorithm requirements.
Goal of Feature Handling
32. According to survey, data scientists spend 60% of their
time on data preparation
33. In Feature Handling, you will learn...
Handling categorical data
● Nominal variables
● Ordinal variables
● One hot encoding
● Label/ordinal/integer encoding
Missing invalid values
● Mean method
● Median method
● Mode method
34. A variable whose values are one or more categories.
Categorical Variables
Before we move further,
35. Variable comprises a finite set of discrete values with no relationship between
those values.
These are variables which are not related to each other in any order
Nominal Variables
36. Ordinal variables
Variable comprises a finite set of discrete values with a ranked
ordering between values.
These are variables where we can find a certain order or relation or
rank between those variables.
37.
38. One Hot Encoding
Forcing an ordinal relationship via
an ordinal encoding and allowing
the model to assume a natural
ordering between categories may
result in poor performance or
unexpected results
39. In ordinal encoding, each
unique category value is
assigned an integer value.
Ordinal Encoding
41. Consider a dataset that gives you information
about multiple people aboard the Titanic like
their ages, sexes, sibling counts, embarkment
points and whether or not they survived the
disaster.
Based on this, you have to predict if an
arbitrary passenger on Titanic would survive
the sinking.
Looking at a real-life dataset
43. Real life datasets almost always have
missing values
For example, not all passengers’ age will be recorded.
There are multiple reasons why this could happen.
44. Reasons
● Simply put, it’s difficult to collect data.
● Sometimes data is lost.
● Data can also be corrupted.
● People may not be comfortable with sharing data.
47. Mean
In this method, any missing values in a column are replaced with the mean
of that column.
Assume that we have a dataset of a some patients and in that the age
attribute has some missing values, we have to overcome this or else it will
be a good recipe for a disaster.
48.
49. Cons of using this method
● This method is heavily dependent and extremely sensitive for the outliers
present in a data set.
● Value influenced by outlier is a major threat to any machine learning model
and it may make model catastrophic.
51. Another technique is median imputation in which the missing values are
replaced with the median value of the entire feature column.
52. ● Doesn’t factor the correlations between features. It only works on the
column level.
● Will give poor results on encoded categorical features (do NOT use it
on categorical features).
Cons of using this method
54. Another technique is mode imputation in which the missing values are
replaced with the mode value or most frequent value of the entire
feature column.
55. ● It also doesn’t factor the correlations between features.
● It can introduce bias in the data.
Cons of using this method