De-Cluttering-ML | TechWeekends

Introduction to Pandas,
NumPy & Matplotlib

Pandas
● Pandas is a Python module that helps us read and manipulate data.
● With the help of Pandas Library you can take in data and view it as a table that's human
readable, but it can also be interpreted numerically so that you can do lots of computations
with it.
● We call the table of data a DataFrame.

Lets Start Pandas
● Lets start by importing pandas. It's standard practice to nickname it pd so that it's faster to type later on.
import pandas as pd
● We will be working with a dataset of Titanic passengers. For each passenger, we’ll have some data on them as well
as whether or not they survived the crash.
● Our data is stored as CSV (comma-separated values) file.
● We're going to pull the data into pandas so we can view it as a DataFrame.
● The read_csv function takes a file in csv format and converts it to a Pandas DataFrame.
df = pd.read_csv('titanic.csv')
● The object df is now our pandas dataframe with the Titanic dataset. Now we can use the head method to look at the
data.
The head method returns the first 5 rows of the DataFrame.
print(df.head())

Summarize the Data
● Usually our data is much too big for us to be able to display it all.
Looking at the first few rows is the first step to understanding our data, but then we want to look at some
summary statistics.
● In pandas, we can use the describe method. It returns a table of statistics about the columns.
print(df.describe())
● For each column we see a few statistics. Note that it only gives statistics for the numerical columns.
● Let's review what each of these statistics means:
Count: This is the number of rows that have a value. In our case, every passenger has a value for each of the
columns, so the value is 887 (the total number of passengers).
Mean: Recall that the mean is the standard average.
Std: This is short for standard deviation. This is a measure of how dispersed the data is.
Min: The smallest value
25%: The 25th percentile
50%: The 50th percentile, also known as the median.
75%: The 75th percentile
Max: The largest value

Selecting a single Column
● We often will only want to deal with some of the columns that we have in our dataset. To select
a single column, we use the square brackets and the column name.
● In this example, we're selecting just the column with the passenger fares.
col = df['Fare']
print(col)
● The result is what we call a Pandas Series.
A series is like a DataFrame, but it's just a single column.

Selecting Multiple Columns
● We can also select multiple columns from our original DataFrame, creating a smaller
DataFrame.
We're going to select just the Age, Sex, and Survived columns from our original DataFrame.
● We put these values in a list as follows:
['Age', 'Sex', 'Survived']
● Now we use that list inside of the bracket notation df[...] When printing a large DataFrame
that’s too big to display, you can use the head method to print just the first 5 rows.
small_df = df[['Age', 'Sex', 'Survived']]
print(small_df.head())

Creating a Column
● We often want our data in a slightly different format than it originally comes in. For example,
our data has the sex of the passenger as a string ("male" or "female"). This is easy for a human
to read, but when we do computations on our data later on, we’ll want it as Boolean values
(Trues and Falses).
● We create a Pandas Series that will be a series of Trues and Falses (True if the passenger
is male and False if the passenger is female).
● df[‘Sex’] == ‘male’
● Now we want to create a column with this result. To create a new column, we use the same
bracket syntax (df['male']) and then assign this new value to it.
df['male'] = df['Sex'] == 'male'

Numpy
● Numpy is a Python package for manipulating lists and tables of numerical data. We can use it to
do a lot of statistical calculations. We call the list or table of data a numpy array.
● We often will take the data from our pandas DataFrame and put it in numpy arrays.
● Pandas DataFrames are great because we have the column names and other text data that makes
it human readable.
● A DataFrame, while easy for a human to read, is not the ideal format for doing calculations. The
numpy arrays are generally less human readable, but are in a format that enables the necessary
computation.

Lets Start Numpy
● Here also lets start by importing Numpy. It's standard practice to nickname it np so that it's
faster to type later on.
import numpy as np
● Now creating an array or vector in numpy
○ Row wise :-
vector_row = np.array([1,2,3])
print(vector_row)
○ Column wise:-
vector_col = np.array([[1],[2],[3]])
print(vector_col)
● Now creating a matrix or 2-d array
matrix = np.array([[1,2,3],[4,5,6]])
(This will print a matrix of size 2*3)

● When you need to select one or more element in a vector or matrix
#Create a vector as a Row
vector_row = np.array([ 1,2,3,4,5,6 ])
#Create a Matrix
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])
#Select 3rd element of Vector
print(vector_row[2])
#Select 2nd row 2nd column
print(matrix[1,1])
#Select all elements of a vector
print(vector_row[:])

#Select everything up to and including the 3rd element
print(vector_row[:3])
#Select the everything after the 3rd element
print(vector_row[3:])
#Select the last element
print(vector_row[-1])
#Select the first 2 rows and all the columns of the matrix
print(matrix[:2,:])
#Select all rows and the 2nd column of the matrix
print(matrix[:,1:2])

#Create a Matrix
matrix =np.array([[1,2,3],[4,5,6],[7,8,9]])
#View the Number of Rows and Columns
print(matrix.shape)
#View the number of elements (rows*columns)
print(matrix.size)
#View the number of Dimensions(2 in this case)
print(matrix.ndim)

● Finding Minimum and Maximum
#Create a Matrix
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(matrix)
#Return the max element
print(np.max(matrix))
#Return the min element
print(np.min(matrix))
#To find the max element in each column
print(np.max(matrix,axis=0))
#To find the max element in each row
print(np.max(matrix,axis=1))

● Calculating Mean, Variance and Standard Deviation
#Create a Matrix
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(matrix)
#Mean
print(np.mean(matrix))
#Standard Deviation.
print(np.std(matrix))
#Variance
print(np.var(matrix))

MATPLOTLIB
● We basically uses the matplotlib library to plot our data. Plotting the data can often help us
build intuition about our data.
● We first need to import matplotlib. It’s standard practice to nickname it plt.
import matplotlib.pyplot as plt
● We use the scatter function to plot our data. The first argument of the scatter function is the x-
axis (horizontal direction) and the second argument is the y-axis (vertical direction).
plt.scatter(df['Age'], df['Fare'])
This plots the Age on the x-axis and the Fare on the y-axis.

● To make it easier to interpret, we can add x and y labels.
plt.xlabel('Age')
plt.ylabel('Fare')
● We can also use our data to color code our scatter plot. This will give each of the 3 classes a
different color. We add the c parameter and give it a Pandas series. In this case, our Pandas series
has 3 possible values (1st, 2nd, and 3rd class), so we'll see our datapoints each get one of three
colors.
plt.scatter(df['Age'], df['Fare'], c=df['Pclass'])
● The purple dots are first class, the green dots are second class, and the yellow dots are third class.
● A scatter plot is used to show all the values from your data on a graph. In order to get a visual
representation of our data, we have to limit our data to two features.

● Now that we can put individual datapoints on a plot, let's see how to draw the line.
The plot function does just that. The following draws a line to approximately separate the 1st
class from the 2nd and 3rd class. From eyeballing, we’ll put the line from (0, 85) to (80, 5). Our
syntax below has a list of the x values and a list of the y values.
plt.plot([0, 80], [85, 5])
● In matplotlib, we use the scatter function to create a scatter plot and the plot function for a line.

Before proceeding ahead will be talking about
following: -
❏ Machine learning definition and types
❏ Linear Regression(both single and multivariate)
❏ Decision Tree and Random Forest

What is Machine Learning?
Tom Mitchell provides a more modern definition: "A computer program is said to learn
from experience E with respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with experience E."
Example: playing checkers.
E = the experience of playing many games of checkers
T = the task of playing checkers.
P = the probability that the program will win the next game.
In general, any machine learning problem can be assigned to one of two broad
classifications:
supervised learning
OR
unsupervised learning.

Supervised Learning
In supervised learning, we are given a data set and already know what our correct
output should look like, having the idea that there is a relationship between the input
and the output.
Supervised learning problems are categorized into "regression" and "classification"
problems. In a regression problem, we are trying to predict results within a continuous
output, meaning that we are trying to map input variables to some continuous function.
In a classification problem, we are instead trying to predict results in a discrete output.
(a) Regression - Given a picture of Male/Female, We have to predict his/her age on the
basis of given picture.
(b) Classification - Banks have to decide whether or not to give a loan to someone on the
basis of his credit history.

Linear Regression with One Variable
Univariate linear regression is used when you want to predict a single output value y
from a single input value x. We're doing supervised learning and regression here, so that
means we already have an idea about what the input/output cause and effect should
be.

The Hypothesis Function
Our hypothesis function has the general form:
^y=hθ(x)=θ0+θ1 x
Note that this is like the equation of a straight line. We give to hθ(x) values for θ0 and
θ1 to get our estimated output ^y
In other words, we are trying to create a function called hθ that is trying to map our
input data (the x's) to our output data (the y's).

Cost Function
We can measure the accuracy of our hypothesis function by using a cost function. This
takes an average (actually a fancier version of an average) of all the results of the
hypothesis with inputs from x's compared to the actual output y's.

Gradient Descent for Linear Regression
When specifically applied to the case of linear regression, a new form of the gradient
descent equation can be derived. We can substitute our actual cost function and our
actual hypothesis function and modify the equation to (the derivation of the formulas
are out of the scope of this course, but a really great one can be found here):

Linear regression (2): Gradient descent by Alexander Ihler

Linear Regression with Multiple Variables
We now introduce notation for equations where we can have any number of input
variables.

Using the definition of matrix multiplication, our multivariable hypothesis function can
be concisely represented as:

Entropy
Entropy is the measure of randomness of the system

Information gain
Information Gain is applied to quantify which feature provides maximal information
about the classification based on the notion of entropy

Gini index
The Gini Index is determined by deducting the sum of squared of probabilities of each
class from one, mathematically, Gini Index can be expressed as:

Confusion matrix
A confusion matrix is a technique for summarizing the performance of a classification
algorithm.
Classification accuracy alone can be misleading if you have an unequal number of
observations in each class or if you have more than two classes in your dataset.

Accuracy
Overall, how often is the classifier correct?
○ (TP+TN)/total = (100+50)/165 = 0.91

What is sklearn?
Scikit-learn (also known as sklearn) is a free software machine learning library for the Python programming language. It features
various classification, regression and clustering algorithms like Linear Regression,K-NN,random forests etc.
1. Loading the Data
2. Splitting the Data
3. Generate The Model
4. Evaluate The accuracy
.. _boston_dataset:

Boston house prices dataset

---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

prices and the demand for clean air', J. Environ. Economics & Management,

vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics

...', Wiley, 1980. N.B. Various transformations are used in the table on

pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression

problems.

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.

- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learnin
g, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

(506, 13)

(506,)

<class 'numpy.ndarray'>

<class 'numpy.ndarray'>

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88
506 rows × 13 columns
(354, 26)

(354,)

(152, 26)

(152,)

LinearRegression()
The coefficient of determination,
The score can be calculated using the following formula:
T : true/actual value.
P : predicted value.
represents that we are predicting a score as worse as the mean value. (Meaning, all the answers are mean value of the
data).
To prove this, let us put
This will give the score as 0, which is the worst possible score.
0.8787006807787356
Decision Tree
.. _iris_dataset:

Iris plants dataset

--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)

:Number of Attributes: 4 numeric, predictive attributes and the class

:Attribute Information:

- sepal length in cm

- sepal width in cm

- petal length in cm

- petal width in cm

- class:

- Iris-Setosa

- Iris-Versicolour

- Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================

Min Max Mean SD Class Correlation

============== ==== ==== ======= ===== ====================

sepal length: 4.3 7.9 5.84 0.83 0.7826

sepal width: 2.0 4.4 3.05 0.43 -0.4194

petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)

petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)

============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None

:Class Distribution: 33.3% for each of 3 classes.

:Creator: R.A. Fisher

:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)

:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken

from Fisher's paper. Note that it's the same as in R, but not as in the UCI

Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the

pattern recognition literature. Fisher's paper is a classic in the field and

is referenced frequently to this day. (See Duda & Hart, for example.) The

data set contains 3 classes of 50 instances each, where each class refers to a

type of iris plant. One class is linearly separable from the other 2; the

latter are NOT linearly separable from each other.

.. topic:: References

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"

Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to

Mathematical Statistics" (John Wiley, NY, 1950).

- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.

(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.

- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System

Structure and Classification Rule for Recognition in Partially Exposed

Environments". IEEE Transactions on Pattern Analysis and Machine

Intelligence, Vol. PAMI-2, No. 1, 67-71.

- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions

on Information Theory, May 1972, 431-433.

- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II

conceptual clustering system finds 3 classes in the data.

- Many, many more ...

(105, 4)

(105,)

(45, 4)

(45,)
DecisionTreeClassifier()
Graphviz - Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information such as diagrams of abstract graphs and networks.
export_graphviz - This function generates a GraphViz representation of the decision tree in dot format, which is then written into an output file ('out_file').
Pydotplus - PyDotPlus provides a Python Interface to Graphviz’s Dot language.
0.9555555555555556
array([[14, 0, 0],

[ 0, 17, 1],

[ 0, 1, 12]], dtype=int64)
The algorithm uses the Gini Index for constructing the decision tree.
A feature with a lower Gini index is chosen for a split.
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX ... CHAS*CHAS NOX*NOX RM*RM AGE*AGE DIS*DIS RAD*RAD TAX*TAX PTRATIO*PTRATIO B
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 ... 0.0 0.289444 43.230625 4251.04 16.728100 1.0 87616.0 234.09 157529.61
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 ... 0.0 0.219961 41.229241 6225.21 24.672082 4.0 58564.0 316.84 157529.61
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 ... 0.0 0.219961 51.624225 3733.21 24.672082 4.0 58564.0 316.84 154315.40
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 ... 0.0 0.209764 48.972004 2097.64 36.750269 9.0 49284.0 349.69 155732.83
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 ... 0.0 0.209764 51.079609 2937.64 36.750269 9.0 49284.0 349.69 157529.61
5 rows × 26 columns
In [46]:
from sklearn import datasets

boston=datasets.load_boston()

In [78]:
print(boston.DESCR)

In [79]:
x=boston.data

y=boston.target

In [80]:
print(x.shape)

print(y.shape)

In [81]:
print(type(x))

print(type(y))

In [95]:
import pandas as pd

boston_data=pd.DataFrame(x,columns=boston.feature_names)

boston_data
Out[95]:
In [96]:
for i in boston.feature_names:

new_f=i+'*'+i

boston_data[new_f]=boston_data[i]**2

boston_data.head()

Out[96]:
In [98]:
x=boston_data.values

In [99]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.7,random_state=1)

print(x_train.shape)

print(y_train.shape)

print(x_test.shape)

print(y_test.shape)

In [88]:
from sklearn.linear_model import LinearRegression

In [100…
reg=LinearRegression()

In [101…
reg.fit(x_train,y_train)

Out[101…
In [102…
y_pred=reg.predict(x_test)

In [ ]:

R
2
1 −
∑ (y
T
i
− y
P
i
)
2
∑ (y
T
i
− y
T
mean)
2
y
T
mean
y
P
i
= y
T
mean
In [103…
reg.score(x_test,y_test)

Out[103…
In [ ]:

In [ ]:

In [104…
import matplotlib.pyplot as plt

plt.scatter(y_pred,y_test)

plt.grid()

plt.xlabel("Predictions")

plt.ylabel("Testing Data")

plt.axis([0,40,0,40])

plt.show()

In [105…
from sklearn import datasets

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

In [106…
iris=datasets.load_iris()

In [107…
print(iris.DESCR)

In [122…
x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,train_size=0.7,random_state=1)

print(x_train.shape)

print(y_train.shape)

print(x_test.shape)

print(y_test.shape)

In [123…
clf=DecisionTreeClassifier()

In [124…
clf.fit(x_train,y_train)

Out[124…
In [135…
y_pred=clf.predict(x_test)

In [131…
from sklearn.tree import export_graphviz

dot_data = export_graphviz(clf,

feature_names=iris.feature_names,

class_names=iris.target_names, filled=True, rounded=True)

import pydotplus

from IPython.display import Image

graph = pydotplus.graph_from_dot_data(dot_data)

Image(graph.create_png())

Out[131…
In [137…
clf.score(x_test,y_test)

Out[137…
In [138…
from sklearn.metrics import confusion_matrix

confusion_matrix(y_pred,y_test)

Out[138…
In [ ]:

Gini Index = 1 −
c
∑
i=1
(pi)
2
In [ ]:

In [ ]:

In [ ]:

De-Cluttering-ML | TechWeekends

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a De-Cluttering-ML | TechWeekends

Similar a De-Cluttering-ML | TechWeekends (20)

Último

Último (20)

De-Cluttering-ML | TechWeekends