SlideShare una empresa de Scribd logo
1 de 47
Descargar para leer sin conexión
Introduction to Pandas,
NumPy & Matplotlib
Pandas
● Pandas is a Python module that helps us read and manipulate data.
● With the help of Pandas Library you can take in data and view it as a table that's human
readable, but it can also be interpreted numerically so that you can do lots of computations
with it.
● We call the table of data a DataFrame.
Lets Start Pandas
● Lets start by importing pandas. It's standard practice to nickname it pd so that it's faster to type later on.
import pandas as pd
● We will be working with a dataset of Titanic passengers. For each passenger, we’ll have some data on them as well
as whether or not they survived the crash.
● Our data is stored as CSV (comma-separated values) file.
● We're going to pull the data into pandas so we can view it as a DataFrame.
● The read_csv function takes a file in csv format and converts it to a Pandas DataFrame.
df = pd.read_csv('titanic.csv')
● The object df is now our pandas dataframe with the Titanic dataset. Now we can use the head method to look at the
data.
The head method returns the first 5 rows of the DataFrame.
print(df.head())
Summarize the Data
● Usually our data is much too big for us to be able to display it all.
Looking at the first few rows is the first step to understanding our data, but then we want to look at some
summary statistics.
● In pandas, we can use the describe method. It returns a table of statistics about the columns.
print(df.describe())
● For each column we see a few statistics. Note that it only gives statistics for the numerical columns.
● Let's review what each of these statistics means:
Count: This is the number of rows that have a value. In our case, every passenger has a value for each of the
columns, so the value is 887 (the total number of passengers).
Mean: Recall that the mean is the standard average.
Std: This is short for standard deviation. This is a measure of how dispersed the data is.
Min: The smallest value
25%: The 25th percentile
50%: The 50th percentile, also known as the median.
75%: The 75th percentile
Max: The largest value
Selecting a single Column
● We often will only want to deal with some of the columns that we have in our dataset. To select
a single column, we use the square brackets and the column name.
● In this example, we're selecting just the column with the passenger fares.
col = df['Fare']
print(col)
● The result is what we call a Pandas Series.
A series is like a DataFrame, but it's just a single column.
Selecting Multiple Columns
● We can also select multiple columns from our original DataFrame, creating a smaller
DataFrame.
We're going to select just the Age, Sex, and Survived columns from our original DataFrame.
● We put these values in a list as follows:
['Age', 'Sex', 'Survived']
● Now we use that list inside of the bracket notation df[...] When printing a large DataFrame
that’s too big to display, you can use the head method to print just the first 5 rows.
small_df = df[['Age', 'Sex', 'Survived']]
print(small_df.head())
Creating a Column
● We often want our data in a slightly different format than it originally comes in. For example,
our data has the sex of the passenger as a string ("male" or "female"). This is easy for a human
to read, but when we do computations on our data later on, we’ll want it as Boolean values
(Trues and Falses).
● We create a Pandas Series that will be a series of Trues and Falses (True if the passenger
is male and False if the passenger is female).
● df[‘Sex’] == ‘male’
● Now we want to create a column with this result. To create a new column, we use the same
bracket syntax (df['male']) and then assign this new value to it.
df['male'] = df['Sex'] == 'male'
Numpy
● Numpy is a Python package for manipulating lists and tables of numerical data. We can use it to
do a lot of statistical calculations. We call the list or table of data a numpy array.
● We often will take the data from our pandas DataFrame and put it in numpy arrays.
● Pandas DataFrames are great because we have the column names and other text data that makes
it human readable.
● A DataFrame, while easy for a human to read, is not the ideal format for doing calculations. The
numpy arrays are generally less human readable, but are in a format that enables the necessary
computation.
Lets Start Numpy
● Here also lets start by importing Numpy. It's standard practice to nickname it np so that it's
faster to type later on.
import numpy as np
● Now creating an array or vector in numpy
○ Row wise :-
vector_row = np.array([1,2,3])
print(vector_row)
○ Column wise:-
vector_col = np.array([[1],[2],[3]])
print(vector_col)
● Now creating a matrix or 2-d array
matrix = np.array([[1,2,3],[4,5,6]])
(This will print a matrix of size 2*3)
● When you need to select one or more element in a vector or matrix
#Create a vector as a Row
vector_row = np.array([ 1,2,3,4,5,6 ])
#Create a Matrix
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])
#Select 3rd element of Vector
print(vector_row[2])
#Select 2nd row 2nd column
print(matrix[1,1])
#Select all elements of a vector
print(vector_row[:])
#Select everything up to and including the 3rd element
print(vector_row[:3])
#Select the everything after the 3rd element
print(vector_row[3:])
#Select the last element
print(vector_row[-1])
#Select the first 2 rows and all the columns of the matrix
print(matrix[:2,:])
#Select all rows and the 2nd column of the matrix
print(matrix[:,1:2])
#Create a Matrix
matrix =np.array([[1,2,3],[4,5,6],[7,8,9]])
#View the Number of Rows and Columns
print(matrix.shape)
#View the number of elements (rows*columns)
print(matrix.size)
#View the number of Dimensions(2 in this case)
print(matrix.ndim)
● Finding Minimum and Maximum
#Create a Matrix
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(matrix)
#Return the max element
print(np.max(matrix))
#Return the min element
print(np.min(matrix))
#To find the max element in each column
print(np.max(matrix,axis=0))
#To find the max element in each row
print(np.max(matrix,axis=1))
● Calculating Mean, Variance and Standard Deviation
#Create a Matrix
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(matrix)
#Mean
print(np.mean(matrix))
#Standard Deviation.
print(np.std(matrix))
#Variance
print(np.var(matrix))
MATPLOTLIB
● We basically uses the matplotlib library to plot our data. Plotting the data can often help us
build intuition about our data.
● We first need to import matplotlib. It’s standard practice to nickname it plt.
import matplotlib.pyplot as plt
● We use the scatter function to plot our data. The first argument of the scatter function is the x-
axis (horizontal direction) and the second argument is the y-axis (vertical direction).
plt.scatter(df['Age'], df['Fare'])
This plots the Age on the x-axis and the Fare on the y-axis.
● To make it easier to interpret, we can add x and y labels.
plt.xlabel('Age')
plt.ylabel('Fare')
● We can also use our data to color code our scatter plot. This will give each of the 3 classes a
different color. We add the c parameter and give it a Pandas series. In this case, our Pandas series
has 3 possible values (1st, 2nd, and 3rd class), so we'll see our datapoints each get one of three
colors.
plt.scatter(df['Age'], df['Fare'], c=df['Pclass'])
● The purple dots are first class, the green dots are second class, and the yellow dots are third class.
● A scatter plot is used to show all the values from your data on a graph. In order to get a visual
representation of our data, we have to limit our data to two features.
● Now that we can put individual datapoints on a plot, let's see how to draw the line.
The plot function does just that. The following draws a line to approximately separate the 1st
class from the 2nd and 3rd class. From eyeballing, we’ll put the line from (0, 85) to (80, 5). Our
syntax below has a list of the x values and a list of the y values.
plt.plot([0, 80], [85, 5])
● In matplotlib, we use the scatter function to create a scatter plot and the plot function for a line.
Before proceeding ahead will be talking about
following: -
❏ Machine learning definition and types
❏ Linear Regression(both single and multivariate)
❏ Decision Tree and Random Forest
What is Machine Learning?
Tom Mitchell provides a more modern definition: "A computer program is said to learn
from experience E with respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with experience E."
Example: playing checkers.
E = the experience of playing many games of checkers
T = the task of playing checkers.
P = the probability that the program will win the next game.
In general, any machine learning problem can be assigned to one of two broad
classifications:
supervised learning
OR
unsupervised learning.
Supervised Learning
In supervised learning, we are given a data set and already know what our correct
output should look like, having the idea that there is a relationship between the input
and the output.
Supervised learning problems are categorized into "regression" and "classification"
problems. In a regression problem, we are trying to predict results within a continuous
output, meaning that we are trying to map input variables to some continuous function.
In a classification problem, we are instead trying to predict results in a discrete output.
(a) Regression - Given a picture of Male/Female, We have to predict his/her age on the
basis of given picture.
(b) Classification - Banks have to decide whether or not to give a loan to someone on the
basis of his credit history.
Linear Regression with One Variable
Univariate linear regression is used when you want to predict a single output value y
from a single input value x. We're doing supervised learning and regression here, so that
means we already have an idea about what the input/output cause and effect should
be.
The Hypothesis Function
Our hypothesis function has the general form:
^y=hθ(x)=θ0+θ1 x
Note that this is like the equation of a straight line. We give to hθ(x) values for θ0 and
θ1 to get our estimated output ^y
In other words, we are trying to create a function called hθ that is trying to map our
input data (the x's) to our output data (the y's).
Cost Function
We can measure the accuracy of our hypothesis function by using a cost function. This
takes an average (actually a fancier version of an average) of all the results of the
hypothesis with inputs from x's compared to the actual output y's.
Gradient Descent for Linear Regression
When specifically applied to the case of linear regression, a new form of the gradient
descent equation can be derived. We can substitute our actual cost function and our
actual hypothesis function and modify the equation to (the derivation of the formulas
are out of the scope of this course, but a really great one can be found here):
Linear regression (2): Gradient descent by Alexander Ihler
Linear Regression with Multiple Variables
We now introduce notation for equations where we can have any number of input
variables.
Using the definition of matrix multiplication, our multivariable hypothesis function can
be concisely represented as:
Decision Tree
Entropy
Entropy is the measure of randomness of the system
Information gain
Information Gain is applied to quantify which feature provides maximal information
about the classification based on the notion of entropy
Gini index
The Gini Index is determined by deducting the sum of squared of probabilities of each
class from one, mathematically, Gini Index can be expressed as:
Some other concepts
R2 or R-squared
Confusion matrix
A confusion matrix is a technique for summarizing the performance of a classification
algorithm.
Classification accuracy alone can be misleading if you have an unequal number of
observations in each class or if you have more than two classes in your dataset.
Accuracy
Overall, how often is the classifier correct?
○ (TP+TN)/total = (100+50)/165 = 0.91
NOW lets code
What is sklearn?
Scikit-learn (also known as sklearn) is a free software machine learning library for the Python programming language. It features
various classification, regression and clustering algorithms like Linear Regression,K-NN,random forests etc.
1. Loading the Data
2. Splitting the Data
3. Generate The Model
4. Evaluate The accuracy
.. _boston_dataset:

Boston house prices dataset

---------------------------

**Data Set Characteristics:** 

:Number of Instances: 506 

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

prices and the demand for clean air', J. Environ. Economics & Management,

vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics

...', Wiley, 1980. N.B. Various transformations are used in the table on

pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression

problems. 

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.

- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learnin
g, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

(506, 13)

(506,)

<class 'numpy.ndarray'>

<class 'numpy.ndarray'>

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88
506 rows × 13 columns
(354, 26)

(354,)

(152, 26)

(152,)

LinearRegression()
The coefficient of determination,
The score can be calculated using the following formula:
T : true/actual value.
P : predicted value.
represents that we are predicting a score as worse as the mean value. (Meaning, all the answers are mean value of the
data).
To prove this, let us put
This will give the score as 0, which is the worst possible score.
0.8787006807787356
Decision Tree
.. _iris_dataset:

Iris plants dataset

--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)

:Number of Attributes: 4 numeric, predictive attributes and the class

:Attribute Information:

- sepal length in cm

- sepal width in cm

- petal length in cm

- petal width in cm

- class:

- Iris-Setosa

- Iris-Versicolour

- Iris-Virginica



:Summary Statistics:

============== ==== ==== ======= ===== ====================

Min Max Mean SD Class Correlation

============== ==== ==== ======= ===== ====================

sepal length: 4.3 7.9 5.84 0.83 0.7826

sepal width: 2.0 4.4 3.05 0.43 -0.4194

petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)

petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)

============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None

:Class Distribution: 33.3% for each of 3 classes.

:Creator: R.A. Fisher

:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)

:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken

from Fisher's paper. Note that it's the same as in R, but not as in the UCI

Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the

pattern recognition literature. Fisher's paper is a classic in the field and

is referenced frequently to this day. (See Duda & Hart, for example.) The

data set contains 3 classes of 50 instances each, where each class refers to a

type of iris plant. One class is linearly separable from the other 2; the

latter are NOT linearly separable from each other.

.. topic:: References

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"

Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to

Mathematical Statistics" (John Wiley, NY, 1950).

- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.

(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.

- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System

Structure and Classification Rule for Recognition in Partially Exposed

Environments". IEEE Transactions on Pattern Analysis and Machine

Intelligence, Vol. PAMI-2, No. 1, 67-71.

- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions

on Information Theory, May 1972, 431-433.

- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II

conceptual clustering system finds 3 classes in the data.

- Many, many more ...

(105, 4)

(105,)

(45, 4)

(45,)
DecisionTreeClassifier()
Graphviz - Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information such as diagrams of abstract graphs and networks.
export_graphviz - This function generates a GraphViz representation of the decision tree in dot format, which is then written into an output file ('out_file').
Pydotplus - PyDotPlus provides a Python Interface to Graphviz’s Dot language.
0.9555555555555556
array([[14, 0, 0],

[ 0, 17, 1],

[ 0, 1, 12]], dtype=int64)
The algorithm uses the Gini Index for constructing the decision tree.
A feature with a lower Gini index is chosen for a split.
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX ... CHAS*CHAS NOX*NOX RM*RM AGE*AGE DIS*DIS RAD*RAD TAX*TAX PTRATIO*PTRATIO B
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 ... 0.0 0.289444 43.230625 4251.04 16.728100 1.0 87616.0 234.09 157529.61
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 ... 0.0 0.219961 41.229241 6225.21 24.672082 4.0 58564.0 316.84 157529.61
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 ... 0.0 0.219961 51.624225 3733.21 24.672082 4.0 58564.0 316.84 154315.40
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 ... 0.0 0.209764 48.972004 2097.64 36.750269 9.0 49284.0 349.69 155732.83
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 ... 0.0 0.209764 51.079609 2937.64 36.750269 9.0 49284.0 349.69 157529.61
5 rows × 26 columns
In [46]:
from sklearn import datasets

boston=datasets.load_boston()

In [78]:
print(boston.DESCR)

In [79]:
x=boston.data

y=boston.target

In [80]:
print(x.shape)

print(y.shape)

In [81]:
print(type(x))

print(type(y))

In [95]:
import pandas as pd

boston_data=pd.DataFrame(x,columns=boston.feature_names)

boston_data
Out[95]:
In [96]:
for i in boston.feature_names:

new_f=i+'*'+i

boston_data[new_f]=boston_data[i]**2

boston_data.head()

Out[96]:
In [98]:
x=boston_data.values

In [99]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.7,random_state=1)

print(x_train.shape)

print(y_train.shape)

print(x_test.shape)

print(y_test.shape)

In [88]:
from sklearn.linear_model import LinearRegression

In [100…
reg=LinearRegression()

In [101…
reg.fit(x_train,y_train)

Out[101…
In [102…
y_pred=reg.predict(x_test)

In [ ]:


R
2
1 −
∑ (y
T
i
− y
P
i
)
2
∑ (y
T
i
− y
T
mean)
2
y
T
mean
y
P
i
= y
T
mean
In [103…
reg.score(x_test,y_test)

Out[103…
In [ ]:


In [ ]:


In [104…
import matplotlib.pyplot as plt

plt.scatter(y_pred,y_test)

plt.grid()

plt.xlabel("Predictions")

plt.ylabel("Testing Data")

plt.axis([0,40,0,40])

plt.show()

In [105…
from sklearn import datasets

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

In [106…
iris=datasets.load_iris()

In [107…
print(iris.DESCR)

In [122…
x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,train_size=0.7,random_state=1)

print(x_train.shape)

print(y_train.shape)

print(x_test.shape)

print(y_test.shape)

In [123…
clf=DecisionTreeClassifier()

In [124…
clf.fit(x_train,y_train)

Out[124…
In [135…
y_pred=clf.predict(x_test)

In [131…
from sklearn.tree import export_graphviz



dot_data = export_graphviz(clf,

feature_names=iris.feature_names,

class_names=iris.target_names, filled=True, rounded=True)



import pydotplus

from IPython.display import Image 



graph = pydotplus.graph_from_dot_data(dot_data)

Image(graph.create_png())

Out[131…
In [137…
clf.score(x_test,y_test)

Out[137…
In [138…
from sklearn.metrics import confusion_matrix

confusion_matrix(y_pred,y_test)

Out[138…
In [ ]:


Gini Index = 1 −
c
∑
i=1
(pi)
2
In [ ]:


In [ ]:


In [ ]:

Más contenido relacionado

La actualidad más candente

SVM Tutorial
SVM TutorialSVM Tutorial
SVM Tutorial
butest
 

La actualidad más candente (20)

R Workshop for Beginners
R Workshop for BeginnersR Workshop for Beginners
R Workshop for Beginners
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Preparation Data Structures 10 trees
Preparation Data Structures 10 treesPreparation Data Structures 10 trees
Preparation Data Structures 10 trees
 
2 data structure in R
2 data structure in R2 data structure in R
2 data structure in R
 
R교육1
R교육1R교육1
R교육1
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
R learning by examples
R learning by examplesR learning by examples
R learning by examples
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
 
SVM Tutorial
SVM TutorialSVM Tutorial
SVM Tutorial
 
Pandas data transformational data structure patterns and challenges final
Pandas   data transformational data structure patterns and challenges  finalPandas   data transformational data structure patterns and challenges  final
Pandas data transformational data structure patterns and challenges final
 
Matrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlpMatrix decomposition and_applications_to_nlp
Matrix decomposition and_applications_to_nlp
 
Introduction to Boosted Trees by Tianqi Chen
Introduction to Boosted Trees by Tianqi ChenIntroduction to Boosted Trees by Tianqi Chen
Introduction to Boosted Trees by Tianqi Chen
 
Text categorization
Text categorizationText categorization
Text categorization
 
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic net
 
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVM
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 

Similar a De-Cluttering-ML | TechWeekends

Data Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NData Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with N
OllieShoresna
 
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science JobCongrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
Rohit Dubey
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
PremaGanesh1
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
carliotwaycave
 

Similar a De-Cluttering-ML | TechWeekends (20)

Data Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with NData Manipulation with Numpy and Pandas in PythonStarting with N
Data Manipulation with Numpy and Pandas in PythonStarting with N
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
Numerical data.
Numerical data.Numerical data.
Numerical data.
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VII
 
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science JobCongrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
 
CE344L-200365-Lab2.pdf
CE344L-200365-Lab2.pdfCE344L-200365-Lab2.pdf
CE344L-200365-Lab2.pdf
 
NUMPY-2.pptx
NUMPY-2.pptxNUMPY-2.pptx
NUMPY-2.pptx
 
R for Statistical Computing
R for Statistical ComputingR for Statistical Computing
R for Statistical Computing
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
 
working with python
working with pythonworking with python
working with python
 
Interpolation Missing values.pptx
Interpolation Missing values.pptxInterpolation Missing values.pptx
Interpolation Missing values.pptx
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
R data types
R   data typesR   data types
R data types
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
R programming slides
R  programming slidesR  programming slides
R programming slides
 

Último

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Último (20)

FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 

De-Cluttering-ML | TechWeekends

  • 1.
  • 3. Pandas ● Pandas is a Python module that helps us read and manipulate data. ● With the help of Pandas Library you can take in data and view it as a table that's human readable, but it can also be interpreted numerically so that you can do lots of computations with it. ● We call the table of data a DataFrame.
  • 4. Lets Start Pandas ● Lets start by importing pandas. It's standard practice to nickname it pd so that it's faster to type later on. import pandas as pd ● We will be working with a dataset of Titanic passengers. For each passenger, we’ll have some data on them as well as whether or not they survived the crash. ● Our data is stored as CSV (comma-separated values) file. ● We're going to pull the data into pandas so we can view it as a DataFrame. ● The read_csv function takes a file in csv format and converts it to a Pandas DataFrame. df = pd.read_csv('titanic.csv') ● The object df is now our pandas dataframe with the Titanic dataset. Now we can use the head method to look at the data. The head method returns the first 5 rows of the DataFrame. print(df.head())
  • 5. Summarize the Data ● Usually our data is much too big for us to be able to display it all. Looking at the first few rows is the first step to understanding our data, but then we want to look at some summary statistics. ● In pandas, we can use the describe method. It returns a table of statistics about the columns. print(df.describe()) ● For each column we see a few statistics. Note that it only gives statistics for the numerical columns. ● Let's review what each of these statistics means: Count: This is the number of rows that have a value. In our case, every passenger has a value for each of the columns, so the value is 887 (the total number of passengers). Mean: Recall that the mean is the standard average. Std: This is short for standard deviation. This is a measure of how dispersed the data is. Min: The smallest value 25%: The 25th percentile 50%: The 50th percentile, also known as the median. 75%: The 75th percentile Max: The largest value
  • 6. Selecting a single Column ● We often will only want to deal with some of the columns that we have in our dataset. To select a single column, we use the square brackets and the column name. ● In this example, we're selecting just the column with the passenger fares. col = df['Fare'] print(col) ● The result is what we call a Pandas Series. A series is like a DataFrame, but it's just a single column.
  • 7. Selecting Multiple Columns ● We can also select multiple columns from our original DataFrame, creating a smaller DataFrame. We're going to select just the Age, Sex, and Survived columns from our original DataFrame. ● We put these values in a list as follows: ['Age', 'Sex', 'Survived'] ● Now we use that list inside of the bracket notation df[...] When printing a large DataFrame that’s too big to display, you can use the head method to print just the first 5 rows. small_df = df[['Age', 'Sex', 'Survived']] print(small_df.head())
  • 8. Creating a Column ● We often want our data in a slightly different format than it originally comes in. For example, our data has the sex of the passenger as a string ("male" or "female"). This is easy for a human to read, but when we do computations on our data later on, we’ll want it as Boolean values (Trues and Falses). ● We create a Pandas Series that will be a series of Trues and Falses (True if the passenger is male and False if the passenger is female). ● df[‘Sex’] == ‘male’ ● Now we want to create a column with this result. To create a new column, we use the same bracket syntax (df['male']) and then assign this new value to it. df['male'] = df['Sex'] == 'male'
  • 9. Numpy ● Numpy is a Python package for manipulating lists and tables of numerical data. We can use it to do a lot of statistical calculations. We call the list or table of data a numpy array. ● We often will take the data from our pandas DataFrame and put it in numpy arrays. ● Pandas DataFrames are great because we have the column names and other text data that makes it human readable. ● A DataFrame, while easy for a human to read, is not the ideal format for doing calculations. The numpy arrays are generally less human readable, but are in a format that enables the necessary computation.
  • 10. Lets Start Numpy ● Here also lets start by importing Numpy. It's standard practice to nickname it np so that it's faster to type later on. import numpy as np ● Now creating an array or vector in numpy ○ Row wise :- vector_row = np.array([1,2,3]) print(vector_row) ○ Column wise:- vector_col = np.array([[1],[2],[3]]) print(vector_col) ● Now creating a matrix or 2-d array matrix = np.array([[1,2,3],[4,5,6]]) (This will print a matrix of size 2*3)
  • 11. ● When you need to select one or more element in a vector or matrix #Create a vector as a Row vector_row = np.array([ 1,2,3,4,5,6 ]) #Create a Matrix matrix = np.array([[1,2,3],[4,5,6],[7,8,9]]) #Select 3rd element of Vector print(vector_row[2]) #Select 2nd row 2nd column print(matrix[1,1]) #Select all elements of a vector print(vector_row[:])
  • 12. #Select everything up to and including the 3rd element print(vector_row[:3]) #Select the everything after the 3rd element print(vector_row[3:]) #Select the last element print(vector_row[-1]) #Select the first 2 rows and all the columns of the matrix print(matrix[:2,:]) #Select all rows and the 2nd column of the matrix print(matrix[:,1:2])
  • 13. #Create a Matrix matrix =np.array([[1,2,3],[4,5,6],[7,8,9]]) #View the Number of Rows and Columns print(matrix.shape) #View the number of elements (rows*columns) print(matrix.size) #View the number of Dimensions(2 in this case) print(matrix.ndim)
  • 14. ● Finding Minimum and Maximum #Create a Matrix matrix = np.array([[1,2,3],[4,5,6],[7,8,9]]) print(matrix) #Return the max element print(np.max(matrix)) #Return the min element print(np.min(matrix)) #To find the max element in each column print(np.max(matrix,axis=0)) #To find the max element in each row print(np.max(matrix,axis=1))
  • 15. ● Calculating Mean, Variance and Standard Deviation #Create a Matrix matrix = np.array([[1,2,3],[4,5,6],[7,8,9]]) print(matrix) #Mean print(np.mean(matrix)) #Standard Deviation. print(np.std(matrix)) #Variance print(np.var(matrix))
  • 16. MATPLOTLIB ● We basically uses the matplotlib library to plot our data. Plotting the data can often help us build intuition about our data. ● We first need to import matplotlib. It’s standard practice to nickname it plt. import matplotlib.pyplot as plt ● We use the scatter function to plot our data. The first argument of the scatter function is the x- axis (horizontal direction) and the second argument is the y-axis (vertical direction). plt.scatter(df['Age'], df['Fare']) This plots the Age on the x-axis and the Fare on the y-axis.
  • 17. ● To make it easier to interpret, we can add x and y labels. plt.xlabel('Age') plt.ylabel('Fare') ● We can also use our data to color code our scatter plot. This will give each of the 3 classes a different color. We add the c parameter and give it a Pandas series. In this case, our Pandas series has 3 possible values (1st, 2nd, and 3rd class), so we'll see our datapoints each get one of three colors. plt.scatter(df['Age'], df['Fare'], c=df['Pclass']) ● The purple dots are first class, the green dots are second class, and the yellow dots are third class. ● A scatter plot is used to show all the values from your data on a graph. In order to get a visual representation of our data, we have to limit our data to two features.
  • 18. ● Now that we can put individual datapoints on a plot, let's see how to draw the line. The plot function does just that. The following draws a line to approximately separate the 1st class from the 2nd and 3rd class. From eyeballing, we’ll put the line from (0, 85) to (80, 5). Our syntax below has a list of the x values and a list of the y values. plt.plot([0, 80], [85, 5]) ● In matplotlib, we use the scatter function to create a scatter plot and the plot function for a line.
  • 19. Before proceeding ahead will be talking about following: - ❏ Machine learning definition and types ❏ Linear Regression(both single and multivariate) ❏ Decision Tree and Random Forest
  • 20. What is Machine Learning? Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." Example: playing checkers. E = the experience of playing many games of checkers T = the task of playing checkers. P = the probability that the program will win the next game. In general, any machine learning problem can be assigned to one of two broad classifications: supervised learning OR unsupervised learning.
  • 21. Supervised Learning In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. Supervised learning problems are categorized into "regression" and "classification" problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. (a) Regression - Given a picture of Male/Female, We have to predict his/her age on the basis of given picture. (b) Classification - Banks have to decide whether or not to give a loan to someone on the basis of his credit history.
  • 22. Linear Regression with One Variable Univariate linear regression is used when you want to predict a single output value y from a single input value x. We're doing supervised learning and regression here, so that means we already have an idea about what the input/output cause and effect should be.
  • 23.
  • 24. The Hypothesis Function Our hypothesis function has the general form: ^y=hθ(x)=θ0+θ1 x Note that this is like the equation of a straight line. We give to hθ(x) values for θ0 and θ1 to get our estimated output ^y In other words, we are trying to create a function called hθ that is trying to map our input data (the x's) to our output data (the y's).
  • 25. Cost Function We can measure the accuracy of our hypothesis function by using a cost function. This takes an average (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's compared to the actual output y's.
  • 26.
  • 27. Gradient Descent for Linear Regression When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to (the derivation of the formulas are out of the scope of this course, but a really great one can be found here):
  • 28.
  • 29.
  • 30. Linear regression (2): Gradient descent by Alexander Ihler
  • 31. Linear Regression with Multiple Variables We now introduce notation for equations where we can have any number of input variables.
  • 32.
  • 33. Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:
  • 34.
  • 35.
  • 37. Entropy Entropy is the measure of randomness of the system
  • 38. Information gain Information Gain is applied to quantify which feature provides maximal information about the classification based on the notion of entropy
  • 39. Gini index The Gini Index is determined by deducting the sum of squared of probabilities of each class from one, mathematically, Gini Index can be expressed as:
  • 40.
  • 41.
  • 44. Confusion matrix A confusion matrix is a technique for summarizing the performance of a classification algorithm. Classification accuracy alone can be misleading if you have an unequal number of observations in each class or if you have more than two classes in your dataset.
  • 45. Accuracy Overall, how often is the classifier correct? ○ (TP+TN)/total = (100+50)/165 = 0.91
  • 47. What is sklearn? Scikit-learn (also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms like Linear Regression,K-NN,random forests etc. 1. Loading the Data 2. Splitting the Data 3. Generate The Model 4. Evaluate The accuracy .. _boston_dataset: Boston house prices dataset --------------------------- **Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None :Creator: Harrison, D. and Rubinfeld, D.L. This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter. The Boston house-price data has been used in many machine learning papers that address regression problems. .. topic:: References - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261. - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learnin g, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann. (506, 13) (506,) <class 'numpy.ndarray'> <class 'numpy.ndarray'> CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT 0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 ... ... ... ... ... ... ... ... ... ... ... ... ... ... 501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67 502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08 503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64 504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48 505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88 506 rows × 13 columns (354, 26) (354,) (152, 26) (152,) LinearRegression() The coefficient of determination, The score can be calculated using the following formula: T : true/actual value. P : predicted value. represents that we are predicting a score as worse as the mean value. (Meaning, all the answers are mean value of the data). To prove this, let us put This will give the score as 0, which is the worst possible score. 0.8787006807787356 Decision Tree .. _iris_dataset: Iris plants dataset -------------------- **Data Set Characteristics:** :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics: ============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation ============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ==================== :Missing Attribute Values: None :Class Distribution: 33.3% for each of 3 classes. :Creator: R.A. Fisher :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) :Date: July, 1988 The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken from Fisher's paper. Note that it's the same as in R, but not as in the UCI Machine Learning Repository, which has two wrong data points. This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. .. topic:: References - Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950). - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71. - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433. - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II conceptual clustering system finds 3 classes in the data. - Many, many more ... (105, 4) (105,) (45, 4) (45,) DecisionTreeClassifier() Graphviz - Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information such as diagrams of abstract graphs and networks. export_graphviz - This function generates a GraphViz representation of the decision tree in dot format, which is then written into an output file ('out_file'). Pydotplus - PyDotPlus provides a Python Interface to Graphviz’s Dot language. 0.9555555555555556 array([[14, 0, 0], [ 0, 17, 1], [ 0, 1, 12]], dtype=int64) The algorithm uses the Gini Index for constructing the decision tree. A feature with a lower Gini index is chosen for a split. CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX ... CHAS*CHAS NOX*NOX RM*RM AGE*AGE DIS*DIS RAD*RAD TAX*TAX PTRATIO*PTRATIO B 0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 ... 0.0 0.289444 43.230625 4251.04 16.728100 1.0 87616.0 234.09 157529.61 1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 ... 0.0 0.219961 41.229241 6225.21 24.672082 4.0 58564.0 316.84 157529.61 2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 ... 0.0 0.219961 51.624225 3733.21 24.672082 4.0 58564.0 316.84 154315.40 3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 ... 0.0 0.209764 48.972004 2097.64 36.750269 9.0 49284.0 349.69 155732.83 4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 ... 0.0 0.209764 51.079609 2937.64 36.750269 9.0 49284.0 349.69 157529.61 5 rows × 26 columns In [46]: from sklearn import datasets boston=datasets.load_boston() In [78]: print(boston.DESCR) In [79]: x=boston.data y=boston.target In [80]: print(x.shape) print(y.shape) In [81]: print(type(x)) print(type(y)) In [95]: import pandas as pd boston_data=pd.DataFrame(x,columns=boston.feature_names) boston_data Out[95]: In [96]: for i in boston.feature_names: new_f=i+'*'+i boston_data[new_f]=boston_data[i]**2 boston_data.head() Out[96]: In [98]: x=boston_data.values In [99]: from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.7,random_state=1) print(x_train.shape) print(y_train.shape) print(x_test.shape) print(y_test.shape) In [88]: from sklearn.linear_model import LinearRegression In [100… reg=LinearRegression() In [101… reg.fit(x_train,y_train) Out[101… In [102… y_pred=reg.predict(x_test) In [ ]: R 2 1 − ∑ (y T i − y P i ) 2 ∑ (y T i − y T mean) 2 y T mean y P i = y T mean In [103… reg.score(x_test,y_test) Out[103… In [ ]: In [ ]: In [104… import matplotlib.pyplot as plt plt.scatter(y_pred,y_test) plt.grid() plt.xlabel("Predictions") plt.ylabel("Testing Data") plt.axis([0,40,0,40]) plt.show() In [105… from sklearn import datasets from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split In [106… iris=datasets.load_iris() In [107… print(iris.DESCR) In [122… x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,train_size=0.7,random_state=1) print(x_train.shape) print(y_train.shape) print(x_test.shape) print(y_test.shape) In [123… clf=DecisionTreeClassifier() In [124… clf.fit(x_train,y_train) Out[124… In [135… y_pred=clf.predict(x_test) In [131… from sklearn.tree import export_graphviz dot_data = export_graphviz(clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True) import pydotplus from IPython.display import Image graph = pydotplus.graph_from_dot_data(dot_data) Image(graph.create_png()) Out[131… In [137… clf.score(x_test,y_test) Out[137… In [138… from sklearn.metrics import confusion_matrix confusion_matrix(y_pred,y_test) Out[138… In [ ]: Gini Index = 1 − c ∑ i=1 (pi) 2 In [ ]: In [ ]: In [ ]: