Introduction to Pandas,
NumPy & Matplotlib
● Pandas is a Python module that helps us read and manipulate data.
● With the help of Pandas Library you can take in data and view it as a table that's human
readable, but it can also be interpreted numerically so that you can do lots of computations
with it.
● We call the table of data a DataFrame.
Lets Start Pandas
● Lets start by importing pandas. It's standard practice to nickname it pd so that it's faster to type later on.
import pandas as pd
● We will be working with a dataset of Titanic passengers. For each passenger, we’ll have some data on them as well
as whether or not they survived the crash.
● Our data is stored as CSV (comma-separated values) file.
● We're going to pull the data into pandas so we can view it as a DataFrame.
● The read_csv function takes a file in csv format and converts it to a Pandas DataFrame.
df = pd.read_csv('titanic.csv')
● The object df is now our pandas dataframe with the Titanic dataset. Now we can use the head method to look at the
The head method returns the first 5 rows of the DataFrame.
Summarize the Data
● Usually our data is much too big for us to be able to display it all.
Looking at the first few rows is the first step to understanding our data, but then we want to look at some
summary statistics.
● In pandas, we can use the describe method. It returns a table of statistics about the columns.
● For each column we see a few statistics. Note that it only gives statistics for the numerical columns.
● Let's review what each of these statistics means:
Count: This is the number of rows that have a value. In our case, every passenger has a value for each of the
columns, so the value is 887 (the total number of passengers).
Mean: Recall that the mean is the standard average.
Std: This is short for standard deviation. This is a measure of how dispersed the data is.
Min: The smallest value
25%: The 25th percentile
50%: The 50th percentile, also known as the median.
75%: The 75th percentile
Max: The largest value
Selecting a single Column
● We often will only want to deal with some of the columns that we have in our dataset. To select
a single column, we use the square brackets and the column name.
● In this example, we're selecting just the column with the passenger fares.
col = df['Fare']
● The result is what we call a Pandas Series.
A series is like a DataFrame, but it's just a single column.
Selecting Multiple Columns
● We can also select multiple columns from our original DataFrame, creating a smaller
We're going to select just the Age, Sex, and Survived columns from our original DataFrame.
● We put these values in a list as follows:
['Age', 'Sex', 'Survived']
● Now we use that list inside of the bracket notation df[...] When printing a large DataFrame
that’s too big to display, you can use the head method to print just the first 5 rows.
small_df = df[['Age', 'Sex', 'Survived']]
Creating a Column
● We often want our data in a slightly different format than it originally comes in. For example,
our data has the sex of the passenger as a string ("male" or "female"). This is easy for a human
to read, but when we do computations on our data later on, we’ll want it as Boolean values
(Trues and Falses).
● We create a Pandas Series that will be a series of Trues and Falses (True if the passenger
is male and False if the passenger is female).
● df[‘Sex’] == ‘male’
● Now we want to create a column with this result. To create a new column, we use the same
bracket syntax (df['male']) and then assign this new value to it.
df['male'] = df['Sex'] == 'male'
● Numpy is a Python package for manipulating lists and tables of numerical data. We can use it to
do a lot of statistical calculations. We call the list or table of data a numpy array.
● We often will take the data from our pandas DataFrame and put it in numpy arrays.
● Pandas DataFrames are great because we have the column names and other text data that makes
it human readable.
● A DataFrame, while easy for a human to read, is not the ideal format for doing calculations. The
numpy arrays are generally less human readable, but are in a format that enables the necessary
Lets Start Numpy
● Here also lets start by importing Numpy. It's standard practice to nickname it np so that it's
faster to type later on.
import numpy as np
● Now creating an array or vector in numpy
○ Row wise :-
vector_row = np.array([1,2,3])
○ Column wise:-
vector_col = np.array([[1],[2],[3]])
● Now creating a matrix or 2-d array
matrix = np.array([[1,2,3],[4,5,6]])
(This will print a matrix of size 2*3)
● When you need to select one or more element in a vector or matrix
#Create a vector as a Row
vector_row = np.array([ 1,2,3,4,5,6 ])
#Create a Matrix
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])
#Select 3rd element of Vector
#Select 2nd row 2nd column
#Select all elements of a vector
#Select everything up to and including the 3rd element
#Select the everything after the 3rd element
#Select the last element
#Select the first 2 rows and all the columns of the matrix
#Select all rows and the 2nd column of the matrix
#Create a Matrix
matrix =np.array([[1,2,3],[4,5,6],[7,8,9]])
#View the Number of Rows and Columns
#View the number of elements (rows*columns)
#View the number of Dimensions(2 in this case)
● Finding Minimum and Maximum
#Create a Matrix
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])
#Return the max element
#Return the min element
#To find the max element in each column
#To find the max element in each row
● Calculating Mean, Variance and Standard Deviation
#Create a Matrix
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])
#Standard Deviation.
● We basically uses the matplotlib library to plot our data. Plotting the data can often help us
build intuition about our data.
● We first need to import matplotlib. It’s standard practice to nickname it plt.
import matplotlib.pyplot as plt
● We use the scatter function to plot our data. The first argument of the scatter function is the x-
axis (horizontal direction) and the second argument is the y-axis (vertical direction).
plt.scatter(df['Age'], df['Fare'])
This plots the Age on the x-axis and the Fare on the y-axis.
● To make it easier to interpret, we can add x and y labels.
● We can also use our data to color code our scatter plot. This will give each of the 3 classes a
different color. We add the c parameter and give it a Pandas series. In this case, our Pandas series
has 3 possible values (1st, 2nd, and 3rd class), so we'll see our datapoints each get one of three
plt.scatter(df['Age'], df['Fare'], c=df['Pclass'])
● The purple dots are first class, the green dots are second class, and the yellow dots are third class.
● A scatter plot is used to show all the values from your data on a graph. In order to get a visual
representation of our data, we have to limit our data to two features.
● Now that we can put individual datapoints on a plot, let's see how to draw the line.
The plot function does just that. The following draws a line to approximately separate the 1st
class from the 2nd and 3rd class. From eyeballing, we’ll put the line from (0, 85) to (80, 5). Our
syntax below has a list of the x values and a list of the y values.
plt.plot([0, 80], [85, 5])
● In matplotlib, we use the scatter function to create a scatter plot and the plot function for a line.
Before proceeding ahead will be talking about
following: -
❏ Machine learning definition and types
❏ Linear Regression(both single and multivariate)
❏ Decision Tree and Random Forest
What is Machine Learning?
Tom Mitchell provides a more modern definition: "A computer program is said to learn
from experience E with respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with experience E."
Example: playing checkers.
E = the experience of playing many games of checkers
T = the task of playing checkers.
P = the probability that the program will win the next game.
In general, any machine learning problem can be assigned to one of two broad
supervised learning
unsupervised learning.
Supervised Learning
In supervised learning, we are given a data set and already know what our correct
output should look like, having the idea that there is a relationship between the input
and the output.
Supervised learning problems are categorized into "regression" and "classification"
problems. In a regression problem, we are trying to predict results within a continuous
output, meaning that we are trying to map input variables to some continuous function.
In a classification problem, we are instead trying to predict results in a discrete output.
(a) Regression - Given a picture of Male/Female, We have to predict his/her age on the
basis of given picture.
(b) Classification - Banks have to decide whether or not to give a loan to someone on the
basis of his credit history.
Linear Regression with One Variable
Univariate linear regression is used when you want to predict a single output value y
from a single input value x. We're doing supervised learning and regression here, so that
means we already have an idea about what the input/output cause and effect should
The Hypothesis Function
Our hypothesis function has the general form:
^y=hθ(x)=θ0+θ1 x
Note that this is like the equation of a straight line. We give to hθ(x) values for θ0 and
θ1 to get our estimated output ^y
In other words, we are trying to create a function called hθ that is trying to map our
input data (the x's) to our output data (the y's).
Cost Function
We can measure the accuracy of our hypothesis function by using a cost function. This
takes an average (actually a fancier version of an average) of all the results of the
hypothesis with inputs from x's compared to the actual output y's.
Gradient Descent for Linear Regression
When specifically applied to the case of linear regression, a new form of the gradient
descent equation can be derived. We can substitute our actual cost function and our
actual hypothesis function and modify the equation to (the derivation of the formulas
are out of the scope of this course, but a really great one can be found here):
Linear regression (2): Gradient descent by Alexander Ihler
Linear Regression with Multiple Variables
We now introduce notation for equations where we can have any number of input
Using the definition of matrix multiplication, our multivariable hypothesis function can
be concisely represented as:
Decision Tree
Entropy is the measure of randomness of the system
Information gain
Information Gain is applied to quantify which feature provides maximal information
about the classification based on the notion of entropy
Gini index
The Gini Index is determined by deducting the sum of squared of probabilities of each
class from one, mathematically, Gini Index can be expressed as:
Some other concepts
R2 or R-squared
Confusion matrix
A confusion matrix is a technique for summarizing the performance of a classification
Classification accuracy alone can be misleading if you have an unequal number of
observations in each class or if you have more than two classes in your dataset.
Overall, how often is the classifier correct?
○ (TP+TN)/total = (100+50)/165 = 0.91
NOW lets code
What is sklearn?
Scikit-learn (also known as sklearn) is a free software machine learning library for the Python programming language. It features
various classification, regression and clustering algorithms like Linear Regression,K-NN,random forests etc.
1. Loading the Data
2. Splitting the Data
3. Generate The Model
4. Evaluate The accuracy
(506, 13)


<class 'numpy.ndarray'>

<class 'numpy.ndarray'>

0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88
506 rows × 13 columns
(354, 26)


(152, 26)


The coefficient of determination,
The score can be calculated using the following formula:
T : true/actual value.
P : predicted value.
represents that we are predicting a score as worse as the mean value. (Meaning, all the answers are mean value of the
To prove this, let us put
This will give the score as 0, which is the worst possible score.
Decision Tree
(105, 4)


(45, 4)

Graphviz - Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information such as diagrams of abstract graphs and networks.
export_graphviz - This function generates a GraphViz representation of the decision tree in dot format, which is then written into an output file ('out_file').
Pydotplus - PyDotPlus provides a Python Interface to Graphviz’s Dot language.
array([[14, 0, 0],

[ 0, 17, 1],

[ 0, 1, 12]], dtype=int64)
The algorithm uses the Gini Index for constructing the decision tree.
A feature with a lower Gini index is chosen for a split.
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 ... 0.0 0.289444 43.230625 4251.04 16.728100 1.0 87616.0 234.09 157529.61
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 ... 0.0 0.219961 41.229241 6225.21 24.672082 4.0 58564.0 316.84 157529.61
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 ... 0.0 0.219961 51.624225 3733.21 24.672082 4.0 58564.0 316.84 154315.40
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 ... 0.0 0.209764 48.972004 2097.64 36.750269 9.0 49284.0 349.69 155732.83
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 ... 0.0 0.209764 51.079609 2937.64 36.750269 9.0 49284.0 349.69 157529.61
5 rows × 26 columns
from sklearn import datasets


import pandas as pd


for i in boston.feature_names:




In [98]:

from sklearn.model_selection import train_test_split






from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt




plt.ylabel("Testing Data")


from sklearn import datasets

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

In [124…,y_train)

from sklearn.tree import export_graphviz

dot_data = export_graphviz(clf,


class_names=iris.target_names, filled=True, rounded=True)

import pydotplus

from IPython.display import Image 

graph = pydotplus.graph_from_dot_data(dot_data)


from sklearn.metrics import confusion_matrix


Gini Index = 1 −
De-Cluttering-ML | TechWeekends

  • 1.
  • 3. Pandas ● Pandas is a Python module that helps us read and manipulate data. ● With the help of Pandas Library you can take in data and view it as a table that's human readable, but it can also be interpreted numerically so that you can do lots of computations with it. ● We call the table of data a DataFrame.
  • 4. Lets Start Pandas ● Lets start by importing pandas. It's standard practice to nickname it pd so that it's faster to type later on. import pandas as pd ● We will be working with a dataset of Titanic passengers. For each passenger, we’ll have some data on them as well as whether or not they survived the crash. ● Our data is stored as CSV (comma-separated values) file. ● We're going to pull the data into pandas so we can view it as a DataFrame. ● The read_csv function takes a file in csv format and converts it to a Pandas DataFrame. df = pd.read_csv('titanic.csv') ● The object df is now our pandas dataframe with the Titanic dataset. Now we can use the head method to look at the data. The head method returns the first 5 rows of the DataFrame. print(df.head())
  • 5. Summarize the Data ● Usually our data is much too big for us to be able to display it all. Looking at the first few rows is the first step to understanding our data, but then we want to look at some summary statistics. ● In pandas, we can use the describe method. It returns a table of statistics about the columns. print(df.describe()) ● For each column we see a few statistics. Note that it only gives statistics for the numerical columns. ● Let's review what each of these statistics means: Count: This is the number of rows that have a value. In our case, every passenger has a value for each of the columns, so the value is 887 (the total number of passengers). Mean: Recall that the mean is the standard average. Std: This is short for standard deviation. This is a measure of how dispersed the data is. Min: The smallest value 25%: The 25th percentile 50%: The 50th percentile, also known as the median. 75%: The 75th percentile Max: The largest value
  • 6. Selecting a single Column ● We often will only want to deal with some of the columns that we have in our dataset. To select a single column, we use the square brackets and the column name. ● In this example, we're selecting just the column with the passenger fares. col = df['Fare'] print(col) ● The result is what we call a Pandas Series. A series is like a DataFrame, but it's just a single column.
  • 7. Selecting Multiple Columns ● We can also select multiple columns from our original DataFrame, creating a smaller DataFrame. We're going to select just the Age, Sex, and Survived columns from our original DataFrame. ● We put these values in a list as follows: ['Age', 'Sex', 'Survived'] ● Now we use that list inside of the bracket notation df[...] When printing a large DataFrame that’s too big to display, you can use the head method to print just the first 5 rows. small_df = df[['Age', 'Sex', 'Survived']] print(small_df.head())
  • 8. Creating a Column ● We often want our data in a slightly different format than it originally comes in. For example, our data has the sex of the passenger as a string ("male" or "female"). This is easy for a human to read, but when we do computations on our data later on, we’ll want it as Boolean values (Trues and Falses). ● We create a Pandas Series that will be a series of Trues and Falses (True if the passenger is male and False if the passenger is female). ● df[‘Sex’] == ‘male’ ● Now we want to create a column with this result. To create a new column, we use the same bracket syntax (df['male']) and then assign this new value to it. df['male'] = df['Sex'] == 'male'
  • 9. Numpy ● Numpy is a Python package for manipulating lists and tables of numerical data. We can use it to do a lot of statistical calculations. We call the list or table of data a numpy array. ● We often will take the data from our pandas DataFrame and put it in numpy arrays. ● Pandas DataFrames are great because we have the column names and other text data that makes it human readable. ● A DataFrame, while easy for a human to read, is not the ideal format for doing calculations. The numpy arrays are generally less human readable, but are in a format that enables the necessary computation.
  • 10. Lets Start Numpy ● Here also lets start by importing Numpy. It's standard practice to nickname it np so that it's faster to type later on. import numpy as np ● Now creating an array or vector in numpy ○ Row wise :- vector_row = np.array([1,2,3]) print(vector_row) ○ Column wise:- vector_col = np.array([[1],[2],[3]]) print(vector_col) ● Now creating a matrix or 2-d array matrix = np.array([[1,2,3],[4,5,6]]) (This will print a matrix of size 2*3)
  • 11. ● When you need to select one or more element in a vector or matrix #Create a vector as a Row vector_row = np.array([ 1,2,3,4,5,6 ]) #Create a Matrix matrix = np.array([[1,2,3],[4,5,6],[7,8,9]]) #Select 3rd element of Vector print(vector_row[2]) #Select 2nd row 2nd column print(matrix[1,1]) #Select all elements of a vector print(vector_row[:])
  • 12. #Select everything up to and including the 3rd element print(vector_row[:3]) #Select the everything after the 3rd element print(vector_row[3:]) #Select the last element print(vector_row[-1]) #Select the first 2 rows and all the columns of the matrix print(matrix[:2,:]) #Select all rows and the 2nd column of the matrix print(matrix[:,1:2])
  • 13. #Create a Matrix matrix =np.array([[1,2,3],[4,5,6],[7,8,9]]) #View the Number of Rows and Columns print(matrix.shape) #View the number of elements (rows*columns) print(matrix.size) #View the number of Dimensions(2 in this case) print(matrix.ndim)
  • 14. ● Finding Minimum and Maximum #Create a Matrix matrix = np.array([[1,2,3],[4,5,6],[7,8,9]]) print(matrix) #Return the max element print(np.max(matrix)) #Return the min element print(np.min(matrix)) #To find the max element in each column print(np.max(matrix,axis=0)) #To find the max element in each row print(np.max(matrix,axis=1))
  • 15. ● Calculating Mean, Variance and Standard Deviation #Create a Matrix matrix = np.array([[1,2,3],[4,5,6],[7,8,9]]) print(matrix) #Mean print(np.mean(matrix)) #Standard Deviation. print(np.std(matrix)) #Variance print(np.var(matrix))
  • 16. MATPLOTLIB ● We basically uses the matplotlib library to plot our data. Plotting the data can often help us build intuition about our data. ● We first need to import matplotlib. It’s standard practice to nickname it plt. import matplotlib.pyplot as plt ● We use the scatter function to plot our data. The first argument of the scatter function is the x- axis (horizontal direction) and the second argument is the y-axis (vertical direction). plt.scatter(df['Age'], df['Fare']) This plots the Age on the x-axis and the Fare on the y-axis.
  • 17. ● To make it easier to interpret, we can add x and y labels. plt.xlabel('Age') plt.ylabel('Fare') ● We can also use our data to color code our scatter plot. This will give each of the 3 classes a different color. We add the c parameter and give it a Pandas series. In this case, our Pandas series has 3 possible values (1st, 2nd, and 3rd class), so we'll see our datapoints each get one of three colors. plt.scatter(df['Age'], df['Fare'], c=df['Pclass']) ● The purple dots are first class, the green dots are second class, and the yellow dots are third class. ● A scatter plot is used to show all the values from your data on a graph. In order to get a visual representation of our data, we have to limit our data to two features.
  • 18. ● Now that we can put individual datapoints on a plot, let's see how to draw the line. The plot function does just that. The following draws a line to approximately separate the 1st class from the 2nd and 3rd class. From eyeballing, we’ll put the line from (0, 85) to (80, 5). Our syntax below has a list of the x values and a list of the y values. plt.plot([0, 80], [85, 5]) ● In matplotlib, we use the scatter function to create a scatter plot and the plot function for a line.
  • 19. Before proceeding ahead will be talking about following: - ❏ Machine learning definition and types ❏ Linear Regression(both single and multivariate) ❏ Decision Tree and Random Forest
  • 20. What is Machine Learning? Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." Example: playing checkers. E = the experience of playing many games of checkers T = the task of playing checkers. P = the probability that the program will win the next game. In general, any machine learning problem can be assigned to one of two broad classifications: supervised learning OR unsupervised learning.
  • 21. Supervised Learning In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. Supervised learning problems are categorized into "regression" and "classification" problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. (a) Regression - Given a picture of Male/Female, We have to predict his/her age on the basis of given picture. (b) Classification - Banks have to decide whether or not to give a loan to someone on the basis of his credit history.
  • 22. Linear Regression with One Variable Univariate linear regression is used when you want to predict a single output value y from a single input value x. We're doing supervised learning and regression here, so that means we already have an idea about what the input/output cause and effect should be.
  • 23.
  • 24. The Hypothesis Function Our hypothesis function has the general form: ^y=hθ(x)=θ0+θ1 x Note that this is like the equation of a straight line. We give to hθ(x) values for θ0 and θ1 to get our estimated output ^y In other words, we are trying to create a function called hθ that is trying to map our input data (the x's) to our output data (the y's).
  • 25. Cost Function We can measure the accuracy of our hypothesis function by using a cost function. This takes an average (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's compared to the actual output y's.
  • 26.
  • 27. Gradient Descent for Linear Regression When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to (the derivation of the formulas are out of the scope of this course, but a really great one can be found here):
  • 28.
  • 29.
  • 30. Linear regression (2): Gradient descent by Alexander Ihler
  • 31. Linear Regression with Multiple Variables We now introduce notation for equations where we can have any number of input variables.
  • 32.
  • 33. Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:
  • 34.
  • 35.
  • 37. Entropy Entropy is the measure of randomness of the system
  • 38. Information gain Information Gain is applied to quantify which feature provides maximal information about the classification based on the notion of entropy
  • 39. Gini index The Gini Index is determined by deducting the sum of squared of probabilities of each class from one, mathematically, Gini Index can be expressed as:
  • 40.
  • 41.
  • 44. Confusion matrix A confusion matrix is a technique for summarizing the performance of a classification algorithm. Classification accuracy alone can be misleading if you have an unequal number of observations in each class or if you have more than two classes in your dataset.
  • 45. Accuracy Overall, how often is the classifier correct? ○ (TP+TN)/total = (100+50)/165 = 0.91
  • 47. What is sklearn? Scikit-learn (also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms like Linear Regression,K-NN,random forests etc. 1. Loading the Data 2. Splitting the Data 3. Generate The Model 4. Evaluate The accuracy .. _boston_dataset: Boston house prices dataset --------------------------- **Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. :Creator: Harrison, D. and Rubinfeld, D.L. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter. The Boston house-price data has been used in many machine learning papers that address regression problems. .. topic:: References - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261. - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learnin g, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann. (506, 13)
(506,)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

(354, 26)
(152, 26)

LinearRegression()

The coefficient of determination,
The score can be calculated using the following formula:
T : true/actual value.
P : predicted value.

0.8787006807787356 P : predicted value. represents that we are predicting a score as worse as the mean value. (Meaning, all the answers are mean value of the data). Decision Tree

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher Fisher :Donor: Michael Marshall ( :Date: July, 1988 The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken from Fisher's paper. Note that it's the same as in R, but not as in the UCI Machine Learning Repository, which has two wrong data points. This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. .. topic:: References - Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950). - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis. (105, 4)
(45, 4)

DecisionTreeClassifier()

Graphviz - Graphviz is open source graph visualization software. Graph visualization is a way of representing structural information such as diagrams of abstract graphs and networks.
export_graphviz - This function generates a GraphViz representation of the decision tree in dot format, which is then written into an output file ('out_file').
Pydotplus - PyDotPlus provides a Python Interface to Graphviz's Dot language.

0.9555555555555556
array([[14, 0, 0],
[ 0, 17, 1],
[ 0, 1, 12]], dtype=int64)

The algorithm uses the Gini Index for constructing the decision tree.
A feature with a lower Gini index is chosen for a split. Pydotplus - PyDotPlus provides a Python Interface to Graphviz’s Dot language. 0.9555555555555556 array([[14, 0, 0], [ 0, 17, 1], [ 0, 1, 12]], dtype=int64) The algorithm uses the Gini Index for constructing the decision tree. A feature with a lower Gini index is chosen for a split. CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX ... from sklearn import datasets
boston=datasets.load_boston()

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.7,random_state=1)

from sklearn.linear_model import LinearRegression
reg=LinearRegression()
reg.fit(x_train,y_train)
y_pred=reg.predict(x_test)
reg.score(x_test,y_test)

import matplotlib.pyplot as plt
plt.scatter(y_pred,y_test)
plt.grid()
plt.xlabel("Predictions")
plt.ylabel("Testing Data")

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
iris=datasets.load_iris()
x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,train_size=0.7,random_state=1)
clf=DecisionTreeClassifier()
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)

from sklearn.tree import export_graphviz
dot_data = export_graphviz(clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True)
import pydotplus
from IPython.display import Image
graph = pydotplus.graph_from_dot_data(dot_data)

clf.score(x_test,y_test)

from sklearn.metrics import confusion_matrix
confusion_matrix(y_pred,y_test)