1. Predicting Model for Prices of Used Cars
September 2, 2021
1. Project Motivation
2. Dataset Used
3. Analysis PLan
4. EDA
5. EDA Summary
6. Model Development
7. Result/Conclusion
8. References
Project Motivation
The prices of new cars in the market is generally fixed by the manufacturer company with some
additional/hidden costs incurred by the Government. So, customers buying a new car can be
assured of the money they invest to be worthy. But due to the increased price of new cars and
the incapability of customers to buy new cars due to the lack of capital, used cars sales are on a
global increase. There is a need for a used car price prediction system to effectively determine the
worthiness of the car using a variety of features. This Project model aims to predict the sales price
for Used cars of BMW from the existing information.
Dataset Used
The Dataset is taken from datacamp/careerhub-data from Github repository. It contains 8
input variables( features ) and 1 output variable (target) . The target variable is the Price column
Below is the description of the dataset
Features information::
model(Categorical) - different model of the car
year(Numerical) - the year car was first driven
mileage(Numerical)- total miles the car is driven
mpg(Numerical)- miles per galon used
fuelType(Categorical) type of fuel Used
transmission (Categorical) type of gear transmission
tax (Numerical) tax on car
engineSize (Numerical) Size of the engine
Target information:
price (Numerical)- price of the car in US dollars
Analysis Plan
By looking at the dataset, it is Supervised Machine learning task to predict the price
Following task is done for this project::
1
2. Understand the Problem Statement
Inspection of the dataset Exploratory data analysis to be done - to get insights such as outliers,
relationship between variables,misguided values in the dataset prepare the dataset for model building
fit the model ( train/ tune and evaluate the model )
Results and findings
Exploratory Data Analysis
In this , we will be performing various analysis techniques to get to know about the dataset
such as loading the dataset, getting summary statistics, getting the information of features, data
visualization to get insights of features related to each other etc.
[1]: # importing necessary libraries for EDA
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
[2]: car_df = pd.read_csv('bmw.txt') # reading the data set for analysis
car_df.info() # to check the already assigned data types for the variables
print('Dimension of dataset::',car_df.shape)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10781 entries, 0 to 10780
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model 10781 non-null object
1 year 10781 non-null int64
2 price 10781 non-null int64
3 transmission 10781 non-null object
4 mileage 10781 non-null int64
5 fuelType 10781 non-null object
6 tax 10781 non-null int64
7 mpg 10781 non-null float64
8 engineSize 10781 non-null float64
dtypes: float64(2), int64(4), object(3)
memory usage: 758.2+ KB
Dimension of dataset:: (10781, 9)
We have total of 10781 rows with 9 column
There are 3 object(categorical) variables with object type
there are no null values as number of rows and non null counts are same
[3]: car_df.head(3) #to get the glimpse of the dataset
[3]: model year price transmission mileage fuelType tax mpg
0 5 Series 2014 11200 Automatic 67068 Diesel 125 57.6
1 6 Series 2018 27000 Automatic 14827 Petrol 145 42.8
2
3. 2 5 Series 2016 16000 Automatic 62794 Diesel 160 51.4
engineSize
0 2.0
1 2.0
2 3.0
[4]: print('Number of Duplicated rows in the dataset::',car_df.duplicated().sum())
Number of Duplicated rows in the dataset:: 117
[5]: car_df.describe()# to check summary statistics on numerical features
[5]: year price mileage tax mpg
count 10781.000000 10781.000000 10781.000000 10781.000000 10781.000000
mean 2017.078935 22733.408867 25496.986550 131.702068 56.399035
std 2.349038 11415.528189 25143.192559 61.510755 31.336958
min 1996.000000 1200.000000 1.000000 0.000000 5.500000
25% 2016.000000 14950.000000 5529.000000 135.000000 45.600000
50% 2017.000000 20462.000000 18347.000000 145.000000 53.300000
75% 2019.000000 27940.000000 38206.000000 145.000000 62.800000
max 2020.000000 123456.000000 214000.000000 580.000000 470.800000
engineSize
count 10781.000000
mean 2.167767
std 0.552054
min 0.000000
25% 2.000000
50% 2.000000
75% 2.000000
max 6.600000
From above table values, minimum value of EngineSize and tax feature are zero, but how can engine
size be zero of the vehicle
These columns values can be dropped or imputed with values depending on how many rows are
being affected by this
[6]: print('percent of values in which EngineSize is zero ::',len(car_df.
,
→loc[car_df['engineSize']== 0])/ len(car_df) *100)
print('percent of values in which tax is zero ::',len(car_df.
,
→loc[car_df['tax']== 0])/ len(car_df) *100)
percent of values in which EngineSize is zero :: 0.4359521380205918
percent of values in which tax is zero :: 3.15369631759577
3
4. [7]: for i in car_df.columns:
if car_df[i].dtypes == 'object':
print(car_df[i].unique(),'n') # to Check typo error in categorical␣
,
→values/duplicate values
[' 5 Series' ' 6 Series' ' 1 Series' ' 7 Series' ' 2 Series' ' 4 Series'
' X3' ' 3 Series' ' X5' ' X4' ' i3' ' X1' ' M4' ' X2' ' X6' ' 8 Series'
' Z4' ' X7' ' M5' ' i8' ' M2' ' M3' ' M6' ' Z3']
['Automatic' 'Manual' 'Semi-Auto']
['Diesel' 'Petrol' 'Other' 'Hybrid' 'Electric']
The above categorical features seem to have no type error values
[8]: plt.rcParams['figure.figsize'] =(20,8)
sns.heatmap(car_df.corr(),annot=True) #this helps to identify which features␣
,
→are important for further analysis
plt.show()
From above heatmap
* Prce and year are positively strong correlated
* Price and mileage are negatively strong correlated
* Price and engineSize are medium positively correlated
[9]: plt.rcParams['figure.figsize'] =(20,8)
sns.relplot(x = 'year', y = 'price', data = car_df, kind = 'line',ci = False,␣
,
→col = 'transmission') # to check price trend according to year passed
plt.show()
4
5. [10]: plt.rcParams['figure.figsize'] =(20,8)
sns.relplot(x = 'year', y = 'price', data = car_df, kind = 'line',ci = False,␣
,
→hue = 'transmission') # to check price trend according to year passed
plt.xlabel('years passed')
plt.ylabel('price of car')
plt.title('Variation of price based on years passed on transmission Category')
plt.show()
5
6. From the above line plot , it looks like the price of cars keeps on increasing as the years passed.
This means older cars have lower prices and newer cars have greater prices
[11]: # to check price trend according to year passed
plt.rcParams['figure.figsize'] =(20,8)
sns.relplot(x = 'mileage', y = 'price', data = car_df, kind = 'scatter',ci =␣
,
→False)
plt.xlabel('mileage of the car')
plt.ylabel('price of car')
plt.title('Variation of price based on mileage')
plt.show()
6
7. Above trend shows that prices drop as the mileage of car increases
[12]: # Number of vehicle sold based on transmission
sns.countplot(x = 'transmission', data = car_df)
plt.title('Number of vehicles sold based on Transmission type')
plt.ylabel('Number of Vehicles')
plt.xlabel('Type of Transmission')
plt.figure(figsize = (10,10))
plt.show()
7
8. <Figure size 720x720 with 0 Axes>
From above plot, it looks as the Semi-Auto vehicles are in great demand
[13]: sns.countplot(x = 'fuelType', data = car_df)
plt.title('Number of vehicles sold based on fuel type')
plt.ylabel('Number of Vehicles')
plt.xlabel('Type of Transmission')
plt.show()
From the Above Plot, It looks as Diesel and Petrol variants sold outnumbered the rest category
Exploratory Data Analysis Summary
There are some values in which EngineSize and tax are zero. They can be imputed with values such
as mean, But they can be dropped as well as they represent very small percent of values
There are two object columns Transmission and Fueltype with different labels. They should be
8
9. encoded with numerical value for model building
There are no missing values present
There are some duplicated rows in the dataset.They should be removed
Data preparation
[14]: # Duplicated rows are removed
car_df = car_df.drop_duplicates()
[15]: car_df = car_df.loc[car_df['engineSize'] != 0] # Removing values with␣
,
→EngineSize = 0
car_df = car_df.loc[car_df['tax'] != 0] # Removing values with tax = 0
car_df.shape
[15]: (10302, 9)
Converting the Categorical columns into numerical values using one hot enconding technique
[16]: car_df.head()
transmission = pd.get_dummies(car_df['transmission'])
fuelType =pd.get_dummies(car_df['fuelType'])
car_df = car_df.drop(['fuelType','transmission','model'],axis =1)
car_df = car_df.join(fuelType).join(transmission)
car_df = car_df.drop(['Other','Semi-Auto'],axis =1)
car_df.head()
[16]: year price mileage tax mpg engineSize Diesel Hybrid Petrol
0 2014 11200 67068 125 57.6 2.0 1 0 0
1 2018 27000 14827 145 42.8 2.0 0 0 1
2 2016 16000 62794 160 51.4 3.0 1 0 0
3 2017 12750 26676 145 72.4 1.5 1 0 0
4 2014 14500 39554 160 50.4 3.0 1 0 0
Automatic Manual
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
Model Development
As it is a Supervised learning problem with regression analyis, A Random Forest Regressor should
be good for regression. The data set needs to be rescaled as column values are not in a similar scale,
but we are using Random forest regressor , so there is no need to rescale the data. The Categorical
Values have already been encoded in numerical values for machine learning algorithm
The Metric to be used are R2_score, which tells the accuracy of the model, how much data is
explained by it and mean squared error which is responsible for the error in prediction values.
R Square measures how much variability in dependent variable can be explained by the model.
9
10. It is the square of the Correlation Coefficient(R) and that is why it is called R Square. R Square
value is between 0 to 1 and a bigger value indicates a better fit between prediction and
actual value
MSE is calculated by the sum of square of prediction error which is real output minus predicted
output and then divide by the number of data points. It gives you an absolute number on how much
your predicted results deviate from the actual number. Root Mean Square Error(RMSE) is
the square root of MSE. MSE is calculated by the square of error, and thus square root brings
it back to the same level of prediction error and makes it easier for interpretation. A Random
Forest is an ensemble technique capable of performing both regression and classification tasks with
the use of multiple decision trees and a technique called Bootstrap and Aggregation, commonly
known as bagging. The basic idea behind this is to combine multiple decision trees in determining
the final output rather than relying on individual decision trees
[17]: # importing the necessary modules
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error as␣
,
→mse,mean_absolute_error as mae,mean_absolute_percentage_error as mapr
[18]: X = car_df.drop(['price'],axis = 1)
y = car_df[['price']]
[19]: X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3,␣
,
→random_state = 123, shuffle=True)
[20]: rfr = RandomForestRegressor(random_state = 123)
Tuning the Model using Hyperparameter tuning using GridSearchCV
In machine learning, a hyperparameter is a parameter whose value is used to control the learning
process. By contrast, the values of other parameters (typically node weights) are derived via
training.
Hyperparameters can be classified as model hyperparameters, that cannot be inferred while
fitting the machine to the training set because they refer to the model selection task
GridSearchCV is a library function that is a member of sklearn’s model_selection package. It
helps to loop through predefined hyperparameters and fit your estimator (model) on your training
set. So, in the end, you can select the best parameters from the listed hyperparameters
[21]: number_of_trees = [ i for i in range(100,500,100)] # number of Base estimaotrs␣
,
→trees
depth_tree = [ i for i in range(4,10,1)] # heigth of the tree
max_feat = ['auto','sqrt','log2']
# Defining the parameter grid
parameters = dict(n_estimators = number_of_trees,max_depth = depth_tree,␣
,
→max_features = max_feat)
[22]: # fitting the GridSearchCV on training data to get the best possible parameters
10
11. grid_search_model = GridSearchCV(estimator =␣
,
→rfr,param_grid=parameters,return_train_score = True,cv= 10)
grid_search_model.fit(X_train,y_train)
best_model = grid_search_model.best_estimator_
print("parameters of the best model ::")
grid_search_model.best_params_
parameters of the best model ::
[22]: {'max_depth': 9, 'max_features': 'auto', 'n_estimators': 400}
[23]: ## fitting the model using best estimator returned using Hyper Parameter tuning
best_model.fit(X_train,y_train)
[23]: RandomForestRegressor(max_depth=9, n_estimators=400, random_state=123)
[24]: print('Accuracy on Training Data :: ',round(best_model.
,
→score(X_train,y_train),2))
Accuracy on Training Data :: 0.95
[25]: feature_importances = best_model.feature_importances_
feature_list =list(X)
relative_importance = pd.DataFrame(index = feature_list,data =␣
,
→feature_importances,columns = ['importance'])
relative_importance.sort_values(by = 'importance',ascending = True).plot(kind =␣
,
→'barh')
plt.title('Feature importances of the features used')
plt.show()
Model Evaluation
The best estimator / model returned will be evaluated on the testing dataset using R2_score
11
12. [26]: y_pred = best_model.predict(X_test)
print("the R2_score on testing data ::",round(r2_score(y_test,y_pred),2))
print('RMSE - root mean squared error ::',round(np.sqrt(mse(y_test,y_pred)),2))
the R2_score on testing data :: 0.92
RMSE - root mean squared error :: 3225.6
The Random Forest Regressor brings the 0.92 accuracy score on the validation/testing data which
is good
Result / Conclusion
This Project is aimed to develop a model to predict the prices of used cars. The results of the
model helps to determine the important factors while predicting the prices of the used cars.
Using the dataset available, a random forest model is developed that achieved an accuracy score
on testing datset of around 0.91. The model does suffer from overfitting.
The importances of the features are determined using feature extraction done by random forest re-
gressor.Some of the importannt features that does help in deciding the prices are Year,EngineSize,
Mpg and mileage.
The dataset should provide more features such as Car_insured , Convertible, SUV/Sedan,
this would help further in investingating the prices. This can help customers also in deciding their
needs.
References
www.datacamp.com
www.coursera.com
https://github.com/datacamp/careerhub-data
https://pandas.pydata.org/docs/reference/frame.html
https://scikit-learn.org/stable/index.html
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
12