SlideShare una empresa de Scribd logo
1 de 17
Predicting Model for Prices of Used Cars
September 2, 2021
1. Project Motivation
2. Dataset Used
3. Analysis PLan
4. EDA
5. EDA Summary
6. Model Development
7. Result/Conclusion
8. References
Project Motivation
The prices of new cars in the market is generally fixed by the manufacturer company with some
additional/hidden costs incurred by the Government. So, customers buying a new car can be
assured of the money they invest to be worthy. But due to the increased price of new cars and
the incapability of customers to buy new cars due to the lack of capital, used cars sales are on a
global increase. There is a need for a used car price prediction system to effectively determine the
worthiness of the car using a variety of features. This Project model aims to predict the sales price
for Used cars of BMW from the existing information.
Dataset Used
The Dataset is taken from datacamp/careerhub-data from Github repository. It contains 8
input variables( features ) and 1 output variable (target) . The target variable is the Price column
Below is the description of the dataset
Features information::
model(Categorical) - different model of the car
year(Numerical) - the year car was first driven
mileage(Numerical)- total miles the car is driven
mpg(Numerical)- miles per galon used
fuelType(Categorical) type of fuel Used
transmission (Categorical) type of gear transmission
tax (Numerical) tax on car
engineSize (Numerical) Size of the engine
Target information:
price (Numerical)- price of the car in US dollars
Analysis Plan
By looking at the dataset, it is Supervised Machine learning task to predict the price
Following task is done for this project::
1
Understand the Problem Statement
Inspection of the dataset Exploratory data analysis to be done - to get insights such as outliers,
relationship between variables,misguided values in the dataset prepare the dataset for model building
fit the model ( train/ tune and evaluate the model )
Results and findings
Exploratory Data Analysis
In this , we will be performing various analysis techniques to get to know about the dataset
such as loading the dataset, getting summary statistics, getting the information of features, data
visualization to get insights of features related to each other etc.
[1]: # importing necessary libraries for EDA
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
[2]: car_df = pd.read_csv('bmw.txt') # reading the data set for analysis
car_df.info() # to check the already assigned data types for the variables
print('Dimension of dataset::',car_df.shape)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10781 entries, 0 to 10780
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model 10781 non-null object
1 year 10781 non-null int64
2 price 10781 non-null int64
3 transmission 10781 non-null object
4 mileage 10781 non-null int64
5 fuelType 10781 non-null object
6 tax 10781 non-null int64
7 mpg 10781 non-null float64
8 engineSize 10781 non-null float64
dtypes: float64(2), int64(4), object(3)
memory usage: 758.2+ KB
Dimension of dataset:: (10781, 9)
We have total of 10781 rows with 9 column
There are 3 object(categorical) variables with object type
there are no null values as number of rows and non null counts are same
[3]: car_df.head(3) #to get the glimpse of the dataset
[3]: model year price transmission mileage fuelType tax mpg 
0 5 Series 2014 11200 Automatic 67068 Diesel 125 57.6
1 6 Series 2018 27000 Automatic 14827 Petrol 145 42.8
2
2 5 Series 2016 16000 Automatic 62794 Diesel 160 51.4
engineSize
0 2.0
1 2.0
2 3.0
[4]: print('Number of Duplicated rows in the dataset::',car_df.duplicated().sum())
Number of Duplicated rows in the dataset:: 117
[5]: car_df.describe()# to check summary statistics on numerical features
[5]: year price mileage tax mpg 
count 10781.000000 10781.000000 10781.000000 10781.000000 10781.000000
mean 2017.078935 22733.408867 25496.986550 131.702068 56.399035
std 2.349038 11415.528189 25143.192559 61.510755 31.336958
min 1996.000000 1200.000000 1.000000 0.000000 5.500000
25% 2016.000000 14950.000000 5529.000000 135.000000 45.600000
50% 2017.000000 20462.000000 18347.000000 145.000000 53.300000
75% 2019.000000 27940.000000 38206.000000 145.000000 62.800000
max 2020.000000 123456.000000 214000.000000 580.000000 470.800000
engineSize
count 10781.000000
mean 2.167767
std 0.552054
min 0.000000
25% 2.000000
50% 2.000000
75% 2.000000
max 6.600000
From above table values, minimum value of EngineSize and tax feature are zero, but how can engine
size be zero of the vehicle
These columns values can be dropped or imputed with values depending on how many rows are
being affected by this
[6]: print('percent of values in which EngineSize is zero ::',len(car_df.
,
→loc[car_df['engineSize']== 0])/ len(car_df) *100)
print('percent of values in which tax is zero ::',len(car_df.
,
→loc[car_df['tax']== 0])/ len(car_df) *100)
percent of values in which EngineSize is zero :: 0.4359521380205918
percent of values in which tax is zero :: 3.15369631759577
3
[7]: for i in car_df.columns:
if car_df[i].dtypes == 'object':
print(car_df[i].unique(),'n') # to Check typo error in categorical␣
,
→values/duplicate values
[' 5 Series' ' 6 Series' ' 1 Series' ' 7 Series' ' 2 Series' ' 4 Series'
' X3' ' 3 Series' ' X5' ' X4' ' i3' ' X1' ' M4' ' X2' ' X6' ' 8 Series'
' Z4' ' X7' ' M5' ' i8' ' M2' ' M3' ' M6' ' Z3']
['Automatic' 'Manual' 'Semi-Auto']
['Diesel' 'Petrol' 'Other' 'Hybrid' 'Electric']
The above categorical features seem to have no type error values
[8]: plt.rcParams['figure.figsize'] =(20,8)
sns.heatmap(car_df.corr(),annot=True) #this helps to identify which features␣
,
→are important for further analysis
plt.show()
From above heatmap
* Prce and year are positively strong correlated
* Price and mileage are negatively strong correlated
* Price and engineSize are medium positively correlated
[9]: plt.rcParams['figure.figsize'] =(20,8)
sns.relplot(x = 'year', y = 'price', data = car_df, kind = 'line',ci = False,␣
,
→col = 'transmission') # to check price trend according to year passed
plt.show()
4
[10]: plt.rcParams['figure.figsize'] =(20,8)
sns.relplot(x = 'year', y = 'price', data = car_df, kind = 'line',ci = False,␣
,
→hue = 'transmission') # to check price trend according to year passed
plt.xlabel('years passed')
plt.ylabel('price of car')
plt.title('Variation of price based on years passed on transmission Category')
plt.show()
5
From the above line plot , it looks like the price of cars keeps on increasing as the years passed.
This means older cars have lower prices and newer cars have greater prices
[11]: # to check price trend according to year passed
plt.rcParams['figure.figsize'] =(20,8)
sns.relplot(x = 'mileage', y = 'price', data = car_df, kind = 'scatter',ci =␣
,
→False)
plt.xlabel('mileage of the car')
plt.ylabel('price of car')
plt.title('Variation of price based on mileage')
plt.show()
6
Above trend shows that prices drop as the mileage of car increases
[12]: # Number of vehicle sold based on transmission
sns.countplot(x = 'transmission', data = car_df)
plt.title('Number of vehicles sold based on Transmission type')
plt.ylabel('Number of Vehicles')
plt.xlabel('Type of Transmission')
plt.figure(figsize = (10,10))
plt.show()
7
<Figure size 720x720 with 0 Axes>
From above plot, it looks as the Semi-Auto vehicles are in great demand
[13]: sns.countplot(x = 'fuelType', data = car_df)
plt.title('Number of vehicles sold based on fuel type')
plt.ylabel('Number of Vehicles')
plt.xlabel('Type of Transmission')
plt.show()
From the Above Plot, It looks as Diesel and Petrol variants sold outnumbered the rest category
Exploratory Data Analysis Summary
There are some values in which EngineSize and tax are zero. They can be imputed with values such
as mean, But they can be dropped as well as they represent very small percent of values
There are two object columns Transmission and Fueltype with different labels. They should be
8
encoded with numerical value for model building
There are no missing values present
There are some duplicated rows in the dataset.They should be removed
Data preparation
[14]: # Duplicated rows are removed
car_df = car_df.drop_duplicates()
[15]: car_df = car_df.loc[car_df['engineSize'] != 0] # Removing values with␣
,
→EngineSize = 0
car_df = car_df.loc[car_df['tax'] != 0] # Removing values with tax = 0
car_df.shape
[15]: (10302, 9)
Converting the Categorical columns into numerical values using one hot enconding technique
[16]: car_df.head()
transmission = pd.get_dummies(car_df['transmission'])
fuelType =pd.get_dummies(car_df['fuelType'])
car_df = car_df.drop(['fuelType','transmission','model'],axis =1)
car_df = car_df.join(fuelType).join(transmission)
car_df = car_df.drop(['Other','Semi-Auto'],axis =1)
car_df.head()
[16]: year price mileage tax mpg engineSize Diesel Hybrid Petrol 
0 2014 11200 67068 125 57.6 2.0 1 0 0
1 2018 27000 14827 145 42.8 2.0 0 0 1
2 2016 16000 62794 160 51.4 3.0 1 0 0
3 2017 12750 26676 145 72.4 1.5 1 0 0
4 2014 14500 39554 160 50.4 3.0 1 0 0
Automatic Manual
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
Model Development
As it is a Supervised learning problem with regression analyis, A Random Forest Regressor should
be good for regression. The data set needs to be rescaled as column values are not in a similar scale,
but we are using Random forest regressor , so there is no need to rescale the data. The Categorical
Values have already been encoded in numerical values for machine learning algorithm
The Metric to be used are R2_score, which tells the accuracy of the model, how much data is
explained by it and mean squared error which is responsible for the error in prediction values.
R Square measures how much variability in dependent variable can be explained by the model.
9
It is the square of the Correlation Coefficient(R) and that is why it is called R Square. R Square
value is between 0 to 1 and a bigger value indicates a better fit between prediction and
actual value
MSE is calculated by the sum of square of prediction error which is real output minus predicted
output and then divide by the number of data points. It gives you an absolute number on how much
your predicted results deviate from the actual number. Root Mean Square Error(RMSE) is
the square root of MSE. MSE is calculated by the square of error, and thus square root brings
it back to the same level of prediction error and makes it easier for interpretation. A Random
Forest is an ensemble technique capable of performing both regression and classification tasks with
the use of multiple decision trees and a technique called Bootstrap and Aggregation, commonly
known as bagging. The basic idea behind this is to combine multiple decision trees in determining
the final output rather than relying on individual decision trees
[17]: # importing the necessary modules
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error as␣
,
→mse,mean_absolute_error as mae,mean_absolute_percentage_error as mapr
[18]: X = car_df.drop(['price'],axis = 1)
y = car_df[['price']]
[19]: X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3,␣
,
→random_state = 123, shuffle=True)
[20]: rfr = RandomForestRegressor(random_state = 123)
Tuning the Model using Hyperparameter tuning using GridSearchCV
In machine learning, a hyperparameter is a parameter whose value is used to control the learning
process. By contrast, the values of other parameters (typically node weights) are derived via
training.
Hyperparameters can be classified as model hyperparameters, that cannot be inferred while
fitting the machine to the training set because they refer to the model selection task
GridSearchCV is a library function that is a member of sklearn’s model_selection package. It
helps to loop through predefined hyperparameters and fit your estimator (model) on your training
set. So, in the end, you can select the best parameters from the listed hyperparameters
[21]: number_of_trees = [ i for i in range(100,500,100)] # number of Base estimaotrs␣
,
→trees
depth_tree = [ i for i in range(4,10,1)] # heigth of the tree
max_feat = ['auto','sqrt','log2']
# Defining the parameter grid
parameters = dict(n_estimators = number_of_trees,max_depth = depth_tree,␣
,
→max_features = max_feat)
[22]: # fitting the GridSearchCV on training data to get the best possible parameters
10
grid_search_model = GridSearchCV(estimator =␣
,
→rfr,param_grid=parameters,return_train_score = True,cv= 10)
grid_search_model.fit(X_train,y_train)
best_model = grid_search_model.best_estimator_
print("parameters of the best model ::")
grid_search_model.best_params_
parameters of the best model ::
[22]: {'max_depth': 9, 'max_features': 'auto', 'n_estimators': 400}
[23]: ## fitting the model using best estimator returned using Hyper Parameter tuning
best_model.fit(X_train,y_train)
[23]: RandomForestRegressor(max_depth=9, n_estimators=400, random_state=123)
[24]: print('Accuracy on Training Data :: ',round(best_model.
,
→score(X_train,y_train),2))
Accuracy on Training Data :: 0.95
[25]: feature_importances = best_model.feature_importances_
feature_list =list(X)
relative_importance = pd.DataFrame(index = feature_list,data =␣
,
→feature_importances,columns = ['importance'])
relative_importance.sort_values(by = 'importance',ascending = True).plot(kind =␣
,
→'barh')
plt.title('Feature importances of the features used')
plt.show()
Model Evaluation
The best estimator / model returned will be evaluated on the testing dataset using R2_score
11
[26]: y_pred = best_model.predict(X_test)
print("the R2_score on testing data ::",round(r2_score(y_test,y_pred),2))
print('RMSE - root mean squared error ::',round(np.sqrt(mse(y_test,y_pred)),2))
the R2_score on testing data :: 0.92
RMSE - root mean squared error :: 3225.6
The Random Forest Regressor brings the 0.92 accuracy score on the validation/testing data which
is good
Result / Conclusion
This Project is aimed to develop a model to predict the prices of used cars. The results of the
model helps to determine the important factors while predicting the prices of the used cars.
Using the dataset available, a random forest model is developed that achieved an accuracy score
on testing datset of around 0.91. The model does suffer from overfitting.
The importances of the features are determined using feature extraction done by random forest re-
gressor.Some of the importannt features that does help in deciding the prices are Year,EngineSize,
Mpg and mileage.
The dataset should provide more features such as Car_insured , Convertible, SUV/Sedan,
this would help further in investingating the prices. This can help customers also in deciding their
needs.
References
www.datacamp.com
www.coursera.com
https://github.com/datacamp/careerhub-data
https://pandas.pydata.org/docs/reference/frame.html
https://scikit-learn.org/stable/index.html
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
12
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
13
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
14
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
15
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
16
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
[ ]:
17

Más contenido relacionado

La actualidad más candente

Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in PythonMarc Garcia
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with RGreat Wide Open
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning ProjectAbhishek Singh
 
Diabetes prediction using machine learning
Diabetes prediction using machine learningDiabetes prediction using machine learning
Diabetes prediction using machine learningdataalcott
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Object classification using CNN & VGG16 Model (Keras and Tensorflow)
Object classification using CNN & VGG16 Model (Keras and Tensorflow) Object classification using CNN & VGG16 Model (Keras and Tensorflow)
Object classification using CNN & VGG16 Model (Keras and Tensorflow) Lalit Jain
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural NetworksYogendra Tamang
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learningbutest
 
IRJET- Automobile Resale System using Machine Learning
IRJET- Automobile Resale System using Machine LearningIRJET- Automobile Resale System using Machine Learning
IRJET- Automobile Resale System using Machine LearningIRJET Journal
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual IntroductionLukas Masuch
 
House Price Prediction An AI Approach.
House Price Prediction An AI Approach.House Price Prediction An AI Approach.
House Price Prediction An AI Approach.Nahian Ahmed
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection methodAmir Razmjou
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
Predicting house prices_Regression
Predicting house prices_RegressionPredicting house prices_Regression
Predicting house prices_RegressionSruti Jain
 

La actualidad más candente (20)

House price prediction
House price predictionHouse price prediction
House price prediction
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Machine Learning Project
Machine Learning ProjectMachine Learning Project
Machine Learning Project
 
Diabetes prediction using machine learning
Diabetes prediction using machine learningDiabetes prediction using machine learning
Diabetes prediction using machine learning
 
Housing price prediction
Housing price predictionHousing price prediction
Housing price prediction
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Object classification using CNN & VGG16 Model (Keras and Tensorflow)
Object classification using CNN & VGG16 Model (Keras and Tensorflow) Object classification using CNN & VGG16 Model (Keras and Tensorflow)
Object classification using CNN & VGG16 Model (Keras and Tensorflow)
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 
IRJET- Automobile Resale System using Machine Learning
IRJET- Automobile Resale System using Machine LearningIRJET- Automobile Resale System using Machine Learning
IRJET- Automobile Resale System using Machine Learning
 
Deep learning
Deep learningDeep learning
Deep learning
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
Ensemble methods
Ensemble methods Ensemble methods
Ensemble methods
 
House Price Prediction An AI Approach.
House Price Prediction An AI Approach.House Price Prediction An AI Approach.
House Price Prediction An AI Approach.
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection method
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Predicting house prices_Regression
Predicting house prices_RegressionPredicting house prices_Regression
Predicting house prices_Regression
 

Similar a Predicting model for prices of used cars

cars design code power system detai.pptx
cars design code power system detai.pptxcars design code power system detai.pptx
cars design code power system detai.pptxabomoayad19309
 
Deriving insights from data using "R"ight way
Deriving insights from data using "R"ight wayDeriving insights from data using "R"ight way
Deriving insights from data using "R"ight wayGaurav Shrivastav
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?Villu Ruusmann
 
A machine learning model for average fuel consumption in heavy vehicles
A machine learning model for average fuel consumption in heavy vehiclesA machine learning model for average fuel consumption in heavy vehicles
A machine learning model for average fuel consumption in heavy vehiclesVenkat Projects
 
How Vnomics built a "Digital Twin" for Commercial Trucking
How Vnomics built a "Digital Twin" for Commercial TruckingHow Vnomics built a "Digital Twin" for Commercial Trucking
How Vnomics built a "Digital Twin" for Commercial TruckingLloyd Palum
 
Vehicle Parking System Project
Vehicle Parking System ProjectVehicle Parking System Project
Vehicle Parking System ProjectFarooq Mian
 
Prediction of Used Car Prices using Machine Learning Techniques
Prediction of Used Car Prices using Machine Learning TechniquesPrediction of Used Car Prices using Machine Learning Techniques
Prediction of Used Car Prices using Machine Learning TechniquesIRJET Journal
 
Udacity project: Vehicle detection and tracking
Udacity project: Vehicle detection and trackingUdacity project: Vehicle detection and tracking
Udacity project: Vehicle detection and trackingBill Kromydas
 
End to-end machine learning project for beginners
End to-end machine learning project for beginnersEnd to-end machine learning project for beginners
End to-end machine learning project for beginnersSharath Kumar
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project D. j Vicky
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project D. j Vicky
 
ATLAS_Analytics_for_Total_Life_Cycle_for_Automotive_Systems
ATLAS_Analytics_for_Total_Life_Cycle_for_Automotive_SystemsATLAS_Analytics_for_Total_Life_Cycle_for_Automotive_Systems
ATLAS_Analytics_for_Total_Life_Cycle_for_Automotive_SystemsNarendra Ashar
 
YearQuarterLocationCarClassRevenue NumCars 2017Q1Downt.docx
YearQuarterLocationCarClassRevenue  NumCars  2017Q1Downt.docxYearQuarterLocationCarClassRevenue  NumCars  2017Q1Downt.docx
YearQuarterLocationCarClassRevenue NumCars 2017Q1Downt.docxadampcarr67227
 
R Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In RR Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In RRsquared Academy
 
Leverage the power of machine learning on windows
Leverage the power of machine learning on windowsLeverage the power of machine learning on windows
Leverage the power of machine learning on windowsJosé António Silva
 
My Researches in VEOLIA
My Researches in VEOLIAMy Researches in VEOLIA
My Researches in VEOLIAamanideutsch
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)TarunPaparaju
 
Building interactive audience analytics with Spark
Building interactive audience analytics with SparkBuilding interactive audience analytics with Spark
Building interactive audience analytics with SparkEugene Zhulenev
 
Data Mining Techniques using WEKA_Saurabh Singh_10BM60082
Data Mining Techniques using WEKA_Saurabh Singh_10BM60082Data Mining Techniques using WEKA_Saurabh Singh_10BM60082
Data Mining Techniques using WEKA_Saurabh Singh_10BM60082Saurabh Singh
 

Similar a Predicting model for prices of used cars (20)

cars design code power system detai.pptx
cars design code power system detai.pptxcars design code power system detai.pptx
cars design code power system detai.pptx
 
Deriving insights from data using "R"ight way
Deriving insights from data using "R"ight wayDeriving insights from data using "R"ight way
Deriving insights from data using "R"ight way
 
R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?R, Scikit-Learn and Apache Spark ML - What difference does it make?
R, Scikit-Learn and Apache Spark ML - What difference does it make?
 
A machine learning model for average fuel consumption in heavy vehicles
A machine learning model for average fuel consumption in heavy vehiclesA machine learning model for average fuel consumption in heavy vehicles
A machine learning model for average fuel consumption in heavy vehicles
 
Car insurance - data visualization
Car insurance - data visualizationCar insurance - data visualization
Car insurance - data visualization
 
How Vnomics built a "Digital Twin" for Commercial Trucking
How Vnomics built a "Digital Twin" for Commercial TruckingHow Vnomics built a "Digital Twin" for Commercial Trucking
How Vnomics built a "Digital Twin" for Commercial Trucking
 
Vehicle Parking System Project
Vehicle Parking System ProjectVehicle Parking System Project
Vehicle Parking System Project
 
Prediction of Used Car Prices using Machine Learning Techniques
Prediction of Used Car Prices using Machine Learning TechniquesPrediction of Used Car Prices using Machine Learning Techniques
Prediction of Used Car Prices using Machine Learning Techniques
 
Udacity project: Vehicle detection and tracking
Udacity project: Vehicle detection and trackingUdacity project: Vehicle detection and tracking
Udacity project: Vehicle detection and tracking
 
End to-end machine learning project for beginners
End to-end machine learning project for beginnersEnd to-end machine learning project for beginners
End to-end machine learning project for beginners
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project
 
cbse 12 computer science investigatory project
cbse 12 computer science investigatory project  cbse 12 computer science investigatory project
cbse 12 computer science investigatory project
 
ATLAS_Analytics_for_Total_Life_Cycle_for_Automotive_Systems
ATLAS_Analytics_for_Total_Life_Cycle_for_Automotive_SystemsATLAS_Analytics_for_Total_Life_Cycle_for_Automotive_Systems
ATLAS_Analytics_for_Total_Life_Cycle_for_Automotive_Systems
 
YearQuarterLocationCarClassRevenue NumCars 2017Q1Downt.docx
YearQuarterLocationCarClassRevenue  NumCars  2017Q1Downt.docxYearQuarterLocationCarClassRevenue  NumCars  2017Q1Downt.docx
YearQuarterLocationCarClassRevenue NumCars 2017Q1Downt.docx
 
R Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In RR Programming: Transform/Reshape Data In R
R Programming: Transform/Reshape Data In R
 
Leverage the power of machine learning on windows
Leverage the power of machine learning on windowsLeverage the power of machine learning on windows
Leverage the power of machine learning on windows
 
My Researches in VEOLIA
My Researches in VEOLIAMy Researches in VEOLIA
My Researches in VEOLIA
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)
 
Building interactive audience analytics with Spark
Building interactive audience analytics with SparkBuilding interactive audience analytics with Spark
Building interactive audience analytics with Spark
 
Data Mining Techniques using WEKA_Saurabh Singh_10BM60082
Data Mining Techniques using WEKA_Saurabh Singh_10BM60082Data Mining Techniques using WEKA_Saurabh Singh_10BM60082
Data Mining Techniques using WEKA_Saurabh Singh_10BM60082
 

Último

High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 

Último (20)

High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 

Predicting model for prices of used cars

  • 1. Predicting Model for Prices of Used Cars September 2, 2021 1. Project Motivation 2. Dataset Used 3. Analysis PLan 4. EDA 5. EDA Summary 6. Model Development 7. Result/Conclusion 8. References Project Motivation The prices of new cars in the market is generally fixed by the manufacturer company with some additional/hidden costs incurred by the Government. So, customers buying a new car can be assured of the money they invest to be worthy. But due to the increased price of new cars and the incapability of customers to buy new cars due to the lack of capital, used cars sales are on a global increase. There is a need for a used car price prediction system to effectively determine the worthiness of the car using a variety of features. This Project model aims to predict the sales price for Used cars of BMW from the existing information. Dataset Used The Dataset is taken from datacamp/careerhub-data from Github repository. It contains 8 input variables( features ) and 1 output variable (target) . The target variable is the Price column Below is the description of the dataset Features information:: model(Categorical) - different model of the car year(Numerical) - the year car was first driven mileage(Numerical)- total miles the car is driven mpg(Numerical)- miles per galon used fuelType(Categorical) type of fuel Used transmission (Categorical) type of gear transmission tax (Numerical) tax on car engineSize (Numerical) Size of the engine Target information: price (Numerical)- price of the car in US dollars Analysis Plan By looking at the dataset, it is Supervised Machine learning task to predict the price Following task is done for this project:: 1
  • 2. Understand the Problem Statement Inspection of the dataset Exploratory data analysis to be done - to get insights such as outliers, relationship between variables,misguided values in the dataset prepare the dataset for model building fit the model ( train/ tune and evaluate the model ) Results and findings Exploratory Data Analysis In this , we will be performing various analysis techniques to get to know about the dataset such as loading the dataset, getting summary statistics, getting the information of features, data visualization to get insights of features related to each other etc. [1]: # importing necessary libraries for EDA import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns [2]: car_df = pd.read_csv('bmw.txt') # reading the data set for analysis car_df.info() # to check the already assigned data types for the variables print('Dimension of dataset::',car_df.shape) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10781 entries, 0 to 10780 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 model 10781 non-null object 1 year 10781 non-null int64 2 price 10781 non-null int64 3 transmission 10781 non-null object 4 mileage 10781 non-null int64 5 fuelType 10781 non-null object 6 tax 10781 non-null int64 7 mpg 10781 non-null float64 8 engineSize 10781 non-null float64 dtypes: float64(2), int64(4), object(3) memory usage: 758.2+ KB Dimension of dataset:: (10781, 9) We have total of 10781 rows with 9 column There are 3 object(categorical) variables with object type there are no null values as number of rows and non null counts are same [3]: car_df.head(3) #to get the glimpse of the dataset [3]: model year price transmission mileage fuelType tax mpg 0 5 Series 2014 11200 Automatic 67068 Diesel 125 57.6 1 6 Series 2018 27000 Automatic 14827 Petrol 145 42.8 2
  • 3. 2 5 Series 2016 16000 Automatic 62794 Diesel 160 51.4 engineSize 0 2.0 1 2.0 2 3.0 [4]: print('Number of Duplicated rows in the dataset::',car_df.duplicated().sum()) Number of Duplicated rows in the dataset:: 117 [5]: car_df.describe()# to check summary statistics on numerical features [5]: year price mileage tax mpg count 10781.000000 10781.000000 10781.000000 10781.000000 10781.000000 mean 2017.078935 22733.408867 25496.986550 131.702068 56.399035 std 2.349038 11415.528189 25143.192559 61.510755 31.336958 min 1996.000000 1200.000000 1.000000 0.000000 5.500000 25% 2016.000000 14950.000000 5529.000000 135.000000 45.600000 50% 2017.000000 20462.000000 18347.000000 145.000000 53.300000 75% 2019.000000 27940.000000 38206.000000 145.000000 62.800000 max 2020.000000 123456.000000 214000.000000 580.000000 470.800000 engineSize count 10781.000000 mean 2.167767 std 0.552054 min 0.000000 25% 2.000000 50% 2.000000 75% 2.000000 max 6.600000 From above table values, minimum value of EngineSize and tax feature are zero, but how can engine size be zero of the vehicle These columns values can be dropped or imputed with values depending on how many rows are being affected by this [6]: print('percent of values in which EngineSize is zero ::',len(car_df. , →loc[car_df['engineSize']== 0])/ len(car_df) *100) print('percent of values in which tax is zero ::',len(car_df. , →loc[car_df['tax']== 0])/ len(car_df) *100) percent of values in which EngineSize is zero :: 0.4359521380205918 percent of values in which tax is zero :: 3.15369631759577 3
  • 4. [7]: for i in car_df.columns: if car_df[i].dtypes == 'object': print(car_df[i].unique(),'n') # to Check typo error in categorical␣ , →values/duplicate values [' 5 Series' ' 6 Series' ' 1 Series' ' 7 Series' ' 2 Series' ' 4 Series' ' X3' ' 3 Series' ' X5' ' X4' ' i3' ' X1' ' M4' ' X2' ' X6' ' 8 Series' ' Z4' ' X7' ' M5' ' i8' ' M2' ' M3' ' M6' ' Z3'] ['Automatic' 'Manual' 'Semi-Auto'] ['Diesel' 'Petrol' 'Other' 'Hybrid' 'Electric'] The above categorical features seem to have no type error values [8]: plt.rcParams['figure.figsize'] =(20,8) sns.heatmap(car_df.corr(),annot=True) #this helps to identify which features␣ , →are important for further analysis plt.show() From above heatmap * Prce and year are positively strong correlated * Price and mileage are negatively strong correlated * Price and engineSize are medium positively correlated [9]: plt.rcParams['figure.figsize'] =(20,8) sns.relplot(x = 'year', y = 'price', data = car_df, kind = 'line',ci = False,␣ , →col = 'transmission') # to check price trend according to year passed plt.show() 4
  • 5. [10]: plt.rcParams['figure.figsize'] =(20,8) sns.relplot(x = 'year', y = 'price', data = car_df, kind = 'line',ci = False,␣ , →hue = 'transmission') # to check price trend according to year passed plt.xlabel('years passed') plt.ylabel('price of car') plt.title('Variation of price based on years passed on transmission Category') plt.show() 5
  • 6. From the above line plot , it looks like the price of cars keeps on increasing as the years passed. This means older cars have lower prices and newer cars have greater prices [11]: # to check price trend according to year passed plt.rcParams['figure.figsize'] =(20,8) sns.relplot(x = 'mileage', y = 'price', data = car_df, kind = 'scatter',ci =␣ , →False) plt.xlabel('mileage of the car') plt.ylabel('price of car') plt.title('Variation of price based on mileage') plt.show() 6
  • 7. Above trend shows that prices drop as the mileage of car increases [12]: # Number of vehicle sold based on transmission sns.countplot(x = 'transmission', data = car_df) plt.title('Number of vehicles sold based on Transmission type') plt.ylabel('Number of Vehicles') plt.xlabel('Type of Transmission') plt.figure(figsize = (10,10)) plt.show() 7
  • 8. <Figure size 720x720 with 0 Axes> From above plot, it looks as the Semi-Auto vehicles are in great demand [13]: sns.countplot(x = 'fuelType', data = car_df) plt.title('Number of vehicles sold based on fuel type') plt.ylabel('Number of Vehicles') plt.xlabel('Type of Transmission') plt.show() From the Above Plot, It looks as Diesel and Petrol variants sold outnumbered the rest category Exploratory Data Analysis Summary There are some values in which EngineSize and tax are zero. They can be imputed with values such as mean, But they can be dropped as well as they represent very small percent of values There are two object columns Transmission and Fueltype with different labels. They should be 8
  • 9. encoded with numerical value for model building There are no missing values present There are some duplicated rows in the dataset.They should be removed Data preparation [14]: # Duplicated rows are removed car_df = car_df.drop_duplicates() [15]: car_df = car_df.loc[car_df['engineSize'] != 0] # Removing values with␣ , →EngineSize = 0 car_df = car_df.loc[car_df['tax'] != 0] # Removing values with tax = 0 car_df.shape [15]: (10302, 9) Converting the Categorical columns into numerical values using one hot enconding technique [16]: car_df.head() transmission = pd.get_dummies(car_df['transmission']) fuelType =pd.get_dummies(car_df['fuelType']) car_df = car_df.drop(['fuelType','transmission','model'],axis =1) car_df = car_df.join(fuelType).join(transmission) car_df = car_df.drop(['Other','Semi-Auto'],axis =1) car_df.head() [16]: year price mileage tax mpg engineSize Diesel Hybrid Petrol 0 2014 11200 67068 125 57.6 2.0 1 0 0 1 2018 27000 14827 145 42.8 2.0 0 0 1 2 2016 16000 62794 160 51.4 3.0 1 0 0 3 2017 12750 26676 145 72.4 1.5 1 0 0 4 2014 14500 39554 160 50.4 3.0 1 0 0 Automatic Manual 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 Model Development As it is a Supervised learning problem with regression analyis, A Random Forest Regressor should be good for regression. The data set needs to be rescaled as column values are not in a similar scale, but we are using Random forest regressor , so there is no need to rescale the data. The Categorical Values have already been encoded in numerical values for machine learning algorithm The Metric to be used are R2_score, which tells the accuracy of the model, how much data is explained by it and mean squared error which is responsible for the error in prediction values. R Square measures how much variability in dependent variable can be explained by the model. 9
  • 10. It is the square of the Correlation Coefficient(R) and that is why it is called R Square. R Square value is between 0 to 1 and a bigger value indicates a better fit between prediction and actual value MSE is calculated by the sum of square of prediction error which is real output minus predicted output and then divide by the number of data points. It gives you an absolute number on how much your predicted results deviate from the actual number. Root Mean Square Error(RMSE) is the square root of MSE. MSE is calculated by the square of error, and thus square root brings it back to the same level of prediction error and makes it easier for interpretation. A Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap and Aggregation, commonly known as bagging. The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees [17]: # importing the necessary modules from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split,GridSearchCV from sklearn.metrics import r2_score, mean_squared_error as␣ , →mse,mean_absolute_error as mae,mean_absolute_percentage_error as mapr [18]: X = car_df.drop(['price'],axis = 1) y = car_df[['price']] [19]: X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3,␣ , →random_state = 123, shuffle=True) [20]: rfr = RandomForestRegressor(random_state = 123) Tuning the Model using Hyperparameter tuning using GridSearchCV In machine learning, a hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are derived via training. Hyperparameters can be classified as model hyperparameters, that cannot be inferred while fitting the machine to the training set because they refer to the model selection task GridSearchCV is a library function that is a member of sklearn’s model_selection package. It helps to loop through predefined hyperparameters and fit your estimator (model) on your training set. So, in the end, you can select the best parameters from the listed hyperparameters [21]: number_of_trees = [ i for i in range(100,500,100)] # number of Base estimaotrs␣ , →trees depth_tree = [ i for i in range(4,10,1)] # heigth of the tree max_feat = ['auto','sqrt','log2'] # Defining the parameter grid parameters = dict(n_estimators = number_of_trees,max_depth = depth_tree,␣ , →max_features = max_feat) [22]: # fitting the GridSearchCV on training data to get the best possible parameters 10
  • 11. grid_search_model = GridSearchCV(estimator =␣ , →rfr,param_grid=parameters,return_train_score = True,cv= 10) grid_search_model.fit(X_train,y_train) best_model = grid_search_model.best_estimator_ print("parameters of the best model ::") grid_search_model.best_params_ parameters of the best model :: [22]: {'max_depth': 9, 'max_features': 'auto', 'n_estimators': 400} [23]: ## fitting the model using best estimator returned using Hyper Parameter tuning best_model.fit(X_train,y_train) [23]: RandomForestRegressor(max_depth=9, n_estimators=400, random_state=123) [24]: print('Accuracy on Training Data :: ',round(best_model. , →score(X_train,y_train),2)) Accuracy on Training Data :: 0.95 [25]: feature_importances = best_model.feature_importances_ feature_list =list(X) relative_importance = pd.DataFrame(index = feature_list,data =␣ , →feature_importances,columns = ['importance']) relative_importance.sort_values(by = 'importance',ascending = True).plot(kind =␣ , →'barh') plt.title('Feature importances of the features used') plt.show() Model Evaluation The best estimator / model returned will be evaluated on the testing dataset using R2_score 11
  • 12. [26]: y_pred = best_model.predict(X_test) print("the R2_score on testing data ::",round(r2_score(y_test,y_pred),2)) print('RMSE - root mean squared error ::',round(np.sqrt(mse(y_test,y_pred)),2)) the R2_score on testing data :: 0.92 RMSE - root mean squared error :: 3225.6 The Random Forest Regressor brings the 0.92 accuracy score on the validation/testing data which is good Result / Conclusion This Project is aimed to develop a model to predict the prices of used cars. The results of the model helps to determine the important factors while predicting the prices of the used cars. Using the dataset available, a random forest model is developed that achieved an accuracy score on testing datset of around 0.91. The model does suffer from overfitting. The importances of the features are determined using feature extraction done by random forest re- gressor.Some of the importannt features that does help in deciding the prices are Year,EngineSize, Mpg and mileage. The dataset should provide more features such as Car_insured , Convertible, SUV/Sedan, this would help further in investingating the prices. This can help customers also in deciding their needs. References www.datacamp.com www.coursera.com https://github.com/datacamp/careerhub-data https://pandas.pydata.org/docs/reference/frame.html https://scikit-learn.org/stable/index.html [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: 12
  • 13. [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: 13
  • 14. [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: 14
  • 15. [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: 15
  • 16. [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: 16
  • 17. [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: [ ]: 17