SlideShare una empresa de Scribd logo
1 de 49
Exploratory Data Analysis in
R
By Ms. Tahera Shaikh
Factor in R: Categorical & Continuous Variables
• What is Factor in R?
• Factors are variables in R which take on a limited number of different values;
such variables are often referred to as categorical variables.
• In a dataset, we can distinguish two types of
variables: categorical and continuous.
• In a categorical variable, the value is limited and usually based on a particular
finite group. For example, a categorical variable can be countries, year,
gender, occupation.
• A continuous variable, however, can take any values, from integer to decimal.
For example, we can have the revenue, price of a share, etc..
Categorical Variables
• R stores categorical variables into a factor. Let's check the code below to
convert a character variable into a factor variable. Characters are not
supported in machine learning algorithm, and the only way is to convert a
string to an integer.
# Create gender vector
gender_vector <- c("Male", "Female", "Female", "Male", "Male")
class(gender_vector)
# Convert gender_vector to a factor
factor_gender_vector <-factor(gender_vector)
class(factor_gender_vector)
Continuous Variables
• Continuous class variables are the default value in R. They are stored
as numeric or integer. We can see it from the dataset below. mtcars is
a built-in dataset. It gathers information on different types of car. We
can import it by using mtcars and check the class of the variable mpg,
mile per gallon. It returns a numeric value, indicating a continuous
variable.
dataset <- mtcars
class(dataset$mpg)
R Charts & Graphs
• A pie-chart is a representation of values as slices of a circle with
different colors. The slices are labeled and the numbers
corresponding to each slice is also represented in the chart.
• In R the pie chart is created using the pie() function which takes
positive numbers as a vector input. The additional parameters are
used to control labels, color, title etc.
• Syntax
pie(x, labels, radius, main, col, clockwise)
• Syntax
Following is the description of the parameters used −
• x is a vector containing the numeric values used in the pie chart.
• labels is used to give description to the slices.
• radius indicates the radius of the circle of the pie chart.(value between −1 and
+1).
• main indicates the title of the chart.
• col indicates the color palette.
• clockwise is a logical value indicating if the slices are drawn clockwise or anti
clockwise.
pie(x, labels, radius, main, col, clockwise)
• Example
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")
# Give the chart file a name.
png(file = "city.png")
# Plot the chart.
pie(x,labels)
# Save the file.
dev.off()
Pie Chart Title and Colors
Live Demo
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")
# Give the chart file a name.
png(file = "city_title_colours.jpg")
# Plot the chart with title and rainbow color pallet.
pie(x, labels, main = "City pie chart", col = rainbow(length(x)))
# Save the file.
dev.off()
Bar Charts
• A bar chart represents data in rectangular bars with length of the bar
proportional to the value of the variable. R uses the
function barplot() to create bar charts. R can draw both vertical and
Horizontal bars in the bar chart. In bar chart each of the bars can be
given different colors.
• Syntax
• The basic syntax to create a bar-chart in R is −
barplot(H,xlab,ylab,main, names.arg,col)
• Following is the description of the parameters used −
• H is a vector or matrix containing numeric values used in bar chart.
• xlab is the label for x axis.
• ylab is the label for y axis.
• main is the title of the bar chart.
• names.arg is a vector of names appearing under each bar.
• col is used to give colors to the bars in the graph.
Live Demo
# Create the data for the chart
H <- c(7,12,28,3,41)
# Give the chart file a name
png(file = "barchart.png")
# Plot the bar chart
barplot(H)
# Save the file
dev.off()
Bar Chart Labels, Title and Colors
# Create the data for the chart
H <- c(7,12,28,3,41)
M <- c("Mar","Apr","May","Jun","Jul")
# Give the chart file a name
png(file = "barchart_months_revenue.png")
# Plot the bar chart
barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="blue",
main="Revenue chart",border="red")
# Save the file
dev.off()
Boxplots
• Boxplots are a measure of how well distributed is the data in a data
set. It divides the data set into three quartiles. This graph represents
the minimum, maximum, median, first quartile and third quartile in
the data set. It is also useful in comparing the distribution of data
across data sets by drawing boxplots for each of them.
• Boxplots are created in R by using the boxplot() function.
• Syntax
• The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)
• Following is the description of the parameters used −
• x is a vector or a formula.
• data is the data frame.
• notch is a logical value. Set as TRUE to draw a notch.
• varwidth is a logical value. Set as true to draw width of the box proportionate
to the sample size.
• names are the group labels which will be printed under each boxplot.
• main is used to give a title to the graph.
Example
• We use the data set "mtcars" available in the R environment to create
a basic boxplot. Let's look at the columns "mpg" and "cyl" in mtcars.
input <- mtcars[,c('mpg','cyl')]
print(head(input))
Creating the Boxplot
Live Demo
# Give the chart file a name.
png(file = "boxplot.png")
# Plot the chart.
boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data")
# Save the file.
dev.off()
Histograms
• A histogram represents the frequencies of values of a variable
bucketed into ranges. Histogram is similar to bar chat but the
difference is it groups the values into continuous ranges. Each bar in
histogram represents the height of the number of values present in
that range.
• R creates histogram using hist() function. This function takes a vector
as an input and uses some more parameters to plot histograms.
• Syntax
• The basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)
• Following is the description of the parameters used −
• v is a vector containing numeric values used in histogram.
• main indicates title of the chart.
• col is used to set color of the bars.
• border is used to set border color of each bar.
• xlab is used to give description of x-axis.
• xlim is used to specify the range of values on the x-axis.
• ylim is used to specify the range of values on the y-axis.
• breaks is used to mention the width of each bar.
Example
# Create data for the graph.
v <- c(9,13,21,8,36,22,12,41,31,33,19)
# Give the chart file a name.
png(file = "histogram.png")
# Create the histogram.
hist(v,xlab = "Weight",col = "yellow",border = "blue")
# Save the file.
dev.off()
Line Graphs
• A line chart is a graph that connects a series of points by drawing line
segments between them. These points are ordered in one of their
coordinate (usually the x-coordinate) value. Line charts are usually
used in identifying the trends in data.
• The plot() function in R is used to create the line graph.
• Syntax
• The basic syntax to create a line chart in R is −
plot(v,type,col,xlab,ylab)
• Following is the description of the parameters used −
• v is a vector containing the numeric values.
• type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to
draw both points and lines.
• xlab is the label for x axis.
• ylab is the label for y axis.
• main is the Title of the chart.
• col is used to give colors to both the points and lines.
Example
# Create the data for the chart.
v <- c(7,12,28,3,41)
# Give the chart file a name.
png(file = "line_chart.jpg")
# Plot the bar chart.
plot(v,type = "o")
# Save the file.
dev.off()
Line Chart Title, Color and Labels
Live Demo
# Create the data for the chart.
v <- c(7,12,28,3,41)
# Give the chart file a name.
png(file = "line_chart_label_colored.jpg")
# Plot the bar chart.
plot(v,type = "o", col = "red", xlab = "Month", ylab = "Rain fall",
main = "Rain fall chart")
# Save the file.
dev.off()
Scatterplots
• Scatterplots show many points plotted in the Cartesian plane. Each
point represents the values of two variables. One variable is chosen in
the horizontal axis and another in the vertical axis.
• The simple scatterplot is created using the plot() function.
• Syntax
• The basic syntax for creating scatterplot in R is −
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
• Following is the description of the parameters used −
• x is the data set whose values are the horizontal coordinates.
• y is the data set whose values are the vertical coordinates.
• main is the tile of the graph.
• xlab is the label in the horizontal axis.
• ylab is the label in the vertical axis.
• xlim is the limits of the values of x used for plotting.
• ylim is the limits of the values of y used for plotting.
• axes indicates whether both axes should be drawn on the plot.
Example- The below script will create a scatterplot graph for
the relation between wt(weight) and mpg(miles per gallon).
Live Demo
# Get the input values.
input <- mtcars[,c('wt','mpg')]
# Give the chart file a name.
png(file = "scatterplot.png")
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
# Save the file.
dev.off()
Dealing with Missing Values
Dealing with Missing Values
• A common task in data analysis is dealing with missing values. In R,
missing values are often represented by NA or some other value that
represents missing values (i.e. 99). We can easily work with missing
values and in this section you will learn how to:
• Test for missing values
• Recode missing values
• Exclude missing values
Test for missing values
• To identify missing values use is.na() which returns a logical vector
with TRUE in the element locations that contain missing values
represented by NA. is.na() will work on vectors, lists, matrices, and
data frames.
• x <- c(1:4, NA, 6:7, NA)
• is.na(x)
• # data frame with missing data
• df <- data.frame(col1 = c(1:3, NA),
• col2 = c("this", NA,"is", "text"),
• col3 = c(TRUE, FALSE, TRUE, TRUE),
• col4 = c(2.5, 4.2, 3.2, NA),
• stringsAsFactors = FALSE)
• # identify NAs in full data frame
• is.na(df)
• # identify NAs in specific data frame column
• is.na(df$col4)
• To identify the location or the number of NAs we can leverage the
which() and sum() functions:
# identify location of NAs in vector
which(is.na(x))
## [1] 5 8
# identify count of NAs in data frame
sum(is.na(df))
## [1] 3
• For data frames, a convenient shortcut to compute the total missing
values in each column is to use colSums():
colSums(is.na(df))
## col1 col2 col3 col4
## 1 1 0 1
Recode missing values
• To recode missing values; or recode specific indicators that represent
missing values, we can use normal subsetting and assignment
operations.
• For example, we can recode missing values in vector x with the mean
values in x by first subsetting the vector to identify NAs and then
assign these elements a value.
• Similarly, if missing values are represented by another value (i.e. 99)
we can simply subset the data for the elements that contain that
value and then assign a desired value to those elements.
# recode missing values with the mean
# vector with missing data
x <- c(1:4, NA, 6:7, NA)
x
## [1] 1 2 3 4 NA 6 7 NA
x[is.na(x)] <- mean(x, na.rm = TRUE)
round(x, 2)
## [1] 1.00 2.00 3.00 4.00 3.83 6.00 7.00 3.83
# data frame that codes missing values as 99
df <- data.frame(col1 = c(1:3, 99), col2 = c(2.5, 4.2, 99, 3.2))
# change 99s to NAs
df[df == 99] <- NA
df
## col1 col2
## 1 1 2.5
## 2 2 4.2
## 3 3 NA
## 4 NA 3.2
• If we want to recode missing values in a single data frame variable we can
subset for the missing value in that specific variable of interest and then
assign it the replacement value. For example, here we recode the missing
value in col4 with the mean value of col4.
# data frame with missing data
df <- data.frame(col1 = c(1:3, NA),
col2 = c("this", NA,"is", "text"),
col3 = c(TRUE, FALSE, TRUE, TRUE),
col4 = c(2.5, 4.2, 3.2, NA),
stringsAsFactors = FALSE)
df$col4[is.na(df$col4)] <- mean(df$col4, na.rm = TRUE)
Exclude missing values
• We can exclude missing values in a couple different ways. First, if we
want to exclude missing values from mathematical operations use the
na.rm = TRUE argument. If you do not exclude these values most
functions will return an NA.
# A vector with missing values
x <- c(1:4, NA, 6:7, NA)
# including NA values will produce an NA output
mean(x)
## [1] NA
# excluding NA values will calculate the mathematical operation for all non-
missing values
mean(x, na.rm = TRUE)
## [1] 3.833333
# data frame with missing values
df <- data.frame(col1 = c(1:3, NA),
col2 = c("this", NA,"is", "text"),
col3 = c(TRUE, FALSE, TRUE, TRUE),
col4 = c(2.5, 4.2, 3.2, NA),
stringsAsFactors = FALSE)
df
## col1 col2 col3 col4
## 1 1 this TRUE 2.5
## 2 2 <NA> FALSE 4.2
## 3 3 is TRUE 3.2
## 4 NA text TRUE NA
Exercises
• How many missing values are in the built-in data set airquality?
• Which variables are the missing values concentrated in?
• How would you impute the mean or median for these values?
• How would you omit all rows containing missing values?

Más contenido relacionado

La actualidad más candente

Descriptive Statistics with R
Descriptive Statistics with RDescriptive Statistics with R
Descriptive Statistics with R
Kazuki Yoshida
 

La actualidad más candente (20)

Data analytics using R programming
Data analytics using R programmingData analytics using R programming
Data analytics using R programming
 
R basics
R basicsR basics
R basics
 
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...Statistics For Data Science | Statistics Using R Programming Language | Hypot...
Statistics For Data Science | Statistics Using R Programming Language | Hypot...
 
Univariate Analysis
Univariate AnalysisUnivariate Analysis
Univariate Analysis
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programming
 
Review & Hypothesis Testing
Review & Hypothesis TestingReview & Hypothesis Testing
Review & Hypothesis Testing
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
R Programming Language
R Programming LanguageR Programming Language
R Programming Language
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Descriptive Statistics with R
Descriptive Statistics with RDescriptive Statistics with R
Descriptive Statistics with R
 
Introduction to statistical software R
Introduction to statistical software RIntroduction to statistical software R
Introduction to statistical software R
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Chart and graphs in R programming language
Chart and graphs in R programming language Chart and graphs in R programming language
Chart and graphs in R programming language
 
Data Management in R
Data Management in RData Management in R
Data Management in R
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Bivariate
BivariateBivariate
Bivariate
 
Estimation and hypothesis
Estimation and hypothesisEstimation and hypothesis
Estimation and hypothesis
 
Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data Visualization
 
Basic statistics
Basic statisticsBasic statistics
Basic statistics
 

Similar a Exploratory data analysis using r

Week-3 – System RSupplemental material1Recap •.docx
Week-3 – System RSupplemental material1Recap •.docxWeek-3 – System RSupplemental material1Recap •.docx
Week-3 – System RSupplemental material1Recap •.docx
helzerpatrina
 
Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)
Chia-Chi Chang
 
CIV1900 Matlab - Plotting & Coursework
CIV1900 Matlab - Plotting & CourseworkCIV1900 Matlab - Plotting & Coursework
CIV1900 Matlab - Plotting & Coursework
TUOS-Sam
 
Lectures r-graphics
Lectures r-graphicsLectures r-graphics
Lectures r-graphics
etyca
 

Similar a Exploratory data analysis using r (20)

R training5
R training5R training5
R training5
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Linear Regression.pptx
Linear Regression.pptxLinear Regression.pptx
Linear Regression.pptx
 
Time Series.pptx
Time Series.pptxTime Series.pptx
Time Series.pptx
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
MATLAB PLOT.pdf
MATLAB PLOT.pdfMATLAB PLOT.pdf
MATLAB PLOT.pdf
 
Language R
Language RLanguage R
Language R
 
Lecture 02 visualization and programming
Lecture 02   visualization and programmingLecture 02   visualization and programming
Lecture 02 visualization and programming
 
Introduction to r
Introduction to rIntroduction to r
Introduction to r
 
Week-3 – System RSupplemental material1Recap •.docx
Week-3 – System RSupplemental material1Recap •.docxWeek-3 – System RSupplemental material1Recap •.docx
Week-3 – System RSupplemental material1Recap •.docx
 
R language introduction
R language introductionR language introduction
R language introduction
 
R graphics
R graphicsR graphics
R graphics
 
Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)Learning notes of r for python programmer (Temp1)
Learning notes of r for python programmer (Temp1)
 
Presentation: Plotting Systems in R
Presentation: Plotting Systems in RPresentation: Plotting Systems in R
Presentation: Plotting Systems in R
 
statistical computation using R- an intro..
statistical computation using R- an intro..statistical computation using R- an intro..
statistical computation using R- an intro..
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
CIV1900 Matlab - Plotting & Coursework
CIV1900 Matlab - Plotting & CourseworkCIV1900 Matlab - Plotting & Coursework
CIV1900 Matlab - Plotting & Coursework
 
Lecture_3.pptx
Lecture_3.pptxLecture_3.pptx
Lecture_3.pptx
 
Lectures r-graphics
Lectures r-graphicsLectures r-graphics
Lectures r-graphics
 
R Functions.pptx
R Functions.pptxR Functions.pptx
R Functions.pptx
 

Último

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 

Exploratory data analysis using r

  • 1. Exploratory Data Analysis in R By Ms. Tahera Shaikh
  • 2. Factor in R: Categorical & Continuous Variables • What is Factor in R? • Factors are variables in R which take on a limited number of different values; such variables are often referred to as categorical variables. • In a dataset, we can distinguish two types of variables: categorical and continuous. • In a categorical variable, the value is limited and usually based on a particular finite group. For example, a categorical variable can be countries, year, gender, occupation. • A continuous variable, however, can take any values, from integer to decimal. For example, we can have the revenue, price of a share, etc..
  • 3. Categorical Variables • R stores categorical variables into a factor. Let's check the code below to convert a character variable into a factor variable. Characters are not supported in machine learning algorithm, and the only way is to convert a string to an integer. # Create gender vector gender_vector <- c("Male", "Female", "Female", "Male", "Male") class(gender_vector) # Convert gender_vector to a factor factor_gender_vector <-factor(gender_vector) class(factor_gender_vector)
  • 4. Continuous Variables • Continuous class variables are the default value in R. They are stored as numeric or integer. We can see it from the dataset below. mtcars is a built-in dataset. It gathers information on different types of car. We can import it by using mtcars and check the class of the variable mpg, mile per gallon. It returns a numeric value, indicating a continuous variable. dataset <- mtcars class(dataset$mpg)
  • 5. R Charts & Graphs • A pie-chart is a representation of values as slices of a circle with different colors. The slices are labeled and the numbers corresponding to each slice is also represented in the chart. • In R the pie chart is created using the pie() function which takes positive numbers as a vector input. The additional parameters are used to control labels, color, title etc. • Syntax pie(x, labels, radius, main, col, clockwise)
  • 6. • Syntax Following is the description of the parameters used − • x is a vector containing the numeric values used in the pie chart. • labels is used to give description to the slices. • radius indicates the radius of the circle of the pie chart.(value between −1 and +1). • main indicates the title of the chart. • col indicates the color palette. • clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise. pie(x, labels, radius, main, col, clockwise)
  • 7. • Example # Create data for the graph. x <- c(21, 62, 10, 53) labels <- c("London", "New York", "Singapore", "Mumbai") # Give the chart file a name. png(file = "city.png") # Plot the chart. pie(x,labels) # Save the file. dev.off()
  • 8.
  • 9. Pie Chart Title and Colors Live Demo # Create data for the graph. x <- c(21, 62, 10, 53) labels <- c("London", "New York", "Singapore", "Mumbai") # Give the chart file a name. png(file = "city_title_colours.jpg") # Plot the chart with title and rainbow color pallet. pie(x, labels, main = "City pie chart", col = rainbow(length(x))) # Save the file. dev.off()
  • 10.
  • 11. Bar Charts • A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable. R uses the function barplot() to create bar charts. R can draw both vertical and Horizontal bars in the bar chart. In bar chart each of the bars can be given different colors.
  • 12. • Syntax • The basic syntax to create a bar-chart in R is − barplot(H,xlab,ylab,main, names.arg,col) • Following is the description of the parameters used − • H is a vector or matrix containing numeric values used in bar chart. • xlab is the label for x axis. • ylab is the label for y axis. • main is the title of the bar chart. • names.arg is a vector of names appearing under each bar. • col is used to give colors to the bars in the graph.
  • 13. Live Demo # Create the data for the chart H <- c(7,12,28,3,41) # Give the chart file a name png(file = "barchart.png") # Plot the bar chart barplot(H) # Save the file dev.off()
  • 14.
  • 15. Bar Chart Labels, Title and Colors # Create the data for the chart H <- c(7,12,28,3,41) M <- c("Mar","Apr","May","Jun","Jul") # Give the chart file a name png(file = "barchart_months_revenue.png") # Plot the bar chart barplot(H,names.arg=M,xlab="Month",ylab="Revenue",col="blue", main="Revenue chart",border="red") # Save the file dev.off()
  • 16.
  • 17. Boxplots • Boxplots are a measure of how well distributed is the data in a data set. It divides the data set into three quartiles. This graph represents the minimum, maximum, median, first quartile and third quartile in the data set. It is also useful in comparing the distribution of data across data sets by drawing boxplots for each of them. • Boxplots are created in R by using the boxplot() function.
  • 18. • Syntax • The basic syntax to create a boxplot in R is − boxplot(x, data, notch, varwidth, names, main) • Following is the description of the parameters used − • x is a vector or a formula. • data is the data frame. • notch is a logical value. Set as TRUE to draw a notch. • varwidth is a logical value. Set as true to draw width of the box proportionate to the sample size. • names are the group labels which will be printed under each boxplot. • main is used to give a title to the graph.
  • 19. Example • We use the data set "mtcars" available in the R environment to create a basic boxplot. Let's look at the columns "mpg" and "cyl" in mtcars. input <- mtcars[,c('mpg','cyl')] print(head(input))
  • 20. Creating the Boxplot Live Demo # Give the chart file a name. png(file = "boxplot.png") # Plot the chart. boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders", ylab = "Miles Per Gallon", main = "Mileage Data") # Save the file. dev.off()
  • 21.
  • 22. Histograms • A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in histogram represents the height of the number of values present in that range. • R creates histogram using hist() function. This function takes a vector as an input and uses some more parameters to plot histograms.
  • 23. • Syntax • The basic syntax for creating a histogram using R is − hist(v,main,xlab,xlim,ylim,breaks,col,border) • Following is the description of the parameters used − • v is a vector containing numeric values used in histogram. • main indicates title of the chart. • col is used to set color of the bars. • border is used to set border color of each bar. • xlab is used to give description of x-axis. • xlim is used to specify the range of values on the x-axis. • ylim is used to specify the range of values on the y-axis. • breaks is used to mention the width of each bar.
  • 24. Example # Create data for the graph. v <- c(9,13,21,8,36,22,12,41,31,33,19) # Give the chart file a name. png(file = "histogram.png") # Create the histogram. hist(v,xlab = "Weight",col = "yellow",border = "blue") # Save the file. dev.off()
  • 25.
  • 26. Line Graphs • A line chart is a graph that connects a series of points by drawing line segments between them. These points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are usually used in identifying the trends in data. • The plot() function in R is used to create the line graph.
  • 27. • Syntax • The basic syntax to create a line chart in R is − plot(v,type,col,xlab,ylab) • Following is the description of the parameters used − • v is a vector containing the numeric values. • type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw both points and lines. • xlab is the label for x axis. • ylab is the label for y axis. • main is the Title of the chart. • col is used to give colors to both the points and lines.
  • 28. Example # Create the data for the chart. v <- c(7,12,28,3,41) # Give the chart file a name. png(file = "line_chart.jpg") # Plot the bar chart. plot(v,type = "o") # Save the file. dev.off()
  • 29.
  • 30. Line Chart Title, Color and Labels Live Demo # Create the data for the chart. v <- c(7,12,28,3,41) # Give the chart file a name. png(file = "line_chart_label_colored.jpg") # Plot the bar chart. plot(v,type = "o", col = "red", xlab = "Month", ylab = "Rain fall", main = "Rain fall chart") # Save the file. dev.off()
  • 31.
  • 32. Scatterplots • Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of two variables. One variable is chosen in the horizontal axis and another in the vertical axis. • The simple scatterplot is created using the plot() function.
  • 33. • Syntax • The basic syntax for creating scatterplot in R is − plot(x, y, main, xlab, ylab, xlim, ylim, axes) • Following is the description of the parameters used − • x is the data set whose values are the horizontal coordinates. • y is the data set whose values are the vertical coordinates. • main is the tile of the graph. • xlab is the label in the horizontal axis. • ylab is the label in the vertical axis. • xlim is the limits of the values of x used for plotting. • ylim is the limits of the values of y used for plotting. • axes indicates whether both axes should be drawn on the plot.
  • 34. Example- The below script will create a scatterplot graph for the relation between wt(weight) and mpg(miles per gallon). Live Demo # Get the input values. input <- mtcars[,c('wt','mpg')] # Give the chart file a name. png(file = "scatterplot.png") # Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30. plot(x = input$wt,y = input$mpg, xlab = "Weight", ylab = "Milage", xlim = c(2.5,5), ylim = c(15,30), main = "Weight vs Milage" ) # Save the file. dev.off()
  • 35.
  • 37. Dealing with Missing Values • A common task in data analysis is dealing with missing values. In R, missing values are often represented by NA or some other value that represents missing values (i.e. 99). We can easily work with missing values and in this section you will learn how to: • Test for missing values • Recode missing values • Exclude missing values
  • 38. Test for missing values • To identify missing values use is.na() which returns a logical vector with TRUE in the element locations that contain missing values represented by NA. is.na() will work on vectors, lists, matrices, and data frames.
  • 39. • x <- c(1:4, NA, 6:7, NA) • is.na(x) • # data frame with missing data • df <- data.frame(col1 = c(1:3, NA), • col2 = c("this", NA,"is", "text"), • col3 = c(TRUE, FALSE, TRUE, TRUE), • col4 = c(2.5, 4.2, 3.2, NA), • stringsAsFactors = FALSE) • # identify NAs in full data frame • is.na(df) • # identify NAs in specific data frame column • is.na(df$col4)
  • 40. • To identify the location or the number of NAs we can leverage the which() and sum() functions: # identify location of NAs in vector which(is.na(x)) ## [1] 5 8 # identify count of NAs in data frame sum(is.na(df)) ## [1] 3
  • 41. • For data frames, a convenient shortcut to compute the total missing values in each column is to use colSums(): colSums(is.na(df)) ## col1 col2 col3 col4 ## 1 1 0 1
  • 42. Recode missing values • To recode missing values; or recode specific indicators that represent missing values, we can use normal subsetting and assignment operations. • For example, we can recode missing values in vector x with the mean values in x by first subsetting the vector to identify NAs and then assign these elements a value. • Similarly, if missing values are represented by another value (i.e. 99) we can simply subset the data for the elements that contain that value and then assign a desired value to those elements.
  • 43. # recode missing values with the mean # vector with missing data x <- c(1:4, NA, 6:7, NA) x ## [1] 1 2 3 4 NA 6 7 NA x[is.na(x)] <- mean(x, na.rm = TRUE) round(x, 2) ## [1] 1.00 2.00 3.00 4.00 3.83 6.00 7.00 3.83
  • 44. # data frame that codes missing values as 99 df <- data.frame(col1 = c(1:3, 99), col2 = c(2.5, 4.2, 99, 3.2)) # change 99s to NAs df[df == 99] <- NA df ## col1 col2 ## 1 1 2.5 ## 2 2 4.2 ## 3 3 NA ## 4 NA 3.2
  • 45. • If we want to recode missing values in a single data frame variable we can subset for the missing value in that specific variable of interest and then assign it the replacement value. For example, here we recode the missing value in col4 with the mean value of col4. # data frame with missing data df <- data.frame(col1 = c(1:3, NA), col2 = c("this", NA,"is", "text"), col3 = c(TRUE, FALSE, TRUE, TRUE), col4 = c(2.5, 4.2, 3.2, NA), stringsAsFactors = FALSE) df$col4[is.na(df$col4)] <- mean(df$col4, na.rm = TRUE)
  • 46. Exclude missing values • We can exclude missing values in a couple different ways. First, if we want to exclude missing values from mathematical operations use the na.rm = TRUE argument. If you do not exclude these values most functions will return an NA.
  • 47. # A vector with missing values x <- c(1:4, NA, 6:7, NA) # including NA values will produce an NA output mean(x) ## [1] NA # excluding NA values will calculate the mathematical operation for all non- missing values mean(x, na.rm = TRUE) ## [1] 3.833333
  • 48. # data frame with missing values df <- data.frame(col1 = c(1:3, NA), col2 = c("this", NA,"is", "text"), col3 = c(TRUE, FALSE, TRUE, TRUE), col4 = c(2.5, 4.2, 3.2, NA), stringsAsFactors = FALSE) df ## col1 col2 col3 col4 ## 1 1 this TRUE 2.5 ## 2 2 <NA> FALSE 4.2 ## 3 3 is TRUE 3.2 ## 4 NA text TRUE NA
  • 49. Exercises • How many missing values are in the built-in data set airquality? • Which variables are the missing values concentrated in? • How would you impute the mean or median for these values? • How would you omit all rows containing missing values?

Notas del editor

  1. It is important to transform a string into factor when we perform Machine Learning task.
  2. We can expand the features of the chart by adding more parameters to the function. We will use parameter main to add a title to the chart and another parameter is col which will make use of rainbow colour pallet while drawing the chart. The length of the pallet should be same as the number of values we have for the chart. Hence we use length(x).
  3. The features of the bar chart can be expanded by adding more parameters. The main parameter is used to add title. The col parameter is used to add colors to the bars. The args.name is a vector having same number of values as the input vector to describe the meaning of each bar.
  4. mpg cyl Mazda RX4 21.0 6 Mazda RX4 Wag 21.0 6 Datsun 710 22.8 4 Hornet 4 Drive 21.4 6 Hornet Sportabout 18.7 8 Valiant 18.1 6
  5. A simple line chart is created using the input vector and the type parameter as "O". The below script will create and save a line chart in the current R working directory.
  6. The features of the line chart can be expanded by using additional parameters. We add color to the points and lines, give a title to the chart and add labels to the axes.
  7. We use the data set "mtcars" available in the R environment to create a basic scatterplot. Let's use the columns "wt" and "mpg" in mtcars. input <- mtcars[,c('wt','mpg')] print(head(input))
  8. Reorder in column - assignment
  9. We may also desire to subset our data to obtain complete observations, those observations (rows) in our data that contain no missing data. We can do this a few different ways.