2. > x=11
> print(x)
[1] 11
> x
[1] 11
> X
Error: object 'X' not found
> y<-7
> y
[1] 7
> y<-9
> y
[1] 9
> ls()
[1] "x" "y"
> rm(y)
> y
Error: object 'y' not found
> y<-9
> x.1<-14
> x.1
[1] 14
> 1x<-22
Error: unexpected symbol in "1x"
3.
4. Entering data with c
• c function for small datasets – combines or concatenates
terms together
Example: we have a count of the number of typing mistakes of a
word document:
02132011
To enter this into an R session we go like this:
> typo=c(0,2,1,3,2,0,1,1)
> typo
[1] 0 2 1 3 2 0 1 1
5. Learning Objectives
• What is statistics?
• Become aware of the varied applications of statistics in
business.
• Differentiate between descriptive and inferential statistics.
• Identify types of variables.
6. Statistics in Business
• Accounting — auditing and cost estimation
• Economics — local, regional, national, and international
economic performance
• Finance — investments and portfolio management
• Management — human resources, compensation, and quality
management
• Management Information Systems — performance of systems
which gather, summarize, and disseminate information to
various managerial levels
• Marketing — market analysis and consumer research
• International Business — market and demographic analysis
7. What is Statistics?
• Science dealing with
collection, analysis, interpretation and
presentation of data (with a view to making
inferences)
• Branches of statistics:
– Descriptive – graphical or numerical summaries of
data
– Inferential – making a decision based on data
9. Population Versus Sample
• Population — the whole
– a collection of all persons, objects, or items under
study
• Census — gathering data from the entire population
• Sample — gathering data on a subset of the population
– Use information about the sample to infer about the
population
11. Population and Census Data
Identifier
Color
MPG
RD1
Red
12
RD2
Red
10
RD3
Red
13
RD4
Red
10
RD5
Red
13
BL1
Blue
27
BL2
Blue
24
GR1
Green
35
GR2
Green
35
GY1
Gray
15
GY2
Gray
18
GY3
Gray
17
12. Sample and Sample Data
Identifier
Color
MPG
RD2
Red
10
RD5
Red
13
GR1
Green
35
GY2
Gray
18
14. Parameter vs. Statistic
• Parameter — descriptive measure of the population
– Usually represented by Greek letters
denotes population parameter
2 denotes population variance
denotes population standard deviation
• Statistic — descriptive measure of a sample
– Usually represented by Roman letters
x denotes sample mean
s 2 denotes sample variance
s denotes sample standarddeviation
15. Statistics in Business
• Inferences about parameters made under
conditions of uncertainty (which are always
present in statistics)
– Uncertainty can be caused by
• Randomness in selection of a sample
• Lack of knowledge about the source of the
inferences
• Change in conditions not accounted for
16. Variables and Data
Variable : a characteristic of any entity being studied – is
capable of taking on different values that can be used for
analysis
e.g. stock price, ROI, market share, age of worker, income of a
family, total sales, advertising cost etc
Measurement : is done when a standard process is used
to assign numbers to particular characteristics of a
variable – may be obvious or defined
e.g. age is obvious but ROI or Labour productivity is defined
The source of each measurement is called a Sampling unit
Data : recorded measurements
17. Levels of Data Measurement
What are 40 and 80? may represent
Weights of two objects being shipped
Ratings received in a consumer test by two
different products
Football jersey numbers of a fullback and centreforward
Appropriateness of data analysis depends
on the level of measurement of the data gathered
18. Levels of Data Measurement
• Nominal — Qualitative data, typically numbers
are used only to classify or categorize the
attribute, however it is useful to retain original
verbal descriptions of categories
– 1 for “male” and 2 for “female”
– Employee identification number
– Religion, Geographic location, PIN code, Place of
birth
– Demographic questions in survey etc
19. Levels of Data Measurement
• Ordinal - A variable is ordinal measurable if
ranking or ordering is possible for values of
the variable.
– For example, a gold medal reflects superior
performance to a silver or bronze medal in the
Olympics. But can you say a gold and a bronze
medal average out to a silver medal?
– Preference scales are typically ordinal – how much
do you like this cereal? Like it a lot, somewhat like
it, neutral, somewhat dislike it, dislike it a lot.
20. Levels of Data Measurement
• Interval - In interval measurement the
distance between attributes does have
meaning.
– Numerical data typically fall into this category
– For example, when measuring temperature (in
Fahrenheit), the distance from 30-40 is same as
the distance from 70-80. The interval between
values is interpretable.
21. Levels of Data Measurement
• Ratio — in ratio measurement there is always
a reference point that is meaningful (either 0
for rates or 1 for ratios)
– This means that you can construct a meaningful
fraction
(or ratio) with a ratio variable.
– In applied social research most "count" variables
are ratio, for example, the number of clients in
past six months.
22. Visualizing the data
• Construct a frequency distribution
– For both grouped and ungrouped data
• Construct graphical summaries of qualitative
data
• Construct graphical summaries of quantitative
data
• Construct graphical summaries of two
variables
23. Ungrouped vs.Grouped Data
• Ungrouped data
– have not been summarized in any way
– are also called raw data
• Grouped data
– logical groupings of data exists
• i.e. age ranges (20-29, 30-39, etc.)
– have been organized into a frequency distribution
24. Example of Ungrouped Data
42
26
32
34
57
30
58
37
50
30
53
40
30
47
49
50
40
32
31
40
52
28
23
35
25
30
36
32
26
50
55
30
58
64
52
49
33
43
46
32
61
31
30
40
60
74
37
29
43
54
Ages of a sample of
Managers from
Urban Child Care
Centres in US
25. Frequency Distribution
• Frequency Distribution – summary of data
presented in the form of class intervals and
frequencies
– Vary in shape and design
– Constructed according to the individual
researcher's preferences
26. Frequency Distribution
• Steps in Frequency Distribution
– Step 1 – Determine range of frequency distribution
• Range is the difference between the high and the lowest
numbers
– Step 2 – Determine the number of classes
• Do not use too many, or two few classes
– Step 3 – Determine the width of the class interval
• Approx. class width can be calculated by dividing the range
by the number of classes
• Values fit into only one class
27. Frequency Distribution of Child
Care Manager’s Ages
Class Interval
Frequency
20-under 30
6
30-under 40
18
40-under 50
11
50-under 60
11
60-under 70
3
70-under 80
1
28. Relative Frequency
Relative frequency is the proportion of the total frequency that
is in any given class interval in a frequency distributionrtion of
the total frequency
that is any given class interval in a frequency distribution.
Class Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Total
Frequency
6
18
11
11
3
1
50
Relative
Frequency
6
.12
50
.36
18
.22
50
.22
.06
.02
1.00
29. Cumulative Frequency
Cumulative frequency is a running total of frequencies through
the classes of a frequency distributionen class interval in a frequency
distribution.
Class Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Total
Frequency
6
18
11
11
3
1
50
Cumulative
Frequency
6
24
18 + 6
35
11 + 24
46
49
50
30. Cumulative Relative Frequencies
Cumulative relative frequency is a running total of the relative
frequencies through the classes of a frequency distributione
total frequency
Cumulative
Relative Cumulative
Relative
Class Interval Frequency Frequency Frequency
Frequency
20-under 30
6
.12
6
.12
30-under 40
18
.36
24
.48
40-under 50
11
.22
35
.70
50-under 60
11
.22
46
.92
60-under 70
3
.06
49
.98
70-under 80
1
.02
50
1.00
Total
50
1.00
31. Common Statistical Graphs
– Quantitative Data
•
•
•
•
•
Histogram -- vertical bar chart of frequencies
Frequency Polygon -- line graph of frequencies
Ogive -- line graph of cumulative frequencies
Dot Plots – each data value is plotted
Stem and Leaf Plot -- Like a histogram, but
shows individual data values. Useful for small
data sets.
32. Histogram
• A histogram is a graphical summary of a
frequency distribution
• Labeling x-axis with class endpoints and y-axis
with frequencies, drawing a horizontal line
between two class endpoints at each frequency
value
• The number and location of rectangles (bars)
should be determined based on the sample
size and the range of the data
34. Number of Classes
and Class Width
• The number of classes should be between 5 and 15.
– Fewer than 5 classes cause excessive summarization.
– More than 15 classes leave too much detail.
• Class Width
– Divide the range by the number of classes for an
approximate class width
– Round up to a convenient number
35. Class midpoint or Class mark
The midpoint of each class interval is called the
class midpoint or the class mark.
36. Midpoints for Age Classes
Class Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Total
Frequency
6
18
11
11
3
1
50
Midpoint
25
35
45
55
65
75
Relative
Frequency
.12
.36
.22
.22
.06
.02
1.00
Cumulative
Frequency
6
24
35
46
49
50
37. Midpoints for Age Classes
Class Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Total
Frequency
6
18
11
11
3
1
50
Midpoint
25
35
45
55
65
75
Relative
Frequency
.12
.36
.22
.22
.06
.02
1.00
Cumulative
Frequency
6
24
35
46
49
50
43. Histogram vs. Stem and Leaf?
• So, which one should you use?
• A Stem and Leaf plot is useful for small data
sets. It shows the values of the datapoints.
• A histogram foregoes seeing the individual
values of the data for the bigger picture of the
distribution of the data
• The purpose of these graphs is to summarize a
set of data. As long as that need is met, either
one is okay to use.
44. Common Statistical Graphs
– Qualitative Data
• Pie Chart -- proportional representation for
categories of a whole
• Bar Chart – frequency or relative frequency of
one more categorical variables
45. Complaints by Amtrak Passengers
COMPLAINT
NUMBER PROPORTION
DEGREES
Stations, etc.
28,000
.40
144.0
Train
Performance
Equipment
14,700
.21
75.6
10,500
.15
54.0
Personnel
9,800
.14
50.4
Schedules,
etc.
Total
7,000
.10
36.0
70,000
1.00
360.0
47. Second Quarter U.S. Truck Production
Second Quarter Truck
Production in the U.S.
(Hypothetical values)
Company
2d Quarter
Truck
Production
A
357,411
B
354,936
C
160,997
D
34,099
E
Totals
12,747
920,190
49. Pie Chart Calculations
for Company A
Company
2d Quarter
Truck
Production
Proportion
Degrees
A
357,411
.388
140
B
354,936
.386
139
C
160,997
.175
63
D
34,099
.037
13
12,747
920,190
.014
1.000
5
360
E
Totals
50. Vertical Bar Graphs or Column Charts
6
5
4
Kolkata
3
Mumbai
Chennai
2
1
0
2010
2011
2012
2013
52. Pareto Chart
A pareto chart is a bar chart, sorted from the most frequent to the
least frequent, overlaid with a cumulative line graph (like an ogive).
These data present the most common types of defects.
100%
90%
80
70
Frequency
100
90
80%
70%
60
50
40
60%
50%
40%
30
20
30%
20%
10
0
10%
0%
Poor
Wiring
Short in
Coil
Defective
Plug
Other
54. Common Statistical Graphs –
Comparing Two Variables
• Scatter Plot -- type of display using Cartesian
coordinates to display values for two variables for
a set of data.
– The data is displayed as a collection of points, each
having the value of one variable determining the
position on the horizontal axis and the value of the
other variable determining the position on the vertical
axis.
– A scatter plot is also called a scatter chart, scatter
diagram and scatter graph.
55. Measures of Central Tendency
& Dispersion:
Learning Objectives
• Distinguish between measures of central
tendency, measures of variability, measures of
shape, and measures of association.
• Understand the meanings of
mean, median, mode, quartile, percentile, and
range.
• Compute
mean, median, mode, percentile, quartile, range, v
ariance, standard deviation, and mean absolute
deviation on ungrouped data.
• Differentiate between sample and population
variance and standard deviation.
56. Measures of Central Tendency
& Dispersion:
Learning Objectives - continued
• Understand the meaning of standard deviation as
it is applied by using the empirical rule and
Chebyshev’s theorem.
• Compute the mean, median, standard
deviation, and variance on grouped data.
• Understand box and whisker plots, skewness, and
kurtosis.
• Compute a coefficient of correlation and interpret
it.
57. Measures of Central Tendency:
Ungrouped Data
• Measures of central tendency yield information
about “the centre, or middle part, of a group of
numbers.”
• Measures of central tendency do not focus on the
span of the data set or how far values are from the
middle numbers
• Common Measures of Location
–
–
–
–
–
Mode
Median
Mean
Percentiles
Quartiles
58. Mode
• Mode - the most frequently occurring value in a
data set
– Applicable to all levels of data measurement
(nominal, ordinal, interval, and ratio)
– Can be used to determine what categories occur most
frequently
– Sometimes, no mode exists (no duplicates)
• Bimodal – In a tie for the most frequently
occurring value, two modes are listed
• Multimodal -- Data sets that contain more than
two modes
59. Median
• Median - middle value in an ordered array of
numbers.
– Half the data are above it, half the data are below it
– Mathematically, it is the (n+1)/2 th ordered
observation
• For an array with an odd number of terms, the median is
the middle number
– n=11 => (n+1)/2 th = 12/2 th = 6th ordered observation
• For an array with an even number of terms the median is
the average of the middle two numbers
– n=10 => (n+1)/2 th = 11/2 th = 5.5th = average of 5th and 6th
ordered observation
60. Arithmetic Mean
•
•
•
•
Mean is the average of a group of numbers
Applicable for interval and ratio data
Not applicable for nominal or ordinal data
Affected by each value in the data
set, including extreme values
• Computed by summing all values in the data
set and dividing the sum by the number of
values in the data set
61. Demonstration Problem
The number of U.S. cars in service by top car rental
companies in a recent year according to Auto Rental
News follows.
Company / Number of Cars in Service
Enterprise 643,000; Hertz 327,000; National/Alamo
233,000; Avis 204,000; Dollar/Thrifty 167,000; Budget
144,000; Advantage 20,000; U-Save 12,000; Payless
10,000; ACE 9,000; Fox 9,000; Rent-A-Wreck 7,000;
Triangle 6,000
Compute the mode, the median, and the mean.
62. Demonstration Problem
•
Solutions
Solution
Mode: 9,000 (two companies with 9,000 cars in
service)
Median: With 13 different companies in this
group, N = 13. The median is located at the (13
+1)/2 = 7th position. Because the data are
already ordered, median is the 7th term, which is
20,000.
Mean: μ = ∑x/N = (1,791,000/13) = 137,769.23
63. Percentile
• Percentile - measures of central tendency that divide a
group of data into 100 parts
• At least n% of the data lie at or below the nth
percentile, and at most (100 - n)% of the data lie
above the nth percentile
• Example: 90th percentile indicates that at 90% of the
data are equal to or less than it, and 10% of the data
lie above it
64. Calculating Percentiles
• To calculate the pth percentile,
– Order the data
– Calculate i = N (p/100)
– Determine the percentile
• If i is a whole number, then use the average of the
ith and (i+1)th ordered observation
• Otherwise, round i up to the next highest whole
number
65. Quartiles
• Quartile - measures of central tendency that divide a
group of data into four subgroups
• Q1: 25% of the data set is below the first quartile
• Q2: 50% of the data set is below the second quartile
• Q3: 75% of the data set is below the third quartile
Q2
Q1
25%
25%
Q3
25%
25%
66. Quartiles for Demonstration Problem
For the cars in service data, n=13, so
Q1: i = 13 (25/100) = 3.25, so use the 4th ordered observation
Q1 = 9,000
Q3: i = 13 (75/100) = 9.75, so use the 10th ordered observation
Q3 = 204,000
67. Which Measure Do I Use?
• Which measure of central tendency is most
appropriate?
– In general, the mean is preferred, since it has nice
mathematical properties, we shall discuss later
– The median and quartiles, are resistant to outliers
• Consider the following three datasets
–
–
–
–
1, 2, 3 (median=2, mean=2)
1, 2, 6 (median=2, mean=3)
1, 2, 30 (median=2, mean=11)
All have median=2, but the mean is sensitive to the outliers
• In general, if there are outliers, the median is preferred
to the mean
……….. To continue