Statistics with R

> x=11
> print(x)
[1] 11
> x
[1] 11
> X
Error: object 'X' not found
> y<-7
> y
[1] 7
> y<-9
> y
[1] 9

> ls()
[1] "x" "y"
> rm(y)
> y
Error: object 'y' not found
> y<-9
> x.1<-14
> x.1
[1] 14
> 1x<-22
Error: unexpected symbol in "1x"

Entering data with c
• c function for small datasets – combines or concatenates
terms together
Example: we have a count of the number of typing mistakes of a
word document:
02132011
To enter this into an R session we go like this:
> typo=c(0,2,1,3,2,0,1,1)
> typo
[1] 0 2 1 3 2 0 1 1

Learning Objectives
• What is statistics?
• Become aware of the varied applications of statistics in
business.
• Differentiate between descriptive and inferential statistics.
• Identify types of variables.

Statistics in Business
• Accounting — auditing and cost estimation
• Economics — local, regional, national, and international
economic performance
• Finance — investments and portfolio management
• Management — human resources, compensation, and quality
management
• Management Information Systems — performance of systems
which gather, summarize, and disseminate information to
various managerial levels
• Marketing — market analysis and consumer research
• International Business — market and demographic analysis

What is Statistics?
• Science dealing with
collection, analysis, interpretation and
presentation of data (with a view to making
inferences)
• Branches of statistics:
– Descriptive – graphical or numerical summaries of
data
– Inferential – making a decision based on data

What is Statistics?

Statistics in business is the study of VARIATIONS

Population Versus Sample
• Population — the whole
– a collection of all persons, objects, or items under
study
• Census — gathering data from the entire population
• Sample — gathering data on a subset of the population
– Use information about the sample to infer about the
population

Population and Census Data
Identifier

Color

MPG

RD1

Red

12

RD2

Red

10

RD3

Red

13

RD4

Red

10

RD5

Red

13

BL1

Blue

27

BL2

Blue

24

GR1

Green

35

GR2

Green

35

GY1

Gray

15

GY2

Gray

18

GY3

Gray

17

Sample and Sample Data
Identifier

Color

MPG

RD2

Red

10

RD5

Red

13

GR1

Green

35

GY2

Gray

18

Population Versus Sample

Select a
random sample

Parameter vs. Statistic
• Parameter — descriptive measure of the population
– Usually represented by Greek letters
 denotes population parameter
 2 denotes population variance

 denotes population standard deviation

• Statistic — descriptive measure of a sample
– Usually represented by Roman letters
x denotes sample mean
s 2 denotes sample variance

s denotes sample standarddeviation

Statistics in Business
• Inferences about parameters made under
conditions of uncertainty (which are always
present in statistics)
– Uncertainty can be caused by
• Randomness in selection of a sample
• Lack of knowledge about the source of the
inferences
• Change in conditions not accounted for

Variables and Data
Variable : a characteristic of any entity being studied – is
capable of taking on different values that can be used for
analysis
e.g. stock price, ROI, market share, age of worker, income of a
family, total sales, advertising cost etc

Measurement : is done when a standard process is used
to assign numbers to particular characteristics of a
variable – may be obvious or defined
e.g. age is obvious but ROI or Labour productivity is defined

The source of each measurement is called a Sampling unit
Data : recorded measurements

Levels of Data Measurement
What are 40 and 80? may represent
Weights of two objects being shipped
Ratings received in a consumer test by two
different products
Football jersey numbers of a fullback and centreforward
Appropriateness of data analysis depends
on the level of measurement of the data gathered

• Nominal — Qualitative data, typically numbers
are used only to classify or categorize the
attribute, however it is useful to retain original
verbal descriptions of categories
– 1 for “male” and 2 for “female”
– Employee identification number
– Religion, Geographic location, PIN code, Place of
birth
– Demographic questions in survey etc

• Ordinal - A variable is ordinal measurable if
ranking or ordering is possible for values of
the variable.
– For example, a gold medal reflects superior
performance to a silver or bronze medal in the
Olympics. But can you say a gold and a bronze
medal average out to a silver medal?
– Preference scales are typically ordinal – how much
do you like this cereal? Like it a lot, somewhat like
it, neutral, somewhat dislike it, dislike it a lot.

• Interval - In interval measurement the
distance between attributes does have
meaning.
– Numerical data typically fall into this category
– For example, when measuring temperature (in
Fahrenheit), the distance from 30-40 is same as
the distance from 70-80. The interval between
values is interpretable.

• Ratio — in ratio measurement there is always
a reference point that is meaningful (either 0
for rates or 1 for ratios)
– This means that you can construct a meaningful
fraction
(or ratio) with a ratio variable.
– In applied social research most "count" variables
are ratio, for example, the number of clients in
past six months.

Visualizing the data
• Construct a frequency distribution
– For both grouped and ungrouped data

• Construct graphical summaries of qualitative
data
• Construct graphical summaries of quantitative
data
• Construct graphical summaries of two
variables

Ungrouped vs.Grouped Data
• Ungrouped data
– have not been summarized in any way
– are also called raw data

• Grouped data
– logical groupings of data exists
• i.e. age ranges (20-29, 30-39, etc.)

– have been organized into a frequency distribution

Example of Ungrouped Data
42

26

32

34

57

30

58

37

50

30

53

40

30

47

49

50

40

32

31

40

52

28

23

35

25

30

36

32

26

50

55

30

58

64

52

49

33

43

46

32

61

31

30

40

60

74

37

29

43

54

Ages of a sample of
Managers from
Urban Child Care
Centres in US

Frequency Distribution
• Frequency Distribution – summary of data
presented in the form of class intervals and
frequencies
– Vary in shape and design
– Constructed according to the individual
researcher's preferences

Frequency Distribution
• Steps in Frequency Distribution
– Step 1 – Determine range of frequency distribution
• Range is the difference between the high and the lowest
numbers

– Step 2 – Determine the number of classes
• Do not use too many, or two few classes

– Step 3 – Determine the width of the class interval
• Approx. class width can be calculated by dividing the range
by the number of classes
• Values fit into only one class

Frequency Distribution of Child
Care Manager’s Ages
Class Interval

Frequency

20-under 30

6

30-under 40

18

40-under 50

11

50-under 60

11

60-under 70

3

70-under 80

1

Relative Frequency
Relative frequency is the proportion of the total frequency that
is in any given class interval in a frequency distributionrtion of
the total frequency
that is any given class interval in a frequency distribution.

Class Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Total

Frequency
6
18
11
11
3
1
50

Relative
Frequency
6
.12

50
.36
18
.22

50
.22
.06
.02
1.00

Cumulative Frequency
Cumulative frequency is a running total of frequencies through
the classes of a frequency distributionen class interval in a frequency
distribution.

Class Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Total

Frequency
6
18
11
11
3
1
50

Cumulative
Frequency
6
24
18 + 6
35
11 + 24
46
49
50

Cumulative Relative Frequencies
Cumulative relative frequency is a running total of the relative
frequencies through the classes of a frequency distributione
total frequency
Cumulative
Relative Cumulative
Relative
Class Interval Frequency Frequency Frequency
Frequency
20-under 30
6
.12
6
.12
30-under 40
18
.36
24
.48
40-under 50
11
.22
35
.70
50-under 60
11
.22
46
.92
60-under 70
3
.06
49
.98
70-under 80
1
.02
50
1.00
Total
50
1.00

Common Statistical Graphs
– Quantitative Data
•
•
•
•
•

Histogram -- vertical bar chart of frequencies
Frequency Polygon -- line graph of frequencies
Ogive -- line graph of cumulative frequencies
Dot Plots – each data value is plotted
Stem and Leaf Plot -- Like a histogram, but
shows individual data values. Useful for small
data sets.

Histogram
• A histogram is a graphical summary of a
frequency distribution
• Labeling x-axis with class endpoints and y-axis
with frequencies, drawing a horizontal line
between two class endpoints at each frequency
value
• The number and location of rectangles (bars)
should be determined based on the sample
size and the range of the data

Data Range
42

26

32

34

57

30

58

37

50

30

53

40

30

47

49

50

40

32

31

40

52

28

23

35

25

30

36

32

26

50

55

30

58

64

52

49

33

43

46

32

61

31

30

40

60

74

37

29

43

54

Range = Largest - Smallest
= 74 - 23
= 51

Smallest
Largest

Number of Classes
and Class Width
• The number of classes should be between 5 and 15.
– Fewer than 5 classes cause excessive summarization.
– More than 15 classes leave too much detail.

• Class Width
– Divide the range by the number of classes for an
approximate class width
– Round up to a convenient number

Class midpoint or Class mark
The midpoint of each class interval is called the
class midpoint or the class mark.

Midpoints for Age Classes

Class Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Total

Frequency
6
18
11
11
3
1
50

Midpoint
25
35
45
55
65
75

Relative
Frequency
.12
.36
.22
.22
.06
.02
1.00

Cumulative
Frequency
6
24
35
46
49
50

Histogram
Class Interval Frequency
20-under 30
6
30-under 40
18
40-under 50
11
50-under 60
11
60-under 70
3
70-under 80
1

10
0

A graphical display of
class frequencies

Frequency

Class Interval Frequency
20-under 30
6
30-under 40
18
40-under 50
11
50-under 60
11
60-under 70
3
70-under 80
1

20

Frequency Polygon

0

10 20 30 40 50 60 70 80
Years

Relative Frequency Ogive
Cumulative
Relative
Class Interval

Frequency

20-under 30

.12

30-under 40

.48

40-under 50

.70

50-under 60

.92

60-under 70

.98

70-under 80

1.00

Stem and Leaf plot:
Safety Examination Scores for Plant Trainees

Raw Data

Stem

Leaf

86

77

91

60

55

2

3

76

92

47

88

67

3

9

23

59

72

75

83

4

79

5

569

6

07788

77

68

82

97

89

81

75

74

39

67

7

0245567789

79

83

70

78

91

8

11233689

68

49

56

94

81

9

11247

Stem and Leaf plot: Construction
Raw Data
86

77

91

60

Stem
55

Leaf

2

3

3

9

4

79

5

569

Leaf

6

07788

67

7

0245567789

78

91

8

11233689

Leaf
94

81

9

11247

76

92

47

88

23

59

72

75

77

68

82

97

81

75

74

39

79

83

70

68

49

56

Stem

Stem
67
83
89

Histogram vs. Stem and Leaf?
• So, which one should you use?
• A Stem and Leaf plot is useful for small data
sets. It shows the values of the datapoints.
• A histogram foregoes seeing the individual
values of the data for the bigger picture of the
distribution of the data
• The purpose of these graphs is to summarize a
set of data. As long as that need is met, either
one is okay to use.

Common Statistical Graphs
– Qualitative Data
• Pie Chart -- proportional representation for
categories of a whole
• Bar Chart – frequency or relative frequency of
one more categorical variables

Complaints by Amtrak Passengers
COMPLAINT

NUMBER PROPORTION

DEGREES

Stations, etc.

28,000

.40

144.0

Train
Performance
Equipment

14,700

.21

75.6

10,500

.15

54.0

Personnel

9,800

.14

50.4

Schedules,
etc.
Total

7,000

.10

36.0

70,000

1.00

360.0

Complaints by Amtrak Passengers

Second Quarter U.S. Truck Production
Second Quarter Truck
Production in the U.S.
(Hypothetical values)

Company

2d Quarter
Truck
Production

A

357,411

B

354,936

C

160,997

D

34,099

E
Totals

12,747
920,190

Second Quarter
U.S. Truck Production

Pie Chart Calculations
for Company A

Company

2d Quarter
Truck
Production

Proportion

Degrees

A

357,411

.388

140

B

354,936

.386

139

C

160,997

.175

63

D

34,099

.037

13

12,747
920,190

.014
1.000

5
360

E
Totals

Vertical Bar Graphs or Column Charts
6
5

4
Kolkata

3

Mumbai

Chennai

2
1

0
2010

2011

2012

2013

Horizontal Bar Chart
2013

2012
Chennai
Mumbai

2011

Kolkata

2010
0

2

4

6

Pareto Chart
A pareto chart is a bar chart, sorted from the most frequent to the
least frequent, overlaid with a cumulative line graph (like an ogive).
These data present the most common types of defects.
100%
90%

80
70

Frequency

100
90

80%
70%

60
50
40

60%
50%
40%

30
20

30%
20%

10
0

10%
0%

Poor
Wiring

Short in
Coil

Defective
Plug

Other

Scatter Plot
Registered
Vehicles
(1000's)

Gasoline Sales
(1000's of
Gallons)

5

60

15

120

9

90

15

140

7

60

Common Statistical Graphs –
Comparing Two Variables
• Scatter Plot -- type of display using Cartesian
coordinates to display values for two variables for
a set of data.
– The data is displayed as a collection of points, each
having the value of one variable determining the
position on the horizontal axis and the value of the
other variable determining the position on the vertical
axis.
– A scatter plot is also called a scatter chart, scatter
diagram and scatter graph.

Measures of Central Tendency
& Dispersion:
Learning Objectives

• Distinguish between measures of central
tendency, measures of variability, measures of
shape, and measures of association.
• Understand the meanings of
mean, median, mode, quartile, percentile, and
range.
• Compute
mean, median, mode, percentile, quartile, range, v
ariance, standard deviation, and mean absolute
deviation on ungrouped data.
• Differentiate between sample and population
variance and standard deviation.

Measures of Central Tendency
& Dispersion:
Learning Objectives - continued

• Understand the meaning of standard deviation as
it is applied by using the empirical rule and
Chebyshev’s theorem.
• Compute the mean, median, standard
deviation, and variance on grouped data.
• Understand box and whisker plots, skewness, and
kurtosis.
• Compute a coefficient of correlation and interpret
it.

Measures of Central Tendency:
Ungrouped Data
• Measures of central tendency yield information
about “the centre, or middle part, of a group of
numbers.”
• Measures of central tendency do not focus on the
span of the data set or how far values are from the
middle numbers
• Common Measures of Location
–
–
–
–
–

Mode
Median
Mean
Percentiles
Quartiles

Mode
• Mode - the most frequently occurring value in a
data set
– Applicable to all levels of data measurement
(nominal, ordinal, interval, and ratio)
– Can be used to determine what categories occur most
frequently
– Sometimes, no mode exists (no duplicates)

• Bimodal – In a tie for the most frequently
occurring value, two modes are listed
• Multimodal -- Data sets that contain more than
two modes

Median
• Median - middle value in an ordered array of
numbers.
– Half the data are above it, half the data are below it
– Mathematically, it is the (n+1)/2 th ordered
observation
• For an array with an odd number of terms, the median is
the middle number
– n=11 => (n+1)/2 th = 12/2 th = 6th ordered observation

• For an array with an even number of terms the median is
the average of the middle two numbers
– n=10 => (n+1)/2 th = 11/2 th = 5.5th = average of 5th and 6th
ordered observation

Arithmetic Mean
•
•
•
•

Mean is the average of a group of numbers
Applicable for interval and ratio data
Not applicable for nominal or ordinal data
Affected by each value in the data
set, including extreme values
• Computed by summing all values in the data
set and dividing the sum by the number of
values in the data set

Demonstration Problem
The number of U.S. cars in service by top car rental
companies in a recent year according to Auto Rental
News follows.
Company / Number of Cars in Service
Enterprise 643,000; Hertz 327,000; National/Alamo
233,000; Avis 204,000; Dollar/Thrifty 167,000; Budget
144,000; Advantage 20,000; U-Save 12,000; Payless
10,000; ACE 9,000; Fox 9,000; Rent-A-Wreck 7,000;
Triangle 6,000
Compute the mode, the median, and the mean.

Demonstration Problem
•

Solutions

Solution

Mode: 9,000 (two companies with 9,000 cars in
service)

Median: With 13 different companies in this
group, N = 13. The median is located at the (13
+1)/2 = 7th position. Because the data are
already ordered, median is the 7th term, which is
20,000.
Mean: μ = ∑x/N = (1,791,000/13) = 137,769.23

Percentile
• Percentile - measures of central tendency that divide a
group of data into 100 parts
• At least n% of the data lie at or below the nth
percentile, and at most (100 - n)% of the data lie
above the nth percentile
• Example: 90th percentile indicates that at 90% of the
data are equal to or less than it, and 10% of the data
lie above it

Calculating Percentiles
• To calculate the pth percentile,
– Order the data
– Calculate i = N (p/100)
– Determine the percentile
• If i is a whole number, then use the average of the
ith and (i+1)th ordered observation
• Otherwise, round i up to the next highest whole
number

Quartiles
• Quartile - measures of central tendency that divide a
group of data into four subgroups
• Q1: 25% of the data set is below the first quartile
• Q2: 50% of the data set is below the second quartile
• Q3: 75% of the data set is below the third quartile

Q2

Q1
25%

25%

Q3
25%

25%

Quartiles for Demonstration Problem

For the cars in service data, n=13, so
Q1: i = 13 (25/100) = 3.25, so use the 4th ordered observation
Q1 = 9,000
Q3: i = 13 (75/100) = 9.75, so use the 10th ordered observation
Q3 = 204,000

Which Measure Do I Use?
• Which measure of central tendency is most
appropriate?
– In general, the mean is preferred, since it has nice
mathematical properties, we shall discuss later
– The median and quartiles, are resistant to outliers

• Consider the following three datasets
–
–
–
–

1, 2, 3 (median=2, mean=2)
1, 2, 6 (median=2, mean=3)
1, 2, 30 (median=2, mean=11)
All have median=2, but the mean is sensitive to the outliers

• In general, if there are outliers, the median is preferred
to the mean
……….. To continue

Statistics with R

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Statistics with R

Similar a Statistics with R (20)

Más de Ruru Chowdhury

Más de Ruru Chowdhury (20)

Último

Último (20)

Statistics with R