Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Book001(statweb.blogspot.com)
1. Fundamentals of some Basic Statistical Definitions
Basic Definitions Sample
A sample is a group of units
Statistical Inference
selected from a larger group (the
population). By studying the sample it is
Statistical Inference makes use of
hoped to draw valid conclusions about
information from a sample to draw
the larger group.
conclusions (inferences) about the
A sample is generally selected for
population from which the sample was
study because the population is too
taken.
large to study in its entirety.The sample
should be representative of the general
population. This is often best achieved
Experiment
by random sampling. Also, before
collecting the sample, it is important that
An experiment is any process or study
the researcher carefully andcompletely
which results in the collection of data,
defines the population, including a
the outcome of which is unknown. In description of the members to be
statistics, the term is usually restricted to included.
situations in which the researcher has
Example
control over some of the conditions
under which the experiment takes place.
The population for a study of
infant health might be all children born in
Example the UK in the 1980's.The sample might
Before introducing a new drug be all babies born on 7th May in any of
treatment to reduce high blood the years.
pressure, the manufacturer carries out
an experiment to compare the Parameter
effectiveness of the new drug with that A parameter is a value, usually
of one currently prescribed. Newly unknown (and which therefore has to be
diagnosed subjects are recruited from a estimated), used to represent a certain
group of local general practices. Half of population characteristic.For example,
them are chosen at random to receive the population mean is a parameter that
the new drug, the remainder receiving is often used to indicate the average
the present one. So, the researcher has value of a quantity.Within a population,
control over the type of subject recruited a parameter is a fixed value which does
and the way in which they are allocated not vary. Eachsample drawn from the
to treatment. population has its own value of any
statistic that isused to estimate this
Experimental (or Sampling) Unit parameter. For example, the mean of
the data in a sample is used to give
A unit is a person, animal, plant or thing information about the overall mean in
which is actually studied by a the population from which that sample
researcher; the basic objects upon was drawn. Parameters are often
which the study or experiment is carried assigned Greek letters (e.g. ), whereas
out. For example, a person; a monkey; a statistics are assigned Roman letters
sample of soil; a pot of seedlings; a (e.g. s).
postcode area; a doctor's practice.
K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 1
2. Fundamentals of some Basic Statistical Definitions
Statistic parameter µ; is normally distributed
A statistic is a quantity that is
with expected value µ and variance
calculated from a sample of data. It is
/n.
used to give information about unknown
values in the corresponding population.
Estimate
For example, the average of the data in
a sample is used to give information
An estimate is an indication of the
about the overall average in the
value of an unknown quantity based on
population from which that sample
observed data.
wasdrawn. It is possible to draw more
than one sample from the same
population and the value of a statistic More formally, an estimate is the
will in general vary from sample to particular value of an estimator that is
sample. For example, the average value obtained from a particular sample of
in a sample is a statistic. The average data and used to indicate the value of a
values in more than one sample, drawn parameter.
from the same population, will not
necessarily be equal. Statistics are often Example
assigned Roman letters (e.g. m and s), Suppose the manager of a shop
whereas the equivalent unknown values wanted to know the mean expenditure
in the population (parameters ) are of customers in her shop in the last
assigned Greek letters (e.g. µ and ). year. She could calculate the average
expenditure of the hundreds (or perhaps
Sampling Distribution thousands) of customers who bought
goods in her shop, that is, the
The sampling distribution describes population mean. Instead she could use
probabilities associated with a statistic an estimate of this population mean by
when a random sample is drawn from a calculating the mean of a representative
population. sample of customers. If this value was
found to be £25, then £25 would be her
The sampling distribution is the estimate.
probability distribution or probability
density function of the statistic.
Estimator
Derivation of the sampling distribution is
the first step in calculating a confidence An estimator is any quantity
interval or carrying out a hypothesis test calculated from the sample data which
for a parameter. is used to give information about an
unknown quantity in the population. For
Example example, the sample mean is an
Suppose that x1, ......., xn are a simple estimator of the population mean.
random sample from a normally
distributed population with expected Estimators of population
parameters are sometimes
value µ and known variance . Then distinguished from the true value by
the sample mean is a statistic used to using the symbol 'hat'. For example,
give information about the population
K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 2
3. Fundamentals of some Basic Statistical Definitions
Compare continuous data.
= true population standard
deviation
Categorical Data
= estimated (from a sample)
population standard deviation
A set of data is said to be
categorical if the values or observations
Example
belonging to it can be sorted according
to category. Each value is chosen from
The usual estimator of the population
a set of non-overlapping categories. For
mean is
example, shoes in a cupboard can be
sorted according to colour: the
characteristic 'colour' can have non-
where n is the size of the sample and overlapping categories 'black', 'brown',
X1, X2, X3, ......., Xn are the values of the 'red' and 'other'. People have the
sample. characteristic of 'gender' with categories
'male' and 'female'.
If the value of the estimator in a
particular sample is found to be 5, then Categories should be chosen
5 is the estimate of the population mean carefully since a bad choice can
µ. prejudice the outcome of an
investigation. Every value should belong
to one and only one category, and there
Estimation should be no doubt as to which one.
Estimation is the process by
which sample data are used to indicate Nominal Data
the value of an unknown quantity in a
population. A set of data is said to be
nominal if the values / observations
Results of estimation can be expressed belonging to it can be assigned a code
as a single value, known as a point in the form of a number where the
estimate, or a range of values, known as numbers are simply labels. You can
a confidence interval. count but not order or measure nominal
data. For example, in a data set males
Discrete Data could be coded as 0, females as 1;
marital status of an individual could be
coded as Y if married, N if single.
A set of data is said to be
discrete if the values / observations
belonging to it are distinct and separate,
i.e. they can be counted (1,2,3,....). Ordinal Data
Examples might include the number of
kittens in a litter; the number of patients A set of data is said to be ordinal if the
in a doctors surgery; the number of values / observations belonging to it can
flaws in one metre of cloth; gender be ranked (put in order) or have a rating
(male, female); blood group (O, A, B, scale attached. You can count and
AB). order, but not measure, ordinal data.
K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 3
4. Fundamentals of some Basic Statistical Definitions
The categories for an ordinal set of data count, order and measure continuous
have a natural order, for example, data. For example height, weight,
suppose a group of people were asked temperature, the amount of sugar in an
to taste varieties of biscuit and classify orange, the time required to run a mile.
each biscuit on a rating scale of 1 to 5,
representing strongly dislike, dislike, Compare discrete data.
neutral, like, strongly like. A rating of 5
indicates more enjoyment than a rating
of 4, for example, so such data are Frequency Table
ordinal.
A frequency table is a way of
However, the distinction between summarising a set of data. It is a record
neighbouring points on the scale is not of how often each value (or set of
necessarily always the same. For values) of the variable in question
instance, the difference in enjoyment occurs. It may be enhanced by the
expressed by giving a rating of 2 rather addition of percentages that fall into
than 1 might be much less than the each category.
difference in enjoyment expressed by
giving a rating of 4 rather than 3. A frequency table is used to
summarise categorical, nominal, and
ordinal data. It may also be used to
Interval Scale summarise continuous data once the
data set has been divided up into
An interval scale is a scale of sensible groups.
measurement where the distance
between any two adjacents units of When we have more than one
measurement (or 'intervals') is the same categorical variable in our data set, a
but the zero point is arbitrary. Scores on frequency table is sometimes called a
an interval scale can be added and contingency table because the figures
subtracted but can not be meaningfully found in the rows are contingent upon
multiplied or divided. For example, the (dependent upon) those found in the
time interval between the starts of years columns.
1981 and 1982 is the same as that
between 1983 and 1984, namely 365 Example
days. The zero point, year 1 AD, is
arbitrary; time did not begin then. Other Suppose that in thirty shots at a
examples of interval scales include the target, a marksman makes the following
heights of tides, and the measurement scores:
of longitude.
Continuous Data 52234 43203 03215
A set of data is said to be 13155 24004 54455
continuous if the values / observations
belonging to it may take on any value The frequencies of the different scores
within a finite or infinite interval. You can can be summarised as:
K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 4
5. Fundamentals of some Basic Statistical Definitions
Bar Chart
Score Frequency Frequency (%) A bar chart is a way of
0 4 13% summarising a set of categorical data. It
1 3 10% is often used in exploratory data
analysis to illustrate the major features
2 5 17% of the distribution of the data in a
3 5 17% convenient form. It displays the data
4 6 20% using a number of rectangles, of the
5 7 23% same width, each of which represents a
particular category. The length (and
hence area) of each rectangle is
proportional to the number of cases in
Pie Chart the category it represents, for example,
age group, religious affiliation.
A pie chart is a way of
summarising a set of categorical data. It Bar charts are used to
is a circle which is divided into summarise nominal or ordinal data.
segments. Each segment represents a
particular category. The area of each Bar charts can be displayed
segment is proportional to the number of horizontally or vertically and they are
cases in that category. usually drawn with a gap between the
bars (rectangles), whereas the bars of a
Example histogram are drawn immediately next
Suppose that, in the last year a to each other.
sports wear manufacturers has spent 6
million pounds on advertising their
products; 3 million has been spent on
television adverts, 2 million on
sponsorship, 1 million on newspaper
adverts, and a half million on posters.
This spending can be summarised using
a pie chart:
K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 5
6. Fundamentals of some Basic Statistical Definitions
Dot Plot become tedious to construct. A
histogram can also help detect any
A dot plot is a way of unusual observations (outliers), or any
summarising data, often used in gaps in the data set.
exploratory data analysis to illustrate the
major features of the distribution of the
data in a convenient form.
For nominal or ordinal data, a dot
plot is similar to a bar chart, with the
bars replaced by a series of dots. Each
dot represents a fixed number of
individuals. For continuous data, the dot
plot is similar to a histogram, with the
rectangles replaced by dots.
A dot plot can also help detect
any unusual observations (outliers), or
any gaps in the data set. Compare bar chart.
Histogram
A histogram is a way of Stem and Leaf Plot
summarising data that are measured on
an interval scale (either discrete or A stem and leaf plot is a way of
continuous). It is often used in summarising a set of data measured on
exploratory data analysis to illustrate the an interval scale. It is often used in
major features of the distribution of the exploratory data analysis to illustrate the
data in a convenient form. It divides up major features of the distribution of the
the range of possible values in a data data in a convenient and easily drawn
set into classes or groups. For each form.
group, a rectangle is constructed with a
base length equal to the range of values A stem and leaf plot is similar to a
in that specific group, and an area histogram but is usually a more
proportional to the number of informative display for relatively small
observations falling into that group. This data sets (<100 data points). It provides
means that the rectangles might be a table as well as a picture of the data
drawn of non-uniform height. and from it we can readily write down
the data in order of magnitude, which is
The histogram is only appropriate useful for many statistical procedures,
for variables whose values are e.g. in the skinfold thickness example
numerical and measured on an interval below:
scale. It is generally used when dealing
with large data sets (>100
observations), when stem and leaf plots
K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 6
7. Fundamentals of some Basic Statistical Definitions
observations are involved and when two
or more data sets are being compared.
We can compare more than one 5-Number Summary
data set by the use of multiple stem and
leaf plots. By using a back-to
to-back stem A 5-number
number summary is
and leaf plot, we are able to compare especially useful when we have so
the same characteristic in two different many data that it is sufficient to present
groups, for example, pulse rate after a summary of the data rather than the
exercise of smokers and non- -smokers. whole data set. It consists of 5 values:
the most extreme values in the data set
Box and Whisker Plot (or Boxplot) (maximum and minimum values), the
imum
lower and upper quartiles, and the
quartiles
A box and whisker plot is a way median.
of summarising a set of data measured
on an interval scale. It is often used in
exploratory data analysis. It is a type of
graph which is used to show the shape A 5-number summary can be
of the distribution, its central value, and represented in a diagram known as a
variability. The picture produced box and whisker plot. In cases where we
.
consists of the most extreme values in have more than one data set to analyse,
the data set (maximum and minimum a 5-number summary is constructed for
number
values), the lower and upper quartiles, each, with corresponding multiple box
and the median. and whisker plots.
A box plot (as it is often called) is
especially helpful for indicating whether Outlier
a distribution is skewed and whether
there are any unusual observations
An outlier is an observation in a
(outliers) in the data set.
data set which is far removed in value
from the others in the data set. It is an
Box and whisker plots are also
unusually large or an unusually small
very useful when large numbers of
value compared to the others.
K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 7
8. Fundamentals of some Basic Statistical Definitions
Skewness
An outlier might be the result of
an error in measurement, in which case Skewness is defined as
it will distort the interpretation of the asymmetry in the distribution of the
data, having undue influence on many sample data values. Values on one side
summary statistics, for example, the of the distribution tend to be further from
mean. the 'middle' than values on the other
side.
If an outlier is a genuine result, it
is important because it might indicate an For skewed data, the usual
extreme of behaviour of the process measures of location will give different
under study. For this reason, all outliers values,forexample,mode<median<mean
must be examined carefully before would indicate positive (or right)
embarking on any formal analysis. skewness.
Outliers should not routinely be removed
without further justification. Positive (or right) skewness is
more common than negative (or left)
skewness.
Symmetry
If there is evidence of skewness
Symmetry is implied when data in the data, we can apply
values are distributed in the same way transformations, for example, taking
above and below the middle of the logarithms of positive skew data.
sample.
Compare symmetry.
Symmetrical data sets:
Transformation to Normality
a. are easily interpreted;
b. allow a balanced attitude to If there is evidence of marked
outliers, that is, those above and non-normality then we may be able to
below the middle value ( median) remedy this by applying suitable
can be considered by the same transformations.
criteria;
c. allow comparisons of spread or The more commonly used
dispersion with similar data sets. transformations which are appropriate
for data which are skewed to the right
Many standard statistical techniques with increasing strength (positive skew)
are appropriate only for a symmetric are 1/x, log(x) and sqrt(x), where the x's
distributional form. For this reason, are the data values.
attempts are often made to transform
skew-symmetric data so that they The more commonly used
become roughly symmetric. transformations which are appropriate
for data which are skewed to the left
with increasing strength (negative skew)
are squaring, cubing, and exp(x).
K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 8
9. Fundamentals of some Basic Statistical Definitions
Scatter Plot between the two variables is
negative (inverse).
A scatterplot is a useful summary d. If there exists a random scatter of
of a set of bivariate data (two variables), points, there is no relationship
usually drawn before working out a between the two variables (very
linear correlation coefficient or fitting a low or zero correlation).
regression line. It gives a good visual e. Very low or zero correlation could
picture of the relationship between the result from a non-linear
two variables, and aids the interpretation relationship between the
of the correlation coefficient or variables. If the relationship is in
regression model. fact non-linear (points clustering
around a curve, not a straight
Each unit contributes one point to line), the correlation coefficient
the scatterplot, on which points are will not be a good measure of the
plotted but not joined. The resulting strength.
pattern indicates the type and strength
of the relationship between the two A scatterplot will also show up a non-
variables. linear relationship between the two
variables and whether or not there exist
any outliers in the data.
More information can be added to a
two-dimensional scatterplot - for
example, we might label points with a
code to indicate the level of a third
variable.
If we are dealing with many variables
in a data set, a way of presenting all
possible scatter plots of two variables at
a time is in a scatterplot matrix.
Illustrations
a. The more the points tend to Sample Mean
cluster around a straight line, the
stronger the linear relationship The sample mean is an estimator
between the two variables (the available for estimating the population
higher the correlation). mean . It is a measure of location,
b. If the line around which the points commonly called the average, often
tends to cluster runs from lower
left to upper right, the relationship symbolised .
between the two variables is
positive (direct). Its value depends equally on all
c. If the line around which the points of the data which may include outliers. It
tends to cluster runs from upper may not appear representative of the
left to lower right, the relationship central region for skewed data sets.
K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 9
10. Fundamentals of some Basic Statistical Definitions
It is especially useful as being 57 55 85 24 33 49 94 2 8
representative of the whole sample for Data 51 71 30 91 6 47 50 65 43
use in subsequent calculations. 41 7
2 6 7 8 24 30 33 41 43 47
Example Ordered
49 50 51 55 57 65 71 85
Lets say our data set is: 5 3 54 Data
91 94
93 83 22 17 19.
Median Halfway between the two
The sample mean is calculated
'middle' data points - in
by taking the sum of all the data values
this case halfway between
and dividing by the total number of data
47 and 49, and so the
values:
median is 48
Mode
Median
The mode is the most frequently
The median is the value halfway occurring value in a set of discrete data.
through the ordered data set, below and There can be more than one mode if
above which there lies an equal number two or more values are equally
of data values. common.
It is generally a good descriptive Example
measure of the location which works
well for skewed data, or data with Suppose the results of an end of
outliers. term Statistics exam were distributed as
follows:
The median is the 0.5 quantile.
Example Student: Score:</I.< td>
1 94
With an odd number of data 2 81
values, for example 21, we have:
3 56
96 48 27 72 39 70 7 68
Data 99 36 95 4 6 13 34 74 65 4 90
42 28 54 69 5 70
4 6 7 13 27 28 34 36 39 6 65
Ordered
42 48 54 65 68 69 70 72 7 90
Data
74 95 96 99 8 90
48, leaving ten values 9 30
Median below and ten values
Then the mode (most common
above
score) is 90, and the median (middle
score) is 81.
With an even number of data values, for
example 20, we have:
K.MANOJ.M.Sc.,M.phil.,D.C.A., Page 10