SlideShare una empresa de Scribd logo
1 de 13
Descargar para leer sin conexión
TARUN GEHLOT (B.E, CIVIL) (HONOURS)
Describing and Exploring Data
Once a bunch of data has been collected, the raw numbers must be manipulated in
some fashion to make them more informative. Several options are available
including plotting the data or calculating descriptive statistics
Plotting Data
Often, the first thing one does with a set of raw data is to plot frequency distributions.
Usually this is done by first creating a table of the frequencies broken down by values of
the relevant variable, then the frequencies in the table are plotted in a histogram
Example: Your age as estimated by the questionnaire from the first class
TABLE 2.1
Age Frequency
18 3
19 10
20 14
21 10
22 5
23 2
24 1
25 1
26 2
Note: The frequencies in the above table were calculated by simply counting the number
of subjects having the specified value for the age variable
Histogram
TARUN GEHLOT (B.E, CIVIL) (HONOURS)
Grouping Data:
Plotting is easy when the variable of interest has a relatively small number of values (like
our age variable did). However, the values of a variable are sometimes more continuous,
resulting in uninformative frequency plots if done in the above manner. For example, our
weight variable ranges from 100 lb. to 200 lb. If we used the previously described
technique, we would end up with 100 bars, most of which with a frequency less than 2 or
3 (and many with a frequency of zero). We can get around this problem by grouping our
values into bins. Try for around 10 bins with natural splits
Example: Binning our weight variable
TABLE 2.2
Weight Bin Midpoint Frequency
100-109 104.5 6
110-119 114.5 10
120-129 124.5 6
130-139 134.5 10
140-149 144.5 5
150-159 154.5 3
160-169 164.5 4
170-179 174.5 1
TARUN GEHLOT (B.E, CIVIL) (HONOURS)
180-189 184.5 0
190-199 194.5 2
200-209 204.5 1
Histogram
Here's a live demonstration of binning
Stem & Leaf Plots
If values of a variable must be grouped prior to creating a frequency plot, then the
information related to the specific values becomes lost in the process (i.e., the resulting
graph depicts only the frequency values associated with the grouped values). However,
it is possible to obtain the graphical advantage of grouping and still keep all of the
information if stem & leaf plots are used....
These plots are created by splitting a data point into that part associated with the `group'
and that associated with the individual point. For example, the numbers 180, 180, 181,
182, 185, 186, 187, 187, 189 could be represented as:
18 001256779
Thus, we could represent our weight data in the following stem & leaf plot:
Stem & Leaf
10 057788
TARUN GEHLOT (B.E, CIVIL) (HONOURS)
11 0001235558
12 001555
13 0002244555
14 00005
15 005
16 2255
17 0
18
19 05
20 0
Stem & leaf plots are especially nice for comparing distributions:
Males Stem Females
8 10 05778
11 0001235558
12 001555
5440 13 002255
00 14 005
00 15 5
522 16 5
0 17
18
50 19
0 20
Terminology Related to Distributions:
Often, frequency histograms tend to have a roughly symetrical bell-shape and such
distributions are called normal or gaussion
Example: Our height distribution
TARUN GEHLOT (B.E, CIVIL) (HONOURS)
Sometimes, the bell shape is not semetrical
The term positive skew refers to the situation where the "tail" of the distribition is to the
right, negative skew is when the "tail" is to the left
Example: Pizza Data
See the text for other terminology
NOTATION
Variables
TARUN GEHLOT (B.E, CIVIL) (HONOURS)
When we describe a set of data corresponding to the values of some variable, we will
refer to that set using an uppercase letter such as X or Y
When we want to talk about specific data points within that set, we specify those points
by adding a subscript to the uppercase letter like X1
For example:
5, 8, 12, 3, 6, 8, 7
X1, X2, X3, X4, X5, X6, X7
Summation
The greek letter sigma, which looks like , means "add up" or "sum" whatever follows
it. Thus, , means "add up all the Xi's". If we use the Xis from the previous example,
Xi = 49 (or just X)
Note, that sometimes the has number above and below it. These numbers specify the
range over which to sum. For example, if we again use the theXis from the previous
example, but now limit the summation: Xi = 34
Nasty Example:
Antic Real
TABLE 2.3
Student Mark #1 Mark #2
- X Y
1 82 84
2 66 51
3 70 72
4 81 56
5 61 73
Double Subscripts
Sometimes things are made more complicated because capital letters (e.g., X) are
sometimes used to refer to entire datasets (as opposed to single variables) and multiple
subscripts are used to specify specific data points
TABLE 2.4
Student Week 1 Week 2 Week 3 Week 4 Week 5
TARUN GEHLOT (B.E, CIVIL) (HONOURS)
1 7 6 4 2 2
2 3 4 4 3 4
3 3 4 5 4 6
X24 = 3
X or Xij = 61
Measures of Central Tendency
While distributions provide an overall picture of some dataset, it is sometimes desirable
to represent the entire dataset using descriptive statistics
The first descriptive statistics we will discuss, are those used to indicate where the
centre of the distribution lies
There are, in fact, three different measures of central tendency
The first of these is called the mode
The mode is simply the value of the relevant variable that occurs most often (i.e., has the
highest frequency) in the sample
Note that if you have done a frequency histogram, you can often identify the mode
simply by finding the value with the highest bar
However, that will not work when grouping was performed prior to plotting the histogram
(although you can still use the histogram to identify the modal group, just not the modal
value).
Finding the mode:
Create a nongrouped frequency table as described previously, then identify the
value with the greatest frequency
For Example: Class Height
TARUN GEHLOT (B.E, CIVIL) (HONOURS)
TABLE 2.5
Value Frequency Value Frequency
61 3 69 3
62 4 70 2
63 4 71 4
64 4 72 4
65 3 73 0
66 7 74 0
67 5 75 0
68 4 76 1
A second measure of central tendency is called the median
The median is the point corresponding to the score that lies in the middle of the
distribution (i.e., there are as many data points above the median as there are below the
median)
To find the median, the data points must first be sorted into either ascending or
descending numerical order
The position of the median value can then be calculated using the following Formula:
For Examples:
 If there are an odd number of data points:
(1, 3, 3, 4, 4, 5, 6, 7, 12)
 If there are an even number of data points:
The median is the item in the fifth position of the ordered dataset, therefore the median
is 4
Finally, the most commonly used measure of central tendency is called
the mean (denoted for a sample, and for a population)
The mean is the same of what most of us call the average, and it is calculated in the
following manner:
TARUN GEHLOT (B.E, CIVIL) (HONOURS)
For example, given the dataset that we used to calculate the median (odd number
example), the corresponding mean would be:
Similarly, the mean height of our class, as indicated by our sample, is:
Mode vs. Median vs. Mean In our height example, the mode and median were the
same, and the mean was fairly close to the mode and median. This was the case
because the height distribution was fairly symetrical However, when the underlying
distribution is not symetrical, the three measures of central tendency can be quite
different
This raises the issue of which measure is best?
Example: Pizza Eating
TABLE 2.5
Value Frequency Value Frequency
0 4 8 5
1 2 10 2
2 8 15 1
3 6 16 1
4 6 20 1
5 6 40 1
6 5 - -
Mode = 2 slices per week
Median = 4 slices per week
Mean = 5.7 slices per week
Note that if you were calculating these values, you would show all your steps (it's good
to be prof!)
Measures of Variability
In addition to knowing where the centre of the distribution is, it is often helpful to know
the degree to which individual values cluster around the centre. This is known as
variability. There are various measures of variability, the most straightforward
beingthe range of the sample:
Highest value minus lowest value
While range provides a good first pass at variance, it is not the best measure because of
its sensitivity to extreme scores (see section in text)
The Average Deviation
TARUN GEHLOT (B.E, CIVIL) (HONOURS)
Another approach to estimating variance is to directly measure the degree to which
individual datapoints differ from the mean and then average those deviations
That is:
However, if we try to do this with real data, the result will always be zero:
Example: (2,3,4,4,4,5,6,12)
The Mean Absolute Deviation (MAD)
One way to get around the problem with the average deviation is to use the absolute
value of the differences, instead of the differences themselves
The absolute value of some number is just the number without any sign:
For Example,
Thus, we could re-write and solve our average deviation question as follows:
The dataset in question has a mean of 5 and a mean absolute deviation of 2
The Variance
Although the MAD is an acceptable measure of variability, the most commonly used
measure is variance (denoted s2
for a sample and for a population) and its square
root termed the standard deviation (denoted s for a sample and for a population)
The computation of variance is also based on the basic notion of the average deviation
however, instead of getting around the "zero problem" by using absolute deviations (as
in MAD), the "zero problem" is eliminating by squaring the differences from the mean
Specifically:
TARUN GEHLOT (B.E, CIVIL) (HONOURS)
Example: Same old numbers
(2,3,4,4,4,5,6,12)
Alternate formula for s2
and s
The definitional formula of variance just presented was:
An equivalent formula that is easier to work with when calculating variances by hand is:
Although this second formula may look more intimidating, a few examples will show you
that it is actually easier to work with (as you'll See in assignment 2)
Estimating Population Parameters
So, the mean and variance (s2
) are the descriptive statistics that are most commonly
used to represent the datapoints of some sample
The real reason that they are the preferred measures of central tendency and variance
TARUN GEHLOT (B.E, CIVIL) (HONOURS)
is because of certain properties they have as estimators of their corresponding
population parameters; and
Four properties are considered desirable in a population estimator; sufficiency,
unbiasedness, efficiency, & resistance
Both the mean and the variance are the best estimators in their class in terms of the first
three of these four properties
Sufficiency
A sufficient statistic is one that makes use of all of the information in the sample to
estimate its corresponding parameter
Unbiasedness
A statistic is said to be an unbiased estimator if its expected value (i.e., the mean of a
number of sample means) is equal to the population parameter
 Explanation of N-1 in s2
formula
Efficiency
The efficiency of a statistic is reflected in the variance that is observed when one
examines the means of a bunch of independently chosen samples
Assessing the Bias of an Estimator
The bias of a statistic as an estimator of some population parameter can be assessed by
 defining some population with measurable parameters
 taking a number of independent samples from that population
 calculating the relevant statistic for each sample
 averaging that statistic across the samples
 comparing the "average of the sample statistics" with the population parameter
Using the procedure, the mean can be shown to be an unbiased estimator (see pp 47)
However, if the more intuitive formula for s2
is used:
it turns out to underestimate
This bias to underestimate is caused by the act of sampling and it can be shown that this
bias can be eliminated if N-1 is used in the denominator instead of N
Note that this is only true when calculating s2
, if you have a measurable population and
you want to calculate , you use N in the denominator, not N-1
Degrees of Freedom
TARUN GEHLOT (B.E, CIVIL) (HONOURS)
The mean of 6, 8, & 10 is 8
If I allow you to change as many of these numbers as you want BUT the mean must stay
8, how many of the numbers are you free to vary?
The point of this exercise is that when the mean is fixed, it removes a degree of freedom
from your sample -- this is like actually subtracting 1 from the number of observations in
your sample
It is for exactly this reason that we use N-1 in the denominator when we calculate s2
(i.e., the calculation requires that the mean be fixed first which effectively removes --
fixes -- one of the data points)
Resistance
The resistance of an estimator refers to the degree to which that estimate is effected by
extreme values
As mentioned previously, both and s2
are highly sensitive to extreme values
Despite this, they are still the most commonly used estimates of the corresponding
population parameters, mostly because of their superiority over other measures in terms
sufficiency, unbiasedness, & efficiency

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Empirics of standard deviation
Empirics of standard deviationEmpirics of standard deviation
Empirics of standard deviation
 
Visualization-1
Visualization-1Visualization-1
Visualization-1
 
135. Graphic Presentation
135. Graphic Presentation135. Graphic Presentation
135. Graphic Presentation
 
Statistik Chapter 2
Statistik Chapter 2Statistik Chapter 2
Statistik Chapter 2
 
Ch4 notes for students
Ch4 notes for studentsCh4 notes for students
Ch4 notes for students
 
3. frequency distribution
3. frequency distribution3. frequency distribution
3. frequency distribution
 
What is a histogram
What is a histogramWhat is a histogram
What is a histogram
 
Statistics Based On Ncert X Class
Statistics Based On Ncert X ClassStatistics Based On Ncert X Class
Statistics Based On Ncert X Class
 
Quality Engineering material
Quality Engineering materialQuality Engineering material
Quality Engineering material
 
Histogram
HistogramHistogram
Histogram
 
Visualization-2
Visualization-2Visualization-2
Visualization-2
 
Statistics
StatisticsStatistics
Statistics
 
Graphs ppt
Graphs pptGraphs ppt
Graphs ppt
 
Quantitative techniques in business
Quantitative techniques in businessQuantitative techniques in business
Quantitative techniques in business
 
Biostatistics i
Biostatistics iBiostatistics i
Biostatistics i
 
Interprertation of statistics
Interprertation of statisticsInterprertation of statistics
Interprertation of statistics
 
Sqqs1013 ch2-a122
Sqqs1013 ch2-a122Sqqs1013 ch2-a122
Sqqs1013 ch2-a122
 
Statistics
StatisticsStatistics
Statistics
 
Graphical presentation of data
Graphical presentation of dataGraphical presentation of data
Graphical presentation of data
 
Percentiles and Deciles
Percentiles and DecilesPercentiles and Deciles
Percentiles and Deciles
 

Destacado

Solution of nonlinear_equations
Solution of nonlinear_equationsSolution of nonlinear_equations
Solution of nonlinear_equationsTarun Gehlot
 
Intervals of validity
Intervals of validityIntervals of validity
Intervals of validityTarun Gehlot
 
Recurrence equations
Recurrence equationsRecurrence equations
Recurrence equationsTarun Gehlot
 
How to draw a good graph
How to draw a good graphHow to draw a good graph
How to draw a good graphTarun Gehlot
 
An applied approach to calculas
An applied approach to calculasAn applied approach to calculas
An applied approach to calculasTarun Gehlot
 
Graphs of trigonometric functions
Graphs of trigonometric functionsGraphs of trigonometric functions
Graphs of trigonometric functionsTarun Gehlot
 
Modelling with first order differential equations
Modelling with first order differential equationsModelling with first order differential equations
Modelling with first order differential equationsTarun Gehlot
 
Linear approximations
Linear approximationsLinear approximations
Linear approximationsTarun Gehlot
 
Real meaning of functions
Real meaning of functionsReal meaning of functions
Real meaning of functionsTarun Gehlot
 
Probability and statistics as helpers in real life
Probability and statistics as helpers in real lifeProbability and statistics as helpers in real life
Probability and statistics as helpers in real lifeTarun Gehlot
 
C4 discontinuities
C4 discontinuitiesC4 discontinuities
C4 discontinuitiesTarun Gehlot
 
The shortest distance between skew lines
The shortest distance between skew linesThe shortest distance between skew lines
The shortest distance between skew linesTarun Gehlot
 
Review taylor series
Review taylor seriesReview taylor series
Review taylor seriesTarun Gehlot
 
The newton raphson method
The newton raphson methodThe newton raphson method
The newton raphson methodTarun Gehlot
 
Limitations of linear programming
Limitations of linear programmingLimitations of linear programming
Limitations of linear programmingTarun Gehlot
 
Linear programming
Linear programmingLinear programming
Linear programmingTarun Gehlot
 

Destacado (20)

Solution of nonlinear_equations
Solution of nonlinear_equationsSolution of nonlinear_equations
Solution of nonlinear_equations
 
Intervals of validity
Intervals of validityIntervals of validity
Intervals of validity
 
Recurrence equations
Recurrence equationsRecurrence equations
Recurrence equations
 
Critical points
Critical pointsCritical points
Critical points
 
How to draw a good graph
How to draw a good graphHow to draw a good graph
How to draw a good graph
 
An applied approach to calculas
An applied approach to calculasAn applied approach to calculas
An applied approach to calculas
 
Graphs of trigonometric functions
Graphs of trigonometric functionsGraphs of trigonometric functions
Graphs of trigonometric functions
 
Modelling with first order differential equations
Modelling with first order differential equationsModelling with first order differential equations
Modelling with first order differential equations
 
Logicgates
LogicgatesLogicgates
Logicgates
 
Linear approximations
Linear approximationsLinear approximations
Linear approximations
 
Real meaning of functions
Real meaning of functionsReal meaning of functions
Real meaning of functions
 
Matrix algebra
Matrix algebraMatrix algebra
Matrix algebra
 
Probability and statistics as helpers in real life
Probability and statistics as helpers in real lifeProbability and statistics as helpers in real life
Probability and statistics as helpers in real life
 
C4 discontinuities
C4 discontinuitiesC4 discontinuities
C4 discontinuities
 
The shortest distance between skew lines
The shortest distance between skew linesThe shortest distance between skew lines
The shortest distance between skew lines
 
Thermo dynamics
Thermo dynamicsThermo dynamics
Thermo dynamics
 
Review taylor series
Review taylor seriesReview taylor series
Review taylor series
 
The newton raphson method
The newton raphson methodThe newton raphson method
The newton raphson method
 
Limitations of linear programming
Limitations of linear programmingLimitations of linear programming
Limitations of linear programming
 
Linear programming
Linear programmingLinear programming
Linear programming
 

Similar a Describing and exploring data

Wynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg Girls High
 
Principal components
Principal componentsPrincipal components
Principal componentsHutami Endang
 
Class1.ppt
Class1.pptClass1.ppt
Class1.pptGautam G
 
Introduction to Statistics - Basics of Data - Class 1
Introduction to Statistics - Basics of Data - Class 1Introduction to Statistics - Basics of Data - Class 1
Introduction to Statistics - Basics of Data - Class 1RajnishSingh367990
 
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICSSTATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICSnagamani651296
 
descriptive and inferential statistics
descriptive and inferential statisticsdescriptive and inferential statistics
descriptive and inferential statisticsMona Sajid
 
Presentation and-analysis-of-business-data
Presentation and-analysis-of-business-dataPresentation and-analysis-of-business-data
Presentation and-analysis-of-business-datamariantuvilla
 
Presentation and-analysis-of-business-data
Presentation and-analysis-of-business-dataPresentation and-analysis-of-business-data
Presentation and-analysis-of-business-datalawrencechavenia
 
Presentation and-analysis-of-business-data
Presentation and-analysis-of-business-dataPresentation and-analysis-of-business-data
Presentation and-analysis-of-business-datalovelyquintero
 
Presentation and-analysis-of-business-data
Presentation and-analysis-of-business-dataPresentation and-analysis-of-business-data
Presentation and-analysis-of-business-datalovelyquintero
 
Presentation and-analysis-of-business-data
Presentation and-analysis-of-business-dataPresentation and-analysis-of-business-data
Presentation and-analysis-of-business-datalovelyquintero
 
Presentation and analysis of business data
Presentation and analysis of business dataPresentation and analysis of business data
Presentation and analysis of business dataGeorginaRecto
 
Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDouglas Joubert
 
Frequency distribution, central tendency, measures of dispersion
Frequency distribution, central tendency, measures of dispersionFrequency distribution, central tendency, measures of dispersion
Frequency distribution, central tendency, measures of dispersionDhwani Shah
 

Similar a Describing and exploring data (20)

Wynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statistics
 
Principal components
Principal componentsPrincipal components
Principal components
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
Introduction to Statistics - Basics of Data - Class 1
Introduction to Statistics - Basics of Data - Class 1Introduction to Statistics - Basics of Data - Class 1
Introduction to Statistics - Basics of Data - Class 1
 
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICSSTATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
descriptive and inferential statistics
descriptive and inferential statisticsdescriptive and inferential statistics
descriptive and inferential statistics
 
Data presenatation
Data presenatationData presenatation
Data presenatation
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
Presentation and-analysis-of-business-data
Presentation and-analysis-of-business-dataPresentation and-analysis-of-business-data
Presentation and-analysis-of-business-data
 
Presentation and-analysis-of-business-data
Presentation and-analysis-of-business-dataPresentation and-analysis-of-business-data
Presentation and-analysis-of-business-data
 
Presentation and-analysis-of-business-data
Presentation and-analysis-of-business-dataPresentation and-analysis-of-business-data
Presentation and-analysis-of-business-data
 
Presentation and-analysis-of-business-data
Presentation and-analysis-of-business-dataPresentation and-analysis-of-business-data
Presentation and-analysis-of-business-data
 
Presentation and-analysis-of-business-data
Presentation and-analysis-of-business-dataPresentation and-analysis-of-business-data
Presentation and-analysis-of-business-data
 
Presentation and analysis of business data
Presentation and analysis of business dataPresentation and analysis of business data
Presentation and analysis of business data
 
Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data Visualization
 
3.1 Measures of center
3.1 Measures of center3.1 Measures of center
3.1 Measures of center
 
Frequency distribution, central tendency, measures of dispersion
Frequency distribution, central tendency, measures of dispersionFrequency distribution, central tendency, measures of dispersion
Frequency distribution, central tendency, measures of dispersion
 

Más de Tarun Gehlot

Materials 11-01228
Materials 11-01228Materials 11-01228
Materials 11-01228Tarun Gehlot
 
Continuity and end_behavior
Continuity and  end_behaviorContinuity and  end_behavior
Continuity and end_behaviorTarun Gehlot
 
Continuity of functions by graph (exercises with detailed solutions)
Continuity of functions by graph   (exercises with detailed solutions)Continuity of functions by graph   (exercises with detailed solutions)
Continuity of functions by graph (exercises with detailed solutions)Tarun Gehlot
 
Factoring by the trial and-error method
Factoring by the trial and-error methodFactoring by the trial and-error method
Factoring by the trial and-error methodTarun Gehlot
 
Introduction to finite element analysis
Introduction to finite element analysisIntroduction to finite element analysis
Introduction to finite element analysisTarun Gehlot
 
Finite elements : basis functions
Finite elements : basis functionsFinite elements : basis functions
Finite elements : basis functionsTarun Gehlot
 
Finite elements for 2‐d problems
Finite elements  for 2‐d problemsFinite elements  for 2‐d problems
Finite elements for 2‐d problemsTarun Gehlot
 
Error analysis statistics
Error analysis   statisticsError analysis   statistics
Error analysis statisticsTarun Gehlot
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlabTarun Gehlot
 
Linear approximations and_differentials
Linear approximations and_differentialsLinear approximations and_differentials
Linear approximations and_differentialsTarun Gehlot
 
Local linear approximation
Local linear approximationLocal linear approximation
Local linear approximationTarun Gehlot
 
Interpolation functions
Interpolation functionsInterpolation functions
Interpolation functionsTarun Gehlot
 
Propeties of-triangles
Propeties of-trianglesPropeties of-triangles
Propeties of-trianglesTarun Gehlot
 
Gaussian quadratures
Gaussian quadraturesGaussian quadratures
Gaussian quadraturesTarun Gehlot
 
Basics of set theory
Basics of set theoryBasics of set theory
Basics of set theoryTarun Gehlot
 
Numerical integration
Numerical integrationNumerical integration
Numerical integrationTarun Gehlot
 
Applications of set theory
Applications of  set theoryApplications of  set theory
Applications of set theoryTarun Gehlot
 
Miscellneous functions
Miscellneous  functionsMiscellneous  functions
Miscellneous functionsTarun Gehlot
 

Más de Tarun Gehlot (20)

Materials 11-01228
Materials 11-01228Materials 11-01228
Materials 11-01228
 
Binary relations
Binary relationsBinary relations
Binary relations
 
Continuity and end_behavior
Continuity and  end_behaviorContinuity and  end_behavior
Continuity and end_behavior
 
Continuity of functions by graph (exercises with detailed solutions)
Continuity of functions by graph   (exercises with detailed solutions)Continuity of functions by graph   (exercises with detailed solutions)
Continuity of functions by graph (exercises with detailed solutions)
 
Factoring by the trial and-error method
Factoring by the trial and-error methodFactoring by the trial and-error method
Factoring by the trial and-error method
 
Introduction to finite element analysis
Introduction to finite element analysisIntroduction to finite element analysis
Introduction to finite element analysis
 
Finite elements : basis functions
Finite elements : basis functionsFinite elements : basis functions
Finite elements : basis functions
 
Finite elements for 2‐d problems
Finite elements  for 2‐d problemsFinite elements  for 2‐d problems
Finite elements for 2‐d problems
 
Error analysis statistics
Error analysis   statisticsError analysis   statistics
Error analysis statistics
 
Matlab commands
Matlab commandsMatlab commands
Matlab commands
 
Introduction to matlab
Introduction to matlabIntroduction to matlab
Introduction to matlab
 
Linear approximations and_differentials
Linear approximations and_differentialsLinear approximations and_differentials
Linear approximations and_differentials
 
Local linear approximation
Local linear approximationLocal linear approximation
Local linear approximation
 
Interpolation functions
Interpolation functionsInterpolation functions
Interpolation functions
 
Propeties of-triangles
Propeties of-trianglesPropeties of-triangles
Propeties of-triangles
 
Gaussian quadratures
Gaussian quadraturesGaussian quadratures
Gaussian quadratures
 
Basics of set theory
Basics of set theoryBasics of set theory
Basics of set theory
 
Numerical integration
Numerical integrationNumerical integration
Numerical integration
 
Applications of set theory
Applications of  set theoryApplications of  set theory
Applications of set theory
 
Miscellneous functions
Miscellneous  functionsMiscellneous  functions
Miscellneous functions
 

Último

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 

Último (20)

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 

Describing and exploring data

  • 1. TARUN GEHLOT (B.E, CIVIL) (HONOURS) Describing and Exploring Data Once a bunch of data has been collected, the raw numbers must be manipulated in some fashion to make them more informative. Several options are available including plotting the data or calculating descriptive statistics Plotting Data Often, the first thing one does with a set of raw data is to plot frequency distributions. Usually this is done by first creating a table of the frequencies broken down by values of the relevant variable, then the frequencies in the table are plotted in a histogram Example: Your age as estimated by the questionnaire from the first class TABLE 2.1 Age Frequency 18 3 19 10 20 14 21 10 22 5 23 2 24 1 25 1 26 2 Note: The frequencies in the above table were calculated by simply counting the number of subjects having the specified value for the age variable Histogram
  • 2. TARUN GEHLOT (B.E, CIVIL) (HONOURS) Grouping Data: Plotting is easy when the variable of interest has a relatively small number of values (like our age variable did). However, the values of a variable are sometimes more continuous, resulting in uninformative frequency plots if done in the above manner. For example, our weight variable ranges from 100 lb. to 200 lb. If we used the previously described technique, we would end up with 100 bars, most of which with a frequency less than 2 or 3 (and many with a frequency of zero). We can get around this problem by grouping our values into bins. Try for around 10 bins with natural splits Example: Binning our weight variable TABLE 2.2 Weight Bin Midpoint Frequency 100-109 104.5 6 110-119 114.5 10 120-129 124.5 6 130-139 134.5 10 140-149 144.5 5 150-159 154.5 3 160-169 164.5 4 170-179 174.5 1
  • 3. TARUN GEHLOT (B.E, CIVIL) (HONOURS) 180-189 184.5 0 190-199 194.5 2 200-209 204.5 1 Histogram Here's a live demonstration of binning Stem & Leaf Plots If values of a variable must be grouped prior to creating a frequency plot, then the information related to the specific values becomes lost in the process (i.e., the resulting graph depicts only the frequency values associated with the grouped values). However, it is possible to obtain the graphical advantage of grouping and still keep all of the information if stem & leaf plots are used.... These plots are created by splitting a data point into that part associated with the `group' and that associated with the individual point. For example, the numbers 180, 180, 181, 182, 185, 186, 187, 187, 189 could be represented as: 18 001256779 Thus, we could represent our weight data in the following stem & leaf plot: Stem & Leaf 10 057788
  • 4. TARUN GEHLOT (B.E, CIVIL) (HONOURS) 11 0001235558 12 001555 13 0002244555 14 00005 15 005 16 2255 17 0 18 19 05 20 0 Stem & leaf plots are especially nice for comparing distributions: Males Stem Females 8 10 05778 11 0001235558 12 001555 5440 13 002255 00 14 005 00 15 5 522 16 5 0 17 18 50 19 0 20 Terminology Related to Distributions: Often, frequency histograms tend to have a roughly symetrical bell-shape and such distributions are called normal or gaussion Example: Our height distribution
  • 5. TARUN GEHLOT (B.E, CIVIL) (HONOURS) Sometimes, the bell shape is not semetrical The term positive skew refers to the situation where the "tail" of the distribition is to the right, negative skew is when the "tail" is to the left Example: Pizza Data See the text for other terminology NOTATION Variables
  • 6. TARUN GEHLOT (B.E, CIVIL) (HONOURS) When we describe a set of data corresponding to the values of some variable, we will refer to that set using an uppercase letter such as X or Y When we want to talk about specific data points within that set, we specify those points by adding a subscript to the uppercase letter like X1 For example: 5, 8, 12, 3, 6, 8, 7 X1, X2, X3, X4, X5, X6, X7 Summation The greek letter sigma, which looks like , means "add up" or "sum" whatever follows it. Thus, , means "add up all the Xi's". If we use the Xis from the previous example, Xi = 49 (or just X) Note, that sometimes the has number above and below it. These numbers specify the range over which to sum. For example, if we again use the theXis from the previous example, but now limit the summation: Xi = 34 Nasty Example: Antic Real TABLE 2.3 Student Mark #1 Mark #2 - X Y 1 82 84 2 66 51 3 70 72 4 81 56 5 61 73 Double Subscripts Sometimes things are made more complicated because capital letters (e.g., X) are sometimes used to refer to entire datasets (as opposed to single variables) and multiple subscripts are used to specify specific data points TABLE 2.4 Student Week 1 Week 2 Week 3 Week 4 Week 5
  • 7. TARUN GEHLOT (B.E, CIVIL) (HONOURS) 1 7 6 4 2 2 2 3 4 4 3 4 3 3 4 5 4 6 X24 = 3 X or Xij = 61 Measures of Central Tendency While distributions provide an overall picture of some dataset, it is sometimes desirable to represent the entire dataset using descriptive statistics The first descriptive statistics we will discuss, are those used to indicate where the centre of the distribution lies There are, in fact, three different measures of central tendency The first of these is called the mode The mode is simply the value of the relevant variable that occurs most often (i.e., has the highest frequency) in the sample Note that if you have done a frequency histogram, you can often identify the mode simply by finding the value with the highest bar However, that will not work when grouping was performed prior to plotting the histogram (although you can still use the histogram to identify the modal group, just not the modal value). Finding the mode: Create a nongrouped frequency table as described previously, then identify the value with the greatest frequency For Example: Class Height
  • 8. TARUN GEHLOT (B.E, CIVIL) (HONOURS) TABLE 2.5 Value Frequency Value Frequency 61 3 69 3 62 4 70 2 63 4 71 4 64 4 72 4 65 3 73 0 66 7 74 0 67 5 75 0 68 4 76 1 A second measure of central tendency is called the median The median is the point corresponding to the score that lies in the middle of the distribution (i.e., there are as many data points above the median as there are below the median) To find the median, the data points must first be sorted into either ascending or descending numerical order The position of the median value can then be calculated using the following Formula: For Examples:  If there are an odd number of data points: (1, 3, 3, 4, 4, 5, 6, 7, 12)  If there are an even number of data points: The median is the item in the fifth position of the ordered dataset, therefore the median is 4 Finally, the most commonly used measure of central tendency is called the mean (denoted for a sample, and for a population) The mean is the same of what most of us call the average, and it is calculated in the following manner:
  • 9. TARUN GEHLOT (B.E, CIVIL) (HONOURS) For example, given the dataset that we used to calculate the median (odd number example), the corresponding mean would be: Similarly, the mean height of our class, as indicated by our sample, is: Mode vs. Median vs. Mean In our height example, the mode and median were the same, and the mean was fairly close to the mode and median. This was the case because the height distribution was fairly symetrical However, when the underlying distribution is not symetrical, the three measures of central tendency can be quite different This raises the issue of which measure is best? Example: Pizza Eating TABLE 2.5 Value Frequency Value Frequency 0 4 8 5 1 2 10 2 2 8 15 1 3 6 16 1 4 6 20 1 5 6 40 1 6 5 - - Mode = 2 slices per week Median = 4 slices per week Mean = 5.7 slices per week Note that if you were calculating these values, you would show all your steps (it's good to be prof!) Measures of Variability In addition to knowing where the centre of the distribution is, it is often helpful to know the degree to which individual values cluster around the centre. This is known as variability. There are various measures of variability, the most straightforward beingthe range of the sample: Highest value minus lowest value While range provides a good first pass at variance, it is not the best measure because of its sensitivity to extreme scores (see section in text) The Average Deviation
  • 10. TARUN GEHLOT (B.E, CIVIL) (HONOURS) Another approach to estimating variance is to directly measure the degree to which individual datapoints differ from the mean and then average those deviations That is: However, if we try to do this with real data, the result will always be zero: Example: (2,3,4,4,4,5,6,12) The Mean Absolute Deviation (MAD) One way to get around the problem with the average deviation is to use the absolute value of the differences, instead of the differences themselves The absolute value of some number is just the number without any sign: For Example, Thus, we could re-write and solve our average deviation question as follows: The dataset in question has a mean of 5 and a mean absolute deviation of 2 The Variance Although the MAD is an acceptable measure of variability, the most commonly used measure is variance (denoted s2 for a sample and for a population) and its square root termed the standard deviation (denoted s for a sample and for a population) The computation of variance is also based on the basic notion of the average deviation however, instead of getting around the "zero problem" by using absolute deviations (as in MAD), the "zero problem" is eliminating by squaring the differences from the mean Specifically:
  • 11. TARUN GEHLOT (B.E, CIVIL) (HONOURS) Example: Same old numbers (2,3,4,4,4,5,6,12) Alternate formula for s2 and s The definitional formula of variance just presented was: An equivalent formula that is easier to work with when calculating variances by hand is: Although this second formula may look more intimidating, a few examples will show you that it is actually easier to work with (as you'll See in assignment 2) Estimating Population Parameters So, the mean and variance (s2 ) are the descriptive statistics that are most commonly used to represent the datapoints of some sample The real reason that they are the preferred measures of central tendency and variance
  • 12. TARUN GEHLOT (B.E, CIVIL) (HONOURS) is because of certain properties they have as estimators of their corresponding population parameters; and Four properties are considered desirable in a population estimator; sufficiency, unbiasedness, efficiency, & resistance Both the mean and the variance are the best estimators in their class in terms of the first three of these four properties Sufficiency A sufficient statistic is one that makes use of all of the information in the sample to estimate its corresponding parameter Unbiasedness A statistic is said to be an unbiased estimator if its expected value (i.e., the mean of a number of sample means) is equal to the population parameter  Explanation of N-1 in s2 formula Efficiency The efficiency of a statistic is reflected in the variance that is observed when one examines the means of a bunch of independently chosen samples Assessing the Bias of an Estimator The bias of a statistic as an estimator of some population parameter can be assessed by  defining some population with measurable parameters  taking a number of independent samples from that population  calculating the relevant statistic for each sample  averaging that statistic across the samples  comparing the "average of the sample statistics" with the population parameter Using the procedure, the mean can be shown to be an unbiased estimator (see pp 47) However, if the more intuitive formula for s2 is used: it turns out to underestimate This bias to underestimate is caused by the act of sampling and it can be shown that this bias can be eliminated if N-1 is used in the denominator instead of N Note that this is only true when calculating s2 , if you have a measurable population and you want to calculate , you use N in the denominator, not N-1 Degrees of Freedom
  • 13. TARUN GEHLOT (B.E, CIVIL) (HONOURS) The mean of 6, 8, & 10 is 8 If I allow you to change as many of these numbers as you want BUT the mean must stay 8, how many of the numbers are you free to vary? The point of this exercise is that when the mean is fixed, it removes a degree of freedom from your sample -- this is like actually subtracting 1 from the number of observations in your sample It is for exactly this reason that we use N-1 in the denominator when we calculate s2 (i.e., the calculation requires that the mean be fixed first which effectively removes -- fixes -- one of the data points) Resistance The resistance of an estimator refers to the degree to which that estimate is effected by extreme values As mentioned previously, both and s2 are highly sensitive to extreme values Despite this, they are still the most commonly used estimates of the corresponding population parameters, mostly because of their superiority over other measures in terms sufficiency, unbiasedness, & efficiency