SlideShare una empresa de Scribd logo
1 de 22
Introduction to
Sampling
Situo Liu
Spry, Inc.
10/25/2013
Ways to deal with Big Data
• Big Analytics - use distributed database systems
(hadoop) and parallel programming
(MapReduce)
• Sampling - use the representative sample
estimate the population
• Sampling in Hadoop
• Hadoop isn’t the king of interactive analysis
• Sampling is a good way to grab a set of data then
play with it locally (R or Excel)
• Pig has a handy SAMPLE keyword
Elements of a Sample
• Sample - a subset of individuals within a statistical population to
estimate characteristics of the whole population.
• Target Population - collection of observations we want to study
• Sampled Population - all possible observation units that might
have been sampled
• Sampling Frame - list of all sampling units (student roster, list of
phone number)
• Sampling Unit - unit we actually sample (e.g. household)
• Observational Unit - element to be measured (e.g. individual
people in the household)
Sampling Techniques (1)
• Probability Sampling
• Every unit in the population has a chance (greater than zero) of
being selected in the sample, and this probability can be
accurately determined.
• Not every observational unit has to have the same probability of
selection but every observational unit’s probability is known.

• Nonprobability Sampling
• Some elements of the population have no chance of selection
(these are sometimes referred to as 'out of coverage'), or where
the probability of selection can't be accurately determined.
• Because the selection of elements is nonrandom, nonprobability
sampling does not allow the estimation of sampling errors.
Sampling Techniques (2)
• Probability Sampling
•
•
•
•
•
•

Simple Random Sampling
Systematic Sampling
Stratified Sampling
Cluster or Multistage Sampling
Probability Proportional to Size Sampling
Panel sampling

• Nonprobability Sampling
•
•
•
•
•

Accidental sampling / Convenience sampling / Haphazard
Quota sampling
Purposive sampling / Judgmental sampling
Capture-Recapture sampling (determine population size)
Line-intercept sampling
http://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelho
ed.svg
Simple Random Sampling - SRS
• Definition: for a size n simple random sample, every possible
subset of n units in the population has the same chance of
being in the sample
• Requirement: One unique identifier is needed for
implementation
• Advantage: easy to understand and implement
• Disadvantage: biggest variance, least accuracy
Systematic Sampling
• Definition: Systematic sampling relies on arranging the study
population according to some ordering scheme and then
selecting elements at regular intervals through that ordered
list. Systematic sampling involves a random start and then
proceeds with the selection of every kth (k=population
size/sample size) element from then onwards.
• Requirement: Ordering scheme for population
• Advantage: easy to implement, very efficient
• Disadvantage: vulnerable to periodicities
Stratified sampling (1)
• Definition: Where the population embraces a number of
distinct categories, the frame can be organized by these
categories into separate "strata." Each stratum is then
sampled as an independent sub-population, out of which
individual elements can be randomly selected.
• Requirement: population can be divided into distinct,
independent strata, provided that strata are selected based
upon relevance to the criterion in question
• Variability within strata are minimized
• Variability between strata are maximized
• The variables upon which the population is stratified are
strongly correlated with the desired dependent variable.
Stratified sampling (2)
• Advantage:
• Inferences can be done about specific subgroup
• Very likely more efficient statistical estimates
• will never result in less efficiency than SRS, provided that each
stratum is proportional to the group's size in the population.

• Data maybe more readily for individual pre-existing strata within
a population than for the overall population
• Because strata are independent, different approaches for
subgroups

• Disadvantage:
• Complexity in implementation and estiamtion
• Multiple criteria can be tricky
• Specified minimum sample size per group
Cluster Sampling (1)
• Definition: where the entire population is divided into
groups, or clusters, and a random sample of these clusters are
selected. All observations in the selected clusters are included
in the sample.
• Requirement: does not require complete list of every unit in
the population, only requires sampling frame on cluster-level
• Variability within cluster are maximized
• Variability between cluster are minimized
• The variables upon which the population is divided into
clusters are not strongly correlated with the desired
dependent variable.
Cluster Sampling (2)
• Advantages:
• Easy to implement
• Cost-effective

• Disadvantages:
• Complexity in estimation
• May not reflect the diversity of clusters
• Provide less information per observation than SRS
• Redundant information from the others in the cluster

• Standard errors may be higher than other sampling designs
Probability Proportional to size
sampling - PPS
• Definition: Where the selection probability for each element is
set to be proportional to its size measure.
• Every technique before was equal probability of selection (EPS)

• Requirement: auxiliary variable / size measure, correlated to
the variable of interest
• Advantage:
• May improve accuracy for a given sample size by concentrating
sample on large elements that have the greatest impact on
estimation
• For business and auditing, monetary unit sampling (MUS)

• Disadvantage:
• Complexity for implementation and estimation
• Different portions of the population may be over or under
represented due to the probability variation in selection
Representativeness of the sample
• Match between target population and
sampled population
• Method of drawing sample
Two kinds of Errors
• Non-sampling error - can be reduced by careful design of the survey
• Selection bias - part of target population is not in sampled population
(target population may not have a natural frame, the mode of data
collection may restrict frame)
• Coverage Error - the extent to which the Sampling Frame does not cover
the Target population

• Measurement bias - measuring instrument has tendency to differ
from true value in one direction
• Measurement error (Errors of Observation)
•
•
•
•

Deviations of measurement
Inaccurate measurement
Item nonresponse (didn’t understand, didn’t see, or refused question)
Unit nonresponse (not home, not approached by interviewer, refuse call)

• Sampling error - results from taking a sample instead of whole
population, can be quantified by statistics, reduced by increasing
sample size
Sample Size Calculation
• In order to know what our sample size needs to be, we must
decide in advance the maximum estimation error we are
willing to tolerate.
• Determine the nature of estimation – proportion or mean
• The confidence level of your estimation – significant level
Proportion (1)
• Proportion: p^ = X/n
• where X is the number of 'positive' observations, n is sample size

• When the observations are independent, the estimator has a
binomial distribution, variance = np(1-p)
• The maximum variance of this distribution is 0.25*n, when
p=0.5
• For sufficiently large n, the distribution of p^ will be closely
approximated by a normal distribution. around 95% of this
distribution's probability lies within 2 standard deviations of
the mean.
• will form a 95% confidence interval for the true proportion.
Proportion (2)
• If this interval needs to be no more than W units wide, the
equation
• can be solved for n, yielding n = 4/W2 = 1/B2 where B is the
error bound on the estimate
• i.e., the estimate is usually given as within ± B. So,
• for B = 10% one requires n = 100,
• for B = 5% one needs n = 400,
• for B = 3% the requirement approximates to n = 1000,
• while for B = 1% a sample size of n = 10000 is required.
Mean (1)
• A proportion is a special case of a mean. When estimating the
population mean using an independent and identically
distributed (iid) sample of size n, where each data value has
variance σ2, the standard error of the sample mean is:
• This expression describes quantitatively how the estimate
becomes more precise as the sample size increases. Using the
central limit theorem to justify approximating the sample
mean with a normal distribution yields an approximate 95%
confidence interval of the form
Mean (2)
• If we wish to have a confidence interval that is W units in
width, we would solve
• for n, yielding the sample size n = 16σ2/W2.
• i.e., if we are interested in estimating the amount by which a
drug lowers a subject's blood pressure with a confidence
interval that is 6 units wide, and we know that the standard
deviation of blood pressure in the population is 15, then the
required sample size is 100
Stratified Sample Size (1)
• The sample can often be split up into sub-samples. Typically, if
there are k such sub-samples (from k different strata) then
each of them will have a sample size ni, i = 1, 2, ..., k. These ni
must conform to the rule that n1 + n2 + ... + nk = n (i.e. that
the total sample size is given by the sum of the sub-sample
sizes). Selecting these ni optimally can be done in various
ways, using (for example) Neyman's optimal allocation.
• There are many reasons to use stratified sampling:[7] to
decrease variances of sample estimates, to use partly nonrandom methods, or to study strata individually. A
useful, partly non-random method would be to sample
individuals where easily accessible, but, where not, sample
clusters to save travel costs.
Stratified Sample Size (2)
• In general, for H strata, a weighted sample mean is
Thank You
sliu@spryinc.com
www.spryinc.com

Más contenido relacionado

La actualidad más candente

Stratified sampling
Stratified samplingStratified sampling
Stratified samplingsuncil0071
 
Sampling methods PPT
Sampling methods PPTSampling methods PPT
Sampling methods PPTVijay Mehta
 
Sampling Techniques and Sampling Methods (Sampling Types - Probability Sampli...
Sampling Techniques and Sampling Methods (Sampling Types - Probability Sampli...Sampling Techniques and Sampling Methods (Sampling Types - Probability Sampli...
Sampling Techniques and Sampling Methods (Sampling Types - Probability Sampli...Alam Nuzhathalam
 
Population vs sample
Population vs samplePopulation vs sample
Population vs sample5829591
 
probability and non-probability samplings
probability and non-probability samplingsprobability and non-probability samplings
probability and non-probability samplingsn1a2g3a4j5a6i7
 
sampling error.pptx
sampling error.pptxsampling error.pptx
sampling error.pptxtesfkeb
 
CLUSTER SAMPLING PPT
CLUSTER SAMPLING PPTCLUSTER SAMPLING PPT
CLUSTER SAMPLING PPTkpsilpa
 
Probability sampling
Probability samplingProbability sampling
Probability samplingtanzil irfan
 
Systematic ranom sampling for slide share
Systematic ranom sampling for slide shareSystematic ranom sampling for slide share
Systematic ranom sampling for slide shareIVenkatReddyGaaru
 
Sampling and statistical inference
Sampling and statistical inferenceSampling and statistical inference
Sampling and statistical inferenceBhavik A Shah
 
Probability sampling
Probability samplingProbability sampling
Probability samplingBhanu Teja
 
Sampling techniques
Sampling techniquesSampling techniques
Sampling techniqueschetan1923
 
sampling simple random sampling
sampling simple random samplingsampling simple random sampling
sampling simple random samplingDENNY VARGHESE
 
Non-Probability Sampling Method
Non-Probability Sampling Method Non-Probability Sampling Method
Non-Probability Sampling Method Sundar B N
 

La actualidad más candente (20)

Sampling and its types
Sampling and its typesSampling and its types
Sampling and its types
 
Stratified sampling
Stratified samplingStratified sampling
Stratified sampling
 
sampling
samplingsampling
sampling
 
Sampling methods PPT
Sampling methods PPTSampling methods PPT
Sampling methods PPT
 
Sampling Techniques and Sampling Methods (Sampling Types - Probability Sampli...
Sampling Techniques and Sampling Methods (Sampling Types - Probability Sampli...Sampling Techniques and Sampling Methods (Sampling Types - Probability Sampli...
Sampling Techniques and Sampling Methods (Sampling Types - Probability Sampli...
 
Population vs sample
Population vs samplePopulation vs sample
Population vs sample
 
probability and non-probability samplings
probability and non-probability samplingsprobability and non-probability samplings
probability and non-probability samplings
 
sampling error.pptx
sampling error.pptxsampling error.pptx
sampling error.pptx
 
CLUSTER SAMPLING PPT
CLUSTER SAMPLING PPTCLUSTER SAMPLING PPT
CLUSTER SAMPLING PPT
 
Sample Surveys
Sample SurveysSample Surveys
Sample Surveys
 
Probability sampling
Probability samplingProbability sampling
Probability sampling
 
Systematic ranom sampling for slide share
Systematic ranom sampling for slide shareSystematic ranom sampling for slide share
Systematic ranom sampling for slide share
 
Sampling Methods
Sampling MethodsSampling Methods
Sampling Methods
 
Sample design
Sample designSample design
Sample design
 
Sampling & Its Types
Sampling & Its TypesSampling & Its Types
Sampling & Its Types
 
Sampling and statistical inference
Sampling and statistical inferenceSampling and statistical inference
Sampling and statistical inference
 
Probability sampling
Probability samplingProbability sampling
Probability sampling
 
Sampling techniques
Sampling techniquesSampling techniques
Sampling techniques
 
sampling simple random sampling
sampling simple random samplingsampling simple random sampling
sampling simple random sampling
 
Non-Probability Sampling Method
Non-Probability Sampling Method Non-Probability Sampling Method
Non-Probability Sampling Method
 

Destacado

PROBABILITY SAMPLING TECHNIQUES
PROBABILITY SAMPLING TECHNIQUESPROBABILITY SAMPLING TECHNIQUES
PROBABILITY SAMPLING TECHNIQUESAzam Ghaffar
 
Sampling & surveying ppt
Sampling & surveying pptSampling & surveying ppt
Sampling & surveying pptivisdude82
 
Simple random sampling
Simple random samplingSimple random sampling
Simple random samplingsuncil0071
 
Sampling types, size and eroors
Sampling types, size and eroorsSampling types, size and eroors
Sampling types, size and eroorsAdil Arif
 
Case study research by maureann o keefe
Case study research by maureann o keefeCase study research by maureann o keefe
Case study research by maureann o keefewawaaa789
 
Ch. 12 Sampling Methods
Ch. 12 Sampling MethodsCh. 12 Sampling Methods
Ch. 12 Sampling Methodschristjt
 
MM Bagali, HR, MBA, HRM, HRD, Research ,Case study final
MM Bagali, HR, MBA, HRM, HRD, Research ,Case study finalMM Bagali, HR, MBA, HRM, HRD, Research ,Case study final
MM Bagali, HR, MBA, HRM, HRD, Research ,Case study finaldr m m bagali, phd in hr
 
IT3010 Lecture on Case Study Research
IT3010 Lecture on Case Study ResearchIT3010 Lecture on Case Study Research
IT3010 Lecture on Case Study ResearchBabakFarshchian
 
Sampling methods 16
Sampling methods   16Sampling methods   16
Sampling methods 16Raj Selvam
 

Destacado (20)

sampling ppt
sampling pptsampling ppt
sampling ppt
 
PROBABILITY SAMPLING TECHNIQUES
PROBABILITY SAMPLING TECHNIQUESPROBABILITY SAMPLING TECHNIQUES
PROBABILITY SAMPLING TECHNIQUES
 
Chapter 8-SAMPLE & SAMPLING TECHNIQUES
Chapter 8-SAMPLE & SAMPLING TECHNIQUESChapter 8-SAMPLE & SAMPLING TECHNIQUES
Chapter 8-SAMPLE & SAMPLING TECHNIQUES
 
Sampling & surveying ppt
Sampling & surveying pptSampling & surveying ppt
Sampling & surveying ppt
 
EPA Water Sampling Guide
EPA Water Sampling GuideEPA Water Sampling Guide
EPA Water Sampling Guide
 
Sample
SampleSample
Sample
 
Simple random sampling
Simple random samplingSimple random sampling
Simple random sampling
 
Simple random sampling
Simple random samplingSimple random sampling
Simple random sampling
 
Sampling types, size and eroors
Sampling types, size and eroorsSampling types, size and eroors
Sampling types, size and eroors
 
Case study research by maureann o keefe
Case study research by maureann o keefeCase study research by maureann o keefe
Case study research by maureann o keefe
 
Ch. 12 Sampling Methods
Ch. 12 Sampling MethodsCh. 12 Sampling Methods
Ch. 12 Sampling Methods
 
Data sampling and probability
Data sampling and probabilityData sampling and probability
Data sampling and probability
 
sampling
samplingsampling
sampling
 
T5 sampling
T5 samplingT5 sampling
T5 sampling
 
SET FORM 4 (3.1.1-3.1.3)
SET FORM 4 (3.1.1-3.1.3)SET FORM 4 (3.1.1-3.1.3)
SET FORM 4 (3.1.1-3.1.3)
 
Sampling techniques
Sampling techniquesSampling techniques
Sampling techniques
 
MM Bagali, HR, MBA, HRM, HRD, Research ,Case study final
MM Bagali, HR, MBA, HRM, HRD, Research ,Case study finalMM Bagali, HR, MBA, HRM, HRD, Research ,Case study final
MM Bagali, HR, MBA, HRM, HRD, Research ,Case study final
 
Sampling
SamplingSampling
Sampling
 
IT3010 Lecture on Case Study Research
IT3010 Lecture on Case Study ResearchIT3010 Lecture on Case Study Research
IT3010 Lecture on Case Study Research
 
Sampling methods 16
Sampling methods   16Sampling methods   16
Sampling methods 16
 

Similar a Introduction to sampling

Similar a Introduction to sampling (20)

Sampling....
Sampling....Sampling....
Sampling....
 
Chapter_2_Sampling.pptx
Chapter_2_Sampling.pptxChapter_2_Sampling.pptx
Chapter_2_Sampling.pptx
 
5. sampling design
5. sampling design5. sampling design
5. sampling design
 
Res701 research methodology lecture 7 8-devaprakasam
Res701 research methodology lecture 7 8-devaprakasamRes701 research methodology lecture 7 8-devaprakasam
Res701 research methodology lecture 7 8-devaprakasam
 
day9.ppt
day9.pptday9.ppt
day9.ppt
 
Statr sessions 11 to 12
Statr sessions 11 to 12Statr sessions 11 to 12
Statr sessions 11 to 12
 
Sampling design, sampling errors, sample size determination
Sampling design, sampling errors, sample size determinationSampling design, sampling errors, sample size determination
Sampling design, sampling errors, sample size determination
 
Maneesh (economics)
Maneesh (economics)Maneesh (economics)
Maneesh (economics)
 
SAMPLE SIZE DETERMINATION.ppt
SAMPLE SIZE DETERMINATION.pptSAMPLE SIZE DETERMINATION.ppt
SAMPLE SIZE DETERMINATION.ppt
 
samplesizedetermination-221008120007-0081a5b4.ppt
samplesizedetermination-221008120007-0081a5b4.pptsamplesizedetermination-221008120007-0081a5b4.ppt
samplesizedetermination-221008120007-0081a5b4.ppt
 
Methods.pdf
Methods.pdfMethods.pdf
Methods.pdf
 
Sampling
SamplingSampling
Sampling
 
Brm chap-4 present-updated
Brm chap-4 present-updatedBrm chap-4 present-updated
Brm chap-4 present-updated
 
Sampling by Mr Peng Kungkea
Sampling  by Mr Peng KungkeaSampling  by Mr Peng Kungkea
Sampling by Mr Peng Kungkea
 
Sampling
SamplingSampling
Sampling
 
Sampling methodologies in research mrhod
Sampling methodologies in research mrhodSampling methodologies in research mrhod
Sampling methodologies in research mrhod
 
Sample Size Determination
Sample Size DeterminationSample Size Determination
Sample Size Determination
 
8 sampling & sample size (Dr. Mai,2014)
8  sampling & sample size (Dr. Mai,2014)8  sampling & sample size (Dr. Mai,2014)
8 sampling & sample size (Dr. Mai,2014)
 
2RM2 PPT.pptx
2RM2 PPT.pptx2RM2 PPT.pptx
2RM2 PPT.pptx
 
How to do sampling?
How to do sampling?How to do sampling?
How to do sampling?
 

Último

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Último (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Introduction to sampling

  • 2. Ways to deal with Big Data • Big Analytics - use distributed database systems (hadoop) and parallel programming (MapReduce) • Sampling - use the representative sample estimate the population • Sampling in Hadoop • Hadoop isn’t the king of interactive analysis • Sampling is a good way to grab a set of data then play with it locally (R or Excel) • Pig has a handy SAMPLE keyword
  • 3. Elements of a Sample • Sample - a subset of individuals within a statistical population to estimate characteristics of the whole population. • Target Population - collection of observations we want to study • Sampled Population - all possible observation units that might have been sampled • Sampling Frame - list of all sampling units (student roster, list of phone number) • Sampling Unit - unit we actually sample (e.g. household) • Observational Unit - element to be measured (e.g. individual people in the household)
  • 4. Sampling Techniques (1) • Probability Sampling • Every unit in the population has a chance (greater than zero) of being selected in the sample, and this probability can be accurately determined. • Not every observational unit has to have the same probability of selection but every observational unit’s probability is known. • Nonprobability Sampling • Some elements of the population have no chance of selection (these are sometimes referred to as 'out of coverage'), or where the probability of selection can't be accurately determined. • Because the selection of elements is nonrandom, nonprobability sampling does not allow the estimation of sampling errors.
  • 5. Sampling Techniques (2) • Probability Sampling • • • • • • Simple Random Sampling Systematic Sampling Stratified Sampling Cluster or Multistage Sampling Probability Proportional to Size Sampling Panel sampling • Nonprobability Sampling • • • • • Accidental sampling / Convenience sampling / Haphazard Quota sampling Purposive sampling / Judgmental sampling Capture-Recapture sampling (determine population size) Line-intercept sampling http://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelho ed.svg
  • 6. Simple Random Sampling - SRS • Definition: for a size n simple random sample, every possible subset of n units in the population has the same chance of being in the sample • Requirement: One unique identifier is needed for implementation • Advantage: easy to understand and implement • Disadvantage: biggest variance, least accuracy
  • 7. Systematic Sampling • Definition: Systematic sampling relies on arranging the study population according to some ordering scheme and then selecting elements at regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every kth (k=population size/sample size) element from then onwards. • Requirement: Ordering scheme for population • Advantage: easy to implement, very efficient • Disadvantage: vulnerable to periodicities
  • 8. Stratified sampling (1) • Definition: Where the population embraces a number of distinct categories, the frame can be organized by these categories into separate "strata." Each stratum is then sampled as an independent sub-population, out of which individual elements can be randomly selected. • Requirement: population can be divided into distinct, independent strata, provided that strata are selected based upon relevance to the criterion in question • Variability within strata are minimized • Variability between strata are maximized • The variables upon which the population is stratified are strongly correlated with the desired dependent variable.
  • 9. Stratified sampling (2) • Advantage: • Inferences can be done about specific subgroup • Very likely more efficient statistical estimates • will never result in less efficiency than SRS, provided that each stratum is proportional to the group's size in the population. • Data maybe more readily for individual pre-existing strata within a population than for the overall population • Because strata are independent, different approaches for subgroups • Disadvantage: • Complexity in implementation and estiamtion • Multiple criteria can be tricky • Specified minimum sample size per group
  • 10. Cluster Sampling (1) • Definition: where the entire population is divided into groups, or clusters, and a random sample of these clusters are selected. All observations in the selected clusters are included in the sample. • Requirement: does not require complete list of every unit in the population, only requires sampling frame on cluster-level • Variability within cluster are maximized • Variability between cluster are minimized • The variables upon which the population is divided into clusters are not strongly correlated with the desired dependent variable.
  • 11. Cluster Sampling (2) • Advantages: • Easy to implement • Cost-effective • Disadvantages: • Complexity in estimation • May not reflect the diversity of clusters • Provide less information per observation than SRS • Redundant information from the others in the cluster • Standard errors may be higher than other sampling designs
  • 12. Probability Proportional to size sampling - PPS • Definition: Where the selection probability for each element is set to be proportional to its size measure. • Every technique before was equal probability of selection (EPS) • Requirement: auxiliary variable / size measure, correlated to the variable of interest • Advantage: • May improve accuracy for a given sample size by concentrating sample on large elements that have the greatest impact on estimation • For business and auditing, monetary unit sampling (MUS) • Disadvantage: • Complexity for implementation and estimation • Different portions of the population may be over or under represented due to the probability variation in selection
  • 13. Representativeness of the sample • Match between target population and sampled population • Method of drawing sample
  • 14. Two kinds of Errors • Non-sampling error - can be reduced by careful design of the survey • Selection bias - part of target population is not in sampled population (target population may not have a natural frame, the mode of data collection may restrict frame) • Coverage Error - the extent to which the Sampling Frame does not cover the Target population • Measurement bias - measuring instrument has tendency to differ from true value in one direction • Measurement error (Errors of Observation) • • • • Deviations of measurement Inaccurate measurement Item nonresponse (didn’t understand, didn’t see, or refused question) Unit nonresponse (not home, not approached by interviewer, refuse call) • Sampling error - results from taking a sample instead of whole population, can be quantified by statistics, reduced by increasing sample size
  • 15. Sample Size Calculation • In order to know what our sample size needs to be, we must decide in advance the maximum estimation error we are willing to tolerate. • Determine the nature of estimation – proportion or mean • The confidence level of your estimation – significant level
  • 16. Proportion (1) • Proportion: p^ = X/n • where X is the number of 'positive' observations, n is sample size • When the observations are independent, the estimator has a binomial distribution, variance = np(1-p) • The maximum variance of this distribution is 0.25*n, when p=0.5 • For sufficiently large n, the distribution of p^ will be closely approximated by a normal distribution. around 95% of this distribution's probability lies within 2 standard deviations of the mean. • will form a 95% confidence interval for the true proportion.
  • 17. Proportion (2) • If this interval needs to be no more than W units wide, the equation • can be solved for n, yielding n = 4/W2 = 1/B2 where B is the error bound on the estimate • i.e., the estimate is usually given as within ± B. So, • for B = 10% one requires n = 100, • for B = 5% one needs n = 400, • for B = 3% the requirement approximates to n = 1000, • while for B = 1% a sample size of n = 10000 is required.
  • 18. Mean (1) • A proportion is a special case of a mean. When estimating the population mean using an independent and identically distributed (iid) sample of size n, where each data value has variance σ2, the standard error of the sample mean is: • This expression describes quantitatively how the estimate becomes more precise as the sample size increases. Using the central limit theorem to justify approximating the sample mean with a normal distribution yields an approximate 95% confidence interval of the form
  • 19. Mean (2) • If we wish to have a confidence interval that is W units in width, we would solve • for n, yielding the sample size n = 16σ2/W2. • i.e., if we are interested in estimating the amount by which a drug lowers a subject's blood pressure with a confidence interval that is 6 units wide, and we know that the standard deviation of blood pressure in the population is 15, then the required sample size is 100
  • 20. Stratified Sample Size (1) • The sample can often be split up into sub-samples. Typically, if there are k such sub-samples (from k different strata) then each of them will have a sample size ni, i = 1, 2, ..., k. These ni must conform to the rule that n1 + n2 + ... + nk = n (i.e. that the total sample size is given by the sum of the sub-sample sizes). Selecting these ni optimally can be done in various ways, using (for example) Neyman's optimal allocation. • There are many reasons to use stratified sampling:[7] to decrease variances of sample estimates, to use partly nonrandom methods, or to study strata individually. A useful, partly non-random method would be to sample individuals where easily accessible, but, where not, sample clusters to save travel costs.
  • 21. Stratified Sample Size (2) • In general, for H strata, a weighted sample mean is