09 Respondent Driven Sampling and Network Sampling with Memory (2016)
1. Respondent Driven Sampling &
Network Sampling with Memory
(time permitting…)
M. Giovanna Merli
Sanford School of Public Policy &
Duke Population Research Institute (DUPRI)
Duke University
2. Funding Acknowledgements
• RDS Data Collection in China (2009-2010)
– “Place-RDS Comparison Study”
• USAID under the terms of cooperative agreements GPO-A-00-03-00003-00 and
GPO-A-00-09-00003-0 (Weir, PI)
• China National Center for STD Control (Chen, PI)
• Duke CFAR AI064518 (Merli, PI)
– “Partnership for Social Science Research on HIV/AIDS in China”
• NICHD R24 HD056670 (Henderson, PI)
• RDS Data Analyses and Simulations (2011-2015)
– “Using Multiple Data Sources to Improve RDS Estimation”
• NICHD R01HD068523 (Merli, PI)
• NSM Data Collection in Tanzania
– PFirst Award/DGHI (Merli, PI)
2
3. Problems with the study of hidden
populations
Female sex workers, men who have sex with men, injecting drug users,
homeless, undocumented migrants are hidden populations
For these populations we typically want to:
• Obtain accurate and precise estimates of disease prevalence
• Discern impact on larger population health dynamics
• Identify gaps in HIV/STD prevention
Collecting data from hidden populations to infer population representation is
difficult because of the absence of a sampling frame – their members are hard
to identify
– Stigma
– Non response
– Lack of trust
– Rarity
3
4. Problems with the study of hidden
populations
• Convenience samples, clinic-based inquiries,
and sampling frames with limited coverage
(e.g. venue based sampling) lack basis for
inferring representation
4
5. Respondent Driven Sampling (RDS)
Heckathorn 1997, 2002; Salganik and Heckathorn 2004;
Volz and Heckathorn 2008
• Most popular solution to
problems of sampling
hidden populations
– 450+ studies
– 624+ papers, 10k+ citations
– Over $185 million from NIH
• Compare to “ego centric”
– 167 studies funded
– $42 million since 1990
5
6. How RDS works
• RDS primarily used to estimate population proportions of binary
nodal covariates (e.g. gender, infection status, tier of sex work, etc.)
• Leverages social network of respondents to recruit other
respondents
• Chain referral / peer recruitment / link tracing sampling strategy
– “Seed” participants (selected by convenience) receive coupons (2)
– Recruit 2-3 new participants each
– Each new respondent given 2-3 coupons to recruit others
– Recruitment incentives for participating and for successful recruitment
– No one participates more than once
– Process continues until desired sample size is obtained
6
15. Problems with estimation in link tracing
sampling designs of hidden populations
• Sampling frame
unavailable
• Sample inclusion
probabilities are not
known (hence sampling
weights unknown)
• Researchers have limited
control of the sampling
process
• Seed respondents not
chosen at random
16. RDS solution
• Sampling probabilities computed under an approximation of
the true sampling process
– RDS assumes non-seed participants are Sampled with Probability
Proportional to self-reported degree – (SPPD)
– Provable in a random walk on most graphs of interest
– Sampling probabilities approximated by degree, hence sampling
weight = 1/degree
• Weighting/estimation can yield asymptotically unbiased
estimates of the population mean
• SPPD assumption underpins much of RDS estimation claims
16
17. RDS estimators
Estimator Proportion Equation Notes
Naïve 𝑝 = 𝑖𝜖𝜒 𝑥𝑖 𝑛 −1 𝑥𝑖 is the value of the focal
variable for respondent 𝑖; 𝑛 is the
sample size
RDS1-SH
𝑝 = 𝑆0,1 𝑑0 𝑆0,1 𝑑0 + 𝑆1,0 𝑑1
−1 𝑆 𝑎,𝑏 is the estimated proportion of
recruitments from group 𝑎 to 𝑏;
𝑑 𝑎is the estimated average degree
in each group
(Salganik and Heckathorn 2004)
RDS1-LEN
𝑝 = 𝑆0,1
𝑒𝑔𝑜
𝑑0 𝑆0,1
𝑒𝑔𝑜
𝑑0 + 𝑆1,0
𝑒𝑔𝑜
𝑑1
−1 𝑆 𝑎,𝑏
𝑒𝑔𝑜
is the estimated proportion
of network ties from group 𝑎 to 𝑏
based on ego network reports
(Lu 2013)
RDS2-VH 𝒑 = 𝒊∈𝝌 𝒙𝒊 𝒅𝒊
−𝟏
𝒊∈𝝌 𝒅𝒊
−𝟏 −𝟏 𝒅𝒊
−𝟏
is the inverse of self-
reported degree for person 𝒊
(Volz and Heckathorn 2008)
17
18. In RDS, all approximations are subject to critical
assumptions that are often not met in the field
• About the unobserved sample recruitment process (most crucial)
– Respondent gives a coupon to a friend
– Respondents recruit new participants non-preferentially from amongst their
social contacts (each friend has an equal chance of being picked)
– The initial set of respondents (“seeds”) are drawn with random probabilities
– Respondents report their number of ties accurately (how many people you
know that are members of the population of interest?)
• About the social network structure
– Rapid mixing: The chain referral process converges very quickly to the
stationary distribution of a random walk (i.e. node selection probabilities are
independent of sample starting point)
– Connectedness: The target population must be connected by a network that
consists of a single component
– Network size: Network must be sufficiently large (sampling fraction small) that
sampling without replacement can be treated as if it is equivalent to sampling
with replacement
18
19. Prior evaluations of RDS
• Comparison of RDS estimates to known parameters of non-
hidden populations
– (Wejnert 2009; Wejnert & Heckathorn 2008; McCreesh et al. 2012)
• Test effects of violating RDS assumptions about social
network structure on synthetic populations
– (Gile & Handcock 2010; Thomas & Gile 2011; Lu et al. 2011)
• Examine effects of network structure in multiple empirical
settings with theoretical/ideal RDS samples
– (Goel & Salganik 2010; Mouw & Verdery 2012; Verdery , Mouw et al. 2015)
• Use full information on participants’ recruitment behavior to
evaluate non-preferential recruitment assumption
– (Yamanis, Merli, Neely et al. Sociological Methods and Research 2013)
19
20. RDS evaluation in the context of
Female Sex Workers in Liuzhou, China
• Evaluate SPPD assumption and
population coverage (Merli, Moody, Smith et
al., 2015 Social Science and Medicine)
• Evaluate performance of RDS
estimators (Verdery, Merli, Moody et al., 2015
Epidemiology)
• Propose RDS data collection
innovation to improve estimator
performance (Verdery, Merli, Moody, In
Progress)
• Evaluations with a simulation
approach grounded in empirical data
from a hidden population of FSWs in
China (Liuzhou, Guangxi Province)
(Weir, Merli, Li et al. 2012, Sexually Transmitted
Infections)
20
21. Data
• Two sources
– RDS: 583 FSWs (Oct. 2009 – Feb. 2010) (about 8% of total
FSW population in Liuzhou)
– PLACE (venue based sampling approach): 161 FSWs (Nov.
2009 – Mar. 2010)
• Same target population and inclusion definition
– Women who reside in Liuzhou who exchanged sex for money in last 4 weeks
• Same geographic area and similar time period
• Same measurement of key variables
– Test for biomarker of lifetime exposure to syphilis and core questionnaire
• Same face-to-face interview and common applicant pool for interviewers
• Rare to have two concurrent surveys in same population!
21
22. Description of the Liuzhou RDS sample
Tier
of sex
work
Venues where clients are
solicited
RDS
(N = 576)
High Karaoke bars, star hotels, discos,
night clubs
250
Middle Hair salons, saunas, massage
parlors, foot cleaning/massage,
bathhouses
268
Low Streets, parks, other public spaces 27
Non-
venue
based
Telephone, text, internet,
private referrals
31
22
Fisher and Merli 2014, Network Science.
23. Approach, part 1
• Construct “population social network” from data
collected in RDS and PLACE
– Used new methodologies for estimating social network
parameters and simulating population network
• Use Case Control Logistic Regression to estimate homophily
parameters from the RDS data (Smith, SM 2012)
• Use Exponential Random Graph Modeling to generate full
network from local structural features (ERGM; Handcock et al., JOSS 2008)
– Tested various sensitivities about the means by which
this population social network is constructed
• (which data source, venue size estimates, and assumptions
about geographic distribution of social network ties)
23
24. “Population social network”
Generate “population characteristics”
based on PLACE survey estimates
Add “population social network”
based on RDS survey estimates
24
25. Approach, part 2
• Simulate RDS chains over “population social
network” (1000 per recruitment scenario)
– Scenarios vary according to different sample
recruitment assumptions
• Seeding of the chain
• Recruitment patterns
– How much does the ideal case (random seeding
and random recruitment) diverge from actual RDS
seeding and recruitment matched to the Liuzhou
FSW data?
25
26. Results:
Violation of SPPD assumption
• Compared individual degree to
the proportion of times an
individual was sampled across
the simulated chains
– Very high correlation when
seeds and referrals are random
– SSPD assumption increasingly
violated when seeds & referrals
are matched to the actual data
– Over-recruitment of middle tier
sex workers drives the result
• For more:
– Merli, Moody, Smith et al.,
Social Science & Medicine,
2015
26
r=0.82 r=0.96 r=0.97
Merli, Moody, Smith et al., SSM, 2015
27. Distribution of RDS2-VH proportion estimates
(low/middle tier) across seeding and recruitment
scenarios
27
Verdery, Merli, Moody et al. 2015, Epidemiology
28. Variability of estimates: Design effects
(ratio of variance in RDS estimates to variance in estimates from same size SRS)
• DE very large, but not out of line with findings of prior work (Goel
and Salganik 2010)
• Large Design Effects imply that much larger sample sizes would
be required to reach level of precision currently assumed from
RDS samples typically in the hundreds
• CDC recommends RDS sample sizes in the hundreds for public
health surveillance – IMPLICATIONS: Not sufficient power to
identify changes in behaviors or disease prevalence
28
DemDem DemRan RanRan
Middle Tier 6.18 19.60 28.20
29. Discussion
• Seeding and recruitment scenarios
– Matching on seeds not critical
– Matching on recruitment patterns has a larger
effect, exacerbates biases but reduces design
effects
• Problematic because seems harder to control than seed
matching
29
30. Estimator performance
• Estimator development
– Only one (RDS1-LEN) works
markedly better than
others
• Robust to preferential
recruitment by taking into
account respondents’ ego-
network composition
– BUT unusable for most
(unobservable)
characteristics we care
about
– Still problems with variance
estimation
30
Verdery, Merli, Moody et al. 2015, Epidemiology
Distributions of estimates of proportions in low
tiers of sex work by estimator (recruitment and
seeds matched to the Liuzhou FSW data)
31. Recent innovation: IP-RDS
(Verdery, Merli, Moody, In Progress)
• What can be done to improve the performance of RDS
estimates while retaining the method’s desirable peer-
driven sample recruitment properties?
• Modify RDS data collection process
• Apply antithetic variate mean estimator to data
• Results from simulations: Improved estimation
performance
31
32. New data collection protocol
IP-RDS
• Incentivize respondents to invert their
preferences when choosing new respondents,
i.e. respondents are asked to invert their
recruitment preferences on the recruitment
biasing variable (e.g. tier of sex work)
32
36. Antithetic variate mean estimator
• 𝜇 𝐴𝑉 = 𝑖∈𝑚1 𝑦 𝑖
2
+ 𝑖∈𝑚2 𝑦 𝑖
2
, where
yi is the value of the focal variable for the i
respondent
m1 is the count of recruitments by members of
one group of the recruitment biasing variable
(e.g. tier of sex work), and m2 is the count of
recruitments by members of the other group
36
37. Distributions of estimates of proportions in low/mid tiers of sex work
by estimator (naïve mean, RDS2-VH, AV-IP_RDS) and level of biased
recruitment behavior (absolute difference in recruitment probabilities
conditional on attribute of targeted peer)
37
38. Discussion of IP-RDS
• Simple change to RDS protocol
– May or may not require financial incentives for
targeted recruitment (empirical question)
• Outperforms conventional estimators
– Gains in bias reduction comparable to RDS1-LEN
estimator
• Tested on more networks (similar results)
• BUT …Not yet field tested
38
39. Network Sampling with Memory
• Mouw and Verdery 2012, Sociological
Methodology
• Collects network data
• Introduces researcher’s control over the
sampling process
• Directs the recruitment process to more
efficiently explore the network (avoiding
bottlenecks)
40. How does NSM work?
• Recruitment starts with a few seed respondents
• Network roster data collected from respondents about
minimally identifying information of their network members
(last name and last four digits of cell phone number) to
connect nodes in the network (up to 10 network members per
respondent)
• NSM sampling algorithm selects up to 3 nominated network
members per respondent and asks respondents for full contact
information on these
• Process proceeds iteratively to recruit new waves of
respondents
42. How does NSM work?
• NSM sampling algorithm uses two sampling
modes, List and Search
• List mode
– keeps a list, L, of all nominated network members
– samples with replacement from L
– even sampling of new nodes -- new nodes sampled at
the same cumulative sampling rate as earlier nodes
– as list of sampled nodes approaches the full population
network, NSM sample converges to simple random
sampling
43. How does NSM work?
• Search mode—look for “bridge” nodes to
unexplored parts of the network. Start in
search mode, then switch to list mode.
44. Simulation results
• Test NSM vs. RDS using 162 university and School
networks from Facebook and Add Health
• Size of networks ranges from 300 to 16,500 nodes
• Estimate % white (Add Health) and % first year students
(Facebook)
• Start from a randomly selected student, repeat 500
times for each network
• Calculate bias, design effects and mean absolute bias
• Test (162 networks) DE is 1.16 for NSM vs 77.38 for RDS
45. Is it feasible?
• Is it feasible to collect network data on hidden
populations?
• 2010 NSIT (Network Survey of Immigration and
Transnationalism) (Mouw, PI)
• CAHS (Chinese in Africa Health Survey) (Merli, PI)
• Cost effectiveness of gains in precision
46. NSM field applications
Network Survey of Immigration and
Transnationalism (NSIT)
Mouw et al. 2014. Social Problems;
Verdery et al. 2016. Social Networks
Chinese in Africa Health Survey (CAHS)
Merli, Verdery, Mouw, Li 2016. Migration Studies
46
Red: RDU
Blue: Mexico
Green: Houston
Small: Nominated
Large: Sampled
Network of Chinese migrants in Dar es Salaam
sampled by NSM, size = probability of selecting
next node
47. Key challenge: Getting referrals from
respondents
• NSIT required recontacting respondents to get
contact information on alters
• CAHS -- “forward” sampling variant (FNSM)—
more practical
– Asked for contact information on a small number
of alters at each interview (selected by NSM
algorithm)
48. NSM -- Future directions
• NIH R21 grant to test NSM among Chinese
immigrants in RDU (Merli, Mouw, Verdery,
Moody, Keister, Sanders)
– Pilot various approaches to get referrals from
respondents
– Evaluate NSM against ACS
– Test multiple modes of data collection (in-person,
telephone, web)
48