2. 1.Collecting Data
A. Research Design
i. Relational Content
ii. Boundary Specification
iii. Network Samples
a. Local
b. Global
c. Link Tracing designs
B. Sources
i. Archive, Observation, Survey
ii. Survey
a. Name Generators
b. Delivery Mode
2. Data Accuracy
A. How accurate are network survey data?
B. Effect on measurement
C. What can we do about inaccurate or missing
data?
Outline
Social Network Data
3. What information do you want to collect?
This is ultimately a theory question about how you think the social network matters
and what social or biological mechanisms matter for the outcome of interest. This
is driven by thinking through:
Health Outcome Mechanism Relation(s)
Examples:
Sometimes the relations are clear:
STD/HIV Contagion-carrying contact Sex, Drug sharing, etc.
Sometimes not so much:
Health Behavior Information flow Discussion networks
Health Behavior Social Conformity Pressure Admiration nets
Health Behavior opportunities Unsupervised interaction
Research Design: new data collection
Social Network Data
4. What information do you want to collect?
Sometimes the outcome is deliberately unspecified, as when you are collecting data
for a large common use projects (GSS, Add Health, NHRS).
Then the design is effectively reversed: What relations capture the most (general?
comprehensive? efficacious? Reliable?) social mechanisms that will be of broad
interest?
Research Design: new data collection
Social Network Data
Relation(s) Respect
Contact
Information
Pressure
Substance Use
Suicidal Ideation
Treatment adherence
BMI
Disease
Excitement
Social mechanism ambiguity allows broad use, which favors relations that tend to be
general. This, of course, makes crisp causal associations more difficult.
5. What information do you want to collect?
Health Outcome Mechanism Relation(s)
Relations themselves are often multi-dimensional…do these matter for
your question?
- Perception vs. interaction?
“who do you like?” “who do you talk with?”
- Intensity?
“How often …”, “how much…”
strong vs. weak
- Dynamics?
Starting & ending dates, everyday contact or sporadic?
Research Design: New data collection
Social Network Data
6. Boundary Specification
Network methods describe positions in relevant social fields, where flows of
particular goods are of interest. As such, boundaries are a fundamentally
theoretical question about what you think matters in the setting of interest.
In general, there are usually relevant social foci that bound the relevant social
field. We expect that social relations will be very clumpy. Consider the
example of friendship ties within and between a high-school and a Jr. high:
What is the theoretically relevant population?
Research Design: Boundary Specification
Social Network Data
7. What is the theoretically relevant population?
Local Global
“Realist”
(Boundary from actors’
Point of view)
Nominalist
(Boundary from researchers’
point of view)
Relations within a
particular setting (“School
friends” or “Physicians
serving this hospital”)
All relations relevant to
social action (“adolescent
peer network” or
“Community Health
Leaders” )
Everyone connected to
ego in the relevant manner
(all friends, all sex
partners)
Relations defined by a
name-generator, typically
limited in number (“5
closest friends”)
Research Design: Boundary Specification
Social Network Data
Networks are (generally) treated as bounded systems, what constitutes your bound?
Most of the time….these boundaries are porous
8. Add Health: while
students were given the
option to name friends in
the other school, they
rarely do. As such, the
school likely serves as a
strong substantive
boundary
What is the theoretically relevant population?
Research Design: Boundary Specification
Social Network Data
9. Boundaries are often defined theoretically the relation not the setting:
Research Design: Boundary Specification
Social Network Data
Physician patient-sharing networks:
Physicians who share (Medicare)
patients (within one hospital)
For all patients selected in Ohio….
10. Research Design: Boundary Specification
Social Network Data
In practice:
a) set a pragmatic bound that captures the bulk of theoretically relevant data
b) Collect data on boundary crossing.
a) You might ask “friends in this neighborhood” but also “Other close
friends?”
b) Don’t limit nominations to current setting, but only trace within the
bounds.
Good prior research, ethnography, informants, etc. should be used to identify
the bounds as best as possible, but these sorts of data allow one to at least
control for out-of-sample effects in models.
For adaptive sampling, such as link-trace designs, you might use a
capture/recapture rule to figure out if you’ve saturated your population. Once
you stop receiving new names…you’ve finished.
--but, if you jump to a new population…this can be hard to discern.
11. 1. The level of analysis implies a perspective on sampling:
1. Local random probability sampling
2. Adaptive Link trace, RDS
3. Complete Census
These are not as dissimilar as they may appear:
a) Local nets imply global connectivity:
a) Every ego-network is a sample from the population-level global
network, and thus should be consistent with a constrained range of
global networks.
b) If you have a clustered setting, many alters in a local network may
overlap, making partial connectivity information possible.
c) For attribute mixing (proportion of whites with black friends, low
BMI with high, users with non-usres, etc.), ego-network data is
sufficient to draw global inference
Research Design: Network Sample
Social Network Data
12. Research Design: Network Sample
Social Network Data
Nominalist
(researcher pov)
Realist
(natural groups)
Local • Probability samples
• Clinical samples
• Extracted from
complete settings
• Family interviews
• Neighbors
• Workplace samples
Adaptive • Fixed diameter chain
from qualifying
seed(s)
• Unlimited diameter
chain on qualifying
relation
Complete • Census within a fixed
setting (hospital,
school, etc.)
• Only practical for
real groups (“Duke
Faculty” “Crip”).
Get list from
informant &
enumerate.
Data collection strategy
(The column distinction is squishy…)
13. Research Design: Network Sample
Social Network Data
1. Ego Network Sampling (analysis will be covered in separate session)
• Most similar to standard social survey:
• Easily sampled (as any other survey implementation)
• All information comes from the respondent, so very subject to personal projection.
• Ask ego to report on characteristics of alter
For k alters and q attributes adding kq questions
i.e. 5 friends with 10 behaviors adds 50 questions to the survey!
• Ask ego to report on relations amongst alters.
For k alters and j relational features j(k(k-1)/2) questions
i.e. 5 friends and 2 relation question is 20 questions: 2*((5*4)/2)
Respondent
Alter 1
Alter 4
Alter 2
Alter 3
14. 2. Snowball and “link trace” designs
Ego-networks Complete Census
Link-Tracing Designs
Basic idea is to use “adaptive sampling” – start with (a) seed node(s), identify
the network partners, and then interview them.
Earliest “snowball” samples are of this type. Most recent work is “respondent
driven sampling. (RDS)”
-- If done systematically, some inference elements are knowable. Else, you
have to try and disentangle the sampling process from the real structure
Research Design: Network Sample
Social Network Data
15. 3. Global network samples: Population Census
• Key issue is to enumerate the population & collect relational
information on all.
– If dynamic, this can make implementation difficult
– Tends to force case-study style designs (highly clustered
settings)
– Contrast N of networks with N of respondents
– Because behavior is self-reported (rather than alter
reported), adding network questions to a census-based
survey is low cost.
• If you are doing a census anyway….then good to
add network questions. Propser Peers followed this
strategy.
Research Design: Network Sample
Social Network Data
16. Network Data Sources: Secondary & archival data
Social Network Data
Extant direct network data
National Health and Social Life Survey
Americans’ Changing Life Study
Add Health
Prosper Peers
Archival Sources
Most common is two-mode data, records of people in groups or shared
activity
Examples:
Electronic Health Records
Hospital transfer records
Admission records
Group membership
collaboration
Key issue with any secondary or archival data is you have to take what you can get…
17. Survey Elements
a) Informed consent
a) It is important to let people know that their identities matter: network data are
confidential but (at least in the construction) not anonymous.
b) Name Generator Questions
a) General term for what relation you are trying to tap.
b) Many extant name generators out there…most evidence suggests that people are very
sensitive to the questions asked.
a) If you ask multiple relations, be clear whether it is OK to repeat names!
c) Response Format
a) Open List number of lines suggests “right” answer
b) Check off/select very simple on/off, might result in over-estimates
c) Limit choice limiting choice limits degree which affects *every* network statistics.
d) Rank/Rate asking people to rank each other is difficult (and can backfire!)
e) If multiple name generators – grid or separate questions?
Network Data Sources: survey data
Social Network Data
18. If you use surveys to collect data, some general rules of thumb:
a) Network data collection can be time consuming.
If interests are in network-level structure effects, it is better to have breadth over depth.
Having detailed information on <50% of the sample will make it very difficult to draw
conclusions about the general network structure.
If interest is in detail interpersonal information – social support for example – detailed
information on one or two key ties might be more important.
Survey time is the crucial resource: never enough to ask everything you want.
b) Question format:
• If you ask people to recall names (an open list format), fatigue will
result in under-reporting
• If you ask people to check off names from a full list, you can often get
over-reporting
c) It is common to limit people to ~5 nominations. This will bias network stats
for stars, but is sometimes the best choice to avoid fatigue.
Network Data Sources: survey data
Social Network Data
19. Local Network data:
• When using a survey, common to use an “ego-network module.”
• First part: “Name Generator” question to elicit a list of names
• Second part: Working through the list of names to get
information about each person named
• Third part: asking about relations among each person named.
GSS Name Generator:
“From time to time, most people discuss important matters with other people.
Looking back over the last six months -- who are the people with whom you
discussed matters important to you? Just tell me their first names or initials.”
Why this question?
•Only time for one question
•Normative pressure and influence likely travels through strong ties
•Similar to ‘best friend’ or other strong tie generators
•Note there are significant substantive problems with this name generator
Network Data Sources: survey data
Social Network Data
20. Local Network data:
The third part usually asks about relations among the alters. Do this
by looping over all possible combinations. If you are asking about a
symmetric relation, then you can limit your questions to the n(n-1)/2
cells of one triangle of the adjacency matrix:
1 2 3 4 5
1
2
3
4
5
GSS: Please think about the relations between the people you just mentioned. Some of them may
be total strangers in the sense that they wouldn't recognize each other if they bumped into each
other on the street. Others may be especially close, as close or closer to each other as they are to
you. First, think about NAME 1 and NAME 2. A. Are NAME 1 and NAME 2 total strangers? B.
ARe they especially close? PROBE: As close or closer to eahc other as they are to you?
Network Data Sources: survey data
Social Network Data
21. Local Network data:
The third part usually asks about relations among the alters. Do this
by looping over all possible combinations. If you are asking about a
symmetric relation, then you can limit your questions to the n(n-1)/2
cells of one triangle of the adjacency matrix:
Network Data Sources: survey data
Social Network Data
22. Complete network surveys require
a process that lets you link answers
to respondents.
•You cannot have
anonymous surveys.
•Recall format:
•Need Id numbers & a
roster to link, or hand-
code names to find
matches
•Checklists
•Need a roster for people
to check through
Network Data Sources: survey data
Social Network Data
(1994)
23. Complete network surveys require a process that lets you link answers to respondents.
•Typically you have a number of data tradeoffs:
•Limited number of responses.
•Eases survey construction & coding, lowers density & degree, which affects
nearly every other system-level measure.
•Evidence that people try to fill all of the slots.
•Name check-off roster (names down a row or on screen, relations as check-
boxes).
•Easy in small settings or CADI, but encourages over-response.
•The “Amy Willis” Problem.
•Open recall list.
•Very difficult cognitively, requires an extra name-matching step in analysis.
•Still have to give slots in pen & paper, can be dynamic on-line.
Think carefully about what you want to learn from your survey items.
Network Data Sources: survey data
Social Network Data
24. Network Data Sources: survey data
Social Network Data
Check off or Open Ended?
Open ended require more of respondents…subject to
fatigue & size suggestion
25. Network Data Sources: survey data
Social Network Data
Check off or Open Ended?
Check off is simpler – particularly if yes/no – but also
subject to over-response.
26. Network Data Sources: survey data
Social Network Data
Ask respondent for yes/no decisions or quantitative assessment?
Yes/no are cognitively easier (therefore reliable, believable),
Yes/no *much* faster to administer
But yes/no provides no discrimination among levels –ratings provide
more nuance
•A series of binaries can replace one quant rating:
Instead of “How often do you see each person?”
1 = once a year; 2 = once a month; 3 = once a week; etc.
Use three questions (in this order):
Who do you see at least once a year?
Who do you see at least once a month?
Who do you see at least once a week?
Slide from Steve Borgatti: http://www.analytictech.com/mgt780/slides/survey.pdf
27. Network Data Sources: survey data
Social Network Data
Absolute:
“How often do you talk to _____, on average?”
–Need to do pre-testing to determine appropriate time scale
Danger of getting no variance
–Assumes a lot of respondents
Relative:
“How often do you speak to each person on the list below?”
Very infrequently, Somewhat infrequently, About average, Somewhat frequently, Very frequently
Assumes less of respondents; easier task
Is automatically normalized within respondent
Makes it harder to compare values across respondents
Slide from Steve Borgatti: http://www.analytictech.com/mgt780/slides/survey.pdf
28. Network Data Sources: survey data
Social Network Data
Survey Mode
Lots of ongoing research on best practices.
Focus on clear design, careful wording.
Pretest as much as you can afford
Key advantage of electronic survey is data processing on the
back-end.
Even with open-ended; no data entry.
See: https://www.une.edu/sites/default/files/Microsoft-Word-Guiding-Principles-for-
Mail-and-Internet-Surveys_8-3.pdf
29. Data Accuracy: Survey induced error
Social Network Data
How reliable are network data?
In a well-known series of
studies, BKS compare recall
of communication with
records of communication,
and recall doesn’t do well…
• Killworth, P. D . , Bernard, H. R. 1976.
Informant accuracy in social network data.
Hum. Organ. 35:269-86
• Bernard, H. R . , Killworth, P. D. 1977.
Informant accuracy in social network data, II.
Hum. Commun. Res. 4:3-18
• Killworth, P. D. , Bernard, H. R. 1979.
Informant accuracy in social network data, III.
A Comparison of triadic structures in behavioral
and cognitive data. Soc. Networks 2 : 1 9-46
• Bernard, H. R., Killworth , P. D . , Sailer, L.
1980. Informant accuracy in social network
data, IV. A comparison of clique-level structure
in behavioral and cognitive data. Soc. Networks
2: 1 91-218
• Bernard H, Killworth P and Sailer L. 1982.
Informant accuracy in social network data V.
Social Science Research, 11, 30-66. The Problem of Informant Accuracy: The Validity of Retrospective Data
Annual Review of Anthropology
Vol. 13: 495-517 (Volume publication date October 1984)
DOI: 10.1146/annurev.an.13.100184.002431
30. Data Accuracy: Survey induced error
Social Network Data
How reliable are network data?
The BKS studies sparked a bunch of work on network survey reliability and the results
are mixed. Some general features:
a) Important relations are recalled
b) People bias toward “common” activities…
c) …that are relationally salient.
d) Behavior reports are more consistent than attitude reports
e) Strong survey, interviewer or instrument effects.
31. Data Accuracy: Survey induced error
Social Network Data
How reliable are network data?
32. Data Accuracy: Survey induced error
Social Network Data
How reliable are network data?
Assessing accuracy is difficult, because respondents report on relations over
the last 6 months (or year, depending on type), but may be interviewed at
different times.
33. Data Accuracy: Survey induced error
Social Network Data
How reliable are network data?
Once we account for observation windows and question length, we find very
high concordance on dates of relations.
34. Data Accuracy: Survey induced error
Social Network Data
How reliable are network data?
For ego-level ties that were not timed, we can ask if a t1 nomination is
retained: If I “ever did drugs” with you at t1, then I should also have reported
doing so at future data collections.
Very few relations are “recanted” (4.7% sex, 13.6% drug, 3% social).
35. Data Accuracy: Survey induced error
Social Network Data
How reliable are network data?
Ego
A
B
Proportion of times a “matrix” tie is corroborated by a direct response?
Given: How often: A B
A B
36. Data Accuracy: Survey induced error
Social Network Data
How reliable are network data?
Ego
A
B
Proportion of times a “matrix” tie is corroborated by a direct response?
Given: How often:A B
A B
37. Data Accuracy: Survey induced error
Social Network Data
How reliable are network data?
Why are the Colorado Springs data so much more reliable than the BKS data?
a) Very dedicated data collectors
b) No nomination limits on self-reports
c) Highly salient relations in a small community
38. • Interviewer effects
– Systematic variation in responses by interviewer (Paik
and Sagacharin, 2013; Marsden, 2003)
• Design of the survey instrument (Lozar, Vehovar and Hlebec, 2004)
• Panel Conditioning (Lazarsfeld, 1940; Warren and Halpern-Manner, 2012)
– Rise of panels for basic social research (Keeter et al., 2015)
– Survey memory is short (Groves, 1986)
Data Accuracy: Survey induced error
Social Network Data
42. Whatever method is used, data will always be incomplete. What are the
implications for analysis?
Example 1. Ego is a matchable person in the School
Ego
M
M
M
M
Out
Un
True Network
Ego
M
M
M
M
Out
Un
Observed Network
Un
Out
Social Network Data
Effects of missing data
43. Example 2. Ego is not on the school roster
M
M
M
M
M
Un
True Network
M
M
M
M
M
Un
Observed Network
Un
Un
Un
Social Network Data
Effects of missing data
44. Example 3:
Node population: 2-step neighborhood of Actor X
Relational population: Any connection among all nodes
1-step
2-step
3-step
F
1.1
1.2
1.3
1.4
1.5
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
3.1
3.2
3.3
F 1 2 3 4 5 1 2 3 4 5 6 7 8 1 2 3
Full Full (0) Full (0)
Full Full
Full Full
F
F
(0)
F
(0)
Full (0) Unknown UK
UK
Full (0)
Social Network Data
Effects of missing data
45. Example 4
Node population: 2-step neighborhood of Actor X
Relational population: Trace, plus All connections among 1-step contacts
F
1.1
1.2
1.3
1.4
1.5
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
3.1
3.2
3.3
F 1 2 3 4 5 1 2 3 4 5 6 7 8 1 2 3
Full Full (0) Full (0)
Full Full
Full Unknown
F
F
(0)
F
(0)
Full (0) Unknown UK
UK
Full (0)
1-step
2-step
3-step
Social Network Data
Effects of missing data
46. Example 5.
Node population: 2-step neighborhood of Actor X
Relational population: Only tracing contacts
F
1.1
1.2
1.3
1.4
1.5
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
3.1
3.2
3.3
F 1 2 3 4 5 1 2 3 4 5 6 7 8 1 2 3
Full Full (0) Full (0)
Unknown Full
Full Unknown
F
F
(0)
F
(0)
Full (0) Unknown UK
UK
Full (0)
1-step
2-step
3-step
Social Network Data
Effects of missing data
47. Example 6
Node population: 2-step neighborhood from 3 focal actors
Relational population: All relations among actors
Full Full (0) Full (0)
Full Full
Full Full
Full
Full
(0)
Full
(0)
Full (0) Unknown UK
UK
Full (0)
FullFocal
1-Step
2-Step
3-Step
Focal 1-Step 2-Step 3-Step
Social Network Data
Effects of missing data
48. Example 7.
Node population: 1-step neighborhood from 3 focal actors
Relational population: Only relations from focal nodes
Full Full (0) Full (0)
Unknown Unknown
Unknown Unknown
Full
Full
(0)
Full
(0)
Full (0) Unknown UK
UK
Full (0)
FullFocal
1-Step
2-Step
3-Step
Focal 1-Step 2-Step 3-Step
Social Network Data
Effects of missing data
49. Social Network Data
Effects of missing data on measures Smith & Moody, 2014,
Smith, Morgan & Moody 2016
Identify the practical effect of missing data as a measurement error problem:
induce error and evaluate effect.
Randomly select nodes to delete, remove their edges & recalculate statistics of
interest.
53. Social Network Data
Effects of missing data on measures
What to do about missing data?
Easy:
• Do nothing. If associated error is small ignore it. This is the default, not
particularly satisfying.
Harder: Impute ties
• If the relation has known constraints, use those (symmetry, for example)
• If there is a clear association, you can use those to impute values.
• If imputing and can use a randomization routine, do so (akin to multiple
imputation routines)
• All ad hoc.
Hardest:
• Model missingness with ERGM/Latent-network models.
• Build a model for tie formation on observed, include structural missing &
impute. Handcock & Gile have new routines for this.
• Computationally intensive…but analytically not difficult.
54. Summary:
Data collection design & missing data affect the information at hand to draw
conclusions about the system. Everything we do from now on is built on some
manipulation of the observed adjacency matrix; so we want to understand what are valid
and invalid conclusions due to systematic distortions on the data.
Statistical modeling tools hold promise. We can build models of networks that account
for missing data – we are able to “fix” the structural zeros in or models by treating them
as given. This then lets us infer to the world of all graphs with that same missing data
structure. These models are very new, and not widely available yet….
Social Network Data
Network Data Sources: Missing Data