Lecture given as part of the BIGSSS 2019 summer school on migration (https://bigsss-css.jacobs-university.de/migration2019/migration/). See https://ingmarweber.de/publications/ for related publications. Mostly joint work with Emilio Zagheni.
Davis plaque method.pptx recombinant DNA technology
Digital Trace Data for Demographic Research
1. Digital Trace Data for
Demographic Research
Ingmar Weber
@ingmarweber
June 12, 2019
Lecture at BIGSSS CSS 2019
Or How I learned to Love Online Advertising
3. What is Demography?
Demography is the statistical study of populations.
According to IP address 70.67.193.176, user Pbsouthwood and other
contributors to https://en.wikipedia.org/wiki/Demography
4. The Population Equation
Change in population = Inputs – Outputs
Inputs = Births + In-migration
Outputs = Deaths + Out-Migration
• ∆P = (B + I) − (D + O)
Fertility, Mortality and Migration
5. Quant: How much? Where? When?
• Births
- Birth registry: India: ~75%, Kenya: ~65%, Liberia: ~25% (2017)
• Deaths
- “Global Burden of Disease” (Murray and Lopez, 1997):
“Medically certified information is available for less than 30% of
the estimated 50.5 million deaths that occur each year
worldwide.”
• Migration
- “The size of the irregular migrant stock of the EU-27 in 2008
was measured to be between 1.9 and 3.8 million, a decline from
between 2.4 and 5.4 million in the EU-25 in 2005” (Kovacheva
and Vogel, 2009).
6. Qual: Why? How?
• Births
- Effect of religiosity, available childcare, …
• Deaths
- Ikigai: “reason to get up in the morning”
• Migration
- Push/pull factors, assimilation, …
7. Opportunities for New methods
• Filling data gaps
– New data on migration, fertility, employment, …
• Explaining behavior
– Richer data, including networks and long-term history
• Predictive modeling
– Multi-modal forecasting
• Take a global perspective on things
– Facebook, Google, satellites know (almost) no borders
Goal is to augment, not replace, traditional approaches
Big Data is not a cure-all panacea
8. Rest of the Talk: Data-Centric
• Online advertising audience estimates
- Migration stocks, migrant assimilation
- Male mean-age-at-childbirth
- Ethics, limitations and challenges
• More non-obvious data sources
- Google Correlate, Followerwonk
- Even more non-obvious data sources
• Thoughts on interdisciplinary work
19. Bias Reduction via Model-Fitting
Mean out-of-sample absolute percentage error 37%,
down from 56% without origin-age bias correction
Adjusted R^2 = .70
Does not use GDP, language, internet penetration, …
z = age-gender group
i = country of birth
j = US state of residence
21. Do Refugees Share German Interests?
What interests to consider? Everybody likes “Music” and “Technology”.
How to interpret the score? High/low compared to European migrants?
Germans in DEU
FB Interests:
Football (90%)
Max Planck (70%)
Sauerkraut (40%)
…
Arabs in MENA
FB Interests:
Quran (80%)
Ibn Al-Haytham (60%)
Falafel (60%)
…
Arabs in DEU
FB Interests:
?
22. Obtaining an Assimilation Score
Migrant Group Assim. Score
Austrian migrants .900
Spanish migrants .864
French migrants .803
Turkish-speaking migrants .746
Arabic-speaking migrants .643
A: Women, non-uni, 45-64 .461
A: Men, uni, 18-24 .677
• Experimental methodology: take with a ton, not just a grain of salt
• Needs to be validated externally
• Goals include finding “bridging” interests/patterns
31. Ethical Challenges
• Privacy
– Was possible to obtain PII until early 2018 [Venkatadri
et al., 2018]
– Audience estimates for “custom audiences” no longer
supported
– The k in k-anonymity has been increased
• Vulnerable populations
– Was possible to exclude minorities from ads
– Was possible to target based on likely diseases
– Still targetable through proxy interests
We only use aggregate, anonymous data without
interacting with any user
32. Limitations: Selection Bias
Aren’t you just studying FB/LI/… vs. the “real
world”?
• If we understand the selection bias, we can
model it and de-bias the estimates
– Non-response biases in surveys
– Usual signal in a prediction model
– Non-random fake/duplicate accounts could
become problematic depending on domain
• Even if “only” LI, still real world implications
– LI used for hiring and to find keynote speakers
33. Limitations: Black Box
Who knows how FB’s classifier labels “expats” or
SC’s classifier labels “math enthusiasts”?
• Use as signal, not as ground truth
– Empirically, highly predictive of “proper” definition
– Unified definition can be a plus
• Incentives are in the right place
– Companies try to provide values to advertisers and,
hence, are incentivized to have correct labels
• Inconsistencies over time problematic
– In March 2019 FB changed its “expat” classifier
34. Limitations: No Longitudinal Data
None of the services provide information on
running a hypothetical ad campaign in the past
• No historical data sets of audience estimates
exist
– Hard to do causal inference (natural experiments)
• Similar to Twitter streaming API
– The best time to start collecting data is 20 years
ago. The second best time is today.
35. Limitations: What about Myspace?
Services come and go and FB et al. might
become obsolete
• Only useful for understanding and modeling
processes with current relevance
Usage patterns change over time
• FB of 2009 unlike FB of 2019.
• Users might become more privacy concerned.
• Re-validate and re-train your model over time.
37. Google Correlate and Fertility
Discover search terms correlated with different fertility rates across US
states
https://www.google.com/trends/correlate/search?e=id:f7PU4mFDWV-
&t=all
Remove terms with no conceivable link to sex, pregnancy or maternity
38. Predicting Spatial Variability
• Performance of the regression models using
leave-one-out cross-validation. SMAPE is in [%], RMSE
values are multiplied by 1,000.
Use the previous terms to build
models predicting state-level fertility
rates
All these models make predictions
based on linear combinations of
search intensity
Goal: apply these spatial models
across time
39. Learning Across Space, Predicting Across
Time
• Temporal trend when applying the “teen” model
across time. Values are rescaled to a maximum of 1.0.
Pearson r correlation across 2010-2015 when using
the spatial model to predict trends across time.
40. Followerwonk and Gender Roles
(mother|mom) of … (father|dad) of …
… (girls|daughters) 1,257 303 1,560
… (sons|boys) 941 545 1,486
2,198 848
Location: (us|usa|united states)
https://followerwonk.com/bio/?q=(father|dad)%20of%20(sons|boys)&l=(us|usa|united%20states)
41. More Creative Data Sources
Online genealogy
- see how marriage mobility has changed
Online obituaries
- monitor patients discharged from hospital
Google Street View
- parked cars tell income and political orientation
https://sites.google.com/site/digitaldemography/