Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences

•

1 like•464 views

Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences. Paper presented at Web Science 2015, Oxford, UK.

Data & Analytics

Matthew S. Weber
Hai Nguyen
Rutgers University
WebSci 2015
Oxford, UK
BIG DATA,
BIG ISSUES

3
Dataset Research Potential Dates Captures Unique URLs
Hurricane Katrina Online networks and organizational
resilience (Chewning, Lai and Doerfel,
2012; Perry, Taylor and Doerfel, 2003) in
the wake of disasters; information
dissemination
2003 – 2012 1,694,236 663,740
Superstorm
Sandy
2003 – 2012 41,703,112 20,013,455
US Senate Study the growth of political activity in
online environments (Adamic & Glance,
2005; Bruns, 2007; Chang & Park, 2012);
polarization & media discourse
109th – 112th
Congresses
26,965,770 8,674,397
US House 51,840,777 12,410,014
Occupy Wall
Street
Previous research on NGOs in the online
environment (Bach & Stark, 2004;
Shumate, 2003, 2012; Shumate, Fulk, &
Monge, 2005); use of hyperlink data to
study the formation and role of alliances
between SMOs
2010 – 2012 247,928,272 11,3259,655
US Media
Previous studies of news media
organizations (Greer & Mensing, 2006;
Weber, 2012; Weber & Monge, In
Press); focus on evolutionary patterns
2008 – 2012 1,315,132,555 539,184,823

6
News Media on the Web
(Weber, Ognyanova, Kosterich & Nguyen, 2015)

To what degree are large-scale datasets reliable?

• Scale out across multiple datasets:
– US House – 2005:2013:
– US Senate – 2005:2013
– Hurrican Katrina – 2003:2012:
– Occupy Wall Street – 2010:2012
17

0 5 10 15 20 25 30
050000010000001500000200000025000003000000
Potential vs. Actual URLs
CountofPages
18t
CountofURLs
Potential
Actual
Difference

19
0e+002e+064e+066e+06
Changes in Crawl Completeness
CountofPages
t
CountofURLs
OWS
House
Senate
Katrina
existing
potential
b =
set a unit of time for analysis, c
choosing n perios across a total time T

In the ideal case, it would be possible to create a factor that corrects
for data degrade:
bt
How does this help?
Each of the illustrated cases fits against an
exponential function ~ b
• Senate: 0.13
• House: 0.13
• Katrina: 0.02
• OWS: 0.10
20
ebt

23
Challenges are not unique to these
data
Courtesy of Marc Smith, NodeXL

Lessons Learned
• Degradation is a factor in working with available large-scale data
– In part, degradation is related to the provenance of the data
– In turn, there is a need to record the origins of datasets (provenance)
• Patterns of degradation prove problematic for statistical analyses
– Ex: network analysis with snowball samples vs. whole network
• Continued work needed to develop research guidelines as more
scholars engage with this data
24

Get in contact with us:
– matthew.weber@rutgers.edu
– @mediareinvented
The Team
– Kris Carpenter, Vinay Goel, Internet Archive
– David Lazer, Katherine Ognyanova, Northeastern University
– Allie Kosterich, Hai Nguyen, Rutgers University
Research supported by NSF Award #1244727 and the NetSCI Lab @ Rutgers

What's hot

Synthetic Data Generation using exponential random Graph modelingGraph-TA

Dealing with Open Data in IstatGiovanni Barbieri

Big Data Analytics on Hadoop RainStor InfographicRainStor

Distribution of maximal clique size of theIJCNCJournal

The study about the analysis of responsiveness pair clustering tosocial netwo...acijjournal

Geographic Information Management TransformationPat Kenny

What's hot (6)

Synthetic Data Generation using exponential random Graph modeling

Dealing with Open Data in Istat

Big Data Analytics on Hadoop RainStor Infographic

Distribution of maximal clique size of the

The study about the analysis of responsiveness pair clustering tosocial netwo...

Geographic Information Management Transformation

Similar to Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences

From Big Data to Big Theory: Lessons Learned from Archival Internet Research.mwe400

Wire Workshop: Overview slides for ArchiveHub Projectmwe400

Internet Archives and Social Science Research - Yeungnam Universitymwe400

NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...SayantanRoy14

Data Science in 2016: Moving UpPaco Nathan

Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain

The web bang project michele zadraMichele Zadra

wireless sensor networkparry prabhu

Linked Data and the Future Internet Architecture: A motivation: Stefan Decker...FIA2010

Ngdm09 han gaoTarek Dakel

Dissertation Social Network SitesXenia K-i

Scholarship in the Digital AgeEric Meyer

10 problems 06Loc Nguyễn

Kid171 chap0 english versionFrank S.C. Tseng

Ripples on the Web: Diffusion of Activity Bursts across Hyperlink Networks in...Brian Keegan

CeB - f - s01gauvins

Ongoing Research in Data StudiesCommunication and Media Studies, Carleton University

Using Graphs to Enable National-Scale AnalyticsNeo4j

Adaptive network models of socio-cultural dynamicsHiroki Sayama

Homelessness Data DiscussionCommunication and Media Studies, Carleton University

Similar to Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences (20)

From Big Data to Big Theory: Lessons Learned from Archival Internet Research.

Wire Workshop: Overview slides for ArchiveHub Project

Internet Archives and Social Science Research - Yeungnam University

NPTEL BIG DATA FULL PPT BOOK WITH ASSIGNMENT SOLUTION RAJIV MISHRA IIT PATNA...

Data Science in 2016: Moving Up

Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015

The web bang project michele zadra

wireless sensor network

Linked Data and the Future Internet Architecture: A motivation: Stefan Decker...

Ngdm09 han gao

Dissertation Social Network Sites

Scholarship in the Digital Age

10 problems 06

Kid171 chap0 english version

Ripples on the Web: Diffusion of Activity Bursts across Hyperlink Networks in...

CeB - f - s01

Ongoing Research in Data Studies

Using Graphs to Enable National-Scale Analytics

Adaptive network models of socio-cultural dynamics

Homelessness Data Discussion

Recently uploaded

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07

ASML's Taxonomy Adventure by Daniel Cantervoginip

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Recently uploaded (20)

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

Call Girls in Saket 99530🔝 56974 Escort Service

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

20240419 - Measurecamp Amsterdam - SAM.pdf

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

MK KOMUNIKASI DATA (TI)komdat komdat.docx

GA4 Without Cookies [Measure Camp AMS]

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

Call Girls In Dwarka 9654467111 Escorts Service

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

Top 5 Best Data Analytics Courses In Queens

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING

ASML's Taxonomy Adventure by Daniel Canter

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences

1. Matthew S. Weber Hai Nguyen Rutgers University WebSci 2015 Oxford, UK BIG DATA, BIG ISSUES

3. 3 Dataset Research Potential Dates Captures Unique URLs Hurricane Katrina Online networks and organizational resilience (Chewning, Lai and Doerfel, 2012; Perry, Taylor and Doerfel, 2003) in the wake of disasters; information dissemination 2003 – 2012 1,694,236 663,740 Superstorm Sandy 2003 – 2012 41,703,112 20,013,455 US Senate Study the growth of political activity in online environments (Adamic & Glance, 2005; Bruns, 2007; Chang & Park, 2012); polarization & media discourse 109th – 112th Congresses 26,965,770 8,674,397 US House 51,840,777 12,410,014 Occupy Wall Street Previous research on NGOs in the online environment (Bach & Stark, 2004; Shumate, 2003, 2012; Shumate, Fulk, & Monge, 2005); use of hyperlink data to study the formation and role of alliances between SMOs 2010 – 2012 247,928,272 11,3259,655 US Media Previous studies of news media organizations (Greer & Mensing, 2006; Weber, 2012; Weber & Monge, In Press); focus on evolutionary patterns 2008 – 2012 1,315,132,555 539,184,823

4. 4

5. 5

6. 6 News Media on the Web (Weber, Ognyanova, Kosterich & Nguyen, 2015)

7. To what degree are large-scale datasets reliable?

8. 8

9. 9

10. 10

11. 11

12. 12

13. 13

14. 14 March 16, 2008

15. 15

16. 16

17. • Scale out across multiple datasets: – US House – 2005:2013: – US Senate – 2005:2013 – Hurrican Katrina – 2003:2012: – Occupy Wall Street – 2010:2012 17

18. 0 5 10 15 20 25 30 050000010000001500000200000025000003000000 Potential vs. Actual URLs CountofPages 18t CountofURLs Potential Actual Difference

19. 19 0e+002e+064e+066e+06 Changes in Crawl Completeness CountofPages t CountofURLs OWS House Senate Katrina existing potential b = set a unit of time for analysis, c choosing n perios across a total time T

20. In the ideal case, it would be possible to create a factor that corrects for data degrade: bt How does this help? Each of the illustrated cases fits against an exponential function ~ b • Senate: 0.13 • House: 0.13 • Katrina: 0.02 • OWS: 0.10 20 ebt

21. 21

22. 22

23. 23 Challenges are not unique to these data Courtesy of Marc Smith, NodeXL

24. Lessons Learned • Degradation is a factor in working with available large-scale data – In part, degradation is related to the provenance of the data – In turn, there is a need to record the origins of datasets (provenance) • Patterns of degradation prove problematic for statistical analyses – Ex: network analysis with snowball samples vs. whole network • Continued work needed to develop research guidelines as more scholars engage with this data 24

25. Get in contact with us: – matthew.weber@rutgers.edu – @mediareinvented The Team – Kris Carpenter, Vinay Goel, Internet Archive – David Lazer, Katherine Ognyanova, Northeastern University – Allie Kosterich, Hai Nguyen, Rutgers University Research supported by NSF Award #1244727 and the NetSCI Lab @ Rutgers

Editor's Notes

There are many types of large-scale data… only talking about Internet based data… focusing on datasets that are re-used. - Markus - “social scientists are used to fine-grain, well-controlled data, and that doesn’t exist on the web”
20th Century Collection = 9TB of metadata Media Seed List = 4,891 For instance, researchers have proposed focusing archival efforts on capturing data that changes the most frequently, in order to capture the majority of new content [36]. Elsewhere, researchers have suggested that crawling strategies should prioritize archival efforts based on the size and relative position of websites within their larger ecosystems [37].
Driscoll and Walker (2014) For instance, a comparison of Twitter data collected via a public API and data collected from a “fire hose” provided by GNIP PowerTrack, found significant differences between the two datasets. In most cases the PowerTrack data proved to be more powerful,
3 month windows of time…
Also looked at the size of the webpages, and estimating out size… wasn’t as reliable.

Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences

Similar to Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences (20)

More from mwe400

More from mwe400 (8)

Recently uploaded

Recently uploaded (20)

Big Data? Big Issues: Degradation in Longitudinal Data and Implications for Social Sciences

Editor's Notes