Methodological principles in dealing with Big Data, Reijo Sund
1. Methodological Principles in
Dealing with Big Data
Reijo Sund
University of Helsinki, Centre for Research Methods, Faculty of Social Sciences
Big Data seminar
Statistics Finland, Helsinki 2.6.2014
1. kesäkuuta 14
2. Big Data
Data have been produced for hundreds of years
The reasons for such production were originally
administrative in nature
There was a need for systematically collected numerical facts on a
particular subject
Advances in information technology have made it possible to
more effectively collect and store larger and larger data sets
1. kesäkuuta 14
3. From data to information
As far as there has been data, there has been a challenge to
transform it into useful information
Too much data in an unusable form has always been a common complain
Well known hierarchy:
Data - Information - Knowledge -
Wisdom - Intelligence
1. kesäkuuta 14
4. Secondary data
There are more and more ”big data”, but the emphasis has
been on technical aspects and not on the information itself
Data without explanations are useless
Big Data are often secondary data
Not tailored to specific research question at hand
More (detailed) data would not solve the basic problems
More background information is required for utilization
1. kesäkuuta 14
5. Fundamental problem
The belief that big data consist of autonomous, atom-like
building blocks is fundamentally erroneous
Raw register data as such are of little value
No simple magic tricks to overcome problems arising from
the fundamental limitations of empirical research
More general aspects of scientific research are needed in order to
understand the related methodological challenges
1. kesäkuuta 14
6. Knowledge discovery process
Process consists of several main phases:
Understanding the phenomenon, Understanding the problem,
Understanding data, Data preprocessing, Modeling,
Evaluation, Reporting
The main difference to the ”traditional” research process is
the additional interpretation-operationalization phase
Context
Debate
Idea
Theory
Problem
Data
Analysis
Question
Answer
Perspective
1. kesäkuuta 14
7. Prerequisites
Effective use of big data presumes skills in various areas:
Measurement
Data modeling (information sciences)
Statistical computing (statistics)
Theory of the subject matter
1. kesäkuuta 14
8. Principles of measurement
Reality can be confronted by recording observations that
reflect the phenomenon of interest
Measurement aims to create data as symbolic
representations of the observations
Operationalization determines how the phenomenon P that becomes
visible via observations O is mapped to data D ?
Successful if it becomes possible to make valid interpretations I of
symbolic data D in regard to the phenomenon P
1. kesäkuuta 14
9. Infological equation
Information is something that has to be produced from the
data and the pre-knowledge
Infological equation:
I = i(D,S,t)
Information I is produced from the data D and the pre-knowledge S
(at time t using the interpretation process i)
1. kesäkuuta 14
10. Data modeling
Data modeling can be used to construct (computer-based) symbol structures which
capture the meaning of data and organize it in ways that make it understandable
Only what is (or can be) represented is considered to exist
Phenomenon
⇓
Concept
⇓
Object
Host Attributes
Time Place Realized observation
Data component
Knowledge component
Logical component
Taxonomy
Partonomy
Theoretical measurement properties
1. kesäkuuta 14
11. Data preprocessing
Data cleaning and reduction
Correction of “global” deficiencies in the data
Dropping of “uninteresting” data
Data abstraction
“Intelligent enrichment” of data using background knowledge
This kind of preprocessing reminds much more qualitative than
quantitative analysis
Each rule reflects the instability of the concept and is a step further from
the "objectivity" of the study
1. kesäkuuta 14
12. Preprocessing in practice
Need for conceptual representation of each object
Two main classes for concept-data relation:
Factual = minimal background knowledge
Abstracted = cognitive fit acceptable
A sophisticated (and subjective) preprocessing aiming to
scale matters down to a size more suitable for specific
analyses is the most important and time-consuming part of
the (big) data analysis
1. kesäkuuta 14
13. Greater statistics
Statistics offers not only a set of tools for problem- solving,
but also a formal way of thinking about the modeling of the
actual problem
Rather than trying to squeeze the data into a predefined
model or saying too much on what can and cannot be done,
data analysis should work to achieve an appropriate
compromise between the practical problems and the data
1. kesäkuuta 14
14. Challenges
How to analyze massive data effectively when manual
management is unfeasible?
How to avoid ‘snooping/dredging/fishing/shopping’ without
assuming that data are automatically in concordance with the
theory?
How to deal with data that include total populations without
traditional meaning for sampling error and statistical
significance?
1. kesäkuuta 14
15. Thank you!
For more information:
http://www.helsinki.fi/~sund
1. kesäkuuta 14
16. How to calculate the annual number of
hip fractures in Finland?
Background knowledge: All hip fractures in Hospital Discharge
Register
Data challenge: Difficult to separate new admissions from the care of
old fractures
Change of theory: Consider only first hip fractures instead of all hip
fractures
Solution in terms of data: Easy to determine the number of first
hip fractures from the register if enough old data are available and
deterministic record linkage can be used
1. kesäkuuta 14
17. Is there more hip fractures during
winter? How to define winter?
Based on the data, ”Winter” is from November to April
5/98 11/98 5/99 11/99 5/00 11/00 5/01 11/01 5/02 11/02
1/98 7/98 1/99 7/99 1/00 7/00 1/01 7/01 1/02 7/02 1/03
0
5
10
15
20
Institutionalized
5/98 11/98 5/99 11/99 5/00 11/00 5/01 11/01 5/02 11/02
1/98 7/98 1/99 7/99 1/00 7/00 1/01 7/01 1/02 7/02 1/03
0
5
10
15
20
Over 50 years old
1. kesäkuuta 14
18. Data abstracted outcomes
Commonly used outcomes measuring effectiveness of (hip
fracture) surgery are death and complication
These are medical concepts, but must be abstracted from
individual level register-based data by using some ‘rules’,
such as a list of some particular diagnosis codes recorded in
the data
1. kesäkuuta 14
19. Stabile and complex outcomes
It is easy typically straightforward to extract the event of
death from the data by using "one line rule“
Extraction of complications may require tens of
different rules which are justified by using domain
knowledge and evaluation of rules with concrete data until
saturation point is reached
1. kesäkuuta 14