P. Struijs, Toward the Use of Big Data for European Statistics
1. Towards the Use of Big Data
for European Statistics
Peter Struijs
Statistics Netherlands
0
2. 1
Scheveningen Memorandum on Big Data
Examine the potential of Big Data sources for official
statistics
Official Statistics Big Data strategy as part of wider
government strategy
Address privacy and data protection
Collaboration at European and global level
Address need for skills
Partnerships between different stakeholders (government,
academics, private sector)
Developments in Methodology, quality assessment and IT
Adopt action plan and roadmap for the European Statistical
System
3. 2
Envisaged Benefits of Using Big Data for Official Statistics
Faster data production
Higher detail, e.g. geographically, frequency
More data
More flexible response to user needs
Increased efficiency
Stay relevant
4. 3
The ESSnet Big Data
Framework Partnership Agreement: 22 partners
Two Specific Grant Agreements:
SGA-1: February 2016 – July 2017 1.0 M€
SGA-2: January 2017 – May 2018 1.0 M€
6. 5
ESSnet Big Data: Pilots
List of pilot projects
Web scraping (2 work packages)
job vacancies ; enterprise characteristics
Smart meters
electricity consumption ; temporary vacant dwellings
Automatic Identification System (AIS)
vessel identification data
Mobile phone data
preparing for access to data
Early estimates
various domains
Multiple domains
population, tourism / border crossing, agriculture
7. 6
Subdivision of Pilots into Phases
1. Data access
Conditions; partnerships
2. Data handling
Production criteria; micro versus aggregated data;
visualisation
3. Methodology and technology
Methodology for long lasting statistics; process design
4. Statistical output
Examples of existing and new outputs; potential users;
comparison with current estimates (quality,
timeliness, level of detail)
5. Future perspectives
Applicability in ESS; future production process;
exploration of further possibilities of using and
combining (big) data sources
8. 7
WP 1: Webscraping / Job Vacancies
WP leader: UK
Partners: Belgium, Denmark, France, Germany,
Greece, Italy, Portugal, Sweden, Slovenia
• Data access: job portals
• Data handling: legal and technical aspects, test webscraping
• Methodology for output production: from semi-structured
to structured data
• Future perspectives: webscraping enterprise websites,
methodology for future production, explore new products
10. 9
Model for Measuring Job Vacancies
Target Population: All job vacancies
Advertised on enterprise website
Advertised on a job portal
‘Ghost’
Vacancies
Employing business
is identifiable
Advertised through an agency
11. 10
Approach to Data Integration
Counts from online
sources
Enterprise A
Enterprise B
Enterprise C
Enterprise D
Enterprise E
Survey Estimates
Enterprise A
Enterprise B
Enterprise C
Enterprise F
Enterprise G
Scaling Factors
(by NACE?)
Matching
Integrated data set
Enterprise A
Enterprise B
Enterprise C
Enterprise D
Enterprise E
Enterprise F
Enterprise G
Enterprise H
Enterprise I
Enterprise J
Business Register
Enterprise A
Enterprise B
Enterprise C
Enterprise D
Enterprise E
Enterprise F
Enterprise G
Enterprise H
Enterprise I
Enterprise J
1. Scale online
data to survey
estimates
2. Apply scaling
factors to on-line
data
3. Use survey
estimates
4. Modelled
estimates
1. Survey and
Online
2. Online only
3. Survey only
4. Neither survey
or online
Total = Survey Estimate
12. 11
WP 2: Webscraping / Enterprise Characteristics
WP leader: Italy
Partners: Bulgaria, Netherlands, Poland, Sweden, UK
• Data access: inventory of target enterprises, URLs; legal
and privacy aspects
• Data handling: use cases; actual webscraping
• Testing of methods and techniques: proof of concept for
selected use cases; build and apply predictor for estimates
of enterprise characteristics
15. 14
WP 3: Smart Meters
WP leader: Estonia
Partners: Austria, Denmark, Italy, Portugal, Sweden
• Data access: availability of smart meters, legal aspects
• Data handling: coverage assessment, production of
cleaned datasets
• Methodology and techniques: linkage with administrative
data; methodology for electricity consumption businesses
and households; also seasonally vacant living spaces
• Future perspectives: potential new products, feasibility of
using aggregated data
17. 16
Estonian Data Structure
Estonian data structure: 4 main
tables
Metering data – main table
with hourly consumptions
Metering points – location
Agreements – contract info
Customers – contract holder
information
19. 18
WP 4: AIS Data
WP leader: Netherlands
Partners: Denmark, Greece, Norway, Poland
• Data access: data availability (in particular EMSA)
• Data handling: processing and storage, aimed at
linking with data from port authorities, traffic
analyses, journeys
• Methodology and techniques: for linking with data
from port authorities and traffic analyses; estimate
emissions
• Future perspectives: qualitative cost-benefit analysis
23. 22
WP 5: Mobile Phone Data
WP leader: Spain
Partners: Belgium, Finland, France, Germany, Italy,
Netherlands, Romania, UK
• Data access: data availability (workshop with MNOs)
• Data handling: investigation of IT tools and aggregation
level needed
• Statistical outputs: describe a statistical output to be
presented to MNO to carry out a pilot
26. 25
WP 6: Early Estimates
WP leader: Slovenia
Partners: Finland, Italy, Netherlands, Poland, Portugal
• Data access: sources for consumer confidence index,
nowcasts of turnover and early estimates
• Data handling: technical requirements; deployment of
collection system
• Methodology and techniques: includes feability of linking
administrative and other existing sources
• Future perspectives: calculation of the consumer
confidence index and nowcasts of turnover; pilots for
combining sources for early estimates
27. 26
Domains of First Interest
• Tourism
• Population mobility
• Health statistics
• Agriculture
• Quick and dirty statistics (all domains)
• Economic indicators:
• GDP
• Consumer Price Index (CPI)
• Retail sales
• Balance of Payments (BoP)
• Economic sentiment indicators
29. 28
WP 7: Multi Domains
WP leader: Poland
Partners: Netherlands, Portugal, UK
• Data access: data availability (inventory, based on
questionnaire), aimed at three domains (populations,
tourism / border crossings, agriculture)
• Data feasibility: exploration of combining sources for
these domains
• Data combination: experiments
• Future perspectives: suggest pilots for 2018
31. 30
Model for Daily Life Satisfaction
Twitter data
Tweepy
Sklearn
Training Dataset
Machine Learning
algorithm
Data extracting
Predictive model
Labels
Feature vectors
Result set
32. 31
WP 8: Methodology, Quality and IT
WP leader: Netherlands
Partners: Austria, Bulgaria, Italy, Poland,
Portugal, Slovenia
• Literature overview
• Quality of Big Data
• Big Data and IT
• Big Data methodology
33. 32
Main Aspects Identified
Quality IT Methodology
coverage metadata management assessing accuracy
comparability processing life cycle final product definition
processing errors format of processing spatial dimension
chain control datahub changes in data sources
linkability data source integration machine learning
measurement errors infrastructure data linkage
model errors; precision secure and tested APIs multi-party computation
shared libraries; standards inference
data lakes sampling
training, skills and knowledge data process architecture
speed of algorithms unit identification
36. 35
Plans (1)
Implementation
Online job vacancies
Enterprise characteristics
Electricity and energy consumption
Waterways and environmental statistics
35
37. 36
Plans (2)
New pilot projects
Financial transactions data
Remote sensing
Mobile network operator data
Innovative sources and methods for tourism statistics
36
40. 39
Conclusions
Approach very successful
Increased ambitions for coming years
Implement results obtained so far
Start with trusted smart statistics
Challenges
Data access, privacy, methods, implementation, etc.
The ESS dimension
Support and commitment
High interest in participation
Commitment at all levels
Recognition of relevance