As the volume and complexity of data from myriad Earth Observing platforms, both remote sensing and in-situ increases so does the demand for access to both data and information products from these data. The audience no longer is restricted to an investigator team with specialist science credentials. Non-specialist users from scientists from other disciplines, science-literate public, to teachers, to the general public and decision makers want access. What prevents them from this access to resources? It is the very complexity and specialist developed data formats, data set organizations and specialist terminology. What can be done in response? We must shift the burden from the user to the data provider. To achieve this our developed data infrastructures are likely to need greater degrees of internal code and data structure complexity to achieve (relatively) simpler end-user complexity. Evidence from numerous technical and consumer markets supports this scenario. We will cover the elements of modern data environments, what the new use cases are and how we can respond to them.
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Shifting the Burden from the User to the Data Provider
1. Shifting the Burden from the User
to the Data Provider
Peter Fox
High Altitude Observatory,
NCAR (***)
With thanks to eGY and various NSF, DoE and
NASA projects
1
2. Outline
• Background, definitions
• Informatics -> e-Science
• Data has lots of uses
– Virtual Observatories: use cases
– Data Framework: Examples
– Data ingest, integration, mining and …
• Discussion
2
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
3. Background
Scientists should be able to access a global,
distributed knowledge base of scientific data that:
• appears to be integrated
• appears to be locally available
But… data is obtained by multiple instruments, using
various protocols, in differing vocabularies, using
(sometimes unstated) assumptions, with
inconsistent (or non-existent) meta-data. It may be
inconsistent, incomplete, evolving, and distributed
And… there exist(ed) significant levels of semantic
heterogeneity, large-scale data, complex data
types, legacy systems, inflexible and unsustainable
implementation technology…
3
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
4. Information
Information has
But data
products have
Lots of Audiences
More Strategic
Less Strategic
SCIENTISTS TOO
From “Why EPO?”, a NASA internal
report on science education, 2005
4
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
5. The Information Era: Interoperability
Modern information and communications
technologies are creating an
“interoperable” information era in which
ready access to data and information can
be truly universal. Open access to data
and services enables us to meet the new
challenges of understand the Earth and
its space environment as a complex
system:
• managing and accessing large data sets
• higher space/time resolution capabilities
• rapid response requirements
• data assimilation into models
• crossing disciplinary boundaries.
5
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
6. Shifting the Burden from the User
to the Provider
6
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
8. Mind the
Gap!
As a result - finding out who is doing what,
• Informatics ofinformation science includes the
sharing experience/ expertise, and substantial
science of (data and) information, the practice
coordination:
of information processing, and the engineering
• There is/ was still a gap between science and the
of information systems. Informatics studies the
underlying infrastructure and technology that is
structure, behavior, and interactions of natural
available
and artificial systems that store, process and
• Cyberinfrastructure is the new
communicate (data and) information. It also
research environment(s) that support
develops its own conceptual and theoretical
advanced data acquisition, data
foundations. Since computers, individuals and
storage, data management, data
organizations all process information,
integration, data mining, data
informatics has computational, cognitive and
visualization and other computing and
social aspects, including study of the social
information processing services over
impact of information technologies. Wikipedia.
the Internet.
8
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
9. Progression after progression
Informatics
IT Cyber
Infrastru
cture
Cyber
Informatics
Core
Informatics
Science
Informatics,
aka
Xinformatics
Science,
SBAs
9
Fox HDF: Semantic Data Burden Shift Oct 15, 2008
10. Virtual Observatories
• Conceptual examples:
• In-situ: Virtual measurements
– Related measurements
• Remote sensing: Virtual, integrative
measurements
– Data integration
• Managing virtual data products/ sets
10
11. Virtual Observatories
Make data and tools quickly and easily accessible
to a wide audience.
Operationally, virtual observatories need to find the
right balance of data/model holdings, portals and
client software that researchers can use without
effort or interference as if all the materials were
available on his/her local computer using the
user’s preferred language: i.e. appear to be
local and integrated
Likely to provide controlled vocabularies that may
be used for interoperation in appropriate
domains along with database interfaces for
access and storage and “smart” tools for
evolution and maintenance.
11
12. Early days of discipline specific VOs
?
VO2
VO3
VO1
DB1
DB2
DB3
…………
DBn
12
13. The Astronomy approach; datatypes as a service
Limited
interoperability
VO App1
VO App2
VOTable
VO App3
Open Geospatial Consortium:
Simple
Image
Access
Protocol
Web {Feature, Coverage, Mapping}Simple
Service
Spectrum
Sensor Web Enablement:
VO layer
Sensor {Observation, Planning,
Analysis}Lightweight semantics
Service
DB1
use
DB2
Access
Protocol
Simple
Time Access
Protocol
Limited meaning, hard
coded
the same approach DBn
DB
Limited extensibility
3
…………
Under review
13
14. Added value
Education, clearinghouses,
disciplines, et c.
other
services,
Semantic mediation layer - mid-upper-level
VO
Portal
Semantic
interoperability
Added value
VO
API
Web
Serv.
Added value
Semantic query,
hypothesis and
inference
Mediation Layer
• Ontology - capturing concepts of Parameters,
Instruments, Date/Time, Data Product (and
Semantic mediation layer - VSTO associated classes, properties) and Service
Classes
• Maps queries to underlying data Metadata, schema,
data
• Generates access requests for metadata, data
• Allows queries, reasoning, analysis, new value
Added
DB2
DB3
hypothesis generation, testing, explanation, et…
… … … c.
DB
1
Query,
access
and use
of data
low level
DBn
14
15. Content: Coupling Energetics and Dynamics
of Atmospheric Regions WEB
Community data
archive for
observations and
models of
Earth's upper
atmosphere and
geophysical
indices and
parameters
needed to
interpret them.
Includes
browsing
capabilities by
periods, > 310
instruments,
models, > 820
15
parameters…
16. Content: Mauna Loa Solar
real-time
Observatory Near products
data
from Hawaii from
a variety of solar
instruments.
Source for space
weather, solar
variability, and
basic solar
physics
Other content used
too - Center for
Integrated Space
Weather Modeling
16
17. Semantic Web Methodology and
Technology Development Process
•
•
Establish and improve a well-defined methodology vision for
Semantic Technology based application development
Leverage controlled vocabularies, et c.
Rapid
Open World:
Evolve, Iterate, Prototype
Redesign,
Redeploy
Leverage
Technology
Infrastructure
Adopt
Science/Expert
Technology
Approach Review & Iteration
Use Tools
Analysis
Use Case
Small Team,
mixed skills
Develop
model/
ontology
17
18. Science and technical use cases
Find data which represents the state of the neutral
atmosphere anywhere above 100km and toward the
arctic circle (above 45N) at any time of high
geomagnetic activity.
– Extract information from the use-case - encode knowledge
– Translate this into a complete query for data - inference and
integration of data from instruments, indices and models
Provide semantically-enabled, smart data query services
via a SOAP web for the Virtual IonosphereThermosphere-Mesosphere Observatory that retrieve
data, filtered by constraints on Instrument, Date-Time,
and Parameter in any order and with constraints
included in any combination.
18
19. VSTO - semantics and ontologies in an operational
environment: vsto.hao.ucar.edu, www.vsto.org
Web Service
19
Fox RPI: Semantic Data Frameworks May 14, 2008
20. Semantic filtering by
domain or instrument
hierarchy
Partial exposure of
Instrument
class
hierarchy - users
seem to LIKE THIS
20
22. Inferred plot type
and return formats
for data products
22
Fox RPI: Semantic Data Frameworks May 14, 2008
23. Inferred plot type
and return required
axes data
23
Fox RPI: Semantic Data Frameworks May 14, 2008
24. Semantic Web Benefits
•
•
•
•
•
Unified/ abstracted query workflow: Parameters, Instruments, Date-Time
Decreased input requirements for query: in one case reducing the
number of selections from eight to three
Generates only syntactically correct queries: which was not always
insurable in previous implementations without semantics
Semantic query support: by using background ontologies and a
reasoner, our application has the opportunity to only expose coherent
query (portal and services)
Semantic integration: in the past users had to remember (and maintain
codes) to account for numerous different ways to combine and plot the
data whereas now semantic mediation provides the level of sensible data
integration required, now exposed as smart web services
– understanding of coordinate systems, relationships, data synthesis,
transformations, et c.
– returns independent variables and related parameters
•
A broader range of potential users (PhD scientists, students, professional
research associates and those from outside the fields)
24
25. What is a Non-Specialist Use Case?
Teacher accesses internet goes
to An Educational Virtual
Observatory and enters a
search for “Aurora”.
Someone
should be able
to query a
virtual
observatory
without having
specialist
knowledge
25
26. What should the User Receive?
Teacher receives four groupings of search results:
1) Educational materials:
http://www.meted.ucar.edu/topics_spacewx.php
and http://www.meted.ucar.edu/hao/aurora/
2) Research, data and tools: via VSTO, VSPO and
VITMO, knows to search for brightness, or green/red
line emission
3) Did you know?: Aurora is a phenomena of the
upper terrestrial atmosphere (ionosphere) also
known as Northern Lights
4) Did you mean?: Aurora Borealis or Aurora
Australis, et c.
26
29. Issues for Virtual Observatories
rs
se
u
• Scaling to large numbers of data providers and
or
redefining the role(s)/ relations with them f
as
re
• Crossing discipline boundaries n a
rde
• Security, access to resources, policies
bu
tly
• Branding and attribution (where did this data come
en
from and whourr the credit, is it the correct version,
c gets
is this anrauthoritative source?)
ae
se
• Provenance/derivation (propagating key information
he
Tas it passes through a variety of services, copies of
processing algorithms, …)
• Data quality, preservation, stewardship
29
30. Problem definition
•
Data is coming in faster, in greater volumes and outstripping our
ability to perform adequate quality control
•
Data is being used in new ways and we frequently do not have
sufficient information on what happened to the data along the
processing stages to determine if it is suitable for a use we did not
envision
•
We often fail to capture, represent and propagate manually
generated information that need to go with the data flows
•
Each time we develop a new instrument, we develop a new data
ingest procedure and collect different metadata and organize it
differently. It is then hard to use with previous projects
•
30
The task of event determination and feature classification is onerous
and we don't do it until after we get the data
31. Use cases
•
•
•
•
•
•
•
•
•
•
Determine which flat field calibration was applied to the image taken on
January, 26, 2005 around 2100UT by the ACOS Mark IV polarimeter.
Which flat-field algorithm was applied to the set of images taken during the
period November 1, 2004 to February 28, 2005?
How many different data product types can be generated from the ACOS
CHIP instrument?
What images comprised the flat field calibration image used on January 26,
2007 for all ACOS CHIP images?
What processing steps were completed to obtain the ACOS PICS limb
image of the day for January 26, 2005?
Who (person or program) added the comments to the science data file for
the best vignetted, rectangular polarization brightness image from January,
26, 2005 1849:09UT taken by the ACOS Mark IV polarimeter?
What was the cloud cover and atmospheric seeing conditions during the
local morning of January 26, 2005 at MLSO?
Find all good images on March 21, 2008.
Why are the quick look images from March 21, 2008, 1900UT missing?
Why does this image look bad?
31
32. Provenance
• Origin or source from which something
comes, intention for use, who/what
generated for, manner of manufacture,
history of subsequent owners, sense of
place and time of manufacture, production
or discovery, documented in detail
sufficient to allow reproducibility
32
39. Discussion (1)
• Taken together, an emerging set of collected
experience manifests an emerging informatics
core capability that is starting to take data
intensive science into a new realm of realizability
and potentially, sustainability
–
–
–
–
Use cases (i.e. real users)
X-informatics
Core Informatics
Cyber Informatics
• There are implications for data models
39
40. Progression after progression
Informatics
IT Cyber
Infrastru
cture
Cyber
Informatics
Core
Informatics
Science
Informatics
Science,
SBAs
Example:
•CI = OPeNDAP server running over HTTP/HTTPS
•Cyberinformatics = Data (product) and service ontologies, triple store
•Core informatics = Reasoning engine (Pellet), OWL
•Science (X) informatics = Use cases, science domain terms, concepts in
an ontology
40
41. Discussion (2)
• Data and information science is becoming
the ‘fourth’ column (along with theory,
experiment and computation)
• Semantics (of the data) are a very key
ingredient -> may imply richer data models
41
42. Summary
• Informatics is playing a key role in filling the gap
between science (and the spectrum of non-expert)
use and generation and the underlying
cyberinfrastructure, i.e. in shifting the burden
– This is evident due to the emergence of Xinformatics
(world-wide)
• Our experience is implementing informatics as
semantics in Virtual Observatories (as a working
paradigm) and Grid environments
– VSTO is only one example of success
– Data mining, data integration, smart search, provenance
are close behind
• Informatics is a profession and a community activity
and requires efforts in all 3 sub-areas (science, core,
cyber) and must be synergistic
42
Fox RPI: Semantic Data Frameworks May 14, 2008
43. More Information
• Virtual Solar Terrestrial Observatory (VSTO):
http://vsto.hao.ucar.edu, http://www.vsto.org
• Semantically-Enalbed Science Data Integration (SESDI):
http://sesdi.hao.ucar.edu
• Semantic Provenance Capture in Data Ingest Systems
(SPCDIS): http://spcdis.hao.ucar.edu
• Semantic Knowledge Integration Framework (SKIF/SAM):
http://skif.hao.ucar.edu
• Semantic Web for Earth and Environmental Terminology
(SWEET): http://sweet.jpl.nasa.gov
• Conferences: AGU 2008, EGU 2009, ISWC 2008, CIKM
2008, …
• Peter Fox pfox@ucar.edu
43
Editor's Notes
There are lots of different kinds of audiences interested in data. While we are talking about using data in the classroom today, several other audiences of are importance to Virtual observatories. In particular, on the more strategic end are groups that, while smaller, have great impact on the public’s and the government’s perception of the value of the data and its providers. In this category, I would place both science policy specialists and the media. Policy specialists and decision makers have a tremendous impact on budgets, but also feel, at least at some level, beholden to the tax payers. They want to see the impact that data has on people’s lives. They are also looking for information that will help them made an informed decision. In addition, the media plays a critical role, providing about 85% of the science content to the general public. A third group that is worth considering is the educated general public (the science-attentive public). They take science very seriously and can be a vocal advocate for a scinetific resource -- look at the Hubble scenario as an example.
Interoperability technologies have a 20 year history of development and are now mature. Combined with our growing ability to transmit large amounts of information efficiently (Internet, GRID), this provides us with an unprecedented ability to address new and old scientific problems in ways that were hitherto impossible. The challenge now in the geosciences is to capitalise on this new capability. e-Science initiatives are growing up in centers around the world in response to this opportunity.
In the “heroic” era of science, the provider of data plays a relatively passive role. The onus is on the user to identify each data source, accumulate the required data, and prepare the data for assimilation and analysis. A similar process applies for acquiring software for analysis, modeling, and visualisation. In many cases, the user and the provider are the same person.
In an interoperable e-Science world, the provider has to put much more work into describing the structure and content of data and information, and someone has to provide and support Web Services. The user is relieved of these burdens and benefits accordingly. The overall reduction in work load is enormous, but the provider does not see that.
This presentation is a template to be used by anyone as a basis for an introductory eGY presentation - please use it and modify it for your particular audience.
A collection of .ppt files from past presentations are on the website: www.egy.org/resources. Use any you wish.
Notes accompany each slide, so the presentation should be reviewed under “View: Normal”, or perhaps “View: Notes”.
eGY Development Team
July 2006