An e-Infrastructure is a distributed network of service nodes, residing on multiple sites and managed by one or more organizations. e-Infrastructures allow scientists residing at distant places to collaborate. They offer a multiplicity of facilities as-a-service, supporting data sharing and usage at different levels of abstraction, e.g. data transfer, data harmonization, data processing workflows etc. e-Infrastructures are gaining an important place in the field of biodiversity conservation. Their computational capabilities help scientists to reuse models, obtain results in shorter time and share these results with other colleagues. They are also used to access several and heterogeneous biodiversity catalogues.
In this course, the D4Science e-Infrastructure will be used to conduct experiments in the field of biodiversity conservation. D4Science hosts models and contributions by several international organizations involved in the biodiversity conservation field. The course will give students an overview of the models, the practices and the methods that large international organizations like FAO and UNESCO apply by means of D4Science. At the same time, the course will introduce students to the basic concepts under e-Infrastructures, Virtual Research Environments, data sharing and experiments reproducibility.
2. • Biodiversity and geospatial data
• Trends in biodiversity observations
• Combining species observations
• Combining biodiversity and geospatial data
Module 3 - Outline
3. D4Science
D4Science is both a Data and a Computational e-Infrastructure
• Used by several Projects: i-Marine, EUBrazil OpenBio, ENVRI;
• Implements the notion of e-Infrastructure as-a-Service: it offers on demand access to
data management services and computational facilities;
• Hosts several VREs for Fisheries Managers, Biologists, Statisticians…and Students.
4. D4Science - Resources
Large Set of Biodiversity
and Taxonomic Datasets
connected
A Network to
distribute and
access to
Geospatial Data
Distributed Storage
System to store
datasets and
documents
A Social
Network
to share
opinions and
useful news
Algorithms for Biology-
related experiments
5. • Biodiversity and geospatial data
• Trends in biodiversity observations
• Combining species observations
• Combining biodiversity and geospatial data
Module 3 - Outline
7. Biodiversity Data Providers
i-Marine hosts biodiversity datasets coming from several data providers:
• Some are remotely accessed and are maintained by the respective owners;
• Other ones are resident in the e-Infrastructure.
Currently, the accessible datasets are:
• Catalogue of Life (CoL)
• Global Biodiversity Information Facility (GBIF),
• Integrated Taxonomic Information System (ITIS),
• Interim Register of Marine and Nonmarine Genera (IRMNG),
• Ocean Biogeographic Information System (OBIS),
• World Register of Marine Species (WoRMS)
• World Register of Deep-Sea Species ( WoRDSS )
Some data providers are collectors of other data providers, but the alignment is not
guaranteed!
The datasets allow to retrieve:
• Occurrence points (presence points or specimen)
• Taxa names
11. • Biodiversity and geospatial data
• Trends in biodiversity observations
• Combining species observations
• Combining biodiversity and geospatial data
14. • Biodiversity and geospatial data
• Trends in biodiversity observations
• Combining species observations
• Combining biodiversity and geospatial data
17. Occurrences Points Operations
A
x,y
Event Date
Modif Date
Author
Species Scientific
Name
d(x,y) < Distance Thr
=
LD(Author) * LD(SciName) > Lexical Thr
<Take the most recent>
B
x,y
Event Date
Modif Date
Author
Species Scientific
Name
Evaluate
18. Experiment
Solea solea
57 085 Records2 324 Records
1 871 Records
10 542 Records
Duplicates Deletion
with Exact Match
(DThr=0; LThr=1)
Subtraction
DThr=0.01; LThr=0 DThr=0.01; LThr=1
DThr=0.0001;
LThr=0.8
183 Records 0 Records 0 Records
Main remarks:
• The “recordedBy” fields contain
differences in names formats
• The Scientific Names fields are
different (names vs names and
codes)
• D4Science helps in collecting a
larger number of Solea solea
unique occurrence records
• Even if GBIF collects data from
OBIS, the coverage is not updated
19. Occurrences Points Operations
Occurrences Duplicates Deleter:
An algorithm for deleting similar occurrences in a sets of occurrence points coming from the
Species Discovery Facility of D4Science.
A
22. Occurrences Points Operations
Occurrences Merger:
Between two Ocurrence Sets A and B, enriches A with the elements of B that are not in the A.
Updates the elements of the A with more recent elements in B. If one element in A corresponds
to several recent elements in B, these are substituted to the element of A.
A
B
24. • Biodiversity and geospatial data
• Trends in biodiversity observations
• Combining species observations
• Combining biodiversity and geospatial data
Module 3 - Outline
25. Combining Biodiversity and Geospatial data
Environmental layers
Species occurrence dataset
Enriched dataset
28. The giant squid - Architeuthis
16th century 2012
The giant squid (Architeuthis) has been reported worldwide even before the
16th century, and has recently been observed live in its habitat for the first
time.
29. Why rare species?
• Biological and evolutionary investigations
• Fisheries management policies and conservation
• Vulnerable Marine Ecosystems
• Key role in affecting biodiversity richness
• Indicators of degradation for aquatic ecosystems
30. Detecting rare species
• How to build a reliable distribution from few
observations?
• How to account for absence
locations?
• Is there any approach for
rare species?
31. Data quality
For rare species, data quality is fundamental:
• Reliable presence data
• Reliable absence locations
• High quality environmental features
• Non-noisy environmental features
32. Tools – i-marine.d4science.org
D4Science e-Infrastructure:
• Retrieve presence data
• Generate absence data
• Get environmental data
• Model, adjust data and
produce maps
• Share results
33. 1. Presence data of A. dux from D4S
https://i-marine.d4science.org/group/biodiversitylab/species-data-discovery
34. 2. Simulating A. dux absence locations from AquaMaps
https://i-marine.d4science.org/group/biodiversitylab/processing-tools
0<Prob. < 0.2AquaMaps Native
36. 4. MaxEnt model as filter
https://i-marine.d4science.org/group/biodiversitylab/processing-tools
MaxEnt
Env. features most
correlated to the giant
squid
Presence data
Env. data
38. 5. Presence/absence modelling:
Artificial Neural Networks (ANN)
Model trained on positive
and negative examples
In terms of env. features
Binary file
https://i-marine.d4science.org/group/biodiversitylab/processing-tools
Presence/absence data
Filtered env. features
39. 6. Projection of the Neural Network
https://i-marine.d4science.org/group/biodiversitylab/processing-tools
41. Conclusions
• Using data quality enhancement produces high performance
distribution
• A presence/absence ANN combines these data
• Biological, observation and expert evidence confirm the prediction
by the ANN
42. Summary: modelling rare species
distributions
1. Retrieve high quality presence locations by relying on the metadata of the records,
2. Use expert knowledge or an expert system to detect absence locations.
Select absence locations as widespread as possible,
3. Select a number of environmental characteristics correlated to the species presence,
4. Use MaxEnt to filter the environmental characteristics that are really important with
respect to the presence points,
5. Train an Artificial Neural Network on presence and absence locations and select the best
learning topology,
6. Project the ANN at global scale, using the a resolution equal to the maximum in the
environmental features,
7. Train a MaxEnt model as comparison system.
43. Just another example
Coelacanth, Smith 1939
GARP
MaxEnt
AquaMaps
Neural Network
Coro, Gianpaolo, Pasquale Pagano, and Anton Ellenbroek.
"Combining simulated expert knowledge with Neural
Networks to produce Ecological Niche Models for Latimeria
chalumnae." Ecological Modelling 268 (2013): 55-63.