SlideShare una empresa de Scribd logo
1 de 67
1
Mining and Utilizing Dataset Relevancy
from Oceanographic Dataset
(MUDROD) Metadata, Usage Metrics,
and User Feedback to Improve Data
Discovery and Access
Dr. Chaowei (Phil) Yang (PI, George Mason University)
Mr. Edward M Armstrong (Co-I, Jet Propulsion Laboratory, NASA)
AIST-14-82, NNX15AM85G Final Review
June 22, 2017
2
Mining and Utilizing Dataset Relevancy from Oceanographic Data (MUDROD)
PI: Chaowei (Phil) Yang, George Mason University
TRLin = 5
4/09
Objective
• Improve data discovery, selection and access to NASA
Observational Data.
• Intuitive interface to federated data holdings.
• Enable new user communities to discover and access
data for their projects.
• Reduce time for scientists to discover, download and
reformat data.
• Implement extensible ontology framework.
• Improve discovery accuracy of oceanographic data
• Foundation for Managing Big Data.
• Demonstrate MUDROD at PO.DAAC.
Approach
• Setup collaboration, testing environment.
• Design MUDROD knowledge base system.
• Reconstruct sessions from web logs
• Calculate vocabulary similarity from logs and metadata.
• Update semantic search and conduct alpha testing.
• Integrate MUDROD alpha into P.O.DAAC.
• Enhance knowledge base, to include GCMD.
• Demonstrate Prototype.
03/16
Key Milestones
AIST-14-82 NNX15AM85G
CoIs: E. Armstrong, T. Huang, D. Moroni, JPL;
• Start 06/15
• Identify Use cases 01/16
• Design search, query, reasoning engine 03/16
• Ontological System Implementation 07/16
• Complete Beta test at P.O. DAAC 12/16
• Integrated test 02/17
• PO.DAAC metadata discovery Demo (TRL 7) 05/17
MUDROD Engine Architecture
3
Mining and Utilizing Dataset Relevancy from Oceanographic Data (MUDROD)
PI: Chaowei (Phil) Yang, George Mason University
TRLin = 5 TRLout = 7
4/09
Objective
• Improve data discovery, selection and access to NASA
Observational Data.
• Intuitive interface to federated data holdings.
• Enable new user communities to discover and access
data for their projects.
• Reduce time for scientists to discover, download and
reformat data.
• Implement extensible ontology framework.
• Improve discovery accuracy of oceanographic data
• Foundation for Managing Big Data.
• Demonstrate MUDROD at PO.DAAC.
Accomplishments
• Developed framework
• Analyze web logs to discover user knowledge (query and
data relationships)
• Construct knowledge base by combining semantics and
profile analyzer
• Improve data discover
• Resulted in
• better ranking;
• recommendation;
• ontology navigation
03/16 AIST-14-0082
CoIs: T. Huang, D. Moroni, E. Armstrong, JPL;
4
Mr. Edward M Armstrong
(Senior Data Engineer , NASA Jet
Propulsion Laboratory)
Mr. David Moroni
(Data Engineer, NASA Jet
Propulsion Laboratory)
Team Members: Investigators
Dr. Chaowei Phil Yang
(Professor and Director, NSF
Spatiotemporal Innovation Center,
George Mason University)
Mr. Thomas Huang
(Principal Scientist, NASA Jet
Propulsion Laboratory)
5
Team Members: Students and Post-Docs
Ms. Yun Li
(Ph.D. student, George Mason
University; Research Assistant ).
Mr. Yongyao Jiang
(Ph.D. student, George Mason
University; Research Assistant ).
Dr. Lewis J. McGibbney
(Data Scientist, NASA Jet
Propulsion Laboratory).
Mr. Chris Finch,
(Software Engineer, NASA Jet
Propulsion Laboratory).
Mr. Frank Greguska,
(Software Engineer, NASA Jet Propulsion
Laboratory) .
Mr. Gary Chen
6
Presentation Contents
• Background / Objectives / Technical Advancements
• TRL assessment
• Summary of Accomplishments and Plans
• Schedule
• Financials
• Publications
• List of Acronyms
7
Background / Objectives / Tech Advance (1)
• Keyword-based matching (traditional search engines)
– User query: ocean wind
– Final query: ocean AND wind
• Reveal the real intent of user query
– ocean wind = “ocean wind” OR “greco” OR
“surface wind” OR “mackerel breeze” …
• PO.DAAC UWG Recommendation 2014-07
• NASA ESDSWG Search Relevance Recommendations 2016 & 2017
Data Discovery Problems
8
Background / Objectives / Tech Advance (2)
• Analyze web logs to discover user knowledge (query and data
relationships)
• Construct knowledge base by combining semantics and profile
analyzer
• Improve data discovery by 1) better ranking; 2) recommendation; 3)
ontology navigation
Objectives
9
Background / Objectives / Tech Advance (2)
• Web log preprocessing
• Semantic analysis of user queries & Navigation
• Machine learning based search ranking
• Data Recommendation
Tech Advance (four technological modules)
10
Presentation Contents
• Background and Objectives
• TRL assessment
• Summary of Accomplishments and Plans
• Schedule
• Financials
• Publications
• List of Acronyms
11
Accomplishments Since last Review
TRL Assessment (1)
• Performance improvement
• Improved log mining performance using cloud computing
• Hybrid Recommendation algorithm
• Stem MUDROD ontologies from a number of sources
• create an OWL manifestation based on NASA GCMD-DIF data model
• create a PO.DAAC-specific extension of the SWEET concept
• Ingest PO.DAAC extractions into ESIP Ontology Portal
• Improve web UI
• Test MUDROD and ingest latest PO.DAAC logs
• Deploy MUDROD on JPL server
12
Current TRL Rational
TRL Assessment (2)
Component Current TRL Project end TRL Description
Semantic search engine
Search Dispatcher 7 7
Translating a user search query into a
set of new semantic queries
Similarity calculator 7 7
Calculating the semantic similarity from
weblogs, metadata, and ontology
Recommendation module 7 7
Recommending similar datasets to the
clicked dataset
Ranking module 7 7
Re-ranking the search results based on
RankSVM ML algorithm
Knowledge base
Ontology 7 7
Extensions from SWEET ontology for
earth science data
Triple Store 7 7 ESIP ontology repository
Vocabulary linkage discovery engine
Profile analyzer 7 7
Extracting user browsing pattern from
raw web logs
Web services/GUI
Ranking service/presenter 7 7
Providing and presenting the ranked
results
Recommendation service/presenter 7 7
Providing and presenting the related
datasets
Ontology navigation service/presenter 7 7
Providing and presenting related
searches
13
Presentation Contents
• Background and Objectives
• TRL assessment
• Summary of Accomplishments and Plans
• Schedule
• Financials
• Publications
• List of Acronyms
14
Overview
• Web log preprocessing
• Semantic analysis of user queries & Navigation
• Machine learning based search ranking
• Data Recommendation
Functions/Modules
15
Web log processing
16
• Requests sent from client e.g. browser, cmd line tool, etc. recorded by
server
• Log files provided by PO.DAAC (HTTP(S), FTP)
Client IP: 68.180.228.99
Request date/time: [31/Jan/2015:23:59:13 -0800]
Request: " GET /datasetlist/... HTTP/1.1 "
HTTP Code: 200
Bytes returned: 84779
Referrer/previous page: “/ghrsst/"
User agent/browser: "Mozilla/5.0 ...
68.180.228.99 - - [31/Jan/2015:23:59:13 -0800] "GET /datasetlist/... HTTP/1.1" 200
84779 "/ghrsst/" "Mozilla/5.0 ..."
Web logs
17
Goal: reconstruct user browsing pattern
(search history & clickstream) from a set
of raw logs
Web logs
User identification
Crawler detection
Structure
reconstruction
Session
identification
Search
history
Clickstrea
m
Additional steps include: word
normalization, stop words removal, and
stemming
Data preprocess
18
Reconstructed session structure
19
Data preprocess results
1. User search history 2. Clickstream
Jiang, Y., Y. Li, C. Yang, E. M. Armstrong, T. Huang & D. Moroni (2016) “Reconstructing Sessions from Data Discovery
and Access Logs to Build a Semantic Knowledge Base for Improving Data Discovery” ISPRS International Journal of
Geo-Information, 5, 54.
20
Semantic similarity
21
Existing ontology (SWEET)
• SWEET (Raskin and Pan
2003)
• Focus on only two relations
• The closer, the more similar
22
User search history
• Create query – user matrix
• Calculate binary cosine similarity
Conceptual example
23
Clickstream
• Hypothesis: similar queries can result in similar clicking behavior
• If two queries are similar, the data that get clicked after they are
searched would be more likely to be similar
Query b
Data
Query a
Similar?
24
Metadata
• Hypothesis: semantically related terms tend to appear in the same
metadata more frequently
• Essentially the same as the clickstream analysis
• Perform Latent Semantic Analyses (LSA) over the term – metadata
matrix
Query b
Metadata
Query a
Similar?
25
Integration
• All four results could be converted to
• Problem:
– None of them are perfect (uncertainty in data, hypothesis and
method)
– Metadata and ontology might have unknown terms to search
engine end users
– Sometimes, similarity values from different methods are
inconsistent
Concept
A
Concept
B
Similarity
26
Integration
• The maximum similarity of all of the components (large similarity
appears to be more reliable)
• The adjustment increment becomes larger when the similarity exists in
more sources
27
Semantic Similarity Calculation Workflow
28
Results and evaluation
Query
Search history Clickstream Metadata
SWEE
T
Integrated list
ocean
temperature
sea surface
temperature(0.66), sea
surface topography(0.56),
ocean wind(0.56),
aqua(0.49)
sea surface
temperature(0.94),
sst(0.94), group high
resolution sea surface
temperature dataset(0.89),
ghrsst(0.87)
sst(0.96), ghrsst(0.77), sea
surface temperature(0.72),
surface temperature(0.63),
reynolds(0.58)
None
sst(1.0), sea surface temperature(1.0),
ghrsst(1.0), group high resolution sea
surface temperature dataset(0.99),
reynolds sea surface temperature(0.74)
Sample group Overall accuracy
Most popular 10 queries 88%
Least popular 10 queries 61%
Randomly selected 10 queries 83%
By domain experts
Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T.
Huang & D. Moroni (2017) A Comprehensive
Approach to Determining the Linkage Weights
among Geospatial Vocabularies - An Example with
Oceanographic Data Discovery. International
Journal of Geographical Information Science (minor
revision)
29
Search ranking
30
Background
• Ranking is a long-standing problem in geospatial data discovery
• Typically, hundreds, even thousands of matches
• Can get larger as more Earth observation data is being collected
31
Objective and Methods
• Put the most desired data to the top of the result list
• What features can represent users’ search preferences for
geospatial data?
• How can the ranking function reach a balance of all these features?
• Identified eleven features from
– Geospatial metadata attributes
– Query – metadata content
overlap
– User behavior from web logs
32
Ranking features – Metadata attributes
Features Description
Release date The date when the data was published
Processing level (PL) The processing level of image products, ranging from
level 0 to level 4.
Version number The publish version of the data
Spatial resolution The spatial resolution of the data
Temporal resolution The temporal resolution of the data
• Five metadata features
• Verified by domains experts
• Query-independent: static, depends on the data itself, won’t
change with the query
33
Spatial query-metadata overlap
• Spatial similarity between query area and the coverage of a particular
data
• Overlap area normalized by the original area of query and data
34
Ranking features – User behavior
• All-time, monthly, user popularity, and semantic popularity (retrieved
from web logs)
• Semantic popularity: the number of times that the data has been
clicked after searching a particular query and its highly related ones
(query-dependent)
35
RankSVM
• One of the well-recognized ML ranking algorithm
• Convert a ranking problem into a classification problem that a
regular SVM algorithm can solve
• 3 main steps
 1) Standardize: mean = 0, std = 1
o SVM is not scale invariant
o Over-optimized
o Longer to train
 2) For any pair of training data, calculate the difference
 3) A ranking problem becomes a binary classification problem,
where SVM is applied to find the optimal decision boundary
36
Architecture
• All of these (except for
training) can be
finished within 2
seconds
• None of the open
source mainstream
ML library provide any
ranking algorithm
• Implemented it by
ourselves with the aid
of Spark MLlib
Index
User query
Semantic
query
Top K
retrieval
Ranking
model
Learning
algorithm
Training
data
Re-ranked
results
Feature
extractor
Weblogs
User clicksKnowledge
base
37
NDCG (K) for five different ranking
methods at varying K (1-40)
Jiang, Y., Y. Li, C.
Yang, K. Liu, E.
M. Armstrong, T.
Huang, D.
Moroni & L. J.
McGibbney
(2017) Towards
intelligent
geospatial
discovery: a
machine learning
ranking
framework.
International
Journal of Digital
Earth (minor
revision)
38
Data recommendation
39
How to recommend geospatial data?
• Use geospatial metadata for content-based recommendation
-Metadata spatiotemporal similarity
-Metadata attribute similarity
-Metadata content similarity
• Leverage user behaviors data for CF recommendation
-Session-based co-occurrence of data
40
Attribute type Attribute name Attribute description
Spatiotemporal attributes DatasetCoverage-EastLon The East longitude of the bounding rectangle
DatasetCoverage-WestLon The West longitude of the bounding rectangle
DatasetCoverage-NorthLat The North latitude of the bounding rectangle
DatasetCoverage-SouthLat The South latitude of the bounding rectangle
DatasetCoverage-StartTimeLong The start time of the data
DatasetCoverage-StopTimeLong The end time of the data
Categorical geographic
attributes
DatasetRegion-Region Region of dataset. Such as global, Atlantic
Dataset-ProjectionType Project type like cylindrical lat-lon
Dataset-ProcessingLevel Data processing level
DatasetPolicy-DataFormat Data format e.g. HDF, NetCDF
DatasetSource-Sensor-ShortName Short name of sensor
Ordinal geographic
attributes
Dataset-TemporalResolution Temporal resolution of dataset
Dataset-TemporalRepeat Temporal resolution of dataset
Dataset-SpatialResolution Spatial resolution of dataset
Descriptive attributes Dataset-description Describe the content of the dataset
Geographic metadata
41
• Spatial variables: NorthLat, SouthLat, WestLon, EastLon
• Temporal variables: DatasetCoverage-StartTimeLong, StopTimeLong
• Use volume overlap ratio to calculate similarity
𝑠𝑝𝑎𝑡𝑖𝑜𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑠𝑖𝑚 𝑟 𝑖,𝑟 𝑗
=
𝑣𝑜𝑙𝑢𝑚𝑒 𝑟𝑖 ∩ 𝑟𝑗
𝑣𝑜𝑙𝑢𝑚𝑒 𝑟𝑖
+
𝑣𝑜𝑙𝑢𝑚𝑒 𝑟𝑖 ∩ 𝑟𝑗
𝑣𝑜𝑙𝑢𝑚𝑒 𝑟𝑗
∗ 0.5
𝑣𝑜𝑙𝑢𝑚𝑒 𝑟 = 𝑒𝑎𝑠𝑡𝑙𝑜𝑛 − 𝑤𝑒𝑠𝑡𝑙𝑜𝑛 ∗ 𝑠𝑜𝑢𝑡ℎ𝑙𝑎𝑡 − 𝑛𝑜𝑟𝑡ℎ𝑙𝑎𝑡 ∗ 𝑒𝑛𝑑𝑡𝑖𝑚𝑒 − 𝑠𝑡𝑎𝑟𝑡𝑖𝑚𝑒
Spatiotemporal similarity
42
• Fixed number of values
• No intrinsic ordering
• sensor-name: "AMSR-E", "MODIS", "AVHRR-3" and "WindSat"
𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑐𝑎𝑙_𝑣𝑎𝑟_𝑠𝑖𝑚 𝑣𝑖, 𝑣𝑗 =
𝑣 𝑖 ∩ 𝑣 𝑗
𝑣 𝑖 ∪ 𝑣 𝑗
Categorical similarity
43
Ordinal attribute is similar to categorical attribute but its values has a
clear order, e.g. spatial resolution
• Converted into rank from 1 to R
• Nominalize these ranks for similarity calculation
𝑜𝑟𝑑𝑖𝑛𝑎𝑙 𝑣𝑎𝑟
𝑠𝑖𝑚 𝑣 𝑖,𝑣 𝑗
= 1 − 𝑛𝑜𝑟𝑚 𝑟𝑎𝑛𝑘 𝑣 𝑖
− 𝑛𝑜𝑟𝑚 𝑟𝑎𝑛𝑘 𝑣 𝑗
𝑛𝑜𝑟𝑚_𝑟𝑎𝑛𝑘 𝑣𝑖 =
𝑅𝑎𝑛𝑘𝑣 𝑖+1
𝑅+1
Ordinal similarity
44
Original text
Aquarius Level 3 sea surface salinity (SSS) standard mapped image
data contains gridded 1 degree spatial resolution SSS averaged over
daily, 7 day, monthly, and seasonal time scales. This particular data
set is the seasonal climatology, Ascending sea surface salinity
product for version 4.0 of the Aquarius data set, which is the official
end of prime mission public data release from the AQUARIUS/SAC-D
mission. Only retrieved values for Ascending passes have been used
to create this product. The Aquarius instrument is onboard the
AQUARIUS/SAC-D satellite, a collaborative effort between NASA and
the Argentinian Space Agency Comision Nacional de Actividades
Espaciales (CONAE). The instrument consists of three radiometers in
push broom alignment at incidence angles of 29, 38, and 46 degrees
incidence angles relative to the shadow side of the orbit. Footprints
for the beams are: 76 km (along-track) x 94 km (cross-track), 84 km x
120 km and 96km x 156 km, yielding a total cross-track swath of 370
km. The radiometers measure brightness temperature at 1.413 GHz
in their respective horizontal and vertical polarizations (TH and TV). A
scatterometer operating at 1.26 GHz measures ocean backscatter in
each footprint that is used for surface roughness corrections in the
estimation of salinity. The scatterometer has an approximate 390km
swath.
Extracted terms
Radiometers Measure Brightness Temperature,AQUARIUS/SAC
Mission,Image Data,Broom Alignment,Resolution
SSS,AQUARIUS/SAC
Satellite,Scatterometer,Scatterometer,Aquarius Data,Argentinian
Space Agency Comision Nacional,Incidence Angles,Time
Scales,Actividades Espaciales,Cross-track Swath,Official
End,Aquarius Instrument,Shadow Side,Ascending Sea Surface
Salinity Product,Level,Level,Surface Roughness
Corrections,Data Release,Salinity,Density, Salinity,Density,
AQUARIUS,L3,SSS,SMIA,SEASONAL-CLIMATOLOGY,V4
Step 1: Phrase extraction
1. Extract term candidates from metadata description with POS (part of speech) Tagging
2. Introduce “occurrence” and “strength” to filter out terms from candidates.
“occurrence”: occurrences number of terms
“strength”: the number of words in a term
Descriptive similarity
45
Step 2: Represent metadata in the phrase vector space (The
dimension lower than word feature space)
Term 1 Term 2 Term 3 Term 4 Term 5 Term 6 Term 7 … Term N
dataser1 1 0 1 0 0 0 0 1
dataser2 0 0 0 1 1 0 0 0
…
dataset k 1 0 1 0 0 0 0 1
Step 3: Calculate cosine similarity
Metadata abstract semantic similarity
46
Session
1
Session
2
…
Session
N
Data 1 1 0 1
Data 2 0 0 0
…
Data k 1 0 1
Calculate metadata similarity based on session level co-occurrence
Similarity (i,j) =
𝑁(𝑖ᴖ𝑗)
𝑁(𝑖)∗ 𝑁(𝑗)
N(i): The number of sessions in which dataset i was viewed or download
N(j): The number of sessions in which dataset j was viewed or download
N(𝑖ᴖ𝑗): The number of sessions in which both dataset i and j were viewed or download
Session based recommendation
47
Recommendation
method
Pros Cons Strategy
Descriptive
similarity
1. Natural language
processing methods
can be adopted to
find latent semantic
relationship
1. Many datasets has nearly
same abstract with few words/values
changed
2. It is hard to extract detailed
attributed from description
Used as the basic
method of
recommendation
algorithm
Attribute similarity
(spatiotemporal,
ordinal, categorical)
1. As structured
data, geographic
metadata have many
variables.
1. Variable values may be null or
wrong
2. The quality depends on the weight
assigned to every variable
As supplement to
semantic similarity
Session
concurrence
1. Reflect users’
preference
1. Cold start problem: Newly
published data don’t have usage
data
Fine-tune
recommendation
list
Recommend (i) = 𝑊 𝑠𝑠 ∗ 𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑣𝑒𝑐 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑖 + ∑𝑊 𝑐𝑣 ∗ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑎𝑙 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑖) + ∑𝑊 𝑜𝑣 ∗
𝑂𝑟𝑑𝑖𝑎𝑛𝑙 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑖) + 𝑊𝑠𝑡𝑣 ∗ 𝑆𝑝𝑎𝑡𝑖𝑜𝑇𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 + 𝑊𝑠𝑜 ∗ 𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦
Hybrid recommendation
48
Quantitative Evaluation
Hybrid similarity
outperform other
similarities since it
integrates metadata
attributes and user
preference.
Y. Li, Jiang, Y., C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) A Geospatial Data
Recommender System based on Metadata and User Behaviour (revision)
49
Plan Forward
• Add more features (e.g., temporal similarity)
• Create training data from web logs for RankSVM
• Develop a query understanding module to better interpret user’s search
intent (e.g. “ocean wind level 3” -> “ocean wind” AND “level 3”)
• Support Solr
• Support near real-time data ingestion to dynamically update knowledge
base
• Integration with DOMS and OceanXtremes to support an ocean science
analytics center
• Leverage advanced computing techniques to speed up the process
50
What has been done
• Offline/sequential log analysis
– Time and computation intensive
• Ranking
– Use manually labeled data as train data
– RankSVM
• Offline recommendation
– Cannot respond to what user clicked in the last few minutes
• Offline Similarity calculation
– Cannot be updated in real-time
– Cannot parse multi-concept queries
51
Software status
• Java 8
• Git
• Apache Maven 3.X
• Elasticsearch v5.X
• Kibana v4
• Apache Spark v2.0.0
• Apache Tomcat 7.X
• https://github.com/mudrod/mudrod
52
What remains to be done
• Support for Solr
• Add faceted search to the UI and backend
• Support for real-time ranking, recommendation,
similarity calculation
• Support for DOMS and OceanXstreams
53
Real-time ranking
54
Real-time recommendation
User clicked
metadata
Knowledge base
Metadata based
similar datasets
Top-k datasets
Hybrid similarity
calculator
Recommendation results
sequential clicked
datasets in a session
Weblogs
Train
model
55
Proposed timeline
55
Sept. –
Oct.
Nov. –
Dec.
Jan. –
Feb.
March -
April
May -
June
July –
Aug.
Support for Solr
Add faceted search
Support for DOMS &
OceanXstreams
Real-time ranking
Real-time
recommendation
Real-time similarity
56
Research points
• Semantic similarity
– Online matrix analysis
– Query parsing
• Ranking
– Generate training data from user clicks
– Online deep learning algorithm
• Near real-time log ingesting
– Spark Streaming APIs
– Real-time sessionization
• High performance log mining
– Use parallel computing to speed up
• Recommendation
– Session-based real-time recommendation
– Personalized recommendation
57
Summary
• Log mining enables a data portal integrating implicit user
preferences
• Word similarity retrieved by data mining tasks expands any given
query to improve search recall and precision.
• The rich set of ranking features and the ML algorithm provide
substantial advantages over using other ranking methods
• The recommendation algorithm can discover latent data relevancy
• The proposed architecture enables the loosely coupled software
structure of a data portal and avoids the cost of replacing the existing
system
58
Presentation Contents
• Background and Objectives
• TRL assessment
• Summary of Accomplishments and Plans Forward
• Schedule
• Financials
• Publications
• List of Acronyms
59
Project Schedule
60
Jun'1
5
Jul'15
Aug
'15
Sep'1
5
Oct
'16
Nov
'16
Dec
'16
Jan
'16
Feb
'16
Mar
'16
Apr
'16
May
'16
Jun
'16
Jul
'16
Aug
'16
Sep'1
6
Oct
'17
Nov
'17
Dec
'17
Jan
'17
Feb
'17
Mar
'17
Apr
'17
May
'17
Jun
'17
Funding Plan 208 208 208 208 208 208 208 208 208 208 208 208 392 392 392 392 392 392 392 392 392 392 392 392 392
Cum Obligation Plan 2 3 6 8 10 14 23 30 33 41 48 67 91 120 144 167 200 230 253 273 292 343 370 390 392
Cum Actuals 2 3 6 8 10 14 23 30 33 41 48 67 84.3 106.7 132.1 141.2 159.7 179.7 200.1 226.1 255.1 285.5 339.1 366.7
Variance 0 0 0 0 0 0 0 0 0 0 0 0 -7 -13 -12 -26 -40 -50 -53 -47 -37 -58 -31 -23
0
50
100
150
200
250
300
350
400
450
JPL Part: $K
Costs Status and Plan
61
Costs Status and Plan
Funding Plan 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 648,838 648,838 648,838
Cumulative Cost Actual 0 0 2498 9993 28880 56770 66100 74279 107932 126371 151600 181645 206108 218265 255809 323578 404836 434623 442374 461156 478063 482765 524561 578369 594214 628852 648838
Cumulative Cost Plan - - 2498 9993 28880 56770 66100 74279 107932 126371 151600 181645 208693 254642 301097 365107 411031 455801 471465 493314 508058 540992 561021 595815 616823 632186 648838
Variance 0 -2499 2499 -2499 2499 -2499 2499 -2499 2499 -2499 5083 31293 13995 27535 -21340 42518 -13426 45585 -15589 73817 -37356 54802 -32193 35527 -35528
• Actual costs are below or above plan
– Due to the GMU payment cycle is different from end of month
• Impact: None
62
Online Products
1. PO.DAAC Labs website https://podaac.jpl.nasa.gov/podaac_labs
2. PO.DAAC forum page
https://podaac.jpl.nasa.gov/forum/viewforum.php?f=88
3. Open Source with Github and Apache License 2.0:
https://github.com/mudrod/mudrod
4. Demo Site: http://mudrod.jpl.nasa.gov/
5. PD Leverage: http://pd.cloud.gmu.edu/
63
Demonstration
MUDROD deployed in PO.DAAC Labs
• https://podaac.jpl.nasa.gov/podaac_labs
64
Publications
Journal / Conference Papers (over 10 peer review publications)
1. Jiang, Y., Y. Li, C. Yang, E. M. Armstrong, T. Huang & D. Moroni (2016) Reconstructing Sessions from Data Discovery and Access
Logs to Build a Semantic Knowledge Base for Improving Data Discovery. ISPRS International Journal of Geo-Information, 5, 54.
2. Y. Li, Jiang, Y., C. Yang, K. Liu, E. M. Armstrong, T. Huang & D. Moroni (2016) Leverage cloud computing to improve data access
log mining. IEEE Oceans 2016.
3. Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang & D. Moroni (2017) A Comprehensive Approach to Determining the
Linkage Weights among Geospatial Vocabularies - An Example with Oceanographic Data Discovery. International Journal of
Geographical Information Science (minor revision)
4. Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) Towards intelligent geospatial
discovery: a machine learning ranking framework. International Journal of Digital Earth (minor revision)
5. Y. Li, Jiang, Y., C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) A Geospatial Data Recommender
System based on Metadata and User Behaviour (revision)
6. Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) A smart web-based data discovery
system for ocean sciences. (ongoing)
7. Yang, C., et al., 2017. Big Data and cloud computing: innovation opportunities and challenges. International Journal of
Digital Earth, 10(1), pp.13-53. (the 2nd most read paper of IJDE in it’s decadal history)
8. Yang, C.P., Yu, M., Xu, M., Jiang, Y., Qin, H., Li, Y., Bambacus, M., Leung, R.Y., Barbee, B.W., Nuth, J.A. and Seery, B., 2017,
March. An architecture for mitigating near earth object's impact to the earth. In Aerospace Conference, 2017 IEEE (pp. 1-13). IEEE
9. Yu, M. and Yang, C., 2016. Improving the Non-Hydrostatic Numerical Dust Model by Integrating Soil Moisture and Greenness
Vegetation Fraction Data with Different Spatiotemporal Resolutions. PloS one, 11(12), p.e0165616.
10. Vance, T.C., Merati, N., Yang, C. and Yuan, M., 2016, September. Cloud computing for ocean and atmospheric science. In OCEANS
2016 MTS/IEEE Monterey (pp. 1-4). IEEE. (also as a book with the same name)
11. Liu, K., Yang, C., Li, W., Gui, Z., Xu, C. and Xia, J., 2016. Using semantic search and knowledge reasoning to improve the discovery
of earth science records: an example with the ESIP semantic testbed. In Mobile Computing and Wireless Networks: Concepts,
Methodologies, Tools, and Applications (pp. 1375-1389). IGI Global.
12. Xia, J., Yang, C., Liu, K., Gui, Z., Li, Z., Huang, Q. and Li, R., 2015. Adopting cloud computing to optimize spatial web portals for
better performance to support Digital Earth and other global geospatial initiatives. International Journal of Digital Earth, 8(6), pp.451-
475.
65
Students Cultivated
• Ph.D. Dissertations focused on MUDROD: Yongyao Jiang and
Yun Li (GMU), will complete in phase 2 (OceanWorks)
• Graduate Students worked on tasks; Fei Hu, Kai Liu (graduate in
2017 summer), Manzhu Yu (graduate in 2017 summer), Han Qin
(graduated), Mengchao Xu, Ishan Shams (GMU)
• Undergraduate students: Joseph Michael/Univ. of Richmond
(HBCU), Alena/Virginia Tech (Female)
• Other (Approximately 20 presentations at AGU, ESIP, AAG,
GeoInformatics, and other national and international conferences)
66
Acronyms
List of Acronyms
• MUDROD Mining and Utilizing Dataset Relevancy from
Oceanographic Dataset
• HTTP Hypertext Transfer Protocol
• FTP File Transfer Protocol
• SWEET Semantic web for earth and environmental
terminology
• PO.DAAC Physical Oceanography Distributed Active
Archive Center
• LSA Latent Semantic Analysis
• SVM Support vector machine
• NDCG Normalized Discounted Cumulative Gain
67
1. NASA AIST Program (NNX15AM85G)
2. PO.DAAC SWEET Ontology Team (Initially funded by
ESTO)
3. Hydrology DAAC Rahul Ramachandran (providing the
earlier version of NOESIS)
4. ESDIS for providing testing logs of CMR
5. All team members at JPL and GMU
Acknowledgements

Más contenido relacionado

La actualidad más candente

Giving researchers credit for data
Giving researchers credit for dataGiving researchers credit for data
Giving researchers credit for dataJisc
 
An On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemAn On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemCameron Kiddle
 
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...cscpconf
 
Federated Search: The Good, The Bad And The Ugly
Federated Search: The Good, The Bad And The UglyFederated Search: The Good, The Bad And The Ugly
Federated Search: The Good, The Bad And The Uglydorishelfer
 
ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange_and_PRIDE_Semmeting_2015ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange_and_PRIDE_Semmeting_2015Juan Antonio Vizcaino
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringIRJET Journal
 
Current and emerging trends in library services
Current and emerging trends in library servicesCurrent and emerging trends in library services
Current and emerging trends in library servicesNikesh Narayanan
 
Cedar Overview
Cedar OverviewCedar Overview
Cedar Overviewjbgraybeal
 
Sekhon final 1_ppt
Sekhon final 1_pptSekhon final 1_ppt
Sekhon final 1_pptManant Sweet
 
A survey on Design and Implementation of Clever Crawler Based On DUST Removal
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalA survey on Design and Implementation of Clever Crawler Based On DUST Removal
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalIJSRD
 
Query optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query managementQuery optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query managementijdms
 
using a semantic approach for a cataloguing service
using a semantic approach for a cataloguing serviceusing a semantic approach for a cataloguing service
using a semantic approach for a cataloguing serviceDesconnets Jean-Christophe
 
Open domain question answering system using semantic role labeling
Open domain question answering system using semantic role labelingOpen domain question answering system using semantic role labeling
Open domain question answering system using semantic role labelingeSAT Publishing House
 
Alternative impact measures for open access documents?
Alternative impact measures for open access documents? Alternative impact measures for open access documents?
Alternative impact measures for open access documents? uherb
 
PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013Frauke Ziedorn
 
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Research Data Alliance
 
Pdd crawler a focused web
Pdd crawler  a focused webPdd crawler  a focused web
Pdd crawler a focused webcsandit
 

La actualidad más candente (20)

Giving researchers credit for data
Giving researchers credit for dataGiving researchers credit for data
Giving researchers credit for data
 
An On-line Collaborative Data Management System
An On-line Collaborative Data Management SystemAn On-line Collaborative Data Management System
An On-line Collaborative Data Management System
 
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
 
Federated Search: The Good, The Bad And The Ugly
Federated Search: The Good, The Bad And The UglyFederated Search: The Good, The Bad And The Ugly
Federated Search: The Good, The Bad And The Ugly
 
ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange_and_PRIDE_Semmeting_2015ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange_and_PRIDE_Semmeting_2015
 
Perception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document ClusteringPerception Determined Constructing Algorithm for Document Clustering
Perception Determined Constructing Algorithm for Document Clustering
 
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and ApplicationsSemantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
 
Current and emerging trends in library services
Current and emerging trends in library servicesCurrent and emerging trends in library services
Current and emerging trends in library services
 
The WSTIERIA Project – A Web of Services
The  WSTIERIA Project – A Web of ServicesThe  WSTIERIA Project – A Web of Services
The WSTIERIA Project – A Web of Services
 
Pieper NISO Virtual Conf Feb17
Pieper NISO Virtual Conf Feb17Pieper NISO Virtual Conf Feb17
Pieper NISO Virtual Conf Feb17
 
Cedar Overview
Cedar OverviewCedar Overview
Cedar Overview
 
Sekhon final 1_ppt
Sekhon final 1_pptSekhon final 1_ppt
Sekhon final 1_ppt
 
A survey on Design and Implementation of Clever Crawler Based On DUST Removal
A survey on Design and Implementation of Clever Crawler Based On DUST RemovalA survey on Design and Implementation of Clever Crawler Based On DUST Removal
A survey on Design and Implementation of Clever Crawler Based On DUST Removal
 
Query optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query managementQuery optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query management
 
using a semantic approach for a cataloguing service
using a semantic approach for a cataloguing serviceusing a semantic approach for a cataloguing service
using a semantic approach for a cataloguing service
 
Open domain question answering system using semantic role labeling
Open domain question answering system using semantic role labelingOpen domain question answering system using semantic role labeling
Open domain question answering system using semantic role labeling
 
Alternative impact measures for open access documents?
Alternative impact measures for open access documents? Alternative impact measures for open access documents?
Alternative impact measures for open access documents?
 
PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013PIDs and DOI registration with DataCite - IATUL Workshop 2013
PIDs and DOI registration with DataCite - IATUL Workshop 2013
 
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
 
Pdd crawler a focused web
Pdd crawler  a focused webPdd crawler  a focused web
Pdd crawler a focused web
 

Similar a MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access

Using Feedback from Data Consumers to Capture Quality Information on Environm...
Using Feedback from Data Consumers to Capture Quality Information on Environm...Using Feedback from Data Consumers to Capture Quality Information on Environm...
Using Feedback from Data Consumers to Capture Quality Information on Environm...Anusuriya Devaraju
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...AKSHAY BHAGAT
 
Jisc Research Data Discovery Service Project
Jisc Research Data Discovery Service ProjectJisc Research Data Discovery Service Project
Jisc Research Data Discovery Service ProjectJisc RDM
 
UK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalfaceUK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalfaceLizLyon
 
UKRDDS Project Overview - Feb 2016
UKRDDS Project Overview - Feb 2016UKRDDS Project Overview - Feb 2016
UKRDDS Project Overview - Feb 2016Christopher Brown
 
IEDA Overview & Updates, March 2014
IEDA Overview & Updates, March 2014IEDA Overview & Updates, March 2014
IEDA Overview & Updates, March 2014iedadata
 
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...Ben Blaiszik
 
Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...
Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...
Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...The University of Edinburgh
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data GenerationFilip Radulovic
 
RDM Roadmap to the Future, or: Lords and Ladies of the Data
RDM Roadmap to the Future, or: Lords and Ladies of the DataRDM Roadmap to the Future, or: Lords and Ladies of the Data
RDM Roadmap to the Future, or: Lords and Ladies of the DataRobin Rice
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...SEAD
 
Data catalog
Data catalogData catalog
Data catalogiamtodor
 
Research Data Service at the University of Edinburgh
Research Data Service at the University of EdinburghResearch Data Service at the University of Edinburgh
Research Data Service at the University of EdinburghRobin Rice
 
"Data in Context" IG sessions @ RDA 3rd Plenary
"Data in Context" IG sessions @  RDA 3rd Plenary"Data in Context" IG sessions @  RDA 3rd Plenary
"Data in Context" IG sessions @ RDA 3rd PlenaryBrigitte Jörg
 
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...Brigitte Jörg
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube
 
SC13 BoF: RDA and HPC
SC13 BoF: RDA and HPCSC13 BoF: RDA and HPC
SC13 BoF: RDA and HPCJohn Cobb
 
Jisc Research Data Shared Service - Spring Update
Jisc Research Data Shared Service - Spring UpdateJisc Research Data Shared Service - Spring Update
Jisc Research Data Shared Service - Spring UpdateJisc RDM
 

Similar a MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access (20)

Using Feedback from Data Consumers to Capture Quality Information on Environm...
Using Feedback from Data Consumers to Capture Quality Information on Environm...Using Feedback from Data Consumers to Capture Quality Information on Environm...
Using Feedback from Data Consumers to Capture Quality Information on Environm...
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
 
Jisc Research Data Discovery Service Project
Jisc Research Data Discovery Service ProjectJisc Research Data Discovery Service Project
Jisc Research Data Discovery Service Project
 
UK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalfaceUK Digital Curation Centre: enabling research data management at the coalface
UK Digital Curation Centre: enabling research data management at the coalface
 
UKRDDS Project Overview - Feb 2016
UKRDDS Project Overview - Feb 2016UKRDDS Project Overview - Feb 2016
UKRDDS Project Overview - Feb 2016
 
IEDA Overview & Updates, March 2014
IEDA Overview & Updates, March 2014IEDA Overview & Updates, March 2014
IEDA Overview & Updates, March 2014
 
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
 
MUDROD - Ranking
MUDROD - RankingMUDROD - Ranking
MUDROD - Ranking
 
Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...
Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...
Perspectives on the Role of Trustworthy Repository Standards in Data Journal ...
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
Linked Energy Data Generation
Linked Energy Data GenerationLinked Energy Data Generation
Linked Energy Data Generation
 
RDM Roadmap to the Future, or: Lords and Ladies of the Data
RDM Roadmap to the Future, or: Lords and Ladies of the DataRDM Roadmap to the Future, or: Lords and Ladies of the Data
RDM Roadmap to the Future, or: Lords and Ladies of the Data
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
 
Data catalog
Data catalogData catalog
Data catalog
 
Research Data Service at the University of Edinburgh
Research Data Service at the University of EdinburghResearch Data Service at the University of Edinburgh
Research Data Service at the University of Edinburgh
 
"Data in Context" IG sessions @ RDA 3rd Plenary
"Data in Context" IG sessions @  RDA 3rd Plenary"Data in Context" IG sessions @  RDA 3rd Plenary
"Data in Context" IG sessions @ RDA 3rd Plenary
 
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
Data in Context Interest Group Sessions @ RDA 3rd Plenary, Dublin (March 26-2...
 
EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013EarthCube Monthly Community Webinar- Nov. 22, 2013
EarthCube Monthly Community Webinar- Nov. 22, 2013
 
SC13 BoF: RDA and HPC
SC13 BoF: RDA and HPCSC13 BoF: RDA and HPC
SC13 BoF: RDA and HPC
 
Jisc Research Data Shared Service - Spring Update
Jisc Research Data Shared Service - Spring UpdateJisc Research Data Shared Service - Spring Update
Jisc Research Data Shared Service - Spring Update
 

Último

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 

MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access

  • 1. 1 Mining and Utilizing Dataset Relevancy from Oceanographic Dataset (MUDROD) Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access Dr. Chaowei (Phil) Yang (PI, George Mason University) Mr. Edward M Armstrong (Co-I, Jet Propulsion Laboratory, NASA) AIST-14-82, NNX15AM85G Final Review June 22, 2017
  • 2. 2 Mining and Utilizing Dataset Relevancy from Oceanographic Data (MUDROD) PI: Chaowei (Phil) Yang, George Mason University TRLin = 5 4/09 Objective • Improve data discovery, selection and access to NASA Observational Data. • Intuitive interface to federated data holdings. • Enable new user communities to discover and access data for their projects. • Reduce time for scientists to discover, download and reformat data. • Implement extensible ontology framework. • Improve discovery accuracy of oceanographic data • Foundation for Managing Big Data. • Demonstrate MUDROD at PO.DAAC. Approach • Setup collaboration, testing environment. • Design MUDROD knowledge base system. • Reconstruct sessions from web logs • Calculate vocabulary similarity from logs and metadata. • Update semantic search and conduct alpha testing. • Integrate MUDROD alpha into P.O.DAAC. • Enhance knowledge base, to include GCMD. • Demonstrate Prototype. 03/16 Key Milestones AIST-14-82 NNX15AM85G CoIs: E. Armstrong, T. Huang, D. Moroni, JPL; • Start 06/15 • Identify Use cases 01/16 • Design search, query, reasoning engine 03/16 • Ontological System Implementation 07/16 • Complete Beta test at P.O. DAAC 12/16 • Integrated test 02/17 • PO.DAAC metadata discovery Demo (TRL 7) 05/17 MUDROD Engine Architecture
  • 3. 3 Mining and Utilizing Dataset Relevancy from Oceanographic Data (MUDROD) PI: Chaowei (Phil) Yang, George Mason University TRLin = 5 TRLout = 7 4/09 Objective • Improve data discovery, selection and access to NASA Observational Data. • Intuitive interface to federated data holdings. • Enable new user communities to discover and access data for their projects. • Reduce time for scientists to discover, download and reformat data. • Implement extensible ontology framework. • Improve discovery accuracy of oceanographic data • Foundation for Managing Big Data. • Demonstrate MUDROD at PO.DAAC. Accomplishments • Developed framework • Analyze web logs to discover user knowledge (query and data relationships) • Construct knowledge base by combining semantics and profile analyzer • Improve data discover • Resulted in • better ranking; • recommendation; • ontology navigation 03/16 AIST-14-0082 CoIs: T. Huang, D. Moroni, E. Armstrong, JPL;
  • 4. 4 Mr. Edward M Armstrong (Senior Data Engineer , NASA Jet Propulsion Laboratory) Mr. David Moroni (Data Engineer, NASA Jet Propulsion Laboratory) Team Members: Investigators Dr. Chaowei Phil Yang (Professor and Director, NSF Spatiotemporal Innovation Center, George Mason University) Mr. Thomas Huang (Principal Scientist, NASA Jet Propulsion Laboratory)
  • 5. 5 Team Members: Students and Post-Docs Ms. Yun Li (Ph.D. student, George Mason University; Research Assistant ). Mr. Yongyao Jiang (Ph.D. student, George Mason University; Research Assistant ). Dr. Lewis J. McGibbney (Data Scientist, NASA Jet Propulsion Laboratory). Mr. Chris Finch, (Software Engineer, NASA Jet Propulsion Laboratory). Mr. Frank Greguska, (Software Engineer, NASA Jet Propulsion Laboratory) . Mr. Gary Chen
  • 6. 6 Presentation Contents • Background / Objectives / Technical Advancements • TRL assessment • Summary of Accomplishments and Plans • Schedule • Financials • Publications • List of Acronyms
  • 7. 7 Background / Objectives / Tech Advance (1) • Keyword-based matching (traditional search engines) – User query: ocean wind – Final query: ocean AND wind • Reveal the real intent of user query – ocean wind = “ocean wind” OR “greco” OR “surface wind” OR “mackerel breeze” … • PO.DAAC UWG Recommendation 2014-07 • NASA ESDSWG Search Relevance Recommendations 2016 & 2017 Data Discovery Problems
  • 8. 8 Background / Objectives / Tech Advance (2) • Analyze web logs to discover user knowledge (query and data relationships) • Construct knowledge base by combining semantics and profile analyzer • Improve data discovery by 1) better ranking; 2) recommendation; 3) ontology navigation Objectives
  • 9. 9 Background / Objectives / Tech Advance (2) • Web log preprocessing • Semantic analysis of user queries & Navigation • Machine learning based search ranking • Data Recommendation Tech Advance (four technological modules)
  • 10. 10 Presentation Contents • Background and Objectives • TRL assessment • Summary of Accomplishments and Plans • Schedule • Financials • Publications • List of Acronyms
  • 11. 11 Accomplishments Since last Review TRL Assessment (1) • Performance improvement • Improved log mining performance using cloud computing • Hybrid Recommendation algorithm • Stem MUDROD ontologies from a number of sources • create an OWL manifestation based on NASA GCMD-DIF data model • create a PO.DAAC-specific extension of the SWEET concept • Ingest PO.DAAC extractions into ESIP Ontology Portal • Improve web UI • Test MUDROD and ingest latest PO.DAAC logs • Deploy MUDROD on JPL server
  • 12. 12 Current TRL Rational TRL Assessment (2) Component Current TRL Project end TRL Description Semantic search engine Search Dispatcher 7 7 Translating a user search query into a set of new semantic queries Similarity calculator 7 7 Calculating the semantic similarity from weblogs, metadata, and ontology Recommendation module 7 7 Recommending similar datasets to the clicked dataset Ranking module 7 7 Re-ranking the search results based on RankSVM ML algorithm Knowledge base Ontology 7 7 Extensions from SWEET ontology for earth science data Triple Store 7 7 ESIP ontology repository Vocabulary linkage discovery engine Profile analyzer 7 7 Extracting user browsing pattern from raw web logs Web services/GUI Ranking service/presenter 7 7 Providing and presenting the ranked results Recommendation service/presenter 7 7 Providing and presenting the related datasets Ontology navigation service/presenter 7 7 Providing and presenting related searches
  • 13. 13 Presentation Contents • Background and Objectives • TRL assessment • Summary of Accomplishments and Plans • Schedule • Financials • Publications • List of Acronyms
  • 14. 14 Overview • Web log preprocessing • Semantic analysis of user queries & Navigation • Machine learning based search ranking • Data Recommendation Functions/Modules
  • 16. 16 • Requests sent from client e.g. browser, cmd line tool, etc. recorded by server • Log files provided by PO.DAAC (HTTP(S), FTP) Client IP: 68.180.228.99 Request date/time: [31/Jan/2015:23:59:13 -0800] Request: " GET /datasetlist/... HTTP/1.1 " HTTP Code: 200 Bytes returned: 84779 Referrer/previous page: “/ghrsst/" User agent/browser: "Mozilla/5.0 ... 68.180.228.99 - - [31/Jan/2015:23:59:13 -0800] "GET /datasetlist/... HTTP/1.1" 200 84779 "/ghrsst/" "Mozilla/5.0 ..." Web logs
  • 17. 17 Goal: reconstruct user browsing pattern (search history & clickstream) from a set of raw logs Web logs User identification Crawler detection Structure reconstruction Session identification Search history Clickstrea m Additional steps include: word normalization, stop words removal, and stemming Data preprocess
  • 19. 19 Data preprocess results 1. User search history 2. Clickstream Jiang, Y., Y. Li, C. Yang, E. M. Armstrong, T. Huang & D. Moroni (2016) “Reconstructing Sessions from Data Discovery and Access Logs to Build a Semantic Knowledge Base for Improving Data Discovery” ISPRS International Journal of Geo-Information, 5, 54.
  • 21. 21 Existing ontology (SWEET) • SWEET (Raskin and Pan 2003) • Focus on only two relations • The closer, the more similar
  • 22. 22 User search history • Create query – user matrix • Calculate binary cosine similarity Conceptual example
  • 23. 23 Clickstream • Hypothesis: similar queries can result in similar clicking behavior • If two queries are similar, the data that get clicked after they are searched would be more likely to be similar Query b Data Query a Similar?
  • 24. 24 Metadata • Hypothesis: semantically related terms tend to appear in the same metadata more frequently • Essentially the same as the clickstream analysis • Perform Latent Semantic Analyses (LSA) over the term – metadata matrix Query b Metadata Query a Similar?
  • 25. 25 Integration • All four results could be converted to • Problem: – None of them are perfect (uncertainty in data, hypothesis and method) – Metadata and ontology might have unknown terms to search engine end users – Sometimes, similarity values from different methods are inconsistent Concept A Concept B Similarity
  • 26. 26 Integration • The maximum similarity of all of the components (large similarity appears to be more reliable) • The adjustment increment becomes larger when the similarity exists in more sources
  • 28. 28 Results and evaluation Query Search history Clickstream Metadata SWEE T Integrated list ocean temperature sea surface temperature(0.66), sea surface topography(0.56), ocean wind(0.56), aqua(0.49) sea surface temperature(0.94), sst(0.94), group high resolution sea surface temperature dataset(0.89), ghrsst(0.87) sst(0.96), ghrsst(0.77), sea surface temperature(0.72), surface temperature(0.63), reynolds(0.58) None sst(1.0), sea surface temperature(1.0), ghrsst(1.0), group high resolution sea surface temperature dataset(0.99), reynolds sea surface temperature(0.74) Sample group Overall accuracy Most popular 10 queries 88% Least popular 10 queries 61% Randomly selected 10 queries 83% By domain experts Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang & D. Moroni (2017) A Comprehensive Approach to Determining the Linkage Weights among Geospatial Vocabularies - An Example with Oceanographic Data Discovery. International Journal of Geographical Information Science (minor revision)
  • 30. 30 Background • Ranking is a long-standing problem in geospatial data discovery • Typically, hundreds, even thousands of matches • Can get larger as more Earth observation data is being collected
  • 31. 31 Objective and Methods • Put the most desired data to the top of the result list • What features can represent users’ search preferences for geospatial data? • How can the ranking function reach a balance of all these features? • Identified eleven features from – Geospatial metadata attributes – Query – metadata content overlap – User behavior from web logs
  • 32. 32 Ranking features – Metadata attributes Features Description Release date The date when the data was published Processing level (PL) The processing level of image products, ranging from level 0 to level 4. Version number The publish version of the data Spatial resolution The spatial resolution of the data Temporal resolution The temporal resolution of the data • Five metadata features • Verified by domains experts • Query-independent: static, depends on the data itself, won’t change with the query
  • 33. 33 Spatial query-metadata overlap • Spatial similarity between query area and the coverage of a particular data • Overlap area normalized by the original area of query and data
  • 34. 34 Ranking features – User behavior • All-time, monthly, user popularity, and semantic popularity (retrieved from web logs) • Semantic popularity: the number of times that the data has been clicked after searching a particular query and its highly related ones (query-dependent)
  • 35. 35 RankSVM • One of the well-recognized ML ranking algorithm • Convert a ranking problem into a classification problem that a regular SVM algorithm can solve • 3 main steps  1) Standardize: mean = 0, std = 1 o SVM is not scale invariant o Over-optimized o Longer to train  2) For any pair of training data, calculate the difference  3) A ranking problem becomes a binary classification problem, where SVM is applied to find the optimal decision boundary
  • 36. 36 Architecture • All of these (except for training) can be finished within 2 seconds • None of the open source mainstream ML library provide any ranking algorithm • Implemented it by ourselves with the aid of Spark MLlib Index User query Semantic query Top K retrieval Ranking model Learning algorithm Training data Re-ranked results Feature extractor Weblogs User clicksKnowledge base
  • 37. 37 NDCG (K) for five different ranking methods at varying K (1-40) Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) Towards intelligent geospatial discovery: a machine learning ranking framework. International Journal of Digital Earth (minor revision)
  • 39. 39 How to recommend geospatial data? • Use geospatial metadata for content-based recommendation -Metadata spatiotemporal similarity -Metadata attribute similarity -Metadata content similarity • Leverage user behaviors data for CF recommendation -Session-based co-occurrence of data
  • 40. 40 Attribute type Attribute name Attribute description Spatiotemporal attributes DatasetCoverage-EastLon The East longitude of the bounding rectangle DatasetCoverage-WestLon The West longitude of the bounding rectangle DatasetCoverage-NorthLat The North latitude of the bounding rectangle DatasetCoverage-SouthLat The South latitude of the bounding rectangle DatasetCoverage-StartTimeLong The start time of the data DatasetCoverage-StopTimeLong The end time of the data Categorical geographic attributes DatasetRegion-Region Region of dataset. Such as global, Atlantic Dataset-ProjectionType Project type like cylindrical lat-lon Dataset-ProcessingLevel Data processing level DatasetPolicy-DataFormat Data format e.g. HDF, NetCDF DatasetSource-Sensor-ShortName Short name of sensor Ordinal geographic attributes Dataset-TemporalResolution Temporal resolution of dataset Dataset-TemporalRepeat Temporal resolution of dataset Dataset-SpatialResolution Spatial resolution of dataset Descriptive attributes Dataset-description Describe the content of the dataset Geographic metadata
  • 41. 41 • Spatial variables: NorthLat, SouthLat, WestLon, EastLon • Temporal variables: DatasetCoverage-StartTimeLong, StopTimeLong • Use volume overlap ratio to calculate similarity 𝑠𝑝𝑎𝑡𝑖𝑜𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 𝑠𝑖𝑚 𝑟 𝑖,𝑟 𝑗 = 𝑣𝑜𝑙𝑢𝑚𝑒 𝑟𝑖 ∩ 𝑟𝑗 𝑣𝑜𝑙𝑢𝑚𝑒 𝑟𝑖 + 𝑣𝑜𝑙𝑢𝑚𝑒 𝑟𝑖 ∩ 𝑟𝑗 𝑣𝑜𝑙𝑢𝑚𝑒 𝑟𝑗 ∗ 0.5 𝑣𝑜𝑙𝑢𝑚𝑒 𝑟 = 𝑒𝑎𝑠𝑡𝑙𝑜𝑛 − 𝑤𝑒𝑠𝑡𝑙𝑜𝑛 ∗ 𝑠𝑜𝑢𝑡ℎ𝑙𝑎𝑡 − 𝑛𝑜𝑟𝑡ℎ𝑙𝑎𝑡 ∗ 𝑒𝑛𝑑𝑡𝑖𝑚𝑒 − 𝑠𝑡𝑎𝑟𝑡𝑖𝑚𝑒 Spatiotemporal similarity
  • 42. 42 • Fixed number of values • No intrinsic ordering • sensor-name: "AMSR-E", "MODIS", "AVHRR-3" and "WindSat" 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑐𝑎𝑙_𝑣𝑎𝑟_𝑠𝑖𝑚 𝑣𝑖, 𝑣𝑗 = 𝑣 𝑖 ∩ 𝑣 𝑗 𝑣 𝑖 ∪ 𝑣 𝑗 Categorical similarity
  • 43. 43 Ordinal attribute is similar to categorical attribute but its values has a clear order, e.g. spatial resolution • Converted into rank from 1 to R • Nominalize these ranks for similarity calculation 𝑜𝑟𝑑𝑖𝑛𝑎𝑙 𝑣𝑎𝑟 𝑠𝑖𝑚 𝑣 𝑖,𝑣 𝑗 = 1 − 𝑛𝑜𝑟𝑚 𝑟𝑎𝑛𝑘 𝑣 𝑖 − 𝑛𝑜𝑟𝑚 𝑟𝑎𝑛𝑘 𝑣 𝑗 𝑛𝑜𝑟𝑚_𝑟𝑎𝑛𝑘 𝑣𝑖 = 𝑅𝑎𝑛𝑘𝑣 𝑖+1 𝑅+1 Ordinal similarity
  • 44. 44 Original text Aquarius Level 3 sea surface salinity (SSS) standard mapped image data contains gridded 1 degree spatial resolution SSS averaged over daily, 7 day, monthly, and seasonal time scales. This particular data set is the seasonal climatology, Ascending sea surface salinity product for version 4.0 of the Aquarius data set, which is the official end of prime mission public data release from the AQUARIUS/SAC-D mission. Only retrieved values for Ascending passes have been used to create this product. The Aquarius instrument is onboard the AQUARIUS/SAC-D satellite, a collaborative effort between NASA and the Argentinian Space Agency Comision Nacional de Actividades Espaciales (CONAE). The instrument consists of three radiometers in push broom alignment at incidence angles of 29, 38, and 46 degrees incidence angles relative to the shadow side of the orbit. Footprints for the beams are: 76 km (along-track) x 94 km (cross-track), 84 km x 120 km and 96km x 156 km, yielding a total cross-track swath of 370 km. The radiometers measure brightness temperature at 1.413 GHz in their respective horizontal and vertical polarizations (TH and TV). A scatterometer operating at 1.26 GHz measures ocean backscatter in each footprint that is used for surface roughness corrections in the estimation of salinity. The scatterometer has an approximate 390km swath. Extracted terms Radiometers Measure Brightness Temperature,AQUARIUS/SAC Mission,Image Data,Broom Alignment,Resolution SSS,AQUARIUS/SAC Satellite,Scatterometer,Scatterometer,Aquarius Data,Argentinian Space Agency Comision Nacional,Incidence Angles,Time Scales,Actividades Espaciales,Cross-track Swath,Official End,Aquarius Instrument,Shadow Side,Ascending Sea Surface Salinity Product,Level,Level,Surface Roughness Corrections,Data Release,Salinity,Density, Salinity,Density, AQUARIUS,L3,SSS,SMIA,SEASONAL-CLIMATOLOGY,V4 Step 1: Phrase extraction 1. Extract term candidates from metadata description with POS (part of speech) Tagging 2. Introduce “occurrence” and “strength” to filter out terms from candidates. “occurrence”: occurrences number of terms “strength”: the number of words in a term Descriptive similarity
  • 45. 45 Step 2: Represent metadata in the phrase vector space (The dimension lower than word feature space) Term 1 Term 2 Term 3 Term 4 Term 5 Term 6 Term 7 … Term N dataser1 1 0 1 0 0 0 0 1 dataser2 0 0 0 1 1 0 0 0 … dataset k 1 0 1 0 0 0 0 1 Step 3: Calculate cosine similarity Metadata abstract semantic similarity
  • 46. 46 Session 1 Session 2 … Session N Data 1 1 0 1 Data 2 0 0 0 … Data k 1 0 1 Calculate metadata similarity based on session level co-occurrence Similarity (i,j) = 𝑁(𝑖ᴖ𝑗) 𝑁(𝑖)∗ 𝑁(𝑗) N(i): The number of sessions in which dataset i was viewed or download N(j): The number of sessions in which dataset j was viewed or download N(𝑖ᴖ𝑗): The number of sessions in which both dataset i and j were viewed or download Session based recommendation
  • 47. 47 Recommendation method Pros Cons Strategy Descriptive similarity 1. Natural language processing methods can be adopted to find latent semantic relationship 1. Many datasets has nearly same abstract with few words/values changed 2. It is hard to extract detailed attributed from description Used as the basic method of recommendation algorithm Attribute similarity (spatiotemporal, ordinal, categorical) 1. As structured data, geographic metadata have many variables. 1. Variable values may be null or wrong 2. The quality depends on the weight assigned to every variable As supplement to semantic similarity Session concurrence 1. Reflect users’ preference 1. Cold start problem: Newly published data don’t have usage data Fine-tune recommendation list Recommend (i) = 𝑊 𝑠𝑠 ∗ 𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑣𝑒𝑐 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑖 + ∑𝑊 𝑐𝑣 ∗ 𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑎𝑙 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑖) + ∑𝑊 𝑜𝑣 ∗ 𝑂𝑟𝑑𝑖𝑎𝑛𝑙 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑖) + 𝑊𝑠𝑡𝑣 ∗ 𝑆𝑝𝑎𝑡𝑖𝑜𝑇𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 + 𝑊𝑠𝑜 ∗ 𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 Hybrid recommendation
  • 48. 48 Quantitative Evaluation Hybrid similarity outperform other similarities since it integrates metadata attributes and user preference. Y. Li, Jiang, Y., C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) A Geospatial Data Recommender System based on Metadata and User Behaviour (revision)
  • 49. 49 Plan Forward • Add more features (e.g., temporal similarity) • Create training data from web logs for RankSVM • Develop a query understanding module to better interpret user’s search intent (e.g. “ocean wind level 3” -> “ocean wind” AND “level 3”) • Support Solr • Support near real-time data ingestion to dynamically update knowledge base • Integration with DOMS and OceanXtremes to support an ocean science analytics center • Leverage advanced computing techniques to speed up the process
  • 50. 50 What has been done • Offline/sequential log analysis – Time and computation intensive • Ranking – Use manually labeled data as train data – RankSVM • Offline recommendation – Cannot respond to what user clicked in the last few minutes • Offline Similarity calculation – Cannot be updated in real-time – Cannot parse multi-concept queries
  • 51. 51 Software status • Java 8 • Git • Apache Maven 3.X • Elasticsearch v5.X • Kibana v4 • Apache Spark v2.0.0 • Apache Tomcat 7.X • https://github.com/mudrod/mudrod
  • 52. 52 What remains to be done • Support for Solr • Add faceted search to the UI and backend • Support for real-time ranking, recommendation, similarity calculation • Support for DOMS and OceanXstreams
  • 54. 54 Real-time recommendation User clicked metadata Knowledge base Metadata based similar datasets Top-k datasets Hybrid similarity calculator Recommendation results sequential clicked datasets in a session Weblogs Train model
  • 55. 55 Proposed timeline 55 Sept. – Oct. Nov. – Dec. Jan. – Feb. March - April May - June July – Aug. Support for Solr Add faceted search Support for DOMS & OceanXstreams Real-time ranking Real-time recommendation Real-time similarity
  • 56. 56 Research points • Semantic similarity – Online matrix analysis – Query parsing • Ranking – Generate training data from user clicks – Online deep learning algorithm • Near real-time log ingesting – Spark Streaming APIs – Real-time sessionization • High performance log mining – Use parallel computing to speed up • Recommendation – Session-based real-time recommendation – Personalized recommendation
  • 57. 57 Summary • Log mining enables a data portal integrating implicit user preferences • Word similarity retrieved by data mining tasks expands any given query to improve search recall and precision. • The rich set of ranking features and the ML algorithm provide substantial advantages over using other ranking methods • The recommendation algorithm can discover latent data relevancy • The proposed architecture enables the loosely coupled software structure of a data portal and avoids the cost of replacing the existing system
  • 58. 58 Presentation Contents • Background and Objectives • TRL assessment • Summary of Accomplishments and Plans Forward • Schedule • Financials • Publications • List of Acronyms
  • 60. 60 Jun'1 5 Jul'15 Aug '15 Sep'1 5 Oct '16 Nov '16 Dec '16 Jan '16 Feb '16 Mar '16 Apr '16 May '16 Jun '16 Jul '16 Aug '16 Sep'1 6 Oct '17 Nov '17 Dec '17 Jan '17 Feb '17 Mar '17 Apr '17 May '17 Jun '17 Funding Plan 208 208 208 208 208 208 208 208 208 208 208 208 392 392 392 392 392 392 392 392 392 392 392 392 392 Cum Obligation Plan 2 3 6 8 10 14 23 30 33 41 48 67 91 120 144 167 200 230 253 273 292 343 370 390 392 Cum Actuals 2 3 6 8 10 14 23 30 33 41 48 67 84.3 106.7 132.1 141.2 159.7 179.7 200.1 226.1 255.1 285.5 339.1 366.7 Variance 0 0 0 0 0 0 0 0 0 0 0 0 -7 -13 -12 -26 -40 -50 -53 -47 -37 -58 -31 -23 0 50 100 150 200 250 300 350 400 450 JPL Part: $K Costs Status and Plan
  • 61. 61 Costs Status and Plan Funding Plan 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 346,984 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 603,149 648,838 648,838 648,838 Cumulative Cost Actual 0 0 2498 9993 28880 56770 66100 74279 107932 126371 151600 181645 206108 218265 255809 323578 404836 434623 442374 461156 478063 482765 524561 578369 594214 628852 648838 Cumulative Cost Plan - - 2498 9993 28880 56770 66100 74279 107932 126371 151600 181645 208693 254642 301097 365107 411031 455801 471465 493314 508058 540992 561021 595815 616823 632186 648838 Variance 0 -2499 2499 -2499 2499 -2499 2499 -2499 2499 -2499 5083 31293 13995 27535 -21340 42518 -13426 45585 -15589 73817 -37356 54802 -32193 35527 -35528 • Actual costs are below or above plan – Due to the GMU payment cycle is different from end of month • Impact: None
  • 62. 62 Online Products 1. PO.DAAC Labs website https://podaac.jpl.nasa.gov/podaac_labs 2. PO.DAAC forum page https://podaac.jpl.nasa.gov/forum/viewforum.php?f=88 3. Open Source with Github and Apache License 2.0: https://github.com/mudrod/mudrod 4. Demo Site: http://mudrod.jpl.nasa.gov/ 5. PD Leverage: http://pd.cloud.gmu.edu/
  • 63. 63 Demonstration MUDROD deployed in PO.DAAC Labs • https://podaac.jpl.nasa.gov/podaac_labs
  • 64. 64 Publications Journal / Conference Papers (over 10 peer review publications) 1. Jiang, Y., Y. Li, C. Yang, E. M. Armstrong, T. Huang & D. Moroni (2016) Reconstructing Sessions from Data Discovery and Access Logs to Build a Semantic Knowledge Base for Improving Data Discovery. ISPRS International Journal of Geo-Information, 5, 54. 2. Y. Li, Jiang, Y., C. Yang, K. Liu, E. M. Armstrong, T. Huang & D. Moroni (2016) Leverage cloud computing to improve data access log mining. IEEE Oceans 2016. 3. Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang & D. Moroni (2017) A Comprehensive Approach to Determining the Linkage Weights among Geospatial Vocabularies - An Example with Oceanographic Data Discovery. International Journal of Geographical Information Science (minor revision) 4. Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) Towards intelligent geospatial discovery: a machine learning ranking framework. International Journal of Digital Earth (minor revision) 5. Y. Li, Jiang, Y., C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) A Geospatial Data Recommender System based on Metadata and User Behaviour (revision) 6. Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. J. McGibbney (2017) A smart web-based data discovery system for ocean sciences. (ongoing) 7. Yang, C., et al., 2017. Big Data and cloud computing: innovation opportunities and challenges. International Journal of Digital Earth, 10(1), pp.13-53. (the 2nd most read paper of IJDE in it’s decadal history) 8. Yang, C.P., Yu, M., Xu, M., Jiang, Y., Qin, H., Li, Y., Bambacus, M., Leung, R.Y., Barbee, B.W., Nuth, J.A. and Seery, B., 2017, March. An architecture for mitigating near earth object's impact to the earth. In Aerospace Conference, 2017 IEEE (pp. 1-13). IEEE 9. Yu, M. and Yang, C., 2016. Improving the Non-Hydrostatic Numerical Dust Model by Integrating Soil Moisture and Greenness Vegetation Fraction Data with Different Spatiotemporal Resolutions. PloS one, 11(12), p.e0165616. 10. Vance, T.C., Merati, N., Yang, C. and Yuan, M., 2016, September. Cloud computing for ocean and atmospheric science. In OCEANS 2016 MTS/IEEE Monterey (pp. 1-4). IEEE. (also as a book with the same name) 11. Liu, K., Yang, C., Li, W., Gui, Z., Xu, C. and Xia, J., 2016. Using semantic search and knowledge reasoning to improve the discovery of earth science records: an example with the ESIP semantic testbed. In Mobile Computing and Wireless Networks: Concepts, Methodologies, Tools, and Applications (pp. 1375-1389). IGI Global. 12. Xia, J., Yang, C., Liu, K., Gui, Z., Li, Z., Huang, Q. and Li, R., 2015. Adopting cloud computing to optimize spatial web portals for better performance to support Digital Earth and other global geospatial initiatives. International Journal of Digital Earth, 8(6), pp.451- 475.
  • 65. 65 Students Cultivated • Ph.D. Dissertations focused on MUDROD: Yongyao Jiang and Yun Li (GMU), will complete in phase 2 (OceanWorks) • Graduate Students worked on tasks; Fei Hu, Kai Liu (graduate in 2017 summer), Manzhu Yu (graduate in 2017 summer), Han Qin (graduated), Mengchao Xu, Ishan Shams (GMU) • Undergraduate students: Joseph Michael/Univ. of Richmond (HBCU), Alena/Virginia Tech (Female) • Other (Approximately 20 presentations at AGU, ESIP, AAG, GeoInformatics, and other national and international conferences)
  • 66. 66 Acronyms List of Acronyms • MUDROD Mining and Utilizing Dataset Relevancy from Oceanographic Dataset • HTTP Hypertext Transfer Protocol • FTP File Transfer Protocol • SWEET Semantic web for earth and environmental terminology • PO.DAAC Physical Oceanography Distributed Active Archive Center • LSA Latent Semantic Analysis • SVM Support vector machine • NDCG Normalized Discounted Cumulative Gain
  • 67. 67 1. NASA AIST Program (NNX15AM85G) 2. PO.DAAC SWEET Ontology Team (Initially funded by ESTO) 3. Hydrology DAAC Rahul Ramachandran (providing the earlier version of NOESIS) 4. ESDIS for providing testing logs of CMR 5. All team members at JPL and GMU Acknowledgements