SlideShare una empresa de Scribd logo
1 de 16
Spatio-Temporal Pseudo Relevance Feedback
for Large-Scale and Heterogeneous Scientific
Repositories
Shinichi Takeuchi, Yuhei Akahoshi, Bun Theang Ong,
Komei Sugiura, and Koji Zettsu
National Institute of Information and Communications Tech., Japan
Background: Our target task is scientific data retrieval
• Why scientific data retrieval is important?
– Some funding agencies started to
request open access to research
outcomes
– Open science data are useless unless
they are searchable
1 correct vs
9 incorrect
• Existing systems
– Portals: WDS Portal, Pangaea Portal, …
– Search engine: Google Fusion Tables,
…
Examples of existing systems
• Google Fusion Tables
– https://research.google.com/tables?source=fthm
• Pangaea
– http://www.pangaea.de/
Difficulty: Text information is very limited
• Text information is limited compared with web page searching
– e.g. Only 1.7% of Pangaea’s datasets have sufficient text data
Dataset attributes # of datasets Ratio [%]
With abstract 7,028 1.7
With spatial info. 404,145 99.6
With temporal info. 297,478 73.3
With spatio-temporal info. 297,037 73.2
Total (Pangaea) 405,456 100.0
Definition of a “dataset” := a dataset having metadata
cf: We have collected approx. 800,000 scientific datasets
Demo: Baseline has low recall
Conventional studies
• PRF = Pseudo (Blind) Relevance Feedback
Field Example
Scientific data
retrieval
• Generation of spatio-temporal metadata [Pallickara+
2010]
• KVS for discretized spatio-temporal information [Fox+
2013]
Original PRF Validation with TREC tasks [Buckley+ 1995]
PRF applications Microblog search, temporal expression extraction, …
[Lioma+ 2008, Lv+ 2010, Chen+ 2013]
Main innovation and differentiation
• Pseudo relevance feedback using Space-Time-Text(STT) information
• Dataset similarity based on Bhattacharyya distance of spatio-temporal
probabilistic distributions
Standard dataset example
Citation info
(Author, year, etc)
sensory observations
What is observed
Spatio-temporal
info.
Dataset
Overview: Space-Time-Text (STT) query is used in the 2nd search
Browser
Time score
GUI
input
Index
GUI
output
System
Text query
1st search results
DB search
Clustering
Datasets
Dataset
clusters
Text query
2nd search results
STT query
STT query
Space score
Text score
Text query
expansion
Space query
expansion
Retrieval
Retrieval
STT query
expansion
Time query
expansion
Proposed: Bhattacharyya distance is used for measuring similarity
between two spatio-temporal distributions
Space-Time-Text score 𝜙𝜙 𝑦𝑦 is defined as a simple linear combination
𝜙𝜙 𝑦𝑦 = 𝑤𝑤𝑠𝑠 𝜙𝜙𝑠𝑠 𝑦𝑦 + 𝑤𝑤𝑡𝑡 𝜙𝜙𝑡𝑡 𝑦𝑦 + 𝜙𝜙𝑘𝑘(𝑦𝑦)
𝜙𝜙𝑠𝑠(𝑦𝑦) = exp(−( min
𝑦𝑦′∈𝑌𝑌𝐿𝐿
𝑑𝑑𝑠𝑠 𝑦𝑦, 𝑦𝑦′ )2)
If we approximate distributions as Gaussians, Bhattacharyya distance can
be written as follows:
𝑑𝑑 𝑦𝑦𝑖𝑖, 𝑦𝑦𝑗𝑗 =
1
8
𝝁𝝁𝑖𝑖 − 𝝁𝝁𝑗𝑗
′ Σ𝑖𝑖 + Σ𝑗𝑗
2
−1
𝝁𝝁𝑖𝑖 − 𝝁𝝁𝑗𝑗 +
1
2
ln
det
Σ𝑖𝑖 + Σ𝑗𝑗
2
det Σ𝑖𝑖 det(Σ𝑗𝑗)
TextTimeSpace
* Time score is calculated in the same
manner
* Cosine distance is used as text score
Min Distance from top L results
Experiment: We built a test set for evaluation
• No standard benchmarking test is available for scientific data retrieval
• Our test set
– Queries: Scientific keywords
– Training/test datasets obtained from Pangaea
– Labels are given as the average of three expert labelers
Size Source of datasets
Queries
(Scientific keywords)
50 Cross-DB, Google Trends, Microsoft Academic
Search, SWEET Ontology
Training/test datasets 6,000
(120 * 50)
Top 120 Pangaea’s search results per query
acid deposition, aerosol, air quality, atmospheric circulation, boreal forest, climate
change, coastal waters, desert, glacier, global warming, heavy metal, hurricane,
interannual variability, marine biology, ocean circulation, ozone, particulate matter, sea
level pressure, sediment, soil ph, species richness, trade wind, typhoon, …
Queries
Qualitative example: query = “sediment”
Green: Correct (high relevance)
Red: Incorrect (low relevance)
BaselineProposed
Experimental conditions: quantitative comparison
• Labeling by experts in natural science
– Labelers: (at least) master-degree holders
– Relevance: 0 (no relevance) – 3 (high relevance)
• Measure
– nDCG@k, Precision@k, Recall@k, Average Precision
P@𝑘𝑘 =
tp@𝑘𝑘
tp@𝑘𝑘 + fp@𝑘𝑘
R@𝑘𝑘 =
tp@𝑘𝑘
tp@𝑘𝑘 + fn@ALL AP =
1
𝑁𝑁
�
𝑘𝑘=1
𝑁𝑁
rel 𝑘𝑘 P@𝑘𝑘
Method Text PRF Space-Time PRF
Baseline No No
Text-PRF Yes No
STT-PRF Yes Yes
Quantitative result (1):
Text-PRF and STT-PRF improved Average Precision
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 1 2 5 10 20 50 100
Baseline Text-PRF STT-PRF
AP
Ratio of datasets having abstract [%]
STT-PRF beats baseline in
standard setups
Quantitative results (2):STT-PRF obtained best results in Recall,
AP, and number of hits
nDCG@30 P@30 R@30 AP #Hit
Baseline 0.681 0.388 0.095 0.086 15.0
ST-PRF 0.627 0.354 0.155 0.137 26.8
Text-PRF 0.725 0.332 0.221 0.339 91.5
STT-PRF 0.722 0.332 0.238 0.343 91.6
Ratio of datasets having abstract = 2% (simulating Pangaea’s condition)
Future directions: Application to heterogeneous data
We have collected 1.25 million datasets (2.5PB) as of Jan, 2014
Asset category Details
Physical sensor data
Winds, temperature, pressure, humidity, rainfalls, snowfalls, luminance,
CO2, air quality, pollen allergy, radiation, typhoon, earth quake, land slide,
infection disease, etc. (49 sensorss)
Social sensor data Geo-tagged Twitter (JP, US, Sample, trend), Google news, RSS news
Web archive Full-text data, sender data, reputation data, modification relation data
Science data
WDS metadata (40 domains, 25 sites from Pangaea, ICPSR, DRYAD, ESDS,
ADA, etc.)
Open government data Data.gov metadata
Geographical data Landmarks, river-level data, shelter data
Text analysis data
Web text ontology, EDR concept dictionary, WordNet, sentiment
dictionary
Language trans. tools VoiceTra text translation, JServer
Text analysis tools Proper noun extractor, morphological analyzer, dependency parsing
GIS tools
Google Geocoding, Yahoo Contents Geocoder, landmark extractor, postal
code search, GeoNLP
Speech tools VoiceTra (speech recognition & synthesis), Rospeex
Summary
• Novelty of approach
– Pseudo relevance feedback using Space-Time-Text (STT)
information
• Results
– Proposed method improved Recall, AP, and #Hit under
practical setup
• Applications
– SNS and other geo-tagged
messages

Más contenido relacionado

Similar a Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous Scientific Repositories

Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Tetsuya Sakai
 
Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searching
Prof. Wim Van Criekinge
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
Asiri Wijesinghe
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
Shree Shree
 

Similar a Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous Scientific Repositories (20)

Bioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekingeBioinformatics t5-database searching-v2013_wim_vancriekinge
Bioinformatics t5-database searching-v2013_wim_vancriekinge
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
 
Spoken Content Retrieval
Spoken Content RetrievalSpoken Content Retrieval
Spoken Content Retrieval
 
Ijricit 01-002 enhanced replica detection in short time for large data sets
Ijricit 01-002 enhanced replica detection in  short time for large data setsIjricit 01-002 enhanced replica detection in  short time for large data sets
Ijricit 01-002 enhanced replica detection in short time for large data sets
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Bioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searchingBioinformatica 10-11-2011-t5-database searching
Bioinformatica 10-11-2011-t5-database searching
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Answer extraction and passage retrieval for
Answer extraction and passage retrieval forAnswer extraction and passage retrieval for
Answer extraction and passage retrieval for
 
Designing Test Collections That Provide Tight Confidence Intervals
Designing Test Collections That Provide Tight Confidence IntervalsDesigning Test Collections That Provide Tight Confidence Intervals
Designing Test Collections That Provide Tight Confidence Intervals
 
DSE-complete.pptx
DSE-complete.pptxDSE-complete.pptx
DSE-complete.pptx
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
Database Searching
Database SearchingDatabase Searching
Database Searching
 
Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014Bioinformatics t5-databasesearching v2014
Bioinformatics t5-databasesearching v2014
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
AutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsAutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital Decisions
 
data summarization.pptx
data summarization.pptxdata summarization.pptx
data summarization.pptx
 
Database performance improvement, a six sigma project (mesure) by nirav shah
Database performance improvement, a six sigma project (mesure) by nirav shah Database performance improvement, a six sigma project (mesure) by nirav shah
Database performance improvement, a six sigma project (mesure) by nirav shah
 

Más de Komei Sugiura

SuMo-SS: Submodular Optimization Sensor Scattering for Deploying Sensor Netwo...
SuMo-SS: Submodular Optimization Sensor Scattering for Deploying Sensor Netwo...SuMo-SS: Submodular Optimization Sensor Scattering for Deploying Sensor Netwo...
SuMo-SS: Submodular Optimization Sensor Scattering for Deploying Sensor Netwo...
Komei Sugiura
 
ロボットの音声コミュニケーション技術:言葉や能力の壁を越えるデータ指向知能に向けて
ロボットの音声コミュニケーション技術:言葉や能力の壁を越えるデータ指向知能に向けてロボットの音声コミュニケーション技術:言葉や能力の壁を越えるデータ指向知能に向けて
ロボットの音声コミュニケーション技術:言葉や能力の壁を越えるデータ指向知能に向けて
Komei Sugiura
 
20160606劣モジュラ性を利用したドローンによるばらまき型センサ配置
20160606劣モジュラ性を利用したドローンによるばらまき型センサ配置20160606劣モジュラ性を利用したドローンによるばらまき型センサ配置
20160606劣モジュラ性を利用したドローンによるばらまき型センサ配置
Komei Sugiura
 
20140513大規模異分野データ横断検索における時空間情報を用いた擬似適合性フィードバック
20140513大規模異分野データ横断検索における時空間情報を用いた擬似適合性フィードバック20140513大規模異分野データ横断検索における時空間情報を用いた擬似適合性フィードバック
20140513大規模異分野データ横断検索における時空間情報を用いた擬似適合性フィードバック
Komei Sugiura
 
Language acquisition framework for robots: From grounded language acquisition...
Language acquisition framework for robots: From grounded language acquisition...Language acquisition framework for robots: From grounded language acquisition...
Language acquisition framework for robots: From grounded language acquisition...
Komei Sugiura
 
rospeex: a cloud-based speech communication toolkit for ROS
rospeex: a cloud-based speech communication toolkit for ROSrospeex: a cloud-based speech communication toolkit for ROS
rospeex: a cloud-based speech communication toolkit for ROS
Komei Sugiura
 
Introduction to RoboCup@Home
Introduction to RoboCup@HomeIntroduction to RoboCup@Home
Introduction to RoboCup@Home
Komei Sugiura
 

Más de Komei Sugiura (20)

ロボティクスにおける言語の利活用
ロボティクスにおける言語の利活用ロボティクスにおける言語の利活用
ロボティクスにおける言語の利活用
 
生活支援ロボットにおける 大規模データ収集に向けて
生活支援ロボットにおける大規模データ収集に向けて生活支援ロボットにおける大規模データ収集に向けて
生活支援ロボットにおける 大規模データ収集に向けて
 
生活支援ロボットのマルチモーダル言語理解技術
生活支援ロボットのマルチモーダル言語理解技術生活支援ロボットのマルチモーダル言語理解技術
生活支援ロボットのマルチモーダル言語理解技術
 
SuMo-SS: Submodular Optimization Sensor Scattering for Deploying Sensor Netwo...
SuMo-SS: Submodular Optimization Sensor Scattering for Deploying Sensor Netwo...SuMo-SS: Submodular Optimization Sensor Scattering for Deploying Sensor Netwo...
SuMo-SS: Submodular Optimization Sensor Scattering for Deploying Sensor Netwo...
 
ロボットの音声コミュニケーション技術:言葉や能力の壁を越えるデータ指向知能に向けて
ロボットの音声コミュニケーション技術:言葉や能力の壁を越えるデータ指向知能に向けてロボットの音声コミュニケーション技術:言葉や能力の壁を越えるデータ指向知能に向けて
ロボットの音声コミュニケーション技術:言葉や能力の壁を越えるデータ指向知能に向けて
 
言葉や能力の壁を越えるデータ指向知能
言葉や能力の壁を越えるデータ指向知能言葉や能力の壁を越えるデータ指向知能
言葉や能力の壁を越えるデータ指向知能
 
20161014IROS_WS
20161014IROS_WS20161014IROS_WS
20161014IROS_WS
 
New challenge in RoboCup 2017 Nagoya: RoboCup@Home Standard Platform
New challenge in RoboCup 2017 Nagoya: RoboCup@Home Standard PlatformNew challenge in RoboCup 2017 Nagoya: RoboCup@Home Standard Platform
New challenge in RoboCup 2017 Nagoya: RoboCup@Home Standard Platform
 
20160907rsj16ロボット聴覚OS
20160907rsj16ロボット聴覚OS20160907rsj16ロボット聴覚OS
20160907rsj16ロボット聴覚OS
 
20160606劣モジュラ性を利用したドローンによるばらまき型センサ配置
20160606劣モジュラ性を利用したドローンによるばらまき型センサ配置20160606劣モジュラ性を利用したドローンによるばらまき型センサ配置
20160606劣モジュラ性を利用したドローンによるばらまき型センサ配置
 
20160221statistic imitation learning and human-robot communication
20160221statistic imitation learning and human-robot communication20160221statistic imitation learning and human-robot communication
20160221statistic imitation learning and human-robot communication
 
20140513大規模異分野データ横断検索における時空間情報を用いた擬似適合性フィードバック
20140513大規模異分野データ横断検索における時空間情報を用いた擬似適合性フィードバック20140513大規模異分野データ横断検索における時空間情報を用いた擬似適合性フィードバック
20140513大規模異分野データ横断検索における時空間情報を用いた擬似適合性フィードバック
 
20150531Deep Recurrent Neural Networkによる環境モニタリングデータの予測
20150531Deep Recurrent Neural Networkによる環境モニタリングデータの予測20150531Deep Recurrent Neural Networkによる環境モニタリングデータの予測
20150531Deep Recurrent Neural Networkによる環境モニタリングデータの予測
 
階層型評価構造に基づく観光スポット推薦システムの構築と長期実証実験
階層型評価構造に基づく観光スポット推薦システムの構築と長期実証実験階層型評価構造に基づく観光スポット推薦システムの構築と長期実証実験
階層型評価構造に基づく観光スポット推薦システムの構築と長期実証実験
 
Cloud Robotics for Human-Robot Dialogues
Cloud Robotics for Human-Robot DialoguesCloud Robotics for Human-Robot Dialogues
Cloud Robotics for Human-Robot Dialogues
 
20151129インテリジェントホームロボティクス研究会
20151129インテリジェントホームロボティクス研究会20151129インテリジェントホームロボティクス研究会
20151129インテリジェントホームロボティクス研究会
 
Japan Robot Week 2014けいはんなロボットフォーラム
Japan Robot Week 2014けいはんなロボットフォーラムJapan Robot Week 2014けいはんなロボットフォーラム
Japan Robot Week 2014けいはんなロボットフォーラム
 
Language acquisition framework for robots: From grounded language acquisition...
Language acquisition framework for robots: From grounded language acquisition...Language acquisition framework for robots: From grounded language acquisition...
Language acquisition framework for robots: From grounded language acquisition...
 
rospeex: a cloud-based speech communication toolkit for ROS
rospeex: a cloud-based speech communication toolkit for ROSrospeex: a cloud-based speech communication toolkit for ROS
rospeex: a cloud-based speech communication toolkit for ROS
 
Introduction to RoboCup@Home
Introduction to RoboCup@HomeIntroduction to RoboCup@Home
Introduction to RoboCup@Home
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous Scientific Repositories

  • 1. Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous Scientific Repositories Shinichi Takeuchi, Yuhei Akahoshi, Bun Theang Ong, Komei Sugiura, and Koji Zettsu National Institute of Information and Communications Tech., Japan
  • 2. Background: Our target task is scientific data retrieval • Why scientific data retrieval is important? – Some funding agencies started to request open access to research outcomes – Open science data are useless unless they are searchable 1 correct vs 9 incorrect • Existing systems – Portals: WDS Portal, Pangaea Portal, … – Search engine: Google Fusion Tables, …
  • 3. Examples of existing systems • Google Fusion Tables – https://research.google.com/tables?source=fthm • Pangaea – http://www.pangaea.de/
  • 4. Difficulty: Text information is very limited • Text information is limited compared with web page searching – e.g. Only 1.7% of Pangaea’s datasets have sufficient text data Dataset attributes # of datasets Ratio [%] With abstract 7,028 1.7 With spatial info. 404,145 99.6 With temporal info. 297,478 73.3 With spatio-temporal info. 297,037 73.2 Total (Pangaea) 405,456 100.0 Definition of a “dataset” := a dataset having metadata cf: We have collected approx. 800,000 scientific datasets
  • 5. Demo: Baseline has low recall
  • 6. Conventional studies • PRF = Pseudo (Blind) Relevance Feedback Field Example Scientific data retrieval • Generation of spatio-temporal metadata [Pallickara+ 2010] • KVS for discretized spatio-temporal information [Fox+ 2013] Original PRF Validation with TREC tasks [Buckley+ 1995] PRF applications Microblog search, temporal expression extraction, … [Lioma+ 2008, Lv+ 2010, Chen+ 2013] Main innovation and differentiation • Pseudo relevance feedback using Space-Time-Text(STT) information • Dataset similarity based on Bhattacharyya distance of spatio-temporal probabilistic distributions
  • 7. Standard dataset example Citation info (Author, year, etc) sensory observations What is observed Spatio-temporal info. Dataset
  • 8. Overview: Space-Time-Text (STT) query is used in the 2nd search Browser Time score GUI input Index GUI output System Text query 1st search results DB search Clustering Datasets Dataset clusters Text query 2nd search results STT query STT query Space score Text score Text query expansion Space query expansion Retrieval Retrieval STT query expansion Time query expansion
  • 9. Proposed: Bhattacharyya distance is used for measuring similarity between two spatio-temporal distributions Space-Time-Text score 𝜙𝜙 𝑦𝑦 is defined as a simple linear combination 𝜙𝜙 𝑦𝑦 = 𝑤𝑤𝑠𝑠 𝜙𝜙𝑠𝑠 𝑦𝑦 + 𝑤𝑤𝑡𝑡 𝜙𝜙𝑡𝑡 𝑦𝑦 + 𝜙𝜙𝑘𝑘(𝑦𝑦) 𝜙𝜙𝑠𝑠(𝑦𝑦) = exp(−( min 𝑦𝑦′∈𝑌𝑌𝐿𝐿 𝑑𝑑𝑠𝑠 𝑦𝑦, 𝑦𝑦′ )2) If we approximate distributions as Gaussians, Bhattacharyya distance can be written as follows: 𝑑𝑑 𝑦𝑦𝑖𝑖, 𝑦𝑦𝑗𝑗 = 1 8 𝝁𝝁𝑖𝑖 − 𝝁𝝁𝑗𝑗 ′ Σ𝑖𝑖 + Σ𝑗𝑗 2 −1 𝝁𝝁𝑖𝑖 − 𝝁𝝁𝑗𝑗 + 1 2 ln det Σ𝑖𝑖 + Σ𝑗𝑗 2 det Σ𝑖𝑖 det(Σ𝑗𝑗) TextTimeSpace * Time score is calculated in the same manner * Cosine distance is used as text score Min Distance from top L results
  • 10. Experiment: We built a test set for evaluation • No standard benchmarking test is available for scientific data retrieval • Our test set – Queries: Scientific keywords – Training/test datasets obtained from Pangaea – Labels are given as the average of three expert labelers Size Source of datasets Queries (Scientific keywords) 50 Cross-DB, Google Trends, Microsoft Academic Search, SWEET Ontology Training/test datasets 6,000 (120 * 50) Top 120 Pangaea’s search results per query acid deposition, aerosol, air quality, atmospheric circulation, boreal forest, climate change, coastal waters, desert, glacier, global warming, heavy metal, hurricane, interannual variability, marine biology, ocean circulation, ozone, particulate matter, sea level pressure, sediment, soil ph, species richness, trade wind, typhoon, … Queries
  • 11. Qualitative example: query = “sediment” Green: Correct (high relevance) Red: Incorrect (low relevance) BaselineProposed
  • 12. Experimental conditions: quantitative comparison • Labeling by experts in natural science – Labelers: (at least) master-degree holders – Relevance: 0 (no relevance) – 3 (high relevance) • Measure – nDCG@k, Precision@k, Recall@k, Average Precision P@𝑘𝑘 = tp@𝑘𝑘 tp@𝑘𝑘 + fp@𝑘𝑘 R@𝑘𝑘 = tp@𝑘𝑘 tp@𝑘𝑘 + fn@ALL AP = 1 𝑁𝑁 � 𝑘𝑘=1 𝑁𝑁 rel 𝑘𝑘 P@𝑘𝑘 Method Text PRF Space-Time PRF Baseline No No Text-PRF Yes No STT-PRF Yes Yes
  • 13. Quantitative result (1): Text-PRF and STT-PRF improved Average Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 1 2 5 10 20 50 100 Baseline Text-PRF STT-PRF AP Ratio of datasets having abstract [%] STT-PRF beats baseline in standard setups
  • 14. Quantitative results (2):STT-PRF obtained best results in Recall, AP, and number of hits nDCG@30 P@30 R@30 AP #Hit Baseline 0.681 0.388 0.095 0.086 15.0 ST-PRF 0.627 0.354 0.155 0.137 26.8 Text-PRF 0.725 0.332 0.221 0.339 91.5 STT-PRF 0.722 0.332 0.238 0.343 91.6 Ratio of datasets having abstract = 2% (simulating Pangaea’s condition)
  • 15. Future directions: Application to heterogeneous data We have collected 1.25 million datasets (2.5PB) as of Jan, 2014 Asset category Details Physical sensor data Winds, temperature, pressure, humidity, rainfalls, snowfalls, luminance, CO2, air quality, pollen allergy, radiation, typhoon, earth quake, land slide, infection disease, etc. (49 sensorss) Social sensor data Geo-tagged Twitter (JP, US, Sample, trend), Google news, RSS news Web archive Full-text data, sender data, reputation data, modification relation data Science data WDS metadata (40 domains, 25 sites from Pangaea, ICPSR, DRYAD, ESDS, ADA, etc.) Open government data Data.gov metadata Geographical data Landmarks, river-level data, shelter data Text analysis data Web text ontology, EDR concept dictionary, WordNet, sentiment dictionary Language trans. tools VoiceTra text translation, JServer Text analysis tools Proper noun extractor, morphological analyzer, dependency parsing GIS tools Google Geocoding, Yahoo Contents Geocoder, landmark extractor, postal code search, GeoNLP Speech tools VoiceTra (speech recognition & synthesis), Rospeex
  • 16. Summary • Novelty of approach – Pseudo relevance feedback using Space-Time-Text (STT) information • Results – Proposed method improved Recall, AP, and #Hit under practical setup • Applications – SNS and other geo-tagged messages