SlideShare una empresa de Scribd logo
1 de 18
1

WebSmatch : a platform
 for data and metadata
       integration
      Remi Coletta, Emmanuel Castanier,
                Patrick Valduriez,
Christian Frisch, DuyHoa Ngo, Zohra Bellahsene
2




                      Motivations
Context: open data in France
Problems
   •   High number of data sources
   •   Heterogeneous formats
   •   Poorly structured
Example (DataPublica): the web crawl for french open data
sources found 148509 Excel files and only 369 RDF files
Needs: integrate and visualize data sources to yield high-
value information


                                                        2
3




              www.data-publica.com
Business: market place for open data
Functions: crawl, classify, document and reference data
sources in a search engine
The data is extracted and structured in a database in order to
be visualized and accessible through APIs
Problem: scale to high numbers of heterogeneous, poorly
structured sources




                                                            3
4




                DataPublica Workflow

DataPublica provides more than 10 000 XLS files (from several
sources such as INSEE, various public organizations...)
WebSmatch is integrated in their workflow




                                                           4
5




                Example of input
URL : http://www.data-publica.com/publication/4736


                                    Problem : where are
                                    data and metadata?
                                    incomplete lines,
                                    unnamed attributes

                                    Existing tools such
                                    as OpenII or Google
                                    Refine work only on
                                    clean files



                                                      5
6




                Example of input
URL : http://www.data-publica.com/publication/4736


                                      Find data table
                                      Remove blank lines
                                      or columns




                                                      6
7




                Example of input
URL : http://www.data-publica.com/publication/4736


                                      Find metadata such
                                      as titles
                                      Identify collections
                                      for bidimensionnal
                                      tables




                                                       7
8




                 WebSmatch workflow
Focus on metadata extraction service
This service is not used if the input is in a structured format
(such as RDF, RDFS, OWL...)




                                                             8
9




MetaData Extraction: XLS example



                          First step :
                          Table detection
                          using vision
                          algorithms
                          (dilate/erode)




                                      9
10




MetaData Extraction: XLS example




                        Second step :
                        Attribute detection
                        using
                        machine learning
                        on cell content
                        and neigboorhood




                                        10
11




       MetaData Extraction: XLS example




Third step : automatic detection of concepts using YAM++
(14 matching techniques such as string matching, instance
based, wordnet...)

YAM++ came 1st and 2nd at OAEI 2011 : http://oaei.ontologymatching.org/2011/results/

                                                                                       11
12




                 WebSmatch Workflow
Focus on matching service
Relies on YAM++, combining different metrics (String, Wordnet,
Instance based)




                                                            12
13




                  Data Visualization
Structured export formats easy to use for third parties : DSPL
DSPL : DataSet Publishing Language from Google Inc. see
https://developers.google.com/public-data/
For bidimensionnal tables, we need to denormalize as DSPL
uses flat CSV files for data



                           =>




                                                            13
Exporting the Results : integrated
                                                             14




                    metadata
How to make richer datasets : aggregation or intersection
   – using generic concepts such as time or location
   – find a specific concept using the matching




                                                            14
15




Visualizing the Results




                          15
16




      Visualizing the Results
http://api.data-publica.com/…/content.json?
limit=10&filter={revenue_fiscal_par_foyer:{$gt:25000}}
                     • Multi format (json, xml, spreadsheet,csv)
                     • Geolocalized queries
                     • Mashups




                                                                   16
17




                       Perspectives


1. Automating large volume extraction: confidence / machine
   learning
2. Clustering documents (on specific concepts & concept
   instances)
•   Integration with other tools
     •   Google Refine
     •   RDF export



                                                          17
18




                       Conclusion


WebSmatch is a flexible environment for Open Data
integration
End-to-end process: importing,         data   cleansing   and
integrating data sources
DSPL export format for visualization
Real validation with DataPublica data sources




                                                            18

Más contenido relacionado

La actualidad más candente

Deploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application ServerDeploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application Server
webhostingguy
 
Maps4 finland 28.8.2012, jari reini
Maps4 finland 28.8.2012, jari reiniMaps4 finland 28.8.2012, jari reini
Maps4 finland 28.8.2012, jari reini
Olli Rinne
 

La actualidad más candente (15)

Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
 
Linked Data Tutorial
Linked Data TutorialLinked Data Tutorial
Linked Data Tutorial
 
Deploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application ServerDeploying PHP applications using Virtuoso as Application Server
Deploying PHP applications using Virtuoso as Application Server
 
Ecuadorian Geospatial Linked Data
Ecuadorian Geospatial Linked Data Ecuadorian Geospatial Linked Data
Ecuadorian Geospatial Linked Data
 
Metadata: A concept
Metadata: A conceptMetadata: A concept
Metadata: A concept
 
Metasearchers Benchmarking
Metasearchers BenchmarkingMetasearchers Benchmarking
Metasearchers Benchmarking
 
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
Open for Business  Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business  Open Archives, OpenURL, RSS and the Dublin Core
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
 
Linked Open Data: an overview
Linked Open Data: an overviewLinked Open Data: an overview
Linked Open Data: an overview
 
Linked data life cycles
Linked data life cyclesLinked data life cycles
Linked data life cycles
 
DBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, DublinDBpedia Tutorial - Feb 2015, Dublin
DBpedia Tutorial - Feb 2015, Dublin
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
Building Linked Data Applications
Building Linked Data ApplicationsBuilding Linked Data Applications
Building Linked Data Applications
 
Sören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge GraphsSören Auer | Enterprise Knowledge Graphs
Sören Auer | Enterprise Knowledge Graphs
 
Maps4 finland 28.8.2012, jari reini
Maps4 finland 28.8.2012, jari reiniMaps4 finland 28.8.2012, jari reini
Maps4 finland 28.8.2012, jari reini
 
Charleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data WorldCharleston 2012 - The Future of Serials in a Linked Data World
Charleston 2012 - The Future of Serials in a Linked Data World
 

Destacado

Destacado (8)

Bime analytics
Bime analyticsBime analytics
Bime analytics
 
Open source vs. open data
Open source vs. open dataOpen source vs. open data
Open source vs. open data
 
Treerank richard drai
Treerank richard draiTreerank richard drai
Treerank richard drai
 
Open data Websmatch
Open data WebsmatchOpen data Websmatch
Open data Websmatch
 
Tinyclues david bessis
Tinyclues david bessisTinyclues david bessis
Tinyclues david bessis
 
Vecteur Plus 2013
Vecteur Plus 2013Vecteur Plus 2013
Vecteur Plus 2013
 
Mapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawlMapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawl
 
Suez environnement frédéric charles
Suez environnement frédéric charlesSuez environnement frédéric charles
Suez environnement frédéric charles
 

Similar a Web smatch wod2012

Large-Scale Machine Learning at Twitter
Large-Scale Machine Learning at TwitterLarge-Scale Machine Learning at Twitter
Large-Scale Machine Learning at Twitter
nep_test_account
 
HEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptx
HEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptxHEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptx
HEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptx
ssuser0d9ec0
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
Tu Pham
 
Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
eswcsummerschool
 
conTEXT -- Lightweight Text Analytics using Linked Data
conTEXT -- Lightweight Text Analytics using Linked DataconTEXT -- Lightweight Text Analytics using Linked Data
conTEXT -- Lightweight Text Analytics using Linked Data
Ali Khalili
 
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Ahmad Assaf
 

Similar a Web smatch wod2012 (20)

Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
 
Large-Scale Machine Learning at Twitter
Large-Scale Machine Learning at TwitterLarge-Scale Machine Learning at Twitter
Large-Scale Machine Learning at Twitter
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
Myth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are InterchangeableMyth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are Interchangeable
 
Configuring and Visualizing The Data Resources in a Cloud-based Data Collect...
Configuring and Visualizing The Data Resources  in a Cloud-based Data Collect...Configuring and Visualizing The Data Resources  in a Cloud-based Data Collect...
Configuring and Visualizing The Data Resources in a Cloud-based Data Collect...
 
HEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptx
HEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptxHEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptx
HEDW-2020-Using-Data-Virtualization-to-Break-Down-Data-Silos.pptx
 
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...
When Data is Everywhere, Where Do You Start?: Using Drupal to Manage, Distrib...
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
conTEXT -- Lightweight Text Analytics using Linked Data
conTEXT -- Lightweight Text Analytics using Linked DataconTEXT -- Lightweight Text Analytics using Linked Data
conTEXT -- Lightweight Text Analytics using Linked Data
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
Enabling Self-service Data Provisioning Through Semantic Enrichment of Data |...
 

Último

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Último (20)

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 

Web smatch wod2012

  • 1. 1 WebSmatch : a platform for data and metadata integration Remi Coletta, Emmanuel Castanier, Patrick Valduriez, Christian Frisch, DuyHoa Ngo, Zohra Bellahsene
  • 2. 2 Motivations Context: open data in France Problems • High number of data sources • Heterogeneous formats • Poorly structured Example (DataPublica): the web crawl for french open data sources found 148509 Excel files and only 369 RDF files Needs: integrate and visualize data sources to yield high- value information 2
  • 3. 3 www.data-publica.com Business: market place for open data Functions: crawl, classify, document and reference data sources in a search engine The data is extracted and structured in a database in order to be visualized and accessible through APIs Problem: scale to high numbers of heterogeneous, poorly structured sources 3
  • 4. 4 DataPublica Workflow DataPublica provides more than 10 000 XLS files (from several sources such as INSEE, various public organizations...) WebSmatch is integrated in their workflow 4
  • 5. 5 Example of input URL : http://www.data-publica.com/publication/4736 Problem : where are data and metadata? incomplete lines, unnamed attributes Existing tools such as OpenII or Google Refine work only on clean files 5
  • 6. 6 Example of input URL : http://www.data-publica.com/publication/4736 Find data table Remove blank lines or columns 6
  • 7. 7 Example of input URL : http://www.data-publica.com/publication/4736 Find metadata such as titles Identify collections for bidimensionnal tables 7
  • 8. 8 WebSmatch workflow Focus on metadata extraction service This service is not used if the input is in a structured format (such as RDF, RDFS, OWL...) 8
  • 9. 9 MetaData Extraction: XLS example First step : Table detection using vision algorithms (dilate/erode) 9
  • 10. 10 MetaData Extraction: XLS example Second step : Attribute detection using machine learning on cell content and neigboorhood 10
  • 11. 11 MetaData Extraction: XLS example Third step : automatic detection of concepts using YAM++ (14 matching techniques such as string matching, instance based, wordnet...) YAM++ came 1st and 2nd at OAEI 2011 : http://oaei.ontologymatching.org/2011/results/ 11
  • 12. 12 WebSmatch Workflow Focus on matching service Relies on YAM++, combining different metrics (String, Wordnet, Instance based) 12
  • 13. 13 Data Visualization Structured export formats easy to use for third parties : DSPL DSPL : DataSet Publishing Language from Google Inc. see https://developers.google.com/public-data/ For bidimensionnal tables, we need to denormalize as DSPL uses flat CSV files for data => 13
  • 14. Exporting the Results : integrated 14 metadata How to make richer datasets : aggregation or intersection – using generic concepts such as time or location – find a specific concept using the matching 14
  • 16. 16 Visualizing the Results http://api.data-publica.com/…/content.json? limit=10&filter={revenue_fiscal_par_foyer:{$gt:25000}} • Multi format (json, xml, spreadsheet,csv) • Geolocalized queries • Mashups 16
  • 17. 17 Perspectives 1. Automating large volume extraction: confidence / machine learning 2. Clustering documents (on specific concepts & concept instances) • Integration with other tools • Google Refine • RDF export 17
  • 18. 18 Conclusion WebSmatch is a flexible environment for Open Data integration End-to-end process: importing, data cleansing and integrating data sources DSPL export format for visualization Real validation with DataPublica data sources 18