SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
Mapping french Open
Data actors on the web
with Common Crawl
guillaume.lebourgeois@data-publica.com
@glebourg
Mining the Web at Data Publica
Different needs, different techniques
   ● Scraping
   ● Focused crawling
   ● Prospective crawling
Mining the Web at Data Publica
Scraping
  ● Identified resources
  ● Configured extractors
  ● Structured content
  ● Not scalable
Mining the Web at Data Publica
Focused crawling
  ● Identified entities
  ● Fuzzy extraction
  ● Structured content using text-mining
  ● Scalable
  ● Useful to get meta information on known
    entities
Mining the Web at Data Publica
Prospective crawling
  ● No starting point
  ● Fuzzy extraction
  ● Structured content using text-mining
  ● Very hard to scale
  ● Heavy resources needed : CPU, RAM,
    HDD

It makes your life easier to use a third-party !
From a crawl to a map
Goal : build a map of the french open data
actors on the web
  ● As a graph
  ● Showing websites
From a crawl to a map
Using Common Crawl
  ● Large web crawl archives fully accessible
  ● Good coverage of french web
  ● Easy access via AWS / MapReduce jobs
From a crawl to a map
Working on french web
 ● Irrelevant to use tld .fr for detection
 ● Detecting page language
 ● Giving websites a "frenchness" score
     ○ Sw = amount of fr pages / total of pages
     ○ Cutoff manually chosen via testing on french
       websites
From a crawl to a map
Working on Open Data websites
 ● Building an Open Data "vocabulary"
 ● Detecting if page speaks about Open
    Data
 ● Giving websites an "opendataness" score
     ○ Sw = amount of Open Data pages / total of pages
     ○ Cutoff manually chosen via testing on Open Data
       websites
From a crawl to a map
Building graph
  ● Inside our subset
     ○ Inlinks
     ○ Outlinks
  ● Generating two files
     ○ nodes.csv (list of websites with an id)
     ○ edges.csv (directed links between websites)


              A inlink                A outlink
                             Node A



                  A inlink
From a crawl to a map
Building graph
  ● Links tell a lot about websites
     ○ Authorities
     ○ Hubs
From a crawl to a map
Visualizing graph using Gephi
  ● Load graph
  ● Spatialize graph
     ○ links between websites create "attraction", to
       make them appear near each other
     ○ the more inlinks, bigger the node (= authority)
     ○ categorizing web site for better understanding (a
       color per category)
        ■ Companies, Non profit/blogs, Governement
           agencies
     ○ communities can now appear !
From a crawl to a map
From a crawl to a map
Visualizing graph on the web
  ● Sigma.js
  ● Uses Gephi files
  ● Gives better interactivity
Analyze
● The final graph is a good way to understand
  interactions between actors
  ○ Open Data is definitely initiated by a Non Profit
    movement
  ○ Companies are beginning to work on the subject
  ○ French state only had some sporadic initiatives for
    now
● This graph is to be generated again in near
  futur, to see changes in this ecosystem
Results
● Large scale crawl made easy
  ○ Easy to focus on mining the results instead of
    finding/storing the data
● Nice workflow from raw data to an
  understandable visualisation
● The final graph is a good way to understand
  interactions between actors
Feedback
● Common Crawl
  ○ Common crawl doesn't have an exhaustive crawl of
    the french web for now
  ○ Data is not fresh as it could be
  ○ It is missing an index to access at least domains,
    and maybe pages in O(1)
● Methodology
  ○ Opendataness scoring can put aside some websites
    not enough focused on open data even if relevant
Resources
● http://webatlas.
  fr/tempshare/OpenDataActeursTypes.pdf
   ○ poster by Franck Ghitalla
● http://french-opendata.data-publica.
  com/index.html
   ○ dynamic visualisation of the results, by Data Publica
● http://fr.slideshare.net/willounet/a-sneak-
  peek-into-the-web-presentation,
   ○ A sneak peek into the web, by GL
● http://french-opendata.data-publica.com/
   ○ Project host page
Mapping french Open
Data actors on the web
with Common Crawl
guillaume.lebourgeois@data-publica.com
@glebourg

Más contenido relacionado

La actualidad más candente

AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012Amazon Web Services
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction ServicePromptCloud
 
Introduction to CKAN
Introduction to CKANIntroduction to CKAN
Introduction to CKANOKCon2013
 
Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Datahuguk
 
Website analysis report
Website analysis reportWebsite analysis report
Website analysis reportvimlesh88
 
온톨로지 개념 및 표현언어
온톨로지 개념 및 표현언어온톨로지 개념 및 표현언어
온톨로지 개념 및 표현언어Dongbum Kim
 
Monthly Web Analytics Report
Monthly Web Analytics ReportMonthly Web Analytics Report
Monthly Web Analytics ReportMark Kegley
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupTushar Mittal
 
Web 3.0 The Semantic Web
Web 3.0 The Semantic WebWeb 3.0 The Semantic Web
Web 3.0 The Semantic WebHatem Mahmoud
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
Website Analysis Seo Report
Website Analysis Seo ReportWebsite Analysis Seo Report
Website Analysis Seo ReportSEO Google Guru
 
Web scraping & browser automation
Web scraping & browser automationWeb scraping & browser automation
Web scraping & browser automationBHAWESH RAJPAL
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in PythonSatwik Kansal
 

La actualidad más candente (20)

AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
AWS Customer Presentation: Freie Univerisitat - Berlin Summit 2012
 
Web mining
Web miningWeb mining
Web mining
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Introduction to CKAN
Introduction to CKANIntroduction to CKAN
Introduction to CKAN
 
Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Data
 
Website analysis report
Website analysis reportWebsite analysis report
Website analysis report
 
온톨로지 개념 및 표현언어
온톨로지 개념 및 표현언어온톨로지 개념 및 표현언어
온톨로지 개념 및 표현언어
 
Monthly Web Analytics Report
Monthly Web Analytics ReportMonthly Web Analytics Report
Monthly Web Analytics Report
 
Web mining (1)
Web mining (1)Web mining (1)
Web mining (1)
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Deep web Seminar
Deep web Seminar Deep web Seminar
Deep web Seminar
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful Soup
 
Web 3.0 The Semantic Web
Web 3.0 The Semantic WebWeb 3.0 The Semantic Web
Web 3.0 The Semantic Web
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Website Analysis Seo Report
Website Analysis Seo ReportWebsite Analysis Seo Report
Website Analysis Seo Report
 
Semantic web
Semantic webSemantic web
Semantic web
 
Web scraping & browser automation
Web scraping & browser automationWeb scraping & browser automation
Web scraping & browser automation
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in Python
 
Web 3.0 Intro
Web 3.0 IntroWeb 3.0 Intro
Web 3.0 Intro
 

Similar a Mapping french open data actors on the web with common crawl

How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
How and why governments should use OpenStreetMap - Pete Lancaster - State of ...How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
How and why governments should use OpenStreetMap - Pete Lancaster - State of ...OSMFstateofthemap
 
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...OW2
 
Open Source Summit Paris '17 Amsterdam Open Source
Open Source Summit Paris '17 Amsterdam Open SourceOpen Source Summit Paris '17 Amsterdam Open Source
Open Source Summit Paris '17 Amsterdam Open SourceBoris van Hoytema
 
City of Amsterdam: High velocity development
City of Amsterdam: High velocity developmentCity of Amsterdam: High velocity development
City of Amsterdam: High velocity developmentBoris van Hoytema
 
Web Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptxWeb Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptxHitechIOT
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
SUNY Purchase Social Media Certificate Program - Session 4
SUNY Purchase Social Media Certificate Program - Session 4SUNY Purchase Social Media Certificate Program - Session 4
SUNY Purchase Social Media Certificate Program - Session 4Bridget Gibbons
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionSammy Fung
 
OutSystems Webinar - Troubleshooting Mobile Apps Performance
OutSystems Webinar - Troubleshooting Mobile Apps PerformanceOutSystems Webinar - Troubleshooting Mobile Apps Performance
OutSystems Webinar - Troubleshooting Mobile Apps PerformanceDaniel Reis
 
Training Webinar: Troubleshooting Mobile Apps Performance
Training Webinar: Troubleshooting Mobile Apps Performance Training Webinar: Troubleshooting Mobile Apps Performance
Training Webinar: Troubleshooting Mobile Apps Performance OutSystems
 
What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012lokku
 
Open streetmapによる鳥取ガイドの試み3
Open streetmapによる鳥取ガイドの試み3Open streetmapによる鳥取ガイドの試み3
Open streetmapによる鳥取ガイドの試み3Hiroyuki Nakaji
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015Kanwal Prakash Singh
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015Kanwal Prakash Singh
 
Tools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web BrowserTools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web BrowserSafe Software
 
Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04Torben Brodt
 
OER World Map Project
OER World Map Project OER World Map Project
OER World Map Project Robert Farrow
 

Similar a Mapping french open data actors on the web with common crawl (20)

How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
How and why governments should use OpenStreetMap - Pete Lancaster - State of ...How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
How and why governments should use OpenStreetMap - Pete Lancaster - State of ...
 
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
Amsterdam developing public code for every city and everyone, Boris Van Hoyte...
 
Open Source Summit Paris '17 Amsterdam Open Source
Open Source Summit Paris '17 Amsterdam Open SourceOpen Source Summit Paris '17 Amsterdam Open Source
Open Source Summit Paris '17 Amsterdam Open Source
 
DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014DBpedia Viewer - LDOW 2014
DBpedia Viewer - LDOW 2014
 
Linking knowledge spaces
Linking knowledge spacesLinking knowledge spaces
Linking knowledge spaces
 
City of Amsterdam: High velocity development
City of Amsterdam: High velocity developmentCity of Amsterdam: High velocity development
City of Amsterdam: High velocity development
 
Web Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptxWeb Scraping_ Gathering Data from Websites.pptx
Web Scraping_ Gathering Data from Websites.pptx
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
SUNY Purchase Social Media Certificate Program - Session 4
SUNY Purchase Social Media Certificate Program - Session 4SUNY Purchase Social Media Certificate Program - Session 4
SUNY Purchase Social Media Certificate Program - Session 4
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell Extension
 
OutSystems Webinar - Troubleshooting Mobile Apps Performance
OutSystems Webinar - Troubleshooting Mobile Apps PerformanceOutSystems Webinar - Troubleshooting Mobile Apps Performance
OutSystems Webinar - Troubleshooting Mobile Apps Performance
 
Training Webinar: Troubleshooting Mobile Apps Performance
Training Webinar: Troubleshooting Mobile Apps Performance Training Webinar: Troubleshooting Mobile Apps Performance
Training Webinar: Troubleshooting Mobile Apps Performance
 
What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012What’s next in mapping for portals? ppw2012
What’s next in mapping for portals? ppw2012
 
marc portier_westtoer
marc portier_westtoermarc portier_westtoer
marc portier_westtoer
 
Open streetmapによる鳥取ガイドの試み3
Open streetmapによる鳥取ガイドの試み3Open streetmapによる鳥取ガイドの試み3
Open streetmapによる鳥取ガイドの試み3
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015
 
India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015India Analytics and Big Data Summit 2015
India Analytics and Big Data Summit 2015
 
Tools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web BrowserTools for Visualizing Geospatial Data in a Web Browser
Tools for Visualizing Geospatial Data in a Web Browser
 
Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04Recommender Hackathon @plista 2013/04
Recommender Hackathon @plista 2013/04
 
OER World Map Project
OER World Map Project OER World Map Project
OER World Map Project
 

Más de data publica

Más de data publica (12)

Open data Websmatch
Open data WebsmatchOpen data Websmatch
Open data Websmatch
 
Web smatch wod2012
Web smatch wod2012Web smatch wod2012
Web smatch wod2012
 
Open source vs. open data
Open source vs. open dataOpen source vs. open data
Open source vs. open data
 
Suez environnement frédéric charles
Suez environnement frédéric charlesSuez environnement frédéric charles
Suez environnement frédéric charles
 
Tinyclues david bessis
Tinyclues david bessisTinyclues david bessis
Tinyclues david bessis
 
Treerank richard drai
Treerank richard draiTreerank richard drai
Treerank richard drai
 
Bime analytics
Bime analyticsBime analytics
Bime analytics
 
Cours emi cfd
Cours emi cfdCours emi cfd
Cours emi cfd
 
Utc data publica1
Utc data publica1Utc data publica1
Utc data publica1
 
Pikko
PikkoPikko
Pikko
 
Isthma
IsthmaIsthma
Isthma
 
Hurence
HurenceHurence
Hurence
 

Mapping french open data actors on the web with common crawl

  • 1. Mapping french Open Data actors on the web with Common Crawl guillaume.lebourgeois@data-publica.com @glebourg
  • 2. Mining the Web at Data Publica Different needs, different techniques ● Scraping ● Focused crawling ● Prospective crawling
  • 3. Mining the Web at Data Publica Scraping ● Identified resources ● Configured extractors ● Structured content ● Not scalable
  • 4. Mining the Web at Data Publica Focused crawling ● Identified entities ● Fuzzy extraction ● Structured content using text-mining ● Scalable ● Useful to get meta information on known entities
  • 5. Mining the Web at Data Publica Prospective crawling ● No starting point ● Fuzzy extraction ● Structured content using text-mining ● Very hard to scale ● Heavy resources needed : CPU, RAM, HDD It makes your life easier to use a third-party !
  • 6. From a crawl to a map Goal : build a map of the french open data actors on the web ● As a graph ● Showing websites
  • 7. From a crawl to a map Using Common Crawl ● Large web crawl archives fully accessible ● Good coverage of french web ● Easy access via AWS / MapReduce jobs
  • 8. From a crawl to a map Working on french web ● Irrelevant to use tld .fr for detection ● Detecting page language ● Giving websites a "frenchness" score ○ Sw = amount of fr pages / total of pages ○ Cutoff manually chosen via testing on french websites
  • 9. From a crawl to a map Working on Open Data websites ● Building an Open Data "vocabulary" ● Detecting if page speaks about Open Data ● Giving websites an "opendataness" score ○ Sw = amount of Open Data pages / total of pages ○ Cutoff manually chosen via testing on Open Data websites
  • 10. From a crawl to a map Building graph ● Inside our subset ○ Inlinks ○ Outlinks ● Generating two files ○ nodes.csv (list of websites with an id) ○ edges.csv (directed links between websites) A inlink A outlink Node A A inlink
  • 11. From a crawl to a map Building graph ● Links tell a lot about websites ○ Authorities ○ Hubs
  • 12. From a crawl to a map Visualizing graph using Gephi ● Load graph ● Spatialize graph ○ links between websites create "attraction", to make them appear near each other ○ the more inlinks, bigger the node (= authority) ○ categorizing web site for better understanding (a color per category) ■ Companies, Non profit/blogs, Governement agencies ○ communities can now appear !
  • 13. From a crawl to a map
  • 14. From a crawl to a map Visualizing graph on the web ● Sigma.js ● Uses Gephi files ● Gives better interactivity
  • 15. Analyze ● The final graph is a good way to understand interactions between actors ○ Open Data is definitely initiated by a Non Profit movement ○ Companies are beginning to work on the subject ○ French state only had some sporadic initiatives for now ● This graph is to be generated again in near futur, to see changes in this ecosystem
  • 16. Results ● Large scale crawl made easy ○ Easy to focus on mining the results instead of finding/storing the data ● Nice workflow from raw data to an understandable visualisation ● The final graph is a good way to understand interactions between actors
  • 17. Feedback ● Common Crawl ○ Common crawl doesn't have an exhaustive crawl of the french web for now ○ Data is not fresh as it could be ○ It is missing an index to access at least domains, and maybe pages in O(1) ● Methodology ○ Opendataness scoring can put aside some websites not enough focused on open data even if relevant
  • 18. Resources ● http://webatlas. fr/tempshare/OpenDataActeursTypes.pdf ○ poster by Franck Ghitalla ● http://french-opendata.data-publica. com/index.html ○ dynamic visualisation of the results, by Data Publica ● http://fr.slideshare.net/willounet/a-sneak- peek-into-the-web-presentation, ○ A sneak peek into the web, by GL ● http://french-opendata.data-publica.com/ ○ Project host page
  • 19. Mapping french Open Data actors on the web with Common Crawl guillaume.lebourgeois@data-publica.com @glebourg