SlideShare una empresa de Scribd logo
1 de 18
Descargar para leer sin conexión
Archiving the French Web:
the BnF web archiving workflow
Sara Aubry
Web Archiving Project Manager, IT department
Bibliothèque nationale de France
International Conference on Web archives and e-LD
Biblioteca Nacional de España, Madrid, July 9th 2013
Let’s start with some figures
• Programme start in 2000, industrialisation in 2008-
2012
• Collections:
– 1996 - now
– 20 000 websites for focused crawls, 2.5 million .fr domains for broad
crawls
– 18.8 billion URLs, 370 TB, growing up +100TB / year
• Resources:
– 9 Full Time Employees (5 librarians, 4 engineers)
– many partners within and out of Library, both at the national and
international level
– 70 robots (648GB RAM, 144 CPUs 2.4GHz)
Digital curation is not different!
• « Actions, tools and practices defined
and applied to collect, identify, select,
organize and preserve digital contents
(…) in order to use them and make them
available (…) »
Definition of Digital Archiving in Wikipedia
BnF workflow overview
Selecting
Collecting
Indexing
Accessing
Preserving
nas_preload
Selecting with BCWeb
Selecting with BCWeb
• A form-based application, commonly called a
« curator tool »
– for content curators and researchers to nominate
websites to harvest
– giving basic information about them (content policies,
trends watch)
• Most important information for each website:
– Internet address/URL
– frequency (daily, monthly, yearly, once…)
– size/budget (small, medium, big)
– depth (entire domain, part of it) Content curators
The Web is made of HTML pages
1 HTML page, 48
URL
• 1 HTML
• 1 text/css
• 4 javascript
• 17 image/png
• 5 image/jpeg
• 21 image/gif
all links and
inclusions are URL
references
Harvesting with Heritrix
• A harvester is a piece of
software (crawler,
spider, robot)
• Simulates what a
person would do with a
browser but repeatedly
and very fast
• Follows a looping
process
• Repeated until new and
in-scope URL are found
and limits are not
reached (budget and
time)
WARC
Pick a
location
Make a
Request
Receive a
Response
Examine for
references
Save the
content
Assets:
- open source
- small and large scale
- textual or all-media formats
- data structures
Digital curators: legal
deposit department
Engineers : IT department
Challenges:
• rich media and ever-changing
environment
• social networks
• content beyond paywalls
(news sites, ebooks)
Piloting the crawls with
NetarchiveSuite
• Prepare, schedule, run and monitor harvests
of websites, perform QA
Digital curators: legal
deposit department
Engineers : IT department
Offering access with Wayback
• Give readers the ability to
browse the web “as it
was” with:
– a regular web browser
– a search and redisplay
software
• An application called
“Web archives”
– Wayback: for URL search,
display and browsing
– Nutch prototype for
keyword search
– Guided paths for collection
highlights
Challenges:
• links with our main Catalogue and
open data repository
• “smart” URL search
• full text search and indexing
• small-scale data mining projects with
researchers
Questions ?
E-mail: sara.aubry@bnf.fr
Web site: http://www.bnf.fr
Twitter: http://twitter.com/DLWebBnF

Más contenido relacionado

La actualidad más candente

Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez MorilloNetarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez MorilloBiblioteca Nacional de España
 
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...Artium Vitoria
 
LoCloud: Local Content in a Europeana Cloud
LoCloud: Local Content in a Europeana CloudLoCloud: Local Content in a Europeana Cloud
LoCloud: Local Content in a Europeana Cloudlocloud
 
20190304_shifting_minds_open_belgium_2019
20190304_shifting_minds_open_belgium_201920190304_shifting_minds_open_belgium_2019
20190304_shifting_minds_open_belgium_2019Samuel Donvil
 
20190304 shifting minds_open_belgium_2019
20190304 shifting minds_open_belgium_201920190304 shifting minds_open_belgium_2019
20190304 shifting minds_open_belgium_2019PACKED vzw
 
Local content in a Europeana cloud for small & medium content providers
Local content in a Europeana cloud for small & medium content providersLocal content in a Europeana cloud for small & medium content providers
Local content in a Europeana cloud for small & medium content providerslocloud
 
Digital Cultural Heritage and the new EU Framework Programme
Digital Cultural Heritage and the new EU Framework ProgrammeDigital Cultural Heritage and the new EU Framework Programme
Digital Cultural Heritage and the new EU Framework Programmelocloud
 
The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...locloud
 
Heeren pan-seadda-leiden-17mrt2020
Heeren pan-seadda-leiden-17mrt2020Heeren pan-seadda-leiden-17mrt2020
Heeren pan-seadda-leiden-17mrt2020ariadnenetwork
 
I Linked Open Data nei Beni Culturali, alcuni progetti e casi di studio
I Linked Open Data nei Beni Culturali, alcuni progetti e casi di studioI Linked Open Data nei Beni Culturali, alcuni progetti e casi di studio
I Linked Open Data nei Beni Culturali, alcuni progetti e casi di studioCulturaItalia
 
ALIADA Project. AtCult
ALIADA Project. AtCultALIADA Project. AtCult
ALIADA Project. AtCultaliada project
 
Charper.lawdi.20130531
Charper.lawdi.20130531Charper.lawdi.20130531
Charper.lawdi.20130531charper
 
LoCloud: Local Cultural Heritage Online and in the Cloud
LoCloud: Local Cultural Heritage Online and in the CloudLoCloud: Local Cultural Heritage Online and in the Cloud
LoCloud: Local Cultural Heritage Online and in the Cloudlocloud
 
Uniting Digitization & Heritage Metadata : Calames Plus & other tracks
Uniting  Digitization & Heritage Metadata : Calames Plus & other tracksUniting  Digitization & Heritage Metadata : Calames Plus & other tracks
Uniting Digitization & Heritage Metadata : Calames Plus & other tracksABES
 
Open Cultural Heritage Data @ the Rijksmuseum
Open Cultural Heritage Data @ the RijksmuseumOpen Cultural Heritage Data @ the Rijksmuseum
Open Cultural Heritage Data @ the RijksmuseumSaskia Scheltjens
 

La actualidad más candente (20)

Datahub for museums (poster)
Datahub for museums (poster)Datahub for museums (poster)
Datahub for museums (poster)
 
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez MorilloNetarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
Netarchive Suite at the BNE. Juan Carlos García Arratia y Mar Pérez Morillo
 
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
VIII Encuentros de Centros de Documentación de Arte Contemporáneo en Artium -...
 
LoCloud: Local Content in a Europeana Cloud
LoCloud: Local Content in a Europeana CloudLoCloud: Local Content in a Europeana Cloud
LoCloud: Local Content in a Europeana Cloud
 
RDM @ KU Leuven: De verbindende kracht van het Research Data Management Compe...
RDM @ KU Leuven: De verbindende kracht van het Research Data Management Compe...RDM @ KU Leuven: De verbindende kracht van het Research Data Management Compe...
RDM @ KU Leuven: De verbindende kracht van het Research Data Management Compe...
 
20190304_shifting_minds_open_belgium_2019
20190304_shifting_minds_open_belgium_201920190304_shifting_minds_open_belgium_2019
20190304_shifting_minds_open_belgium_2019
 
20190304 shifting minds_open_belgium_2019
20190304 shifting minds_open_belgium_201920190304 shifting minds_open_belgium_2019
20190304 shifting minds_open_belgium_2019
 
Local content in a Europeana cloud for small & medium content providers
Local content in a Europeana cloud for small & medium content providersLocal content in a Europeana cloud for small & medium content providers
Local content in a Europeana cloud for small & medium content providers
 
Sam Donvil PACKED public domain day 2018
Sam Donvil PACKED public domain day 2018Sam Donvil PACKED public domain day 2018
Sam Donvil PACKED public domain day 2018
 
Digital Cultural Heritage and the new EU Framework Programme
Digital Cultural Heritage and the new EU Framework ProgrammeDigital Cultural Heritage and the new EU Framework Programme
Digital Cultural Heritage and the new EU Framework Programme
 
The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...The LoCloud lightweight digital library and alternative content sources, Adam...
The LoCloud lightweight digital library and alternative content sources, Adam...
 
Heeren pan-seadda-leiden-17mrt2020
Heeren pan-seadda-leiden-17mrt2020Heeren pan-seadda-leiden-17mrt2020
Heeren pan-seadda-leiden-17mrt2020
 
I Linked Open Data nei Beni Culturali, alcuni progetti e casi di studio
I Linked Open Data nei Beni Culturali, alcuni progetti e casi di studioI Linked Open Data nei Beni Culturali, alcuni progetti e casi di studio
I Linked Open Data nei Beni Culturali, alcuni progetti e casi di studio
 
ALIADA Project. AtCult
ALIADA Project. AtCultALIADA Project. AtCult
ALIADA Project. AtCult
 
Charper.lawdi.20130531
Charper.lawdi.20130531Charper.lawdi.20130531
Charper.lawdi.20130531
 
LoCloud: Local Cultural Heritage Online and in the Cloud
LoCloud: Local Cultural Heritage Online and in the CloudLoCloud: Local Cultural Heritage Online and in the Cloud
LoCloud: Local Cultural Heritage Online and in the Cloud
 
Aquiles imlr seminar
Aquiles imlr seminarAquiles imlr seminar
Aquiles imlr seminar
 
Linked (open) data: het met elkaar verbinden van kennis en organisaties
Linked (open) data: het met elkaar verbinden van kennis en organisatiesLinked (open) data: het met elkaar verbinden van kennis en organisaties
Linked (open) data: het met elkaar verbinden van kennis en organisaties
 
Uniting Digitization & Heritage Metadata : Calames Plus & other tracks
Uniting  Digitization & Heritage Metadata : Calames Plus & other tracksUniting  Digitization & Heritage Metadata : Calames Plus & other tracks
Uniting Digitization & Heritage Metadata : Calames Plus & other tracks
 
Open Cultural Heritage Data @ the Rijksmuseum
Open Cultural Heritage Data @ the RijksmuseumOpen Cultural Heritage Data @ the Rijksmuseum
Open Cultural Heritage Data @ the Rijksmuseum
 

Similar a Archiving the French Web: the BnF web archiving workflow. Sara Aubry

Arcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls BeginnersArcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls Beginnersarcomem
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎Libcorpio
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Anna Perricci
 
Web Archiving – Lessons and Potential
 Web Archiving – Lessons and Potential Web Archiving – Lessons and Potential
Web Archiving – Lessons and PotentialDaniel Gomes
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
 
The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...
The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...
The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...Docker, Inc.
 
CONTENTdm Presentation 060711
CONTENTdm Presentation 060711CONTENTdm Presentation 060711
CONTENTdm Presentation 060711Buttes
 
Reusing and Unifying Background Knowledge for Internet of Things with LOV4IoT
Reusing and Unifying Background Knowledge for Internet of Things with LOV4IoTReusing and Unifying Background Knowledge for Internet of Things with LOV4IoT
Reusing and Unifying Background Knowledge for Internet of Things with LOV4IoTFIESTA-IoT
 
FiCloud2016 lov4iot extended
FiCloud2016 lov4iot extended FiCloud2016 lov4iot extended
FiCloud2016 lov4iot extended Amélie Gyrard
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...Martin Klein
 
Resource sync overview and real-world use cases for discovery, harvesting, an...
Resource sync overview and real-world use cases for discovery, harvesting, an...Resource sync overview and real-world use cases for discovery, harvesting, an...
Resource sync overview and real-world use cases for discovery, harvesting, an...openminted_eu
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3Essam Obaid
 
Web and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of CongressWeb and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of Congressnullhandle
 
The ABES Discovery Study
The ABES Discovery StudyThe ABES Discovery Study
The ABES Discovery StudyABES
 
Ict uses in libraries
Ict uses in librariesIct uses in libraries
Ict uses in librariesLiaquat Rahoo
 
Internet tech & web prog. p1,2,3-ver1
Internet tech & web prog.  p1,2,3-ver1Internet tech & web prog.  p1,2,3-ver1
Internet tech & web prog. p1,2,3-ver1Taymoor Nazmy
 
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...Nuno Freire
 

Similar a Archiving the French Web: the BnF web archiving workflow. Sara Aubry (20)

Arcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls BeginnersArcomem training Specifying Crawls Beginners
Arcomem training Specifying Crawls Beginners
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
 
The Hellenic Aggregator
The Hellenic AggregatorThe Hellenic Aggregator
The Hellenic Aggregator
 
Web Archiving – Lessons and Potential
 Web Archiving – Lessons and Potential Web Archiving – Lessons and Potential
Web Archiving – Lessons and Potential
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...
The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...
The Tale of Two Deployments: Greenfield and Monolith Apps with Docker Enterpr...
 
CONTENTdm Presentation 060711
CONTENTdm Presentation 060711CONTENTdm Presentation 060711
CONTENTdm Presentation 060711
 
Reusing and Unifying Background Knowledge for Internet of Things with LOV4IoT
Reusing and Unifying Background Knowledge for Internet of Things with LOV4IoTReusing and Unifying Background Knowledge for Internet of Things with LOV4IoT
Reusing and Unifying Background Knowledge for Internet of Things with LOV4IoT
 
FiCloud2016 lov4iot extended
FiCloud2016 lov4iot extended FiCloud2016 lov4iot extended
FiCloud2016 lov4iot extended
 
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
ResourceSync - Overview and Real-World Use Cases for Discovery, Harvesting, a...
 
Resource sync overview and real-world use cases for discovery, harvesting, an...
Resource sync overview and real-world use cases for discovery, harvesting, an...Resource sync overview and real-world use cases for discovery, harvesting, an...
Resource sync overview and real-world use cases for discovery, harvesting, an...
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3
 
Web and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of CongressWeb and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of Congress
 
The ABES Discovery Study
The ABES Discovery StudyThe ABES Discovery Study
The ABES Discovery Study
 
Ict uses in libraries
Ict uses in librariesIct uses in libraries
Ict uses in libraries
 
Internet tech & web prog. p1,2,3-ver1
Internet tech & web prog.  p1,2,3-ver1Internet tech & web prog.  p1,2,3-ver1
Internet tech & web prog. p1,2,3-ver1
 
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
Metadata Aggregation: Assessing the Application of IIIF and Sitemaps within C...
 

Más de Biblioteca Nacional de España

La colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaLa colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoIdentidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoBiblioteca Nacional de España
 
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...Biblioteca Nacional de España
 
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesRDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesBiblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaPleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaObjetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaBiblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Biblioteca Nacional de España
 
Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Biblioteca Nacional de España
 
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoPleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoBiblioteca Nacional de España
 
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Biblioteca Nacional de España
 

Más de Biblioteca Nacional de España (20)

La colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaLa colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de España
 
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoIdentidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
 
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
 
Data privacy in library authority files: a survey
Data privacy in library authority files: a surveyData privacy in library authority files: a survey
Data privacy in library authority files: a survey
 
Perfil de RDA de la BNE. Resumen de cambios
Perfil de RDA de la BNE. Resumen de cambiosPerfil de RDA de la BNE. Resumen de cambios
Perfil de RDA de la BNE. Resumen de cambios
 
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesRDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
 
RDA: el nuevo texto
RDA: el nuevo textoRDA: el nuevo texto
RDA: el nuevo texto
 
Pleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaPleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de España
 
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaObjetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
 
Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019
 
Dirección Técnica. Objetivos 2019
Dirección Técnica. Objetivos 2019Dirección Técnica. Objetivos 2019
Dirección Técnica. Objetivos 2019
 
Evaluación 2018. Objetivos 2019
Evaluación 2018. Objetivos 2019Evaluación 2018. Objetivos 2019
Evaluación 2018. Objetivos 2019
 
Evaluación actuaciones 2018. Dirección Cultural
Evaluación actuaciones 2018. Dirección CulturalEvaluación actuaciones 2018. Dirección Cultural
Evaluación actuaciones 2018. Dirección Cultural
 
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoPleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
 
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
 
VIAF GDPR
VIAF GDPRVIAF GDPR
VIAF GDPR
 
Renacer prensa historica
Renacer prensa historicaRenacer prensa historica
Renacer prensa historica
 
RDA y Linked data (Ricardo Santos Muñoz)
RDA y Linked data (Ricardo Santos Muñoz)RDA y Linked data (Ricardo Santos Muñoz)
RDA y Linked data (Ricardo Santos Muñoz)
 
Desarrollo actual de RDA (Pilar Tejero López)
Desarrollo actual de RDA (Pilar Tejero López)Desarrollo actual de RDA (Pilar Tejero López)
Desarrollo actual de RDA (Pilar Tejero López)
 

Último

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Último (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Archiving the French Web: the BnF web archiving workflow. Sara Aubry

  • 1. Archiving the French Web: the BnF web archiving workflow Sara Aubry Web Archiving Project Manager, IT department Bibliothèque nationale de France International Conference on Web archives and e-LD Biblioteca Nacional de España, Madrid, July 9th 2013
  • 2. Let’s start with some figures • Programme start in 2000, industrialisation in 2008- 2012 • Collections: – 1996 - now – 20 000 websites for focused crawls, 2.5 million .fr domains for broad crawls – 18.8 billion URLs, 370 TB, growing up +100TB / year • Resources: – 9 Full Time Employees (5 librarians, 4 engineers) – many partners within and out of Library, both at the national and international level – 70 robots (648GB RAM, 144 CPUs 2.4GHz)
  • 3. Digital curation is not different! • « Actions, tools and practices defined and applied to collect, identify, select, organize and preserve digital contents (…) in order to use them and make them available (…) » Definition of Digital Archiving in Wikipedia
  • 6. Selecting with BCWeb • A form-based application, commonly called a « curator tool » – for content curators and researchers to nominate websites to harvest – giving basic information about them (content policies, trends watch) • Most important information for each website: – Internet address/URL – frequency (daily, monthly, yearly, once…) – size/budget (small, medium, big) – depth (entire domain, part of it) Content curators
  • 7. The Web is made of HTML pages 1 HTML page, 48 URL • 1 HTML • 1 text/css • 4 javascript • 17 image/png • 5 image/jpeg • 21 image/gif all links and inclusions are URL references
  • 8. Harvesting with Heritrix • A harvester is a piece of software (crawler, spider, robot) • Simulates what a person would do with a browser but repeatedly and very fast • Follows a looping process • Repeated until new and in-scope URL are found and limits are not reached (budget and time) WARC Pick a location Make a Request Receive a Response Examine for references Save the content
  • 9. Assets: - open source - small and large scale - textual or all-media formats - data structures
  • 11. Engineers : IT department Challenges: • rich media and ever-changing environment • social networks • content beyond paywalls (news sites, ebooks)
  • 12. Piloting the crawls with NetarchiveSuite • Prepare, schedule, run and monitor harvests of websites, perform QA Digital curators: legal deposit department Engineers : IT department
  • 13. Offering access with Wayback • Give readers the ability to browse the web “as it was” with: – a regular web browser – a search and redisplay software • An application called “Web archives” – Wayback: for URL search, display and browsing – Nutch prototype for keyword search – Guided paths for collection highlights
  • 14.
  • 15.
  • 16.
  • 17. Challenges: • links with our main Catalogue and open data repository • “smart” URL search • full text search and indexing • small-scale data mining projects with researchers
  • 18. Questions ? E-mail: sara.aubry@bnf.fr Web site: http://www.bnf.fr Twitter: http://twitter.com/DLWebBnF