Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Próxima SlideShare
Διαδικτυακή εφαρμογή βαθμονόμησης δεξαμενών
Siguiente
Descargar para leer sin conexión y ver en pantalla completa.

0

Compartir

Descargar para leer sin conexión

The theory and practice of Website Archivability

Descargar para leer sin conexión

Libros relacionados

Gratis con una prueba de 30 días de Scribd

Ver todo

Audiolibros relacionados

Gratis con una prueba de 30 días de Scribd

Ver todo
  • Sé el primero en recomendar esto

The theory and practice of Website Archivability

  1. 1. The Theory and Practice of Website Archivability Vangelis Banos1, Yunhyong Kim2, Seamus Ross2, Yannis Manolopoulos1 1Department of Informatics, Aristotle University, Thessaloniki , Greece 2University of Glasgow, United Kingdom FROM CLEAR TO ARCHIVEREADY.COM
  2. 2. 2 Table of Contents 1. Problem definition, 2. CLEAR: A Credible Live Method to Evaluate Website Archivability, 3. Demo: http://archiveready.com/, 4. Future Work.
  3. 3. Problem definition • Web content acquisition is a critical step in the process of web archiving, • Web bots face increasing difficulties in harvesting websites, • After web harvesting, archive administrators review manually the content and endorse or reject the harvested material, • Key Problem: Web harvesting is automated while Quality Assurance (QA) is manual. 3
  4. 4. Website Archivability ? What is Website Archivability captures the core aspects of a website crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy. Attention! it must not be confused with website dependability, reliability, availability, safety, security, survivability, maintainability.
  5. 5. CLEAR: A Credible Live Method to Evaluate Website Archivability • An approach to producing a credible on-the-fly measurement of Website Archivability, by: • Using standard HTTP to get website elements, • Evaluating information such as file types, content encoding and transfer errors, • Combining this information with an evaluation of the website's compliance with recognised practices in digital curation, • Using adopted standards, validating formats, assigning metadata • Calculating Website Archivability Score (0 – 100%) 5
  6. 6. 6 Accessibility Cohesion Standards Compliance Performance Metadata CLEAR: A Credible Live Method to Evaluate Website Archivability
  7. 7. 7 Website attributes evaluated using CLEAR
  8. 8. 8 C L E A R • The method can be summarised as follows: 1. Perform specific Evaluations on Website Attributes, 2. In order to calculate each Archivability Facet’s score, • Scores range from (0 – 100%), • Not all evaluations are equal, if an important evaluation fails, score = 0, if a minor evaluation fails, score = 50% 3. Producing the final Website Archivability as the sum all Facets’ scores.
  9. 9. Accessibility 9 Are web archiving crawlers able to discover all content using standard protocols and best practices?
  10. 10. Accessibility evaluation 10 Facet Evaluation Rating Total Accessibility No RSS feed 50% 50% No robots.txt 50% No sitemap.xml 0% 6 links, all valid 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  11. 11. Cohesion 11 • Dependencies are a great issue in digital curation. • If a website is dispersed across different web locations (images, javascripts, CSS, CDNs, etc), the acquisition and ingest is likely to risk suffering if one or more web locations fail on change. • Web bots may have issues accessing a lot of different web locations due to configuration issues.
  12. 12. Cohesion evaluation 12 Facet Evaluation Rating Total Cohesion 1 external and no internal scripts 0% 70% 4 local and 1 external images 80% No proprietary (Quicktime & Flash) files 100% 1 local CSS file 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  13. 13. Metadata 13 • Metadata are necessary for digital curation and archiving. • Lack of metadata impairs the ability to manage, organise, retrieve and interact with content. • Web content metadata may be: • Syntactic: (e.g. content encoding, character set) • Semantic: (e.g. description, keywords, dates) • Pragmatic: (e.g. FOAF, RDF, Dublin Core)
  14. 14. Metadata evaluation 14 Facet Evaluation Rating Total Metadata Meta description found 100% 87% HTTP Content type 100% HTTP Page expiration not found 50% HTTP Last-modified found 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  15. 15. Performance 15 • Calculate the average network response time for all website content. • The throughput of web spider data acquisition affects the number and complexity of the web sources it can process. • Performance evaluation: Facet Evaluation Rating Total Performance Average network response time is 0.546ms 100% 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  16. 16. Standards Compliance 16 • Digital curation best practices recommend that web resources must be represented in known and transparent standards, in order to be preserved.
  17. 17. Standards Compliance evaluation 17 Facet Evaluation Rating Total Standards Compliance 1 Invalid CSS file 0% 87% Invalid HTML file 0% Meta description found 100% No HTTP Content encoding 50% HTTP Content Type found 100% HTTP Page expiration found 100% HTTP Last-modified found 100% No Quicktime or Flash objects 100% 5 images found and validated with JHOVE 100% http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013
  18. 18. iPRES 2013 Website Archivability Evaluation 18 Facet Rating Website Archivability Accessibility 50% 77% Cohesion 70% Standards Compliance 77% Metadata 87% Performance 100%
  19. 19. ArchiveReady.com Demonstration - Web application implementing CLEAR, - Web interface & also Web API in JSON, - Running on Linux, Python, Nginx, Redis, Mysql. 19
  20. 20. Impact 20 1. Web professionals - evaluate the archivability of their websites in an easy but thorough way, - become aware of web preservation concepts, - embrace preservation-friendly practices. 2. Web archive operators - make informed decisions on archiving websites, - perform large scale website evaluations with ease, - automate web archiving Quality Assurance, - minimise wasted resources on problematic websites.
  21. 21. 21 Future Work 1. Not optimal to treat all Archivability Facets as equal. 2. Evaluating a single website page, based on the assumption that web pages from the same website share the same components and standards. Sampling would be necessary. 3. Certain classes and specific types of errors create lesser or greater obstacles to website acquisition and ingest than others. Differential valuing of error classes and types is necessary. 4. Cross validation with web archive data is under way
  22. 22. THANK YOU Vangelis Banos Web: http://vbanos.gr/ Email: vbanos@gmail.com ANY QUESTIONS? 22 The research leading to these results has received funding from the European Commission Framework Programme 7 (FP7), BlogForever project, grant agreement No.269963.

Vistas

Total de vistas

2.302

En Slideshare

0

De embebidos

0

Número de embebidos

35

Acciones

Descargas

4

Compartidos

0

Comentarios

0

Me gusta

0

×