LIS688_Group1

Automatic Metadata Generation through Extraction LIS 688 / spring 2011 Group 1 MendyOzan, EunaeChae, and Vanessa Smith 1

Overview of automatic metadata generation Differences between harvesting and extraction Examination of extraction (6 literature reviews) Examination of Klarity Conclusion Outline 2

With increased wed-based information resources Limited resources – labor and money “ For libraries to advance and take leadership in the bibliographic control of web resources, they must investigate more efficient and less costly metadata creation methods.” – Greenberg, Spirgin, and Crystal, 2005 3 Background

Help serve the need for more efficient and less costly metadata creation Increase consistency in records : more searchable and interoperable Cut down on the amount of human labor 4 Benefits

Harvesting : rely on machine-enabled collection of previously tagged or populated metadata fields Extraction : use complex indexing algorithms and mining techniques to read the contents of a resource, analyze this content, and extract information for application to a metadata schema Main difference : the type of content that is read and analyzed by the program 5 Main methods

Rule-based systems based on natural language processing : used to extract metadata from educational materials Machine learning methods : used to extract titles from general documents Weaknesses : many methods mainly extract metadata from the first page of a document but not from the inner pages of a document 6 More about Extraction

Automatic Metadata Retrieval form Ancient Manuscripts by Le bourgeosis, F.& Kalieh, H. (2004) Aimed at processing automatically digitized manuscripts by using a generic platform which can be used by non-specialists in image processing and pattern recognition Finding : the source of the image was the main factor in deciding if the automatically generated metadata was good : the quality of image is the key. 7 Literature reviews (1)

Metadata Extraction and Harvesting: Klarity & Dc.dot. Klarity ,[object Object]

Generate metadata five elements : identifier, title, concepts, keywords, and descriptions

Some elements (Identifier and title) are harvested directly form the resources while keywords, description, and concept are generated through extraction.8 Literature reviews (2)

Klarity provides an interface for manual editing and addition to the metadata Weakness of description: a smaller amount of text creates better metadata Inaccuracy of Title and keywords - character limit in the title field (>100) But it has potential as it is further developed 9 Literature reviews (2)

DC.dot finds the resource “identifier” from the web browser’s address prompt and pulls the remaining resources from the source code metadata 10 Literature reviews (2)

Still Require human oversight and input Scored low on their accuracy evaluations (86.2% for Klarity, 78.6% for DC.dot) 11 Literature reviews (2)

New possibilities for metadata creation in an institutional repository context Institutional Repository : archive, distribute, manage, and preserve research efforts Focused on the use of two methods of metadata : text mining and machine learning 12 Literature reviews (3)

Text mining : the process by which a computer recognizes the similarities between objects based on textual content ,[object Object],13 Literature reviews (3)

Machine learning : process where problems are given to the computer along with the solution ,[object Object]

Number of characters, nouns, verbs per line, adjectives per line can all signal the system to identify specific elements 14 Literature reviews (3)

Findings : records with good metadata can be taught to the system to those records other records could have similar metadata created 15 Literature reviews (3)

Automation generation and extraction Provide overview of metadata, and automatic generation More detailed about DescribeThis Using the Dublin Core format, it will create records that web administrators can use in describing their own web content 16 Literature reviews (4)

Automated document metadata extraction Focused on theses and dissertations Nature of these documents : standard headers, titles, table of contents, abstract, acknowledgement, preface, introduction, conclusion and references There preexisting categories aid in the automatic generation Rule- based system, extract more metadata 17 Literature reviews (5)

LIS688_Group1

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (19)

Similar a LIS688_Group1

Similar a LIS688_Group1 (20)

Último

Último (20)

LIS688_Group1