SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Taxonomies for
        Human vs. Auto-Indexing
Taxonomy Boot Camp, September 25, 2008, San Jose, CA

Heather Hedden
Hedden Information Management
heather@hedden.net
Background
Heather Hedden's taxonomy-creation experience
 For human indexing
    Developed controlled vocabularies for periodical article index
    databases (Gale)
 For auto-indexing
    Developed taxonomies for integration within an enterprise
    search software product for corporate content and web page
    searching (Viziant)
    Matched controlled vocabulary to keywords for consumer online
    products/services directories (various "yellow pages" clients)
 For either
    Created enterprise taxonomies for corporate web sites and
    intranets for site navigation (Earley & Associations)



                      © 2008 Hedden Information Management
Outline
 Taxonomies & Indexing Background
 Choosing Human vs. Auto-indexing
 Taxonomies & Human Indexing
 Taxonomies & Auto-indexing
 Taxonomy Creation Comparison
   Differences in taxonomy terms
   Differences in term relationships
   Differences in definitions & notes
   Differences in synonyms/variants
 Additional Work for the Taxonomist
 Resources

                  © 2008 Hedden Information Management
Taxonomies & Indexing
Types of Taxonomies
1.   Organization, classification, navigation support
     – more emphasis on hierarchies
2.   Search and retrieval support
     – more emphasis on synonyms

     For indexing, use #2 above:
     Search & retrieval support taxonomies



                    © 2008 Hedden Information Management
Taxonomies & Indexing
Search & retrieval taxonomies:
  Connect users to desired content by means of a
  common nomenclature/terminology/vocabulary
  Matching between:
    1.   the vocabulary of the users
    2.   the vocabulary of the content
  Taxonomies interface with
    1.   the users
    2.   the content


  Indexing/tagging/categorization deals with
  #2 in each case above: the connection of taxonomy to
  content
                       © 2008 Hedden Information Management
Taxonomies & Indexing
Indexing/tagging/categorization:
 Indexing
 – done by (trained) indexers
 – creating a (browsable) index
 Tagging
 – done by any person
 – applying labels, metatag descriptors to documents to
 be picked up by database or search software
 – may not require a taxonomy/controlled vocabulary
 Categorization
  – done more systematically/automatically
  – putting documents into (pre-defined) categories
  – often within facets

                   © 2008 Hedden Information Management
Choosing Human vs. Auto-indexing:
The Content
Human indexing                     Auto-indexing
 Manageable number of                  Very large number of
 documents                             documents
 Includes non-text files               Text files only
 Varied and                            Common document
 undifferentiated                      types/formats (or pre-
 document types/formats                tagged types)
 Varied subject areas                  Focused subject areas
                                       (legal, medical, etc.)



                  © 2008 Hedden Information Management
Choosing Human vs. Auto-indexing:
The Culture
Human indexing                      Auto-indexing
 Higher accuracy in                     Greater volume indexed
 indexing                               Greater speed in indexing
 Invest in people                       Invest in technology
 Low-tech: can build your               High-tech: must purchase
 own indexing UI or buy                 auto-indexing software
 Internal control, or                   Software vendor
 outsourcing vendor                     relationship
 relationship



                   © 2008 Hedden Information Management
Taxonomies & Human Indexing
Who are indexers?
 Specialists or not
   the taxonomist and/or other information
   specialists, librarians
   dedicated hired indexers (with or without prior
   indexing) experience
   supplemental work for other staff (editors,
   writers, administrators)
 One person or multiple people
 Usually in-house but could be contracted out

                 © 2008 Hedden Information Management
Taxonomies & Human Indexing
Indexing software/module
 Indexing user interface optimized for ease,
 speed, and accuracy in indexing
 Method for indexers to nominate new taxonomy
 terms
Training & documentation for indexers
 Indexing policy guidelines
 Method to communicate new and changed
 taxonomy terms to indexers
 Method for checking and quality control

               © 2008 Hedden Information Management
© 2008 Hedden Information Management
© 2008 Hedden Information Management
Taxonomies & Auto-Indexing
Technologies
 Entity extraction
 Text mining and text analytics
 Auto-categorization or auto-classification
 utilizing taxonomies:
  1. Machine-learning and training documents
  2. Rules-based categorization




                © 2008 Hedden Information Management
Taxonomies & Auto-Indexing
Machine-learning auto-categorization:
 Complex mathematical algorithms are created
 Taxonomist must then provide several (at least
 5-10) representative sample documents for each
 taxonomy term to “train” the automated indexing
 system.
 If only using only 5-10 documents, then profile/
 overview, encyclopedic articles are best.
 If pre-indexed records exist (i.e. converting from
 human to automated indexing), then hundreds of
 varied documents can be used for each term.
                 © 2008 Hedden Information Management
Taxonomies & Auto-Indexing
      Machine-learning auto-categorization




                 © 2008 Hedden Information Management
Taxonomies & Auto-Indexing
Rules-based auto-categorization:
  Taxonomist must write rules for each taxonomy
  term
  Similar to advanced Boolean searching

bush
IF (INITIAL CAPS AND (MENTIONS "president*" OR WITH
    administration*" OR AROUND "white house" OR NEAR
    "george"))

USE U.S. president
ELSE USE Shrubs
ENDIF                                                       Data Harmony


                     © 2008 Hedden Information Management
Taxonomy Creation Comparison
 Differences in taxonomy terms
 Differences in term relationships
 Differences in term notes, definitions
 Differences in synonyms/variants




               © 2008 Hedden Information Management
Differences in Taxonomy Terms
 For human indexing
  Create terms as specific (granular) as the
   content will support and users will expect.
 For auto-indexing
   Cannot have subtle differences between
   preferred terms:
   International relations; Foreign policy
   Avoid creating both action and topic terms:
   Investing; Investments

                 © 2008 Hedden Information Management
Differences in Term Relationships
  Hierarchical (broader/narrower) links
  Associative (related terms) links
  For human indexing
  Highly useful to indexer, as is to end-user, in finding the
     best term. Consider indexer behavior.
  For auto-indexing
  Not needed, but could be utilized in search results:
     Broader terms recursively include narrower term
     results
     Related terms display as suggestions
     Consider search results.
                    © 2008 Hedden Information Management
Differences in Term Relationships
  Facets
Certain facets may work better with human
 indexing than with auto-indexing.
 Automated indexing may not distinguish
 between different facet meanings of a term.
Examples:
  Mergers - Action/Event or Business Topic?
  Churches – Place or Organization type?



                 © 2008 Hedden Information Management
Differences in Term Notes
Concise explanatory notes (not a dictionary
   definition) on some terms, as needed:
1.   To restrict or expand the application of a term
2.   To distinguish between terms of overlapping meaning
     (may have reciprocal notes)
3.   To provide advice on term usage

For the end-user, optional aid
For indexing:
     often needed for some terms for human indexing
     never needed for auto-indexing
May have notes for indexers that are not for end-users.

                    © 2008 Hedden Information Management
Differences in Term Notes
Scope Notes examples
ProQuest Controlled Vocabulary:

Occupational health
SN: Employer activities designed to protect and promote the health and
     safety of employees on the job
Inequality
SN: Socioeconomic disparity stemming from racial, cultural, or social
     bias

Medical Subject Headings (MeSH):

Nonverbal Communication
Annotation: human only; for animals use ANIMAL COMMUNICATION
    or VOCALIZATION, ANIMAL

                        © 2008 Hedden Information Management
Differences in Synonyms/Variants
Non-preferred terms. Types include:
  synonyms: Cars USE Automobiles
  near-synonyms: Junior high USE Middle school
  variant spellings: Defence USE Defense
  lexical variants: Hair loss USE Baldness
  foreign language terms: Luftwaffe USE German Air Force
  acronyms/spelled out forms: UN USE United Nations
  scientific/technical names: Neoplasms USE Cancer
  antonyms: Misbehavior USE Behavior
  narrower terms and instances that are not preferred
  terms: Power hand drills USE Power hand tools

  Each preferred term may have multiple non-preferred
  terms.
                   © 2008 Hedden Information Management
Differences in Synonyms/Variants
For human indexing
 “Shortcuts”- unique abbreviations within each
 facet (2-3 letters) for commonly entered terms
    For countries, states; industry codes
    For within a facet of limited size – memorizable
  Examples:
    mna – Mergers & acquisitions
    bnk – Banking
    fr – France

 Phrase inversions for alphabetical browsing
 Example: Photography, digital

                   © 2008 Hedden Information Management
Differences in Synonyms/Variants
For Auto-indexing
If machine-learning auto-categorization:
   Need greater number of non-preferred terms
   Can include non-noun phrases
                                              For auto-indexing
For human-indexing                            Presidential candidate
Presidential candidates                       Presidential candidacy
Candidates, presidential                      Candidate for president
                                              Candidacy for president
                                              Presidential hopeful
                                              Running for president
                                              Campaigning for president
                                              Presidential nominee
                       © 2008 Hedden Information Management
Taxonomy Creation Summary
Human indexing                       Auto-Indexing
  Rich relationships                     Cannot have subtle
  between terms                          differences between
  Term notes for                         terms
  clarification                          Avoid creating action-type
  Common-use shortcuts                   terms
  Phrase inversions as                   Be careful with facets
  term variants                          Need more, varied non-
Also:                                    preferred terms, including
  Browsable (A-Z) display                non-noun phrases
  Multiple ways to search
  (beginning of term, word
  within term, etc.)


                    © 2008 Hedden Information Management
Additional Work for the Taxonomist
Human Indexing                       Auto-Indexing
 Inform indexers of newly                Continual update work,
 added terms                             for each new term:
 Adjustments based on                         Add new training
 review of indexers’ work:                    documents, or
    If terms are                              Write new rules
    overlooked (not used):               Adjustments based on
    - Create more non-                   inappropriate results:
    preferred terms                           Add, delete, edit training
    - Create more related-                    documents
    term links                                Tweak existing rules
    If terms are misused:
    - Re-word terms
    - Add scope notes
                    © 2008 Hedden Information Management
Resources
 American Society for Indexing
 www.asindexing.org
 Taxonomies & Controlled Vocabularies SIG of
 the American Society for Indexing
 www.taxonomies-sig.org
 "Taxonomies and Controlled Vocabularies"
 Simmons College Graduate School of Library and
 Information Science Continuing Education Program
    onsite workshop (October 25, 2008, Boston)
    online workshop (February 2009)
 www.simmons.edu/gslis/continuinged/workshops

                  © 2008 Hedden Information Management
Contact
Heather Hedden
Hedden Information Management
98 East Riding Dr.
Carlisle, MA 01741

978-371-0822
978-467-5195 (mobile)
Heather@hedden.net
www.hedden-information.com




                  © 2008 Hedden Information Management

Más contenido relacionado

La actualidad más candente

La actualidad más candente (9)

Managing Taxonomy Tagging
Managing Taxonomy TaggingManaging Taxonomy Tagging
Managing Taxonomy Tagging
 
Taxonomy Fundamentals Workshop
Taxonomy Fundamentals WorkshopTaxonomy Fundamentals Workshop
Taxonomy Fundamentals Workshop
 
Taxonomy 101
Taxonomy 101Taxonomy 101
Taxonomy 101
 
Taxonomy and seo sla 05-06-10(jc)
Taxonomy and seo   sla 05-06-10(jc)Taxonomy and seo   sla 05-06-10(jc)
Taxonomy and seo sla 05-06-10(jc)
 
Mapping Taxonomies, Thesauri, and Ontologies
Mapping Taxonomies, Thesauri, and OntologiesMapping Taxonomies, Thesauri, and Ontologies
Mapping Taxonomies, Thesauri, and Ontologies
 
Ontology and Other Semantic Options
Ontology and Other Semantic OptionsOntology and Other Semantic Options
Ontology and Other Semantic Options
 
Selecting Software for Taxonomy, Thesaurus and Ontology Management
Selecting Software for Taxonomy, Thesaurus and Ontology ManagementSelecting Software for Taxonomy, Thesaurus and Ontology Management
Selecting Software for Taxonomy, Thesaurus and Ontology Management
 
Benefits of Taxonomies
Benefits of TaxonomiesBenefits of Taxonomies
Benefits of Taxonomies
 
Taxonomies in Support of Search
Taxonomies in Support of SearchTaxonomies in Support of Search
Taxonomies in Support of Search
 

Similar a Taxonomies for Human vs Auto-Indexing

Dita webinar 20th march
Dita webinar 20th marchDita webinar 20th march
Dita webinar 20th march
Metapercept
 
Simplified Technical English: How Standardizing Content Saves Translation Cos...
Simplified Technical English: How Standardizing Content Saves Translation Cos...Simplified Technical English: How Standardizing Content Saves Translation Cos...
Simplified Technical English: How Standardizing Content Saves Translation Cos...
Scott Abel
 
SWT Lecture Session 7 - Advanced uses of RDFS
SWT Lecture Session 7 - Advanced uses of RDFSSWT Lecture Session 7 - Advanced uses of RDFS
SWT Lecture Session 7 - Advanced uses of RDFS
Mariano Rodriguez-Muro
 

Similar a Taxonomies for Human vs Auto-Indexing (20)

Mapping, Merging, and Multilingual Taxonomies
Mapping, Merging, and Multilingual TaxonomiesMapping, Merging, and Multilingual Taxonomies
Mapping, Merging, and Multilingual Taxonomies
 
User-Driven Taxonomies
User-Driven TaxonomiesUser-Driven Taxonomies
User-Driven Taxonomies
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic Web
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic Web
 
Taxonomy 101: Classifying DITA Tasks
Taxonomy 101: Classifying DITA TasksTaxonomy 101: Classifying DITA Tasks
Taxonomy 101: Classifying DITA Tasks
 
Identifying Security Risks Using Auto-Tagging and Text Analytics
Identifying Security Risks Using Auto-Tagging and Text AnalyticsIdentifying Security Risks Using Auto-Tagging and Text Analytics
Identifying Security Risks Using Auto-Tagging and Text Analytics
 
IWMW 2002: The Value of Metadata and How to Realise It
IWMW 2002: The Value of Metadata and How to Realise ItIWMW 2002: The Value of Metadata and How to Realise It
IWMW 2002: The Value of Metadata and How to Realise It
 
Dita webinar 20th march
Dita webinar 20th marchDita webinar 20th march
Dita webinar 20th march
 
Testing Taxonomies
Testing TaxonomiesTesting Taxonomies
Testing Taxonomies
 
Successful Content Management Through Taxonomy And Metadata Design
Successful Content Management Through Taxonomy And Metadata DesignSuccessful Content Management Through Taxonomy And Metadata Design
Successful Content Management Through Taxonomy And Metadata Design
 
3 25 11 Term Store Best Practices
3 25 11 Term Store Best Practices3 25 11 Term Store Best Practices
3 25 11 Term Store Best Practices
 
Simplified Technical English: How Standardizing Content Saves Translation Cos...
Simplified Technical English: How Standardizing Content Saves Translation Cos...Simplified Technical English: How Standardizing Content Saves Translation Cos...
Simplified Technical English: How Standardizing Content Saves Translation Cos...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Meta data
Meta dataMeta data
Meta data
 
XXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerXXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair Kerner
 
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOMTEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
TEXT MINING-TAPPING HIDDEN KERNELS OF WISDOM
 
Julie glanville embase sunrise seminar may 2016
Julie glanville embase sunrise seminar may 2016Julie glanville embase sunrise seminar may 2016
Julie glanville embase sunrise seminar may 2016
 
SWT Lecture Session 7 - Advanced uses of RDFS
SWT Lecture Session 7 - Advanced uses of RDFSSWT Lecture Session 7 - Advanced uses of RDFS
SWT Lecture Session 7 - Advanced uses of RDFS
 
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
FEDSPUG Meeting: Intelligent Metadata and Auto-classification in Records Mana...
 
Using metadata repositories with search
Using metadata repositories with searchUsing metadata repositories with search
Using metadata repositories with search
 

Más de Heather Hedden

Más de Heather Hedden (12)

Introduction to Knowledge Graphs for Information Architects.pdf
Introduction to Knowledge Graphs for Information Architects.pdfIntroduction to Knowledge Graphs for Information Architects.pdf
Introduction to Knowledge Graphs for Information Architects.pdf
 
Thesauri for Indexing Support / Thesauri zur Unterstützung der Registererstel...
Thesauri for Indexing Support / Thesauri zur Unterstützung der Registererstel...Thesauri for Indexing Support / Thesauri zur Unterstützung der Registererstel...
Thesauri for Indexing Support / Thesauri zur Unterstützung der Registererstel...
 
A Brief Introduction to SKOS
A Brief Introduction to SKOSA Brief Introduction to SKOS
A Brief Introduction to SKOS
 
A Brief Introduction to Knowledge Graphs
A Brief Introduction to Knowledge GraphsA Brief Introduction to Knowledge Graphs
A Brief Introduction to Knowledge Graphs
 
Taxonomies for Users
Taxonomies for UsersTaxonomies for Users
Taxonomies for Users
 
Taxonomy Design for SharePoint
Taxonomy Design for SharePointTaxonomy Design for SharePoint
Taxonomy Design for SharePoint
 
Taxonomies, Categories, and Tags in WordPress
Taxonomies, Categories, and Tags in WordPressTaxonomies, Categories, and Tags in WordPress
Taxonomies, Categories, and Tags in WordPress
 
Customer-Focused Thesauri
Customer-Focused ThesauriCustomer-Focused Thesauri
Customer-Focused Thesauri
 
Synonyms, Alternative Labels, and Nonpreferred Terms
Synonyms, Alternative Labels, and Nonpreferred TermsSynonyms, Alternative Labels, and Nonpreferred Terms
Synonyms, Alternative Labels, and Nonpreferred Terms
 
Managing Mature Taxonomies: Resolving Orphan Terms
Managing Mature Taxonomies: Resolving Orphan TermsManaging Mature Taxonomies: Resolving Orphan Terms
Managing Mature Taxonomies: Resolving Orphan Terms
 
Taxonomies for E-commerce
Taxonomies for E-commerceTaxonomies for E-commerce
Taxonomies for E-commerce
 
Making Decisions in Creating Taxonomies
Making Decisions in Creating TaxonomiesMaking Decisions in Creating Taxonomies
Making Decisions in Creating Taxonomies
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Taxonomies for Human vs Auto-Indexing

  • 1. Taxonomies for Human vs. Auto-Indexing Taxonomy Boot Camp, September 25, 2008, San Jose, CA Heather Hedden Hedden Information Management heather@hedden.net
  • 2. Background Heather Hedden's taxonomy-creation experience For human indexing Developed controlled vocabularies for periodical article index databases (Gale) For auto-indexing Developed taxonomies for integration within an enterprise search software product for corporate content and web page searching (Viziant) Matched controlled vocabulary to keywords for consumer online products/services directories (various "yellow pages" clients) For either Created enterprise taxonomies for corporate web sites and intranets for site navigation (Earley & Associations) © 2008 Hedden Information Management
  • 3. Outline Taxonomies & Indexing Background Choosing Human vs. Auto-indexing Taxonomies & Human Indexing Taxonomies & Auto-indexing Taxonomy Creation Comparison Differences in taxonomy terms Differences in term relationships Differences in definitions & notes Differences in synonyms/variants Additional Work for the Taxonomist Resources © 2008 Hedden Information Management
  • 4. Taxonomies & Indexing Types of Taxonomies 1. Organization, classification, navigation support – more emphasis on hierarchies 2. Search and retrieval support – more emphasis on synonyms For indexing, use #2 above: Search & retrieval support taxonomies © 2008 Hedden Information Management
  • 5. Taxonomies & Indexing Search & retrieval taxonomies: Connect users to desired content by means of a common nomenclature/terminology/vocabulary Matching between: 1. the vocabulary of the users 2. the vocabulary of the content Taxonomies interface with 1. the users 2. the content Indexing/tagging/categorization deals with #2 in each case above: the connection of taxonomy to content © 2008 Hedden Information Management
  • 6. Taxonomies & Indexing Indexing/tagging/categorization: Indexing – done by (trained) indexers – creating a (browsable) index Tagging – done by any person – applying labels, metatag descriptors to documents to be picked up by database or search software – may not require a taxonomy/controlled vocabulary Categorization – done more systematically/automatically – putting documents into (pre-defined) categories – often within facets © 2008 Hedden Information Management
  • 7. Choosing Human vs. Auto-indexing: The Content Human indexing Auto-indexing Manageable number of Very large number of documents documents Includes non-text files Text files only Varied and Common document undifferentiated types/formats (or pre- document types/formats tagged types) Varied subject areas Focused subject areas (legal, medical, etc.) © 2008 Hedden Information Management
  • 8. Choosing Human vs. Auto-indexing: The Culture Human indexing Auto-indexing Higher accuracy in Greater volume indexed indexing Greater speed in indexing Invest in people Invest in technology Low-tech: can build your High-tech: must purchase own indexing UI or buy auto-indexing software Internal control, or Software vendor outsourcing vendor relationship relationship © 2008 Hedden Information Management
  • 9. Taxonomies & Human Indexing Who are indexers? Specialists or not the taxonomist and/or other information specialists, librarians dedicated hired indexers (with or without prior indexing) experience supplemental work for other staff (editors, writers, administrators) One person or multiple people Usually in-house but could be contracted out © 2008 Hedden Information Management
  • 10. Taxonomies & Human Indexing Indexing software/module Indexing user interface optimized for ease, speed, and accuracy in indexing Method for indexers to nominate new taxonomy terms Training & documentation for indexers Indexing policy guidelines Method to communicate new and changed taxonomy terms to indexers Method for checking and quality control © 2008 Hedden Information Management
  • 11. © 2008 Hedden Information Management
  • 12. © 2008 Hedden Information Management
  • 13. Taxonomies & Auto-Indexing Technologies Entity extraction Text mining and text analytics Auto-categorization or auto-classification utilizing taxonomies: 1. Machine-learning and training documents 2. Rules-based categorization © 2008 Hedden Information Management
  • 14. Taxonomies & Auto-Indexing Machine-learning auto-categorization: Complex mathematical algorithms are created Taxonomist must then provide several (at least 5-10) representative sample documents for each taxonomy term to “train” the automated indexing system. If only using only 5-10 documents, then profile/ overview, encyclopedic articles are best. If pre-indexed records exist (i.e. converting from human to automated indexing), then hundreds of varied documents can be used for each term. © 2008 Hedden Information Management
  • 15. Taxonomies & Auto-Indexing Machine-learning auto-categorization © 2008 Hedden Information Management
  • 16. Taxonomies & Auto-Indexing Rules-based auto-categorization: Taxonomist must write rules for each taxonomy term Similar to advanced Boolean searching bush IF (INITIAL CAPS AND (MENTIONS "president*" OR WITH administration*" OR AROUND "white house" OR NEAR "george")) USE U.S. president ELSE USE Shrubs ENDIF Data Harmony © 2008 Hedden Information Management
  • 17. Taxonomy Creation Comparison Differences in taxonomy terms Differences in term relationships Differences in term notes, definitions Differences in synonyms/variants © 2008 Hedden Information Management
  • 18. Differences in Taxonomy Terms For human indexing Create terms as specific (granular) as the content will support and users will expect. For auto-indexing Cannot have subtle differences between preferred terms: International relations; Foreign policy Avoid creating both action and topic terms: Investing; Investments © 2008 Hedden Information Management
  • 19. Differences in Term Relationships Hierarchical (broader/narrower) links Associative (related terms) links For human indexing Highly useful to indexer, as is to end-user, in finding the best term. Consider indexer behavior. For auto-indexing Not needed, but could be utilized in search results: Broader terms recursively include narrower term results Related terms display as suggestions Consider search results. © 2008 Hedden Information Management
  • 20. Differences in Term Relationships Facets Certain facets may work better with human indexing than with auto-indexing. Automated indexing may not distinguish between different facet meanings of a term. Examples: Mergers - Action/Event or Business Topic? Churches – Place or Organization type? © 2008 Hedden Information Management
  • 21. Differences in Term Notes Concise explanatory notes (not a dictionary definition) on some terms, as needed: 1. To restrict or expand the application of a term 2. To distinguish between terms of overlapping meaning (may have reciprocal notes) 3. To provide advice on term usage For the end-user, optional aid For indexing: often needed for some terms for human indexing never needed for auto-indexing May have notes for indexers that are not for end-users. © 2008 Hedden Information Management
  • 22. Differences in Term Notes Scope Notes examples ProQuest Controlled Vocabulary: Occupational health SN: Employer activities designed to protect and promote the health and safety of employees on the job Inequality SN: Socioeconomic disparity stemming from racial, cultural, or social bias Medical Subject Headings (MeSH): Nonverbal Communication Annotation: human only; for animals use ANIMAL COMMUNICATION or VOCALIZATION, ANIMAL © 2008 Hedden Information Management
  • 23. Differences in Synonyms/Variants Non-preferred terms. Types include: synonyms: Cars USE Automobiles near-synonyms: Junior high USE Middle school variant spellings: Defence USE Defense lexical variants: Hair loss USE Baldness foreign language terms: Luftwaffe USE German Air Force acronyms/spelled out forms: UN USE United Nations scientific/technical names: Neoplasms USE Cancer antonyms: Misbehavior USE Behavior narrower terms and instances that are not preferred terms: Power hand drills USE Power hand tools Each preferred term may have multiple non-preferred terms. © 2008 Hedden Information Management
  • 24. Differences in Synonyms/Variants For human indexing “Shortcuts”- unique abbreviations within each facet (2-3 letters) for commonly entered terms For countries, states; industry codes For within a facet of limited size – memorizable Examples: mna – Mergers & acquisitions bnk – Banking fr – France Phrase inversions for alphabetical browsing Example: Photography, digital © 2008 Hedden Information Management
  • 25. Differences in Synonyms/Variants For Auto-indexing If machine-learning auto-categorization: Need greater number of non-preferred terms Can include non-noun phrases For auto-indexing For human-indexing Presidential candidate Presidential candidates Presidential candidacy Candidates, presidential Candidate for president Candidacy for president Presidential hopeful Running for president Campaigning for president Presidential nominee © 2008 Hedden Information Management
  • 26. Taxonomy Creation Summary Human indexing Auto-Indexing Rich relationships Cannot have subtle between terms differences between Term notes for terms clarification Avoid creating action-type Common-use shortcuts terms Phrase inversions as Be careful with facets term variants Need more, varied non- Also: preferred terms, including Browsable (A-Z) display non-noun phrases Multiple ways to search (beginning of term, word within term, etc.) © 2008 Hedden Information Management
  • 27. Additional Work for the Taxonomist Human Indexing Auto-Indexing Inform indexers of newly Continual update work, added terms for each new term: Adjustments based on Add new training review of indexers’ work: documents, or If terms are Write new rules overlooked (not used): Adjustments based on - Create more non- inappropriate results: preferred terms Add, delete, edit training - Create more related- documents term links Tweak existing rules If terms are misused: - Re-word terms - Add scope notes © 2008 Hedden Information Management
  • 28. Resources American Society for Indexing www.asindexing.org Taxonomies & Controlled Vocabularies SIG of the American Society for Indexing www.taxonomies-sig.org "Taxonomies and Controlled Vocabularies" Simmons College Graduate School of Library and Information Science Continuing Education Program onsite workshop (October 25, 2008, Boston) online workshop (February 2009) www.simmons.edu/gslis/continuinged/workshops © 2008 Hedden Information Management
  • 29. Contact Heather Hedden Hedden Information Management 98 East Riding Dr. Carlisle, MA 01741 978-371-0822 978-467-5195 (mobile) Heather@hedden.net www.hedden-information.com © 2008 Hedden Information Management