SlideShare una empresa de Scribd logo
1 de 25
Leveraging the Power of Lucene and XML for Instant Semantically Enriched Data Distribution Marjorie Hlava, President and Chairman,  Lamine Idjeraoui, Java / Lucene programmer Access Innovations, Inc.mhlava@accessinn.com, October 8, 2010
Outline for today  The Case NICEM / Media Sleuth  The Challenges Taxonomy - Semantic Enrichment Lucene Deployment The Search Interface Lucene Effect  Wrap up 3 ©2010 Access Innovations, Inc. All Rights Reserved
Access Innovations, Inc NICEM and Media Sleuth = the case Software and Service Company Founded 1978  Create and implement taxonomies Create thesauri  Provide semantic enrichment tools Provide metadata extraction tools Standards early adopters and developers  thesauri (ANSI/NISO Z.39.19), taxonomies,  Metadata, Dublin Core, etc Developers of Data Harmony™  XML , JAVA, TCP/IP UNICODE 4 ©2010 Access Innovations, Inc. All Rights Reserved
National Information Center for Educational Media (NICEM) NICEM  data base NICEM archive files  670,000 media from 25,000 sources  MediaSleuth e-commerce platform to purchase media Once a staff of 31 editors  now done by staff of two Created and Stored in an XML intranet system (XIS) Save to XIS Save to SQL  Save to Lucene One click - ON SAVE 5 ©2010 Access Innovations, Inc. All Rights Reserved
Data Flow / Collection  Producer and Distributor sources Catalogs Web sites Uploads Crawler harvesting and auto extraction auto indexing load to XIS NICEM thesaurus terms applied Editorial review 6 ©2010 Access Innovations, Inc. All Rights Reserved
NICEM Data Base Creation The database is created using the XIS - XML Intranet System There are 57 fields of data possible Many have a pick list or authority file Some have ranges of allowed values The NICEM Taxonomy is used to index all records MAI* is used to automatically suggest valid taxonomy terms  Metadata extractor is used to pull the data from sources    *MAI is Data Harmony’s automated indexer 7 ©2010 Access Innovations, Inc. All Rights Reserved
MediaSleuth E-commerce division of NICEM Utilizes database records from NICEM electronic database and adds e-commerce Calls on the NICEM taxonomy for an auto-completion feature at the time of Search. The search presentation layer Search Harmony Draws on the full thesaurus (taxonomy) Uses same terms as used in semantic enrichment of the sources 8 ©2010 Access Innovations, Inc. All Rights Reserved
Raw Full text data feeds  NICEM data base creation  SQL for ecommerce On Save XIS Creation XIS repository  Printed source  materials Load to NICEM Lucene Taxonomy terms  Data Crawls on sources Add metadata Load to MediaSleuth Lucene MAI Concept Extractor Metadata Extractor MAI Rule Base Taxonomy Thesaurus Master Search Harmony Display  9
NICEM / MediaSleuth builds XML-tagged elements 10
Machine Aided Indexer (M.A.I.) automatically suggests taxonomy descriptors 11
The Data Challenge The complexity of media Educational media Changes hands regularly – IP buy sell Changes format often  ex. film – CD – streaming media One year 25% of the data changed format Linking related media Users with many search styles Need immediate access to changes No monthly cycle for loading allowed 12 ©2010 Access Innovations, Inc. All Rights Reserved
The Search Challenge Considerations Too long until available on website Use taxonomy for semantic enrichment  Use the taxonomy in search  XML records for portability Staff productivity Key hurdles One content set, two websites, three data files, from one data base E-commerce = YES Flexible search – match learning styles Support the ordering and delivery of media  13 ©2010 Access Innovations, Inc. All Rights Reserved
The NICEM Thesaurus(Taxonomy) Hierarchical outline of content by subject categories Basis for browsing Framework for content organization Increased recall Better precision High accuracy Terms total terms 5068 preferred terms 4133 nonpreferred terms (use or see also) 14 ©2010 Access Innovations, Inc. All Rights Reserved
Thesaurus Term Record view Taxonomy view 15
XML or RDF export 16
NICEM – Lucene Deployment  Query Query fetches hit list from SH, snippets from Repository. Search Search Harmony Presentation  Layer Data forked so Data Harmony components can serve snippets and docs, and SH can build indexes. Lucene Index Auto-completion NavTree Narrower Terms Related Terms Building Lucene index Cleanup, etc. NICEM data base in XIS Repository XIS 17 ©2010 Access Innovations, Inc. All Rights Reserved
Technical Detail Before adding it to Lucene index, the data is submitted to DH MAI autoindexer to extract taxonomy terms. Code snippet thesTerms = getSuggestedTerms(data);  //data is passed through DH indexer             doc.add(thesTerms);  // the suggested terms are added to Lucene doc doc.add(data);  // add other data to Lucene doc writer.addDocument(doc);  // add doc to Lucene index 18 ©2010 Access Innovations, Inc. All Rights Reserved
Taxonomy  Search* on Lucene Auto-completion Using the Taxonomy Guide the user by applying various  semantic relationships Navigate the full Taxonomy “tree”
Direct link to e-commerce to improve sales Link search and taxonomy directly to the supply or documents or by redirecting to a shopping cart 20 ©2010 Access Innovations, Inc. All Rights Reserved
Lucene / Solr EFFECT Lucene Search for NICEM and MediaSleuth Web site More items viewed, more items found, more orders Easy to implement taxonomy search  Users find information faster  Gave us the flexibility to do ON SAVE Multiple systems Semantics support contextual GoogleAds Web stats are up and increasing 21 ©2010 Access Innovations, Inc. All Rights Reserved
Overview NICEM / Media Sleuth  Data Base creation Use XML and taxonomy Automate the content semantic enrichment High productivity achieved Lucene for search Semantically enhanced search  Cost effective, high accuracy Thank you Lucid! 22 ©2010 Access Innovations, Inc. All Rights Reserved
Thank You!  Marjorie M.K. Hlava President and Chairman mhlava@accessinn.comwww.taxodiary.com - the taxonomy news blog mmkhlava = twitter mhlava = facebook, linkedin, eacademy, plaxo Lamine Idjeraoui  lamine_Idjeraoui@accessinn.com        Access Innovations / Data Harmony www.dataharmony.comwww.accessinn.com505-998-0800  23 ©2010 Access Innovations, Inc. All Rights Reserved
Sources www.nicem.com www.mediasleuth.com Next  Site search on other sites  www.dataharmony.com www.accessinn.com “Indispensable for anyone trying to identify instructional media for teaching.” – CHOICE Magazine 24 ©2010 Access Innovations, Inc. All Rights Reserved
Data Harmony Architecture M.A.I.  Rule Bases M.A.I. Concept  Extractor Auto  Summarization  Entity  Extractor Novelty  Detection Search Software  Search Indexes Thesaurus Master WEB Server I Data Harmony Administrative Module  Rules for  Concept Extractor SUBJECT  TERMS ABSTRACT Dublin Core  METADATA Library OPAC Database system  Bibliographic citation  with abstract Search Server Web Portals DH API Web Content Files, Documents DH  CONCEPT  EXTRACTION SYSTEM  Databases Email,  Groupware, etc. Taxonomies / ontology Auto-completion Broader Term Narrower Term Related Term Navigation Tree Categorization Inline tagging Query expansion using rule base Fast indexing Massive data sets Incremental indexing Fast query speeds Search within results ©2010 Access Innovations, Inc. All Rights Reserved
XIS/Data Harmony Written in JAVA 	(JAVA plug-in installs automatically) Stores data in XML format Web Services or Client Server Functions on any platform 	Windows, NT, Mac, Unix, Linux, Solaris SaaS and ASP available Password-controlled access 26 ©2010 Access Innovations, Inc. All Rights Reserved

Más contenido relacionado

Similar a Lucene revolution with Data Harmony

Linked Open Data_mlanet13
Linked Open Data_mlanet13Linked Open Data_mlanet13
Linked Open Data_mlanet13Kristi Holmes
 
Key Success Factors for MOSS 2007 as ECM at Telecom - V07 - Rayner, Miles & B...
Key Success Factors for MOSS 2007 as ECM at Telecom - V07 - Rayner, Miles & B...Key Success Factors for MOSS 2007 as ECM at Telecom - V07 - Rayner, Miles & B...
Key Success Factors for MOSS 2007 as ECM at Telecom - V07 - Rayner, Miles & B...Nadine Burnett
 
Keyword extraction and clustering for document recommendation in conversations.
Keyword extraction and clustering for document recommendation in conversations.Keyword extraction and clustering for document recommendation in conversations.
Keyword extraction and clustering for document recommendation in conversations.LeMeniz Infotech
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesASIS&T
 
CloudLab: A File System Friendly Key Value Store
CloudLab: A File System Friendly Key Value StoreCloudLab: A File System Friendly Key Value Store
CloudLab: A File System Friendly Key Value StoreMaxiScale
 
INFOGOV14 - Trusting Your KM & ECM Strategy to SharePoint
INFOGOV14 - Trusting Your KM & ECM Strategy to SharePointINFOGOV14 - Trusting Your KM & ECM Strategy to SharePoint
INFOGOV14 - Trusting Your KM & ECM Strategy to SharePointJonathan Ralton
 
Ontotext Overview Winter 2012
Ontotext Overview Winter 2012Ontotext Overview Winter 2012
Ontotext Overview Winter 2012Matthew Petrillo
 
Atlas and ranger epam meetup
Atlas and ranger epam meetupAtlas and ranger epam meetup
Atlas and ranger epam meetupAlex Zeltov
 
Drilling Down to the Challenges of SharePoint Taxonomy Implementation
Drilling Down to the Challenges of SharePoint Taxonomy ImplementationDrilling Down to the Challenges of SharePoint Taxonomy Implementation
Drilling Down to the Challenges of SharePoint Taxonomy ImplementationTSoholt
 
An Introduction to AtoM, Archivematica, and Artefactual Systems
An Introduction to AtoM, Archivematica, and Artefactual SystemsAn Introduction to AtoM, Archivematica, and Artefactual Systems
An Introduction to AtoM, Archivematica, and Artefactual SystemsArtefactual Systems - AtoM
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebAmit Sheth
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebAmit Sheth
 
Cilip Seminar 6th October - Integrating With Open Source
Cilip Seminar 6th October - Integrating With Open SourceCilip Seminar 6th October - Integrating With Open Source
Cilip Seminar 6th October - Integrating With Open SourceJonathan Field
 
Norfolk Intranet 2.0
Norfolk Intranet 2.0Norfolk Intranet 2.0
Norfolk Intranet 2.0djoneseaccess
 
Using Web 2.0 to Improve How TSA Does Business
Using Web 2.0 to Improve How TSA Does BusinessUsing Web 2.0 to Improve How TSA Does Business
Using Web 2.0 to Improve How TSA Does BusinessPeter Stinson
 

Similar a Lucene revolution with Data Harmony (20)

Data Harmony Version 3.9 Features Update
Data Harmony Version 3.9 Features UpdateData Harmony Version 3.9 Features Update
Data Harmony Version 3.9 Features Update
 
Linked Open Data_mlanet13
Linked Open Data_mlanet13Linked Open Data_mlanet13
Linked Open Data_mlanet13
 
Key Success Factors for MOSS 2007 as ECM at Telecom - V07 - Rayner, Miles & B...
Key Success Factors for MOSS 2007 as ECM at Telecom - V07 - Rayner, Miles & B...Key Success Factors for MOSS 2007 as ECM at Telecom - V07 - Rayner, Miles & B...
Key Success Factors for MOSS 2007 as ECM at Telecom - V07 - Rayner, Miles & B...
 
Taxonomy Book Camp SharePoint IA 11-17-10
Taxonomy Book Camp SharePoint IA 11-17-10Taxonomy Book Camp SharePoint IA 11-17-10
Taxonomy Book Camp SharePoint IA 11-17-10
 
Keyword extraction and clustering for document recommendation in conversations.
Keyword extraction and clustering for document recommendation in conversations.Keyword extraction and clustering for document recommendation in conversations.
Keyword extraction and clustering for document recommendation in conversations.
 
e-Framework Tools
e-Framework Toolse-Framework Tools
e-Framework Tools
 
Hughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication RepositoriesHughes RDAP11 Data Publication Repositories
Hughes RDAP11 Data Publication Repositories
 
KMA Webinar: Managed Metadata Services in SharePoint 2010
KMA Webinar: Managed Metadata Services in SharePoint 2010KMA Webinar: Managed Metadata Services in SharePoint 2010
KMA Webinar: Managed Metadata Services in SharePoint 2010
 
CloudLab: A File System Friendly Key Value Store
CloudLab: A File System Friendly Key Value StoreCloudLab: A File System Friendly Key Value Store
CloudLab: A File System Friendly Key Value Store
 
INFOGOV14 - Trusting Your KM & ECM Strategy to SharePoint
INFOGOV14 - Trusting Your KM & ECM Strategy to SharePointINFOGOV14 - Trusting Your KM & ECM Strategy to SharePoint
INFOGOV14 - Trusting Your KM & ECM Strategy to SharePoint
 
Ontotext Overview Winter 2012
Ontotext Overview Winter 2012Ontotext Overview Winter 2012
Ontotext Overview Winter 2012
 
Digitisation and institutional repositories 3
Digitisation and institutional repositories 3Digitisation and institutional repositories 3
Digitisation and institutional repositories 3
 
Atlas and ranger epam meetup
Atlas and ranger epam meetupAtlas and ranger epam meetup
Atlas and ranger epam meetup
 
Drilling Down to the Challenges of SharePoint Taxonomy Implementation
Drilling Down to the Challenges of SharePoint Taxonomy ImplementationDrilling Down to the Challenges of SharePoint Taxonomy Implementation
Drilling Down to the Challenges of SharePoint Taxonomy Implementation
 
An Introduction to AtoM, Archivematica, and Artefactual Systems
An Introduction to AtoM, Archivematica, and Artefactual SystemsAn Introduction to AtoM, Archivematica, and Artefactual Systems
An Introduction to AtoM, Archivematica, and Artefactual Systems
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic Web
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic Web
 
Cilip Seminar 6th October - Integrating With Open Source
Cilip Seminar 6th October - Integrating With Open SourceCilip Seminar 6th October - Integrating With Open Source
Cilip Seminar 6th October - Integrating With Open Source
 
Norfolk Intranet 2.0
Norfolk Intranet 2.0Norfolk Intranet 2.0
Norfolk Intranet 2.0
 
Using Web 2.0 to Improve How TSA Does Business
Using Web 2.0 to Improve How TSA Does BusinessUsing Web 2.0 to Improve How TSA Does Business
Using Web 2.0 to Improve How TSA Does Business
 

Más de Access Innovations, Inc.

Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsMaking AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsAccess Innovations, Inc.
 
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8Access Innovations, Inc.
 
Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Access Innovations, Inc.
 
Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Access Innovations, Inc.
 
Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Access Innovations, Inc.
 
Tagging overview - Why Keywords Don't Cut It
Tagging overview  - Why Keywords Don't Cut ItTagging overview  - Why Keywords Don't Cut It
Tagging overview - Why Keywords Don't Cut ItAccess Innovations, Inc.
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityAccess Innovations, Inc.
 
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedDHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedAccess Innovations, Inc.
 

Más de Access Innovations, Inc. (20)

Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy ResultsMaking AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
Making AI Behave: Using Knowledge Domains to Produce Useful, Trustworthy Results
 
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
ISO 25964-1Working Group ISO/TC 46/SC 9/WG 8
 
Smart submit
Smart submitSmart submit
Smart submit
 
Plos taxonomy beyond search dhug 2021
Plos taxonomy beyond search   dhug 2021Plos taxonomy beyond search   dhug 2021
Plos taxonomy beyond search dhug 2021
 
Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)Hindawi taxonomy and personalization 27.10 (1)
Hindawi taxonomy and personalization 27.10 (1)
 
Data harmonycloudpowerpointclientfacing
Data harmonycloudpowerpointclientfacingData harmonycloudpowerpointclientfacing
Data harmonycloudpowerpointclientfacing
 
Data harmony update 2021
Data harmony update 2021 Data harmony update 2021
Data harmony update 2021
 
Atypon dhug2021
Atypon dhug2021Atypon dhug2021
Atypon dhug2021
 
Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021Asco using ai-taxos-for meta-titles-february-2021
Asco using ai-taxos-for meta-titles-february-2021
 
Asce more than just topic taxonomies
Asce more than just topic taxonomiesAsce more than just topic taxonomies
Asce more than just topic taxonomies
 
Acs discoverability-dhug2021
Acs discoverability-dhug2021Acs discoverability-dhug2021
Acs discoverability-dhug2021
 
Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)Ai webinar 2 -what's in a name (consolidated pdf)
Ai webinar 2 -what's in a name (consolidated pdf)
 
Tagging overview - Why Keywords Don't Cut It
Tagging overview  - Why Keywords Don't Cut ItTagging overview  - Why Keywords Don't Cut It
Tagging overview - Why Keywords Don't Cut It
 
Health Affairs - Why Keywords Don't Cut It
Health Affairs - Why Keywords Don't Cut ItHealth Affairs - Why Keywords Don't Cut It
Health Affairs - Why Keywords Don't Cut It
 
Why Keywords Don't Cut It
Why Keywords Don't Cut ItWhy Keywords Don't Cut It
Why Keywords Don't Cut It
 
Data Harmony update 2020 final
Data Harmony update 2020 finalData Harmony update 2020 final
Data Harmony update 2020 final
 
Data Harmony Update 2020 final
Data Harmony Update 2020 finalData Harmony Update 2020 final
Data Harmony Update 2020 final
 
DHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository InteroperabilityDHUG 2018: Towards Web-Centric Repository Interoperability
DHUG 2018: Towards Web-Centric Repository Interoperability
 
DHUG 2018 - Florida Thesis OCR
DHUG 2018 - Florida Thesis OCRDHUG 2018 - Florida Thesis OCR
DHUG 2018 - Florida Thesis OCR
 
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project FundedDHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
DHUG 2017 - Understanding ROI Just Enough to Get Your Project Funded
 

Lucene revolution with Data Harmony

  • 1. Leveraging the Power of Lucene and XML for Instant Semantically Enriched Data Distribution Marjorie Hlava, President and Chairman, Lamine Idjeraoui, Java / Lucene programmer Access Innovations, Inc.mhlava@accessinn.com, October 8, 2010
  • 2. Outline for today The Case NICEM / Media Sleuth The Challenges Taxonomy - Semantic Enrichment Lucene Deployment The Search Interface Lucene Effect Wrap up 3 ©2010 Access Innovations, Inc. All Rights Reserved
  • 3. Access Innovations, Inc NICEM and Media Sleuth = the case Software and Service Company Founded 1978 Create and implement taxonomies Create thesauri Provide semantic enrichment tools Provide metadata extraction tools Standards early adopters and developers thesauri (ANSI/NISO Z.39.19), taxonomies, Metadata, Dublin Core, etc Developers of Data Harmony™ XML , JAVA, TCP/IP UNICODE 4 ©2010 Access Innovations, Inc. All Rights Reserved
  • 4. National Information Center for Educational Media (NICEM) NICEM data base NICEM archive files 670,000 media from 25,000 sources MediaSleuth e-commerce platform to purchase media Once a staff of 31 editors now done by staff of two Created and Stored in an XML intranet system (XIS) Save to XIS Save to SQL Save to Lucene One click - ON SAVE 5 ©2010 Access Innovations, Inc. All Rights Reserved
  • 5. Data Flow / Collection Producer and Distributor sources Catalogs Web sites Uploads Crawler harvesting and auto extraction auto indexing load to XIS NICEM thesaurus terms applied Editorial review 6 ©2010 Access Innovations, Inc. All Rights Reserved
  • 6. NICEM Data Base Creation The database is created using the XIS - XML Intranet System There are 57 fields of data possible Many have a pick list or authority file Some have ranges of allowed values The NICEM Taxonomy is used to index all records MAI* is used to automatically suggest valid taxonomy terms Metadata extractor is used to pull the data from sources *MAI is Data Harmony’s automated indexer 7 ©2010 Access Innovations, Inc. All Rights Reserved
  • 7. MediaSleuth E-commerce division of NICEM Utilizes database records from NICEM electronic database and adds e-commerce Calls on the NICEM taxonomy for an auto-completion feature at the time of Search. The search presentation layer Search Harmony Draws on the full thesaurus (taxonomy) Uses same terms as used in semantic enrichment of the sources 8 ©2010 Access Innovations, Inc. All Rights Reserved
  • 8. Raw Full text data feeds NICEM data base creation SQL for ecommerce On Save XIS Creation XIS repository Printed source materials Load to NICEM Lucene Taxonomy terms Data Crawls on sources Add metadata Load to MediaSleuth Lucene MAI Concept Extractor Metadata Extractor MAI Rule Base Taxonomy Thesaurus Master Search Harmony Display 9
  • 9. NICEM / MediaSleuth builds XML-tagged elements 10
  • 10. Machine Aided Indexer (M.A.I.) automatically suggests taxonomy descriptors 11
  • 11. The Data Challenge The complexity of media Educational media Changes hands regularly – IP buy sell Changes format often ex. film – CD – streaming media One year 25% of the data changed format Linking related media Users with many search styles Need immediate access to changes No monthly cycle for loading allowed 12 ©2010 Access Innovations, Inc. All Rights Reserved
  • 12. The Search Challenge Considerations Too long until available on website Use taxonomy for semantic enrichment Use the taxonomy in search XML records for portability Staff productivity Key hurdles One content set, two websites, three data files, from one data base E-commerce = YES Flexible search – match learning styles Support the ordering and delivery of media 13 ©2010 Access Innovations, Inc. All Rights Reserved
  • 13. The NICEM Thesaurus(Taxonomy) Hierarchical outline of content by subject categories Basis for browsing Framework for content organization Increased recall Better precision High accuracy Terms total terms 5068 preferred terms 4133 nonpreferred terms (use or see also) 14 ©2010 Access Innovations, Inc. All Rights Reserved
  • 14. Thesaurus Term Record view Taxonomy view 15
  • 15. XML or RDF export 16
  • 16. NICEM – Lucene Deployment Query Query fetches hit list from SH, snippets from Repository. Search Search Harmony Presentation Layer Data forked so Data Harmony components can serve snippets and docs, and SH can build indexes. Lucene Index Auto-completion NavTree Narrower Terms Related Terms Building Lucene index Cleanup, etc. NICEM data base in XIS Repository XIS 17 ©2010 Access Innovations, Inc. All Rights Reserved
  • 17. Technical Detail Before adding it to Lucene index, the data is submitted to DH MAI autoindexer to extract taxonomy terms. Code snippet thesTerms = getSuggestedTerms(data); //data is passed through DH indexer doc.add(thesTerms); // the suggested terms are added to Lucene doc doc.add(data); // add other data to Lucene doc writer.addDocument(doc); // add doc to Lucene index 18 ©2010 Access Innovations, Inc. All Rights Reserved
  • 18. Taxonomy Search* on Lucene Auto-completion Using the Taxonomy Guide the user by applying various semantic relationships Navigate the full Taxonomy “tree”
  • 19. Direct link to e-commerce to improve sales Link search and taxonomy directly to the supply or documents or by redirecting to a shopping cart 20 ©2010 Access Innovations, Inc. All Rights Reserved
  • 20. Lucene / Solr EFFECT Lucene Search for NICEM and MediaSleuth Web site More items viewed, more items found, more orders Easy to implement taxonomy search Users find information faster Gave us the flexibility to do ON SAVE Multiple systems Semantics support contextual GoogleAds Web stats are up and increasing 21 ©2010 Access Innovations, Inc. All Rights Reserved
  • 21. Overview NICEM / Media Sleuth Data Base creation Use XML and taxonomy Automate the content semantic enrichment High productivity achieved Lucene for search Semantically enhanced search Cost effective, high accuracy Thank you Lucid! 22 ©2010 Access Innovations, Inc. All Rights Reserved
  • 22. Thank You! Marjorie M.K. Hlava President and Chairman mhlava@accessinn.comwww.taxodiary.com - the taxonomy news blog mmkhlava = twitter mhlava = facebook, linkedin, eacademy, plaxo Lamine Idjeraoui lamine_Idjeraoui@accessinn.com Access Innovations / Data Harmony www.dataharmony.comwww.accessinn.com505-998-0800 23 ©2010 Access Innovations, Inc. All Rights Reserved
  • 23. Sources www.nicem.com www.mediasleuth.com Next Site search on other sites www.dataharmony.com www.accessinn.com “Indispensable for anyone trying to identify instructional media for teaching.” – CHOICE Magazine 24 ©2010 Access Innovations, Inc. All Rights Reserved
  • 24. Data Harmony Architecture M.A.I. Rule Bases M.A.I. Concept Extractor Auto Summarization Entity Extractor Novelty Detection Search Software Search Indexes Thesaurus Master WEB Server I Data Harmony Administrative Module Rules for Concept Extractor SUBJECT TERMS ABSTRACT Dublin Core METADATA Library OPAC Database system Bibliographic citation with abstract Search Server Web Portals DH API Web Content Files, Documents DH CONCEPT EXTRACTION SYSTEM Databases Email, Groupware, etc. Taxonomies / ontology Auto-completion Broader Term Narrower Term Related Term Navigation Tree Categorization Inline tagging Query expansion using rule base Fast indexing Massive data sets Incremental indexing Fast query speeds Search within results ©2010 Access Innovations, Inc. All Rights Reserved
  • 25. XIS/Data Harmony Written in JAVA (JAVA plug-in installs automatically) Stores data in XML format Web Services or Client Server Functions on any platform Windows, NT, Mac, Unix, Linux, Solaris SaaS and ASP available Password-controlled access 26 ©2010 Access Innovations, Inc. All Rights Reserved