SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
SHRINKING THE HAYSTACK WITH SOLR AND NLP
wes.caldwell@issinc.com

Wes Caldwell

@caldwellw

Chief Architect, Intelligent Software Solutions

http://linkd.in/1cfOR79
Topics
• 
• 
• 
• 
• 
• 
• 

Introduction to ISS, and our customer base
The data challenges our customers are facing
Our data processing pipeline (and how Solr and NLP fit in)
The document processing eco-system
Additional Solr features that we find useful
NLP techniques we use
Why we use multiple NLP techniques and how they complement
each other
•  Quick demo
About ISS
§ 

Headquartered	
  in	
  Colorado	
  Springs	
  
§ 

§ 

Other	
  offices	
  located	
  in	
  Washington	
  DC,	
  Hampton	
  
VA,	
  Tampa	
  FL,	
  and	
  Rome	
  NY	
  

InnovaEve	
  SoluEons	
  from	
  “Space	
  to	
  Mud	
  and	
  
Everything	
  Between”	
  
Sole	
  prime	
  on	
  mulEple	
  Air	
  Force	
  Research	
  
Labs	
  programs	
  IDIQ	
  
§  Currently	
  ExecuEng	
  More	
  Than	
  100	
  
SoSware	
  Development	
  Projects	
  
§  Over	
  800	
  employees	
  	
  
§  Strength	
  in	
  SoluEons	
  Development	
  and	
  
Deployment	
  
§ 

§ 

Consistently	
  Recognized	
  as	
  a	
  Leader	
  

Recognized	
  as	
  a	
  DeloiXe	
  Fast	
  50	
  Colorado	
  
company	
  and	
  a	
  DeloiXe	
  Fast	
  500	
  company	
  
over	
  eight	
  consecuEve	
  years	
  
§  Three-­‐Eme	
  Inc.	
  Magazine	
  500	
  winner	
  
§  2009	
  Defense	
  Company	
  of	
  the	
  Year	
  
§ 
The data challenge
• 

• 
• 

Most electronic information is not relational, but unstructured
(textual, binary) or semi-structured (spreadsheet, RSS feed).
–  In 2007, the estimated information content of all human
knowledge was 295 exabytes (295 million terabytes)
–  Data production will be 44 times greater in 2020 than in 2009
•  Approximately 35 zetabytes total (35 billion terabytes)
–  A majority of the data produced in the future will be
unstructured
Unstructured data is easily processed by human beings, but is
more difficult for machines.
A tremendous amount of information and knowledge is dormant
within unstructured data.
Our customer’s data environment
•  Literally thousands of data sources/feeds from a variety of
strategic, national, and tactical sources
– 
– 
– 
– 
– 
– 

Media (documents, images, etc.)
Human Interactions
Geospatial
Open Source
Imagery/Video
Many more…
How our analysts feel
The need
• 
• 
• 

Analysts are looking to extract knowledge from the massive
heterogeneous data sets, providing “actionable intelligence”
Search and NLP techniques are key enablers to allow an analyst to
reliably search for the information they know about, and to assist them
in discovering the information they don’t know about
It is critical (especially in tactical environments) to provide tools to the
analyst that allow them to “shrink the haystack” to a more digestible size,
and seed that information into an analytics pipeline, targeted at a
particular problem domain (e.g. C-Terrorism, C-Narcotics, etc.)
–  Time-to-live on the relevance of data collected can be very short
–  Its not about finding the needle in the haystack, its about giving a trained
analyst the tools to present the most relevant information in a timely manner,
allowing them to make an informed decision
Where our journey led us
Our approach
Content	
  AcquisiEon	
  

Search/Discovery	
  

SemanEc	
  Enrichment	
  

Data	
  PerspecEves	
  
Data	
  

GazeXeers	
  
Structured	
  Content	
  
NLP	
  Pipeline	
  

Content	
  Index	
  
Content	
  Cache	
  
(Haystacks)	
  

Semi-­‐Structured	
  
Content	
  

Un-­‐Structured	
  
Content	
  

Tenets	
  
• 
• 
• 
• 

Connector	
  architecture	
  
Data	
  normalizaEon	
  
Data	
  staging	
  
Data	
  CompartmenEng	
  
(MulEple	
  Haystacks)	
  

Tenets	
  

•  OpEmized	
  Index	
  of	
  
Content	
  for	
  Search	
  and	
  
Discovery	
  of	
  Big	
  Data	
  
•  Analyst	
  Topics	
  that	
  “Shrink	
  
the	
  Haystack”	
  
•  Advanced	
  Search	
  Features	
  
(Facets,	
  Auto-­‐Complete,	
  
Tagging,	
  Comments,	
  etc.)	
  
•  SemanEc	
  (Synonym)	
  
Search	
  based	
  on	
  pluggable	
  
taxonomies	
  

Tenets	
  

CategorizaEon	
  
Named	
  
EnEty	
  
RecogniEon	
  
Clustering	
  

•  “Domain	
  Spaces”	
  that	
  
support	
  pluggable	
  enEty	
  
recogniEon	
  and	
  
categorizaEon	
  
•  ConEnuous	
  feedback	
  loop	
  
that	
  improves	
  the	
  system	
  
over	
  Eme	
  with	
  analyst	
  
input	
  
•  Lexicon-­‐based	
  analyEcs	
  
that	
  allows	
  for	
  targeted	
  
categorizaEon	
  across	
  
corpus	
  of	
  data	
  

Tenets	
  

•  Data	
  ReducEon	
  into	
  
focused	
  “Data	
  
Perspec<ves”	
  
•  Data	
  perspecEves	
  
stored	
  in	
  op<mized	
  
formats	
  (e.g.	
  Graph,	
  
Time	
  Series,	
  Geo,	
  etc.)	
  
for	
  the	
  quesEons	
  being	
  
asked	
  
•  Leveraging	
  industry-­‐
standard	
  parallel	
  
processing	
  frameworks	
  
for	
  scalable	
  analyEcs	
  
Document Processing Pipeline Eco-System

Content	
  
Management	
  	
  

Text	
  
ExtracEon	
  

Named-­‐EnEty	
  
RecogniEon	
  

GeospaEal	
  
Tagging	
  

Clustering/	
  
ClassificaEon	
  

Indexing	
  
Additional Solr features that we find useful
• 

• 

Synonym (aka “Semantic Search” to us)
–  Allows us to load in pre-defined hierarchical synonym sets(driven by lexicons) to provide
search that is tuned for a particular customer domain
•  For example, a search for “weapon” finds various gun types (AK-47, M-16)
–  Currently implemented at index time
–  Simple feature to implement, but has proven very powerful as a “practical analytic”
Geospatial resolution (used in NLP pipeline)
–  Loaded GeoNames dataset into a separate Solr core
–  Allows for quick lookups in geospatial entity resolution
•  e.g. resolving “Paris” to latitude/longitude based geo-coordinate
–  Can boost based on general rules, or customer-specific ones
•  For example, which “Paris” is it? The one in France or Texas?
–  Population could be the boost parameter that returns Paris, France over Paris,
Texas
•  Allows us to easily override for local conditions
–  For example, if a customer wants all geo resolution to be focused in a
particular region of the world (i.e. their AOR)
NLP techniques we use
• 

• 

• 

Leverage Unstructured Information Management Architecture (UIMA) for NLP pipeline
–  Supports analysis engines for both GATE/Gazetteer and OpenNLP/SML
techniques
–  Starting to use UIMA-AS (Asynch) to help in scaling out various pipeline steps
–  Abstracts vendor-specific NLP engine details, hence allowing you to plug in
different implementations without much disruption
GATE/Gazetteer approach
–  Essentially Dictionaries containing key terms used for categorization (facets)
–  Can have n number of “categories” that are generic, as well as customer domain
defined
OpenNLP/Supervised Machine Learning approach
–  “Context aware” models that are trained by data scientists/SMEs
–  Based on probabilistic theory (Maximum Entropy)
Why use both NLP approaches?
• 
• 

• 

• 

Both approaches have their pro/cons
Gazetteer approach
–  Pros
•  Good precision – you are going to find what is important to you
•  Simple for analyst to “tune” - does not require a data scientist
•  Quick and easy to add new categories to a problem domain
–  Cons
•  Only as good as the gazetteer
•  Not context aware
Supervised Machine Learning approach
–  Pros
•  Once properly trained, good at finding new concepts in context
–  Cons
•  Requires a data scientist/SME to produce quality models
•  Can be tedious to train
Bottom-line – A combined approach helps you find the things you know are relevant, and also helps you find
things that are relevant that you may not know about
Additional information
• 
• 
• 
• 
• 
• 
• 

Apache Jackrabbit - http://jackrabbit.apache.org/
UIMA - http://uima.apache.org/
GATE - http://gate.ac.uk/
OpenNLP - http://opennlp.apache.org/
Boilerpipe - https://code.google.com/p/boilerpipe/
Apache Tika - http://tika.apache.org/
Geonames - http://www.geonames.org/
Demo
Questions?

Más contenido relacionado

Similar a Shrinking the haystack wes caldwell - final

Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Lucidworks (Archived)
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
Salford Systems
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
Artificial Intelligence Institute at UofSC
 

Similar a Shrinking the haystack wes caldwell - final (20)

II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Expert system (unit 1 &amp; 2)
Expert system (unit 1 &amp; 2)Expert system (unit 1 &amp; 2)
Expert system (unit 1 &amp; 2)
 
Systems analysis and design
Systems analysis and designSystems analysis and design
Systems analysis and design
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 
Creating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing AssignmentCreating a Project Plan for a Data Warehouse Testing Assignment
Creating a Project Plan for a Data Warehouse Testing Assignment
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Chatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine LearningChatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine Learning
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 

Más de lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 

Más de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 

Último

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
ssuserdda66b
 

Último (20)

Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 

Shrinking the haystack wes caldwell - final

  • 1.
  • 2. SHRINKING THE HAYSTACK WITH SOLR AND NLP wes.caldwell@issinc.com Wes Caldwell @caldwellw Chief Architect, Intelligent Software Solutions http://linkd.in/1cfOR79
  • 3. Topics •  •  •  •  •  •  •  Introduction to ISS, and our customer base The data challenges our customers are facing Our data processing pipeline (and how Solr and NLP fit in) The document processing eco-system Additional Solr features that we find useful NLP techniques we use Why we use multiple NLP techniques and how they complement each other •  Quick demo
  • 4. About ISS §  Headquartered  in  Colorado  Springs   §  §  Other  offices  located  in  Washington  DC,  Hampton   VA,  Tampa  FL,  and  Rome  NY   InnovaEve  SoluEons  from  “Space  to  Mud  and   Everything  Between”   Sole  prime  on  mulEple  Air  Force  Research   Labs  programs  IDIQ   §  Currently  ExecuEng  More  Than  100   SoSware  Development  Projects   §  Over  800  employees     §  Strength  in  SoluEons  Development  and   Deployment   §  §  Consistently  Recognized  as  a  Leader   Recognized  as  a  DeloiXe  Fast  50  Colorado   company  and  a  DeloiXe  Fast  500  company   over  eight  consecuEve  years   §  Three-­‐Eme  Inc.  Magazine  500  winner   §  2009  Defense  Company  of  the  Year   § 
  • 5. The data challenge •  •  •  Most electronic information is not relational, but unstructured (textual, binary) or semi-structured (spreadsheet, RSS feed). –  In 2007, the estimated information content of all human knowledge was 295 exabytes (295 million terabytes) –  Data production will be 44 times greater in 2020 than in 2009 •  Approximately 35 zetabytes total (35 billion terabytes) –  A majority of the data produced in the future will be unstructured Unstructured data is easily processed by human beings, but is more difficult for machines. A tremendous amount of information and knowledge is dormant within unstructured data.
  • 6. Our customer’s data environment •  Literally thousands of data sources/feeds from a variety of strategic, national, and tactical sources –  –  –  –  –  –  Media (documents, images, etc.) Human Interactions Geospatial Open Source Imagery/Video Many more…
  • 8. The need •  •  •  Analysts are looking to extract knowledge from the massive heterogeneous data sets, providing “actionable intelligence” Search and NLP techniques are key enablers to allow an analyst to reliably search for the information they know about, and to assist them in discovering the information they don’t know about It is critical (especially in tactical environments) to provide tools to the analyst that allow them to “shrink the haystack” to a more digestible size, and seed that information into an analytics pipeline, targeted at a particular problem domain (e.g. C-Terrorism, C-Narcotics, etc.) –  Time-to-live on the relevance of data collected can be very short –  Its not about finding the needle in the haystack, its about giving a trained analyst the tools to present the most relevant information in a timely manner, allowing them to make an informed decision
  • 10. Our approach Content  AcquisiEon   Search/Discovery   SemanEc  Enrichment   Data  PerspecEves   Data   GazeXeers   Structured  Content   NLP  Pipeline   Content  Index   Content  Cache   (Haystacks)   Semi-­‐Structured   Content   Un-­‐Structured   Content   Tenets   •  •  •  •  Connector  architecture   Data  normalizaEon   Data  staging   Data  CompartmenEng   (MulEple  Haystacks)   Tenets   •  OpEmized  Index  of   Content  for  Search  and   Discovery  of  Big  Data   •  Analyst  Topics  that  “Shrink   the  Haystack”   •  Advanced  Search  Features   (Facets,  Auto-­‐Complete,   Tagging,  Comments,  etc.)   •  SemanEc  (Synonym)   Search  based  on  pluggable   taxonomies   Tenets   CategorizaEon   Named   EnEty   RecogniEon   Clustering   •  “Domain  Spaces”  that   support  pluggable  enEty   recogniEon  and   categorizaEon   •  ConEnuous  feedback  loop   that  improves  the  system   over  Eme  with  analyst   input   •  Lexicon-­‐based  analyEcs   that  allows  for  targeted   categorizaEon  across   corpus  of  data   Tenets   •  Data  ReducEon  into   focused  “Data   Perspec<ves”   •  Data  perspecEves   stored  in  op<mized   formats  (e.g.  Graph,   Time  Series,  Geo,  etc.)   for  the  quesEons  being   asked   •  Leveraging  industry-­‐ standard  parallel   processing  frameworks   for  scalable  analyEcs  
  • 11. Document Processing Pipeline Eco-System Content   Management     Text   ExtracEon   Named-­‐EnEty   RecogniEon   GeospaEal   Tagging   Clustering/   ClassificaEon   Indexing  
  • 12. Additional Solr features that we find useful •  •  Synonym (aka “Semantic Search” to us) –  Allows us to load in pre-defined hierarchical synonym sets(driven by lexicons) to provide search that is tuned for a particular customer domain •  For example, a search for “weapon” finds various gun types (AK-47, M-16) –  Currently implemented at index time –  Simple feature to implement, but has proven very powerful as a “practical analytic” Geospatial resolution (used in NLP pipeline) –  Loaded GeoNames dataset into a separate Solr core –  Allows for quick lookups in geospatial entity resolution •  e.g. resolving “Paris” to latitude/longitude based geo-coordinate –  Can boost based on general rules, or customer-specific ones •  For example, which “Paris” is it? The one in France or Texas? –  Population could be the boost parameter that returns Paris, France over Paris, Texas •  Allows us to easily override for local conditions –  For example, if a customer wants all geo resolution to be focused in a particular region of the world (i.e. their AOR)
  • 13. NLP techniques we use •  •  •  Leverage Unstructured Information Management Architecture (UIMA) for NLP pipeline –  Supports analysis engines for both GATE/Gazetteer and OpenNLP/SML techniques –  Starting to use UIMA-AS (Asynch) to help in scaling out various pipeline steps –  Abstracts vendor-specific NLP engine details, hence allowing you to plug in different implementations without much disruption GATE/Gazetteer approach –  Essentially Dictionaries containing key terms used for categorization (facets) –  Can have n number of “categories” that are generic, as well as customer domain defined OpenNLP/Supervised Machine Learning approach –  “Context aware” models that are trained by data scientists/SMEs –  Based on probabilistic theory (Maximum Entropy)
  • 14. Why use both NLP approaches? •  •  •  •  Both approaches have their pro/cons Gazetteer approach –  Pros •  Good precision – you are going to find what is important to you •  Simple for analyst to “tune” - does not require a data scientist •  Quick and easy to add new categories to a problem domain –  Cons •  Only as good as the gazetteer •  Not context aware Supervised Machine Learning approach –  Pros •  Once properly trained, good at finding new concepts in context –  Cons •  Requires a data scientist/SME to produce quality models •  Can be tedious to train Bottom-line – A combined approach helps you find the things you know are relevant, and also helps you find things that are relevant that you may not know about
  • 15. Additional information •  •  •  •  •  •  •  Apache Jackrabbit - http://jackrabbit.apache.org/ UIMA - http://uima.apache.org/ GATE - http://gate.ac.uk/ OpenNLP - http://opennlp.apache.org/ Boilerpipe - https://code.google.com/p/boilerpipe/ Apache Tika - http://tika.apache.org/ Geonames - http://www.geonames.org/
  • 16. Demo