SlideShare una empresa de Scribd logo
1 de 22
Intelligent Apps with
Apache Lucene, Mahout and
friends
Grant Ingersoll
Lucid Imagination, Inc.
Topics
What is an Intelligent Application?
Examples
I’ve heard of Lucene/Solr, but what else can I use?
Mahout
OpenNLP
Others? UIMA, Weka, Mallet, MinorThird, etc.
Building Blocks
Tying it all together
Lucid Imagination, Inc.
What is an Intelligent Application?
I favor a loose definition
Evolving as techniques get better
General Characteristics:
Embraces fuzziness and uncertainty by:
• Learning from past behavior and adapting
• Leveraging the masses while incorporating the personal
Provide Content Insight
• Organize vast quantities of data into consumable chunks
• Encourage Serendipity
Do what users want even if they don’t know it yet, but don’t turn
them off either
Lucid Imagination, Inc.
Caveats
I’m mostly interested in applications where:
Unstructured text is a component
• i.e. I’m not building a next-gen video game
Users interact via text, clicks, etc.
• Typing in queries
• Browsing links, reading ads/content, etc.
Some of these tools are useful for other applications too
Consider the topics here to be a toolkit, not all apps need all
features
Lucid Imagination, Inc.
Examples
http://www.netflix.com
Amazon
http://www.fancast.com
Yahoo
Apache Open Source Players
Lucene/Solr
http://lucene.apache.org
Mahout
http://mahout.apache.org
UIMA
http://uima.apache.org
Nutch
http://nutch.apache.org
Tika
http://tika.apache.org
Hadoop
http://hadoop.apache.org
ManifoldCF
http://incubator.apache.org/c
onnectors
Lucid Imagination, Inc.
Other Open Source Players
OpenNLP (ASL)
http://opennlp.sourceforge.net
-> Incubator?
Carrot2 (BSD)
http://project.carrot2.org/
MALLET (CPL)
http://mallet.cs.umass.edu/
Weka (GPL)
http://www.cs.waikato.ac.nz/~ml/weka/index.html
Lucid Imagination, Inc.
Aggregating Analysis
User History
Discovery/Guides/Organizatio
n
Language
Analysis
Building Blocks
Content Users
Acquisition
Relationships
Search
Domain
Knowledge
Extraction
User Profile/Model Context
Adaptation
Lucid Imagination, Inc.
Building Blocks: Acquisition and Extraction
Garbage In Garbage Out
Acquisition:
Nutch
Solr Data Import Handler
ManifoldCF
Extraction
Tika (PDFBox, POI, etc.)
Lucid Imagination, Inc.
Building Blocks: Language Analysis
Basics:
Morphology, Tokenization, Stemming/Lemmatization, Language
Detection…
Lucene has extensive support, plus pluggable
Intermediate:
Phrases, Part of Speech, Collocations, Shallow Parsing…
Lucene, Mahout, OpenNLP
Advanced:
Concepts, Sentiment, Relationships, Deep Parsing…
Machine Learning tools like Mahout
Lucid Imagination, Inc.
Building Blocks: Domain Knowledge
You, Your Business, Your Requirements
Focus groups
Examples:
Synonyms, taxonomies
Genre (sublanguage: jargon, abbreviations, etc.)
Content relationships (explicit and implicit links)
Metadata: location, time, authorship, content type
Tools:
Tika, Machine Learning tools like Mahout
Lucid Imagination, Inc.
Building Blocks: Search
Search is often the interface through which users interact
with a system
Doesn’t require explicit typing in of keywords
Sometimes a search need not be a search
Less frequently used capabilities become more important:
Pluggable Query Parsing
Spans/Payloads
Terms, TermVectors
Lucene/Solr can actually stand-in for many of the higher
layers (organizational)
Building Blocks: Organization/Discovery
Organization
Classification
• Named Entity Extraction
Clustering
• Collection
• Search Results
Topic Modeling
Summarization
• Document
• Collection
Discovery/Guidance
Faceting/Clusters
Auto-suggest
Did you mean?
Related Searches
More Like This
Lucid Imagination, Inc.
Building Blocks: Relationships
Harness multilevel relationships
Within documents: phrases/collocations, co-reference resolution,
anaphora, even sentences, paragraphs have relationships
Doc <-> Doc:
• Explicit: links, citations, etc.
• Implicit: shared concepts/topics
User <-> Doc:
• Read/Rated/Reviewed/Shared…
User <-> User
• Explicit: Friend, Colleague, Reports to, friend of friend
• Implicit: email, Instant Msg, asked/answered question
Lucid Imagination, Inc.
Building Blocks: Users
History
Saved Searches -> Deeper analysis -> Alerts
Profile
Likes/Dislikes
Location
Roles
Enhance/Restrict Queries, personalize
scoring/ranking/recommendations
Lucid Imagination, Inc.
Building Blocks: Aggregating Analysis
You’re an Engineer, do you know what’s in your production
logs?
Log analysis
Who, what, when, where, why?
Hadoop, Pig, Mahout etc.
Classification/Clustering
Label/Group users based on their actions
• Power users, new users, etc.
• Mahout and other Machine Learning techniques
Lucid Imagination, Inc.
Adaptation
Automated
Retrain models based on user interactions on a regular basis
Manual
Lessons learned incorporated over time
Tying it Together
Key Extension Points
Analyzer Chain
UpdateProcessor
Request Handler
SearchComponent
Qparser(Plugin)
Event Listeners
Lucid Imagination, Inc.
Example
http://github.com/gsingers/ApacheCon2010
Work-in-Progress Proof of Concept
Wikipedia dataset
http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-
pages-articles.xml.bz2
Index, classify, cluster, recommend
Lucid Imagination, Inc.
Indexing
Document
•Request Handler
Update
Proc. Chain
•Bayes Update
Request
Processor
•UIMA (SOLR-
2129)
Update
Handler
IndexWriter
Analysis
•NameFilter
•Payloads
•Sentence Det.
•Parsing
New
Searcher
Event
•Cluster
Collection
Lucid Imagination, Inc.
Searching
Query
• Request Handler
Query Comp
• QParser (SOLR-
1337)
• Analysis
• Spans
• DocList/Set
• Spatial
Clustering
Comp.
• Carrot2
• Mahout
Suggestions
• Spell Checking
• Auto Suggest
• Related Searches
(SOLR-2080)
Recommendations
• Item-Item
Results
Lucid Imagination, Inc.
Resources
Handles
@gsingers
grant@lucidimagination.com
http://blog.lucidimagination.com
http://lucene.grantingersoll.com
Taming Text by Grant Ingersoll, Thomas Morton and Drew
Farris
http://lucene.li/1c
Code: apachecon2010

Más contenido relacionado

Similar a Intelligent Apps with Apache Lucene, Mahout and Friends

Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools Nathalie Reid
 
Data Science - Experiments
Data Science - ExperimentsData Science - Experiments
Data Science - ExperimentsGaurav Marwaha
 
Promises and Perils of Web 2.0 in Special Libraries
Promises and Perils of Web 2.0 in Special LibrariesPromises and Perils of Web 2.0 in Special Libraries
Promises and Perils of Web 2.0 in Special LibrariesMichelle Kraft
 
Designing to save lives: Government technical documentation
Designing  to save  lives: Government technical documentation Designing  to save  lives: Government technical documentation
Designing to save lives: Government technical documentation Laurian Vega
 
Ibm Web 2 0 Goes To Work Presentation
Ibm  Web 2 0 Goes To Work PresentationIbm  Web 2 0 Goes To Work Presentation
Ibm Web 2 0 Goes To Work Presentationjward5519
 
Task Orientation BSIT 6th .pdf
Task Orientation BSIT 6th .pdfTask Orientation BSIT 6th .pdf
Task Orientation BSIT 6th .pdfSairaNoreen5
 
Collaboration tools and digital presence
Collaboration tools and digital presenceCollaboration tools and digital presence
Collaboration tools and digital presenceErika Sorto
 
Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions R A Akerkar
 
Harnessing search engines for KM
Harnessing search engines for KMHarnessing search engines for KM
Harnessing search engines for KMInvotra
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0guestec15e68
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0guestec15e68
 
Y'ems Group's Social Networking for Organizations Ver 1.0
Y'ems Group's Social Networking for Organizations Ver 1.0Y'ems Group's Social Networking for Organizations Ver 1.0
Y'ems Group's Social Networking for Organizations Ver 1.0Y'ems GroupCo
 
2008 web-managers-hwilfert-final
2008 web-managers-hwilfert-final2008 web-managers-hwilfert-final
2008 web-managers-hwilfert-finalHallie Wilfert
 
Meta e learning presentation for imoot
Meta e learning presentation for imootMeta e learning presentation for imoot
Meta e learning presentation for imootKristina Hollis
 
Collaborativet Tools
Collaborativet ToolsCollaborativet Tools
Collaborativet Toolststephens
 
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.pptsagarjsicg
 
Opening Up User-Centric Identity
Opening Up User-Centric IdentityOpening Up User-Centric Identity
Opening Up User-Centric IdentityEduserv Foundation
 

Similar a Intelligent Apps with Apache Lucene, Mahout and Friends (20)

Rusa nov20 2013
Rusa nov20 2013Rusa nov20 2013
Rusa nov20 2013
 
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
 
Data Science - Experiments
Data Science - ExperimentsData Science - Experiments
Data Science - Experiments
 
Promises and Perils of Web 2.0 in Special Libraries
Promises and Perils of Web 2.0 in Special LibrariesPromises and Perils of Web 2.0 in Special Libraries
Promises and Perils of Web 2.0 in Special Libraries
 
Designing to save lives: Government technical documentation
Designing  to save  lives: Government technical documentation Designing  to save  lives: Government technical documentation
Designing to save lives: Government technical documentation
 
Ibm Web 2 0 Goes To Work Presentation
Ibm  Web 2 0 Goes To Work PresentationIbm  Web 2 0 Goes To Work Presentation
Ibm Web 2 0 Goes To Work Presentation
 
Task Orientation BSIT 6th .pdf
Task Orientation BSIT 6th .pdfTask Orientation BSIT 6th .pdf
Task Orientation BSIT 6th .pdf
 
Collaboration tools and digital presence
Collaboration tools and digital presenceCollaboration tools and digital presence
Collaboration tools and digital presence
 
Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions
 
Harnessing search engines for KM
Harnessing search engines for KMHarnessing search engines for KM
Harnessing search engines for KM
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0
 
Presentation on collaboration
Presentation on collaborationPresentation on collaboration
Presentation on collaboration
 
Y'ems Group's Social Networking for Organizations Ver 1.0
Y'ems Group's Social Networking for Organizations Ver 1.0Y'ems Group's Social Networking for Organizations Ver 1.0
Y'ems Group's Social Networking for Organizations Ver 1.0
 
2008 web-managers-hwilfert-final
2008 web-managers-hwilfert-final2008 web-managers-hwilfert-final
2008 web-managers-hwilfert-final
 
Meta e learning presentation for imoot
Meta e learning presentation for imootMeta e learning presentation for imoot
Meta e learning presentation for imoot
 
Collaborativet Tools
Collaborativet ToolsCollaborativet Tools
Collaborativet Tools
 
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
 
Opening Up User-Centric Identity
Opening Up User-Centric IdentityOpening Up User-Centric Identity
Opening Up User-Centric Identity
 

Más de Grant Ingersoll

This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineGrant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 

Más de Grant Ingersoll (20)

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taming Text
Taming TextTaming Text
Taming Text
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 

Último

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Último (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Intelligent Apps with Apache Lucene, Mahout and Friends

  • 1. Intelligent Apps with Apache Lucene, Mahout and friends Grant Ingersoll
  • 2. Lucid Imagination, Inc. Topics What is an Intelligent Application? Examples I’ve heard of Lucene/Solr, but what else can I use? Mahout OpenNLP Others? UIMA, Weka, Mallet, MinorThird, etc. Building Blocks Tying it all together
  • 3. Lucid Imagination, Inc. What is an Intelligent Application? I favor a loose definition Evolving as techniques get better General Characteristics: Embraces fuzziness and uncertainty by: • Learning from past behavior and adapting • Leveraging the masses while incorporating the personal Provide Content Insight • Organize vast quantities of data into consumable chunks • Encourage Serendipity Do what users want even if they don’t know it yet, but don’t turn them off either
  • 4. Lucid Imagination, Inc. Caveats I’m mostly interested in applications where: Unstructured text is a component • i.e. I’m not building a next-gen video game Users interact via text, clicks, etc. • Typing in queries • Browsing links, reading ads/content, etc. Some of these tools are useful for other applications too Consider the topics here to be a toolkit, not all apps need all features
  • 6. Apache Open Source Players Lucene/Solr http://lucene.apache.org Mahout http://mahout.apache.org UIMA http://uima.apache.org Nutch http://nutch.apache.org Tika http://tika.apache.org Hadoop http://hadoop.apache.org ManifoldCF http://incubator.apache.org/c onnectors
  • 7. Lucid Imagination, Inc. Other Open Source Players OpenNLP (ASL) http://opennlp.sourceforge.net -> Incubator? Carrot2 (BSD) http://project.carrot2.org/ MALLET (CPL) http://mallet.cs.umass.edu/ Weka (GPL) http://www.cs.waikato.ac.nz/~ml/weka/index.html
  • 8. Lucid Imagination, Inc. Aggregating Analysis User History Discovery/Guides/Organizatio n Language Analysis Building Blocks Content Users Acquisition Relationships Search Domain Knowledge Extraction User Profile/Model Context Adaptation
  • 9. Lucid Imagination, Inc. Building Blocks: Acquisition and Extraction Garbage In Garbage Out Acquisition: Nutch Solr Data Import Handler ManifoldCF Extraction Tika (PDFBox, POI, etc.)
  • 10. Lucid Imagination, Inc. Building Blocks: Language Analysis Basics: Morphology, Tokenization, Stemming/Lemmatization, Language Detection… Lucene has extensive support, plus pluggable Intermediate: Phrases, Part of Speech, Collocations, Shallow Parsing… Lucene, Mahout, OpenNLP Advanced: Concepts, Sentiment, Relationships, Deep Parsing… Machine Learning tools like Mahout
  • 11. Lucid Imagination, Inc. Building Blocks: Domain Knowledge You, Your Business, Your Requirements Focus groups Examples: Synonyms, taxonomies Genre (sublanguage: jargon, abbreviations, etc.) Content relationships (explicit and implicit links) Metadata: location, time, authorship, content type Tools: Tika, Machine Learning tools like Mahout
  • 12. Lucid Imagination, Inc. Building Blocks: Search Search is often the interface through which users interact with a system Doesn’t require explicit typing in of keywords Sometimes a search need not be a search Less frequently used capabilities become more important: Pluggable Query Parsing Spans/Payloads Terms, TermVectors Lucene/Solr can actually stand-in for many of the higher layers (organizational)
  • 13. Building Blocks: Organization/Discovery Organization Classification • Named Entity Extraction Clustering • Collection • Search Results Topic Modeling Summarization • Document • Collection Discovery/Guidance Faceting/Clusters Auto-suggest Did you mean? Related Searches More Like This
  • 14. Lucid Imagination, Inc. Building Blocks: Relationships Harness multilevel relationships Within documents: phrases/collocations, co-reference resolution, anaphora, even sentences, paragraphs have relationships Doc <-> Doc: • Explicit: links, citations, etc. • Implicit: shared concepts/topics User <-> Doc: • Read/Rated/Reviewed/Shared… User <-> User • Explicit: Friend, Colleague, Reports to, friend of friend • Implicit: email, Instant Msg, asked/answered question
  • 15. Lucid Imagination, Inc. Building Blocks: Users History Saved Searches -> Deeper analysis -> Alerts Profile Likes/Dislikes Location Roles Enhance/Restrict Queries, personalize scoring/ranking/recommendations
  • 16. Lucid Imagination, Inc. Building Blocks: Aggregating Analysis You’re an Engineer, do you know what’s in your production logs? Log analysis Who, what, when, where, why? Hadoop, Pig, Mahout etc. Classification/Clustering Label/Group users based on their actions • Power users, new users, etc. • Mahout and other Machine Learning techniques
  • 17. Lucid Imagination, Inc. Adaptation Automated Retrain models based on user interactions on a regular basis Manual Lessons learned incorporated over time
  • 18. Tying it Together Key Extension Points Analyzer Chain UpdateProcessor Request Handler SearchComponent Qparser(Plugin) Event Listeners
  • 19. Lucid Imagination, Inc. Example http://github.com/gsingers/ApacheCon2010 Work-in-Progress Proof of Concept Wikipedia dataset http://people.apache.org/~gsingers/wikipedia/enwiki-20070527- pages-articles.xml.bz2 Index, classify, cluster, recommend
  • 20. Lucid Imagination, Inc. Indexing Document •Request Handler Update Proc. Chain •Bayes Update Request Processor •UIMA (SOLR- 2129) Update Handler IndexWriter Analysis •NameFilter •Payloads •Sentence Det. •Parsing New Searcher Event •Cluster Collection
  • 21. Lucid Imagination, Inc. Searching Query • Request Handler Query Comp • QParser (SOLR- 1337) • Analysis • Spans • DocList/Set • Spatial Clustering Comp. • Carrot2 • Mahout Suggestions • Spell Checking • Auto Suggest • Related Searches (SOLR-2080) Recommendations • Item-Item Results

Notas del editor

  1. Do what users expect – Go beyond just UI design Early days of Amazon recommender
  2. In fact, the Gmail shows 3 examples
  3. On the users side, I won’t go into too much detail about things like history, profile, modeling
  4. In my day to day experience, this seemingly mundane task is where you will spend a good amount of time
  5. You can build recommenders, classifiers, clustering, etc. on L/S