SlideShare una empresa de Scribd logo
1 de 22
Intelligent Apps with
Apache Lucene, Mahout and
friends
Grant Ingersoll
Lucid Imagination, Inc.
Topics
What is an Intelligent Application?
Examples
I’ve heard of Lucene/Solr, but what else can I use?
Mahout
OpenNLP
Others? UIMA, Weka, Mallet, MinorThird, etc.
Building Blocks
Tying it all together
Lucid Imagination, Inc.
What is an Intelligent Application?
I favor a loose definition
Evolving as techniques get better
General Characteristics:
Embraces fuzziness and uncertainty by:
• Learning from past behavior and adapting
• Leveraging the masses while incorporating the personal
Provide Content Insight
• Organize vast quantities of data into consumable chunks
• Encourage Serendipity
Do what users want even if they don’t know it yet, but don’t turn
them off either
Lucid Imagination, Inc.
Caveats
I’m mostly interested in applications where:
Unstructured text is a component
• i.e. I’m not building a next-gen video game
Users interact via text, clicks, etc.
• Typing in queries
• Browsing links, reading ads/content, etc.
Some of these tools are useful for other applications too
Consider the topics here to be a toolkit, not all apps need all
features
Lucid Imagination, Inc.
Examples
http://www.netflix.com
Amazon
http://www.fancast.com
Yahoo
Apache Open Source Players
Lucene/Solr
http://lucene.apache.org
Mahout
http://mahout.apache.org
UIMA
http://uima.apache.org
Nutch
http://nutch.apache.org
Tika
http://tika.apache.org
Hadoop
http://hadoop.apache.org
ManifoldCF
http://incubator.apache.org/c
onnectors
Lucid Imagination, Inc.
Other Open Source Players
OpenNLP (ASL)
http://opennlp.sourceforge.net
-> Incubator?
Carrot2 (BSD)
http://project.carrot2.org/
MALLET (CPL)
http://mallet.cs.umass.edu/
Weka (GPL)
http://www.cs.waikato.ac.nz/~ml/weka/index.html
Lucid Imagination, Inc.
Aggregating Analysis
User History
Discovery/Guides/Organizatio
n
Language
Analysis
Building Blocks
Content Users
Acquisition
Relationships
Search
Domain
Knowledge
Extraction
User Profile/Model Context
Adaptation
Lucid Imagination, Inc.
Building Blocks: Acquisition and Extraction
Garbage In Garbage Out
Acquisition:
Nutch
Solr Data Import Handler
ManifoldCF
Extraction
Tika (PDFBox, POI, etc.)
Lucid Imagination, Inc.
Building Blocks: Language Analysis
Basics:
Morphology, Tokenization, Stemming/Lemmatization, Language
Detection…
Lucene has extensive support, plus pluggable
Intermediate:
Phrases, Part of Speech, Collocations, Shallow Parsing…
Lucene, Mahout, OpenNLP
Advanced:
Concepts, Sentiment, Relationships, Deep Parsing…
Machine Learning tools like Mahout
Lucid Imagination, Inc.
Building Blocks: Domain Knowledge
You, Your Business, Your Requirements
Focus groups
Examples:
Synonyms, taxonomies
Genre (sublanguage: jargon, abbreviations, etc.)
Content relationships (explicit and implicit links)
Metadata: location, time, authorship, content type
Tools:
Tika, Machine Learning tools like Mahout
Lucid Imagination, Inc.
Building Blocks: Search
Search is often the interface through which users interact
with a system
Doesn’t require explicit typing in of keywords
Sometimes a search need not be a search
Less frequently used capabilities become more important:
Pluggable Query Parsing
Spans/Payloads
Terms, TermVectors
Lucene/Solr can actually stand-in for many of the higher
layers (organizational)
Building Blocks: Organization/Discovery
Organization
Classification
• Named Entity Extraction
Clustering
• Collection
• Search Results
Topic Modeling
Summarization
• Document
• Collection
Discovery/Guidance
Faceting/Clusters
Auto-suggest
Did you mean?
Related Searches
More Like This
Lucid Imagination, Inc.
Building Blocks: Relationships
Harness multilevel relationships
Within documents: phrases/collocations, co-reference resolution,
anaphora, even sentences, paragraphs have relationships
Doc <-> Doc:
• Explicit: links, citations, etc.
• Implicit: shared concepts/topics
User <-> Doc:
• Read/Rated/Reviewed/Shared…
User <-> User
• Explicit: Friend, Colleague, Reports to, friend of friend
• Implicit: email, Instant Msg, asked/answered question
Lucid Imagination, Inc.
Building Blocks: Users
History
Saved Searches -> Deeper analysis -> Alerts
Profile
Likes/Dislikes
Location
Roles
Enhance/Restrict Queries, personalize
scoring/ranking/recommendations
Lucid Imagination, Inc.
Building Blocks: Aggregating Analysis
You’re an Engineer, do you know what’s in your production
logs?
Log analysis
Who, what, when, where, why?
Hadoop, Pig, Mahout etc.
Classification/Clustering
Label/Group users based on their actions
• Power users, new users, etc.
• Mahout and other Machine Learning techniques
Lucid Imagination, Inc.
Adaptation
Automated
Retrain models based on user interactions on a regular basis
Manual
Lessons learned incorporated over time
Tying it Together
Key Extension Points
Analyzer Chain
UpdateProcessor
Request Handler
SearchComponent
Qparser(Plugin)
Event Listeners
Lucid Imagination, Inc.
Example
http://github.com/gsingers/ApacheCon2010
Work-in-Progress Proof of Concept
Wikipedia dataset
http://people.apache.org/~gsingers/wikipedia/enwiki-20070527-
pages-articles.xml.bz2
Index, classify, cluster, recommend
Lucid Imagination, Inc.
Indexing
Document
•Request Handler
Update
Proc. Chain
•Bayes Update
Request
Processor
•UIMA (SOLR-
2129)
Update
Handler
IndexWriter
Analysis
•NameFilter
•Payloads
•Sentence Det.
•Parsing
New
Searcher
Event
•Cluster
Collection
Lucid Imagination, Inc.
Searching
Query
• Request Handler
Query Comp
• QParser (SOLR-
1337)
• Analysis
• Spans
• DocList/Set
• Spatial
Clustering
Comp.
• Carrot2
• Mahout
Suggestions
• Spell Checking
• Auto Suggest
• Related Searches
(SOLR-2080)
Recommendations
• Item-Item
Results
Lucid Imagination, Inc.
Resources
Handles
@gsingers
grant@lucidimagination.com
http://blog.lucidimagination.com
http://lucene.grantingersoll.com
Taming Text by Grant Ingersoll, Thomas Morton and Drew
Farris
http://lucene.li/1c
Code: apachecon2010

Más contenido relacionado

Similar a Intelligent Apps with Apache Lucene, Mahout and Friends

Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools Nathalie Reid
 
Data Science - Experiments
Data Science - ExperimentsData Science - Experiments
Data Science - ExperimentsGaurav Marwaha
 
Promises and Perils of Web 2.0 in Special Libraries
Promises and Perils of Web 2.0 in Special LibrariesPromises and Perils of Web 2.0 in Special Libraries
Promises and Perils of Web 2.0 in Special LibrariesMichelle Kraft
 
Designing to save lives: Government technical documentation
Designing  to save  lives: Government technical documentation Designing  to save  lives: Government technical documentation
Designing to save lives: Government technical documentation Laurian Vega
 
Ibm Web 2 0 Goes To Work Presentation
Ibm  Web 2 0 Goes To Work PresentationIbm  Web 2 0 Goes To Work Presentation
Ibm Web 2 0 Goes To Work Presentationjward5519
 
Task Orientation BSIT 6th .pdf
Task Orientation BSIT 6th .pdfTask Orientation BSIT 6th .pdf
Task Orientation BSIT 6th .pdfSairaNoreen5
 
Collaboration tools and digital presence
Collaboration tools and digital presenceCollaboration tools and digital presence
Collaboration tools and digital presenceErika Sorto
 
Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions R A Akerkar
 
Harnessing search engines for KM
Harnessing search engines for KMHarnessing search engines for KM
Harnessing search engines for KMInvotra
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0guestec15e68
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0guestec15e68
 
Y'ems Group's Social Networking for Organizations Ver 1.0
Y'ems Group's Social Networking for Organizations Ver 1.0Y'ems Group's Social Networking for Organizations Ver 1.0
Y'ems Group's Social Networking for Organizations Ver 1.0Y'ems GroupCo
 
2008 web-managers-hwilfert-final
2008 web-managers-hwilfert-final2008 web-managers-hwilfert-final
2008 web-managers-hwilfert-finalHallie Wilfert
 
Meta e learning presentation for imoot
Meta e learning presentation for imootMeta e learning presentation for imoot
Meta e learning presentation for imootKristina Hollis
 
Collaborativet Tools
Collaborativet ToolsCollaborativet Tools
Collaborativet Toolststephens
 
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.pptsagarjsicg
 
Opening Up User-Centric Identity
Opening Up User-Centric IdentityOpening Up User-Centric Identity
Opening Up User-Centric IdentityEduserv Foundation
 

Similar a Intelligent Apps with Apache Lucene, Mahout and Friends (20)

Rusa nov20 2013
Rusa nov20 2013Rusa nov20 2013
Rusa nov20 2013
 
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
Virtual Network Building: Connecting Trauma Experts Through Collaboration Tools
 
Data Science - Experiments
Data Science - ExperimentsData Science - Experiments
Data Science - Experiments
 
Promises and Perils of Web 2.0 in Special Libraries
Promises and Perils of Web 2.0 in Special LibrariesPromises and Perils of Web 2.0 in Special Libraries
Promises and Perils of Web 2.0 in Special Libraries
 
Designing to save lives: Government technical documentation
Designing  to save  lives: Government technical documentation Designing  to save  lives: Government technical documentation
Designing to save lives: Government technical documentation
 
Ibm Web 2 0 Goes To Work Presentation
Ibm  Web 2 0 Goes To Work PresentationIbm  Web 2 0 Goes To Work Presentation
Ibm Web 2 0 Goes To Work Presentation
 
Task Orientation BSIT 6th .pdf
Task Orientation BSIT 6th .pdfTask Orientation BSIT 6th .pdf
Task Orientation BSIT 6th .pdf
 
Collaboration tools and digital presence
Collaboration tools and digital presenceCollaboration tools and digital presence
Collaboration tools and digital presence
 
Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions Harvesting Intelligence from User Interactions
Harvesting Intelligence from User Interactions
 
Harnessing search engines for KM
Harnessing search engines for KMHarnessing search engines for KM
Harnessing search engines for KM
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0
 
Toolbelt Theory 2.0
Toolbelt Theory 2.0Toolbelt Theory 2.0
Toolbelt Theory 2.0
 
Presentation on collaboration
Presentation on collaborationPresentation on collaboration
Presentation on collaboration
 
Y'ems Group's Social Networking for Organizations Ver 1.0
Y'ems Group's Social Networking for Organizations Ver 1.0Y'ems Group's Social Networking for Organizations Ver 1.0
Y'ems Group's Social Networking for Organizations Ver 1.0
 
2008 web-managers-hwilfert-final
2008 web-managers-hwilfert-final2008 web-managers-hwilfert-final
2008 web-managers-hwilfert-final
 
Meta e learning presentation for imoot
Meta e learning presentation for imootMeta e learning presentation for imoot
Meta e learning presentation for imoot
 
Collaborativet Tools
Collaborativet ToolsCollaborativet Tools
Collaborativet Tools
 
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
1. OBJECT ORIENTED PROGRAMMING USING JAVA - OOps Concepts.ppt
 
Opening Up User-Centric Identity
Opening Up User-Centric IdentityOpening Up User-Centric Identity
Opening Up User-Centric Identity
 

Más de Grant Ingersoll

This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineGrant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 

Más de Grant Ingersoll (20)

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taming Text
Taming TextTaming Text
Taming Text
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Último (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Intelligent Apps with Apache Lucene, Mahout and Friends

  • 1. Intelligent Apps with Apache Lucene, Mahout and friends Grant Ingersoll
  • 2. Lucid Imagination, Inc. Topics What is an Intelligent Application? Examples I’ve heard of Lucene/Solr, but what else can I use? Mahout OpenNLP Others? UIMA, Weka, Mallet, MinorThird, etc. Building Blocks Tying it all together
  • 3. Lucid Imagination, Inc. What is an Intelligent Application? I favor a loose definition Evolving as techniques get better General Characteristics: Embraces fuzziness and uncertainty by: • Learning from past behavior and adapting • Leveraging the masses while incorporating the personal Provide Content Insight • Organize vast quantities of data into consumable chunks • Encourage Serendipity Do what users want even if they don’t know it yet, but don’t turn them off either
  • 4. Lucid Imagination, Inc. Caveats I’m mostly interested in applications where: Unstructured text is a component • i.e. I’m not building a next-gen video game Users interact via text, clicks, etc. • Typing in queries • Browsing links, reading ads/content, etc. Some of these tools are useful for other applications too Consider the topics here to be a toolkit, not all apps need all features
  • 6. Apache Open Source Players Lucene/Solr http://lucene.apache.org Mahout http://mahout.apache.org UIMA http://uima.apache.org Nutch http://nutch.apache.org Tika http://tika.apache.org Hadoop http://hadoop.apache.org ManifoldCF http://incubator.apache.org/c onnectors
  • 7. Lucid Imagination, Inc. Other Open Source Players OpenNLP (ASL) http://opennlp.sourceforge.net -> Incubator? Carrot2 (BSD) http://project.carrot2.org/ MALLET (CPL) http://mallet.cs.umass.edu/ Weka (GPL) http://www.cs.waikato.ac.nz/~ml/weka/index.html
  • 8. Lucid Imagination, Inc. Aggregating Analysis User History Discovery/Guides/Organizatio n Language Analysis Building Blocks Content Users Acquisition Relationships Search Domain Knowledge Extraction User Profile/Model Context Adaptation
  • 9. Lucid Imagination, Inc. Building Blocks: Acquisition and Extraction Garbage In Garbage Out Acquisition: Nutch Solr Data Import Handler ManifoldCF Extraction Tika (PDFBox, POI, etc.)
  • 10. Lucid Imagination, Inc. Building Blocks: Language Analysis Basics: Morphology, Tokenization, Stemming/Lemmatization, Language Detection… Lucene has extensive support, plus pluggable Intermediate: Phrases, Part of Speech, Collocations, Shallow Parsing… Lucene, Mahout, OpenNLP Advanced: Concepts, Sentiment, Relationships, Deep Parsing… Machine Learning tools like Mahout
  • 11. Lucid Imagination, Inc. Building Blocks: Domain Knowledge You, Your Business, Your Requirements Focus groups Examples: Synonyms, taxonomies Genre (sublanguage: jargon, abbreviations, etc.) Content relationships (explicit and implicit links) Metadata: location, time, authorship, content type Tools: Tika, Machine Learning tools like Mahout
  • 12. Lucid Imagination, Inc. Building Blocks: Search Search is often the interface through which users interact with a system Doesn’t require explicit typing in of keywords Sometimes a search need not be a search Less frequently used capabilities become more important: Pluggable Query Parsing Spans/Payloads Terms, TermVectors Lucene/Solr can actually stand-in for many of the higher layers (organizational)
  • 13. Building Blocks: Organization/Discovery Organization Classification • Named Entity Extraction Clustering • Collection • Search Results Topic Modeling Summarization • Document • Collection Discovery/Guidance Faceting/Clusters Auto-suggest Did you mean? Related Searches More Like This
  • 14. Lucid Imagination, Inc. Building Blocks: Relationships Harness multilevel relationships Within documents: phrases/collocations, co-reference resolution, anaphora, even sentences, paragraphs have relationships Doc <-> Doc: • Explicit: links, citations, etc. • Implicit: shared concepts/topics User <-> Doc: • Read/Rated/Reviewed/Shared… User <-> User • Explicit: Friend, Colleague, Reports to, friend of friend • Implicit: email, Instant Msg, asked/answered question
  • 15. Lucid Imagination, Inc. Building Blocks: Users History Saved Searches -> Deeper analysis -> Alerts Profile Likes/Dislikes Location Roles Enhance/Restrict Queries, personalize scoring/ranking/recommendations
  • 16. Lucid Imagination, Inc. Building Blocks: Aggregating Analysis You’re an Engineer, do you know what’s in your production logs? Log analysis Who, what, when, where, why? Hadoop, Pig, Mahout etc. Classification/Clustering Label/Group users based on their actions • Power users, new users, etc. • Mahout and other Machine Learning techniques
  • 17. Lucid Imagination, Inc. Adaptation Automated Retrain models based on user interactions on a regular basis Manual Lessons learned incorporated over time
  • 18. Tying it Together Key Extension Points Analyzer Chain UpdateProcessor Request Handler SearchComponent Qparser(Plugin) Event Listeners
  • 19. Lucid Imagination, Inc. Example http://github.com/gsingers/ApacheCon2010 Work-in-Progress Proof of Concept Wikipedia dataset http://people.apache.org/~gsingers/wikipedia/enwiki-20070527- pages-articles.xml.bz2 Index, classify, cluster, recommend
  • 20. Lucid Imagination, Inc. Indexing Document •Request Handler Update Proc. Chain •Bayes Update Request Processor •UIMA (SOLR- 2129) Update Handler IndexWriter Analysis •NameFilter •Payloads •Sentence Det. •Parsing New Searcher Event •Cluster Collection
  • 21. Lucid Imagination, Inc. Searching Query • Request Handler Query Comp • QParser (SOLR- 1337) • Analysis • Spans • DocList/Set • Spatial Clustering Comp. • Carrot2 • Mahout Suggestions • Spell Checking • Auto Suggest • Related Searches (SOLR-2080) Recommendations • Item-Item Results

Notas del editor

  1. Do what users expect – Go beyond just UI design Early days of Amazon recommender
  2. In fact, the Gmail shows 3 examples
  3. On the users side, I won’t go into too much detail about things like history, profile, modeling
  4. In my day to day experience, this seemingly mundane task is where you will spend a good amount of time
  5. You can build recommenders, classifiers, clustering, etc. on L/S