Nell’iperspazio con Rocket: il Framework Web di Rust!
OpenSearchLab and the Lucene Ecosystem
1. OpenSearchLab and Lucene
Grant Ingersoll
Chief Scientist @LucidWorks
Member, Committer at Apache Soft. Found.
Co-Founder, Apache Mahout
2. Hats
I’m here as an individual who happens to contribute (and commit)
to Lucene, Solr, Mahout and other open source projects.
I don’t officially represent the ASF or even Lucene/Solr/Mahout.
3. Topics
• Openness
• What are some OpenSearchLab (OSL) needs?
• The Lucene Ecosystem
• Lucene for Research?
• A Sample Architecture
4. Putting the Open in OpenSearchLab
• Open Development >> Open Source
• Open community
• Open corpora
• Open evaluations
• Open Research
• w/o being onerous
http://www.facebook.com/photo.php?fbid=10151728075710181&set=a.101
51045050120181.780469.68096845180&type=1&theater
7. “An ecosystem is a community of living organisms in conjunction
with the nonliving components of their environment interacting
as a system.”
– Wikipedia
Code
Committers
Contributors
ASF
Users
8. The ASF and ASL
• ASF == Apache Software Foundation
– Volunteer-based, but many are paid to work on open source by their
employer
– Community Over Code
• Consensus-driven development
– Meritocracy
• “Those who do, make the decisions”
– 100+ Top Level Projects
– Infrastructure to support projects
– “The Apache Way”
• ASL == Apache Software License (v2)
ASL ≠ ASF
9. Lucene Community
• In a nutshell: Large, Active Community
• 30+ committers, many, many more contributors
• (Tens of?) Thousands of Practitioners
• Thousands of production instances
– Twitter, Apple, IBM Watson, LinkedIn, Netflix,
Commercial Search Engines, …
– “… they frequently turn to real-time search: our
system serves over two billion queries a day, with an
average query latency of 50 ms. Usually, tweets are
searchable within 10 seconds after creation.” --
EarlyBird, Busch et. al.
11. • Flagship Java library for building search applications
– Indexing, Searching, Language Analysis
• Powers apps large and small the world over
• More in Apache Lucene 4 talk later
• Fast, small footprint
• Lots of useful related modules
– Highlighting, Joins, Spatial, etc.
• http://lucene.apache.org/core
12. • Search server built using Lucene and HTTP
• Faceting, highlighting, most Lucene features,
easy admin
• Highly Extensible
• Scalable (query volume and index size)
• Lucene Best Practices
• http://lucene.apache.org/solr
13. • Originally built for Nutch to solve large scale
crawling problems
• Distributed File System and Computation Model
– HDFS and MapReduce, YARN coming
• Common Use Cases: storage, log analysis, ETL
• http://hadoop.apache.org
14. • Web-scale crawler and search built on
Lucene/Solr and Hadoop
• Link analysis (aka PageRank)
• Plugin framework
• Parsers for common document formats (PDF,
Word, HTML, etc.)
• http://nutch.apache.org
16. • Toolkit for detecting and extracting content from
MIME types
• Support for many common file formats
– Office, PDF, HTML, etc.
• Intuitive API (think SAX parser)
• Wraps best of breed open source extractors
• Plug in your own
• http://tika.apache.org
17. • Supports common NLP tasks
– NER, POS tagging, Chunking, Parsing, CoRef
resolution
• MaxEnt and Perceptron based
– Working to make the machine learning pluggable
• Some Multilingual support
• New life at the ASF
• Related: cTakes, Stanbol
18. Other Useful Tools
• Apache Zookeeper – Distrib. Coordination
• Apache Pig – Hadoop scripting w/o Java
• Apache HBase/Accumulo/Cassandra –
BigTable/Dynamo
• Avro and Protobufs – Serialization
frameworks
• Netty: Server framework – easy to add
protocols and to scale
• Stanbol – Semantic Content Management
using Solr, OpenNLP, others
• UIMA – Unstructured Info Management
19. LUCENE CAN HAS RESEARCH?
• Dispelling a few misconceptions:
– No such thing as Lucene OOTB
– Lucene ≠ Solr
• Researchers are welcome!
– Large audience and many domains
– http://wiki.apache.org/lucene-
java/HowToContribute
– Battle-tested code
– Speed v. Quality tradeoffs
http://1.bp.blogspot.com/_T2ki5Em5dnI/S8gxtImG7
wI/AAAAAAAAAEs/N7aZKZ6g6g4/s1600/cat%2520ty
ping.jpg
20. Research/Contribution Areas
• Work with the community to do evaluations
• Scoring
– BM25, LM, IM, DFR others already implemented
– Easy to add your own
• Codecs
– Extensible compression/storage
– Many already implemented approaches and more coming
– SimpleText FTW!
• Others:
– Faceting, auto-suggest, spell-checking, highlighting,
expansion and more
– Different domains: machine generated data, mobile,
21. Clients
Abstract OSL Architecture Access APIs
Personalization
Shard Shard Shard Users/Admin/
... & Machine
1 2 n Other
Learning
Search View
Updates/Analysis
Distributed, Scalable Distributed
(Batch/Real Time)
Storage Coordination and
(Docs, Users, Logs) Messaging
Keys
Content Acquisition
Distributed Content
Content Acquisition - Service-Oriented Architecture
Acquisition ETL - Stateless
Batch and Real Time - Failover/Fault Tolerant
- Glue is lightweight
- Smart about updates
Data (Internet)
22. Clients
Lucene Ecosystem Implementation Access APIs
Personalization
Shard Shard Shard Users/Admin/
... & Machine
1 2 n Other
Learning
Search View
Updates/Analysis
Distributed, Scalable Distributed
(Batch/Real Time)
Storage Coordination and
(Docs, Users, Logs) Messaging
Keys
Content Acquisition
Distributed Content
Content Acquisition - Service-Oriented Architecture
Acquisition ETL - Stateless
Batch and Real Time - Failover/Fault Tolerant
- Glue is lightweight
- Smart about updates
Data (Internet)
23. Takeaways
• Open Development >> Open Source >> Shared
Source
– Corollary: You never know where good ideas are
coming from
• ASF is a proven model for collaboration
• Lucene ecosystem: extensive, production ready
• Lucene 4 is viable for IR algorithms and data
structure research
• OSL (IMO) needs a services-based, pluggable
architecture
24. Resources
• Getting Started
– {Lucene|Mahout|Hadoop} In Action
– Taming Text
• grant@lucidworks.com
• @gsingers
• http://www.lucidworks.com
Editor's Notes
Shared source, visible source, BDFL is not open source. Open DEVELOPMENT is far more powerfulAnyone can be a “researcher” - Jack Andraka -- His study resulted in over 90 percent accuracy and showed his patent-pending sensor to be 28 times faster, 28 times less expensive and over 100 times more sensitive than current tests. Jack received the Gordon E. Moore Award, of $75,000, named in honor of Intel co-founder and retired chairman and CEO. -- You never know where the next good idea is coming fromOpen corpora: anyone anywhere should be able to download and run evaluations. If Common Crawl can do it, why can’t we? iBiblio, ASF, others can likely helpHow can we build, leverage and share an open evaluation framework? How do we leverage the Internet? Crowdsourcing? Dynamic nature of content, engines, community, users, etc.? Can we time slice experiments on a real system?Open Research: how do we encourage open methodology, open process, publications, etc. without being heavy-handed?
Community will be the single most important pieceBottom up and top down needed to establish a community
https://en.wikipedia.org/wiki/EcosystemMost people have this Pyramid backwards
The ASF has a well developed community model that has been proven out over time
Committers: many are paid to work on Lucene FT.Images: Commits: Ohloh, Traffic: lucene.markmail.org