As organisations store more and more information in their Alfresco content hubs, search and discovery of content becomes important. Alfresco comes bundled with Apache Lucene and Apache Solr for search. Although these provide full text capabilities, they do not have the scalability and functionality of the newer cloud scalable search software such as Apache Solr Cloud 4, Elastic Search and Amazon Cloud Search. Also, searching across multiple Alfresco instances including Alfresco Cloud is quite a challenge and any of the possible approaches are not good enough to be production ready.
This talk shows you how to index and search content stored in one or more Alfresco repositories, other CMIS repositories or file systems using either Apache Solr Cloud 4, Elastic Search or Amazon Cloud Search, while still ensuring the confidentiality of the documents based on the permissions configured in Alfresco or any other repositories.
1. #SummitNow
Super Size Your Search
6th November 2013
Piergiorgio Lucidi (Sourcesense)
Fran Alvarez (Zaizi)
2. #SummitNow#SummitNow
Piergiorgio Lucidi
• Open Source ECM Specialist at Sourcesense
• Alfresco Certified Trainer / Engineer
• Alfresco Wiki Gardener / Community Star
• Alfresco forum supporter
• Global Moderator of the italian forum
• Author and Technical Reviewer at Packt
• PMC Member and Mentor at ASF
• Project Leader in the JBoss Community
5. #SummitNow#SummitNow
Scenario - Alfresco limitations
Alfresco supports these search engines:
• Apache Lucene (embedded)
• Apache Solr (provided by Alfresco)
• needs development if other repositories
must be involved
Every other approach must be implemented
(ScheduledActions, WebScripts, etc..)
6. #SummitNow#SummitNow
Scenario – Embedded
Simple Search Architecture
Alfresco is the only one repository involved in the
architecture using the embedded search engine:
• the repository must take care of indexes also
managing index transactions
Indexes
Alfresco
FrontEnd
applications
Apache Lucene
7. #SummitNow#SummitNow
Scenario – Embedded - Cluster
Embedded
Not easy to scale out with Lucene
1. every cluster must have its own search
indexes
2. The cluster must synchronize indexes
Indexes
Alfresco
Apache Lucene
Indexes
Alfresco
Apache Lucene
JGroups
8. #SummitNow#SummitNow
Scenario – Simple Architecture
Simple search architecture
Alfresco is the only one repository involved in the
architecture with an external search server
1. The search server can be used for publish
contents in the front end architecture
2. The repository will stay in the logic backend
Search Engine
Indexes
Alfresco FrontEnd
applications
9. #SummitNow#SummitNow
Scenario – Publish with search
A search engine can be used for:
• advanced management of search indexes
• scaling out
• executing complex search on contents
• publishing contents in the FE
architecture
10. #SummitNow#SummitNow
Scenario – Publish with search
Publish with search architecture
Alfresco is the only one repository involved in the
architecture with an external search server
1. The search server can be used for publishing
contents in the front end architecture (HTML)
2. The repository will stay in the logic backend
Search Engine
Indexes
Alfresco FrontEnd
applications
BackEnd FrontEnd
Lucene / Solr
Indexes
11. #SummitNow#SummitNow
Scenario – Simple Architecture
Simple Search Architecture
Alfresco is the only one repository involved in the
architecture with an external search server
1. The search server can be used for publish
contents in the front end architecture
2. The repository will stay in the logic backend
Search Engine
Indexes
Alfresco FrontEnd
applications
12. #SummitNow#SummitNow
Scenario – Complex
Architecture
1. Alfresco is only one of the platforms that
must be involved in your search
architecture
2. You don’t want to increase the
development effort
3. You want just something to configure
13. #SummitNow#SummitNow
Scenario – Complex
Architecture
Architecture with different ECM systems
Alfresco is one of the content platforms that must
be involved in the indexing process
Alfresco
Search Engine
Indexes
SharePoint
FileNet
CMIS
JIRA
Google Drive
DropBox
14. #SummitNow#SummitNow
Scenario – Complex
Architecture
Architecture with different ECM systems
Alfresco is one of the content platforms that must
be involved in the indexing process
Alfresco
Search Engine
Indexes
SharePoint
FileNet
CMIS
JIRA
Google Drive
DropBox
15. #SummitNow#SummitNow
Scenario – Complex
Architecture
Architecture with different ECM systems
Alfresco is one of the content platforms that must
be involved in the indexing process
Alfresco
Search Engine
Indexes
SharePoint
FileNet
CMIS
JIRA
Google Drive
DropBox
17. #SummitNow#SummitNow
Apache ManifoldCF - History
ManifoldCF code base was granted by MetaCarta to the
Apache Software Foundation in December 2009.
The MetaCarta effort represented more than five years of
successful development and testing in multiple, challenging
enterprise environments.
The project was graduated as Apache Top Level
Project in July 2012.
18. #SummitNow#SummitNow
Apache ManifoldCF – What is?
Open Source crawler
• crawling model (add, change, delete)
• schedule jobs to create indexes
• get contents from repositories
• push contents on search servers
19. #SummitNow#SummitNow
Apache ManifoldCF – What is?
Repository 1
Repository 3
Repository 4
Repository 2
Apache ManifoldCF
Search Server 1
Search Server 2
Search Server 3
Search Server 4
20. #SummitNow#SummitNow
Apache ManifoldCF – What is?
Out-Of-The-Box it is distributed as a webapp
• REST API
• Authority Service
• ACL indexes
• Crawler UI
can be embedded in any Java application
22. #SummitNow#SummitNow
ManifoldCF – Why? - Reliability
Jobs scheduling and configuration are stored in the
database to maintain the state of all the executions
Repository 1
Repository 3
Repository 4
Repository 2
Apache ManifoldCF
Search Server 1
Search Server 2
Search Server 3
Search Server 4
Pull Agent Daemon
Database
23. #SummitNow#SummitNow
ManifoldCF – Why? -
Incremental
get content changesets obtained from the repository API
Repository 1 Apache ManifoldCF
Pull Agent Daemon
Database
query
Complete
Changesets
24. #SummitNow#SummitNow
ManifoldCF – Why? - Flexible
If the repository can't supply all the changes Manifold can
discover them through crawling
Apache ManifoldCF
Pull Agent Daemon
Database
query
Incomplete
Changesets
Change Discovery
N N
25. #SummitNow#SummitNow
ManifoldCF – Why? – Multi repo
Jobs can retrieve contents
from the following
repositories:
• Google Drive
• Dropbox
• HDFS
• CMIS-compliant
• Alfresco
• IBM FileNet
• EMC Documentum
• Microsoft SharePoint
• OpenText LiveLink
• Autonomy Meridio
• Memex Patriarch
• Windows Share/DFS
• Generic JDBC
• Generic Filesystem
• Generic RSS and Web
26. #SummitNow#SummitNow
ManifoldCF – Why? – Multi repo
Jobs can ingest contents to
the following search servers:
• Apache Solr
• ElasticSearch
• OpenSearchServer
• MetaCarta GTS
27. #SummitNow#SummitNow
ManifoldCF – Why? - Security
Retrieve per-content ACLs
Repository 1
Repository 3
Repository 4
Repository 2
Apache ManifoldCF
Search Server 1
Search Server 2
Search Server 3
Search Server 4
Authority Service
Authority 1
Authority 2
access
tokens
28. #SummitNow#SummitNow
ManifoldCF – Why? - Security
Retrieve per-content ACLs
Repository 1
Repository 3
Repository 4
Repository 2
Apache ManifoldCF
Search Server 1
Search Server 2
Search Server 3
Search Server 4
Authority Service
Authority 1
Authority 2
user access tokens
user specific
search results
29. #SummitNow#SummitNow
ManifoldCF – Why? –
MonitoringUI Crawler allows you to:
• configure jobs and connectors
• monitor jobs execution
• monitor contents ingestion
• status reports
• document status
• queue status
• history reports
• simple history
• maximum activity
• maximum bandwidth
• result histogram
39. #SummitNow#SummitNow
ManifoldCF - Resources
The project is available at
http://manifoldcf.apache.org/
From this website you can access to
the mailing lists, documentation and
download links for binaries and
source.
40. #SummitNow#SummitNow
ManifoldCF – Resources - Book
ManifoldCF in Action
by Karl Wright
published by Manning
Karl is the original developer and the
principal committer of Apache
ManifoldCF
The book is available at
http://www.manning.com/wright
42. #SummitNow#SummitNow
Fran Alvarez
• Director of Zaizi Iberia and Lead Architect
• Alfresco Certified Engineer
• Responsible of large Alfresco
architectures
• Semantic Consultant for Sensefy
• Alfresco Meetups Organizer
43. #SummitNow#SummitNow
Alfresco + Solr Approach
Quite a good architecture
• Performance issues are solved
• Different architectures depending on business requirements
However…
• It does not cover some use cases or scenarios
• It does not leverage Cloud benefits or latest technologies
• With huge data volume there are other approaches
How can we solve limitations and enhance benefits?
44. #SummitNow#SummitNow
Alfresco + Solr Approach
• Decouples Search solution from Alfresco
• Allow to implement different Search solutions
• Allow to change Search solution without changing anything in Alfresco
• Not even a property!
• Provides an API to integrate it with Alfresco as search engine
• Even other repository vendors! E.g. Filesystem, Sharepoint,
Documentum, Filenet, Drupal…
• And preserve security permissions in the results
• Alfresco permissions are indexed and used during search
It’s included in our Semantic solution: Sensefy!
45. #SummitNow#SummitNow
What we’ve done in Manifold
Repository Connector:
• Alfresco Repository Connector: New implementation
• Removing dependency with Alfresco Solr API
Output connectors:
• Cloud Search Output Connector: Design & Development
• Elastic Search Output Connector: Improvements
• Solr Cloud Output Connector: Configuration for Alfresco
Authority Connector
• Alfresco Authority Connector: Design & Development
• Similar approach to Alfresco Solr
• Acl reads for Users and Groups in Alfresco
47. #SummitNow#SummitNow
I: Several Alfresco instances
Current Approach:
• Each Alfresco has its own Search
subsystem
• They can’t share indexes
Implications:
• Federated search is not an option
• Results can’t be merged
• If so, what resultset should be
first?
Conclusion
Results could be presented to users in
different tabs or “manually” merged.
Not the best approach
48. #SummitNow#SummitNow
I: Several Alfresco instances
Zaizi Approach:
• Our solution like search box
• Which manages a single index
Implications:
• All documents are driven to same
index
• Users can select results from either all
Alfresco instances or a subset
Conclusion
Search across Repositories
Could be based Elastic Search, Solr
Cloud, Amazon Cloud, etc.
49. #SummitNow#SummitNow
II: Alfresco + Other data providers
Current Approach:
• Alfresco has its own Search
subsystem
• Other repository may have (or not) its
own Search subsystem
Implications:
• Different data providers mean different
formats
• E.g. Filesystem does not support
CMIS
• Alfresco can’t reach external data
Conclusion
No way to merge results and present
them uniformly to end users
50. #SummitNow#SummitNow
II: Alfresco + Other data providers
Zaizi Approach:
• Both Alfresco and other repositories
share Search subsystem (Manifold)
Implications:
• Alfresco and other providers results
will have same format in our Solution
• They will speak ‘our’ language
• Alfresco reaches external data when
communicating with our solution
Conclusion
Results are present and accessible between
data providers
51. #SummitNow#SummitNow
III: Alfresco + O(TB) data
Current Approach:
• Alfresco has its own Search
subsystem
• All data is in one (or several if cluster)
Solr instance
Implications:
• Every Solr node manages the whole
index
• No chance to apply scale techniques
for indexing:
• Sharding, Replication…
Conclusion
Huge servers are required and
performance might be compromised
52. #SummitNow#SummitNow
III: Alfresco + O(TB) data
Zaizi Approach:
• Alfresco uses our solution
• Data is indexed in search solution which
better suits:
• Amazon Cloud, Solr Cloud, Elastic
Search…
Implications:
• Cloud Search solution manages index
• Indexing techniques can be applied
according to use cases
• Sharding, Replication
Conclusion
Search strategy can be adopted and easily
implemented with search solution which
better fits
53. #SummitNow#SummitNow
Apache Manifold: Other benefits
Can extract, index and map information from any other
sources
• Apache Stanbol, RedLink, any other data enricher
• Our solution will gather everything in one place
• Documents, entities…
Permissions are checked just once
• Everything is in the same place, even user authorization
capabilities
• Performance and scalability is improved
• Faceted search and other search capabilities are combined
with such permission feature
55. #SummitNow#SummitNow
Conclusions
Zaizi solution allows searching and indexing in the most popular Cloud
Search solutions
• Other Search solutions can be integrated as well
Zaizi solution allows retrieving information from the most popular
repositories
• Other Data providers can be integrated too
• It solves plenty of current issues related search and indexing in
Alfresco
• Can be used outside Alfresco or even with Alfresco and any other
data repository
Zaizi solution manages permissions and security from the most popular
repositories and the latest Cloud search technologies
Fully supported by us!