(Re)Indexing Large Repositories in Alfresco

Angel Borroy
Tom Page
10th June 2020
(Re)Indexing
Large Repositories

22
Agenda
(Re)Indexing Large Repositories
• Alfresco SOLR in a Nutshell
• Indexing Process
• Indexing Scenarios
• When to Re-Index
• Deployment Alternatives
• Demo time without downtime!
• Benchmark Review
• Improvements in 1.4.2
• Future Improvements
• Recap
Alfresco SOLR

3
Alfresco SOLR in a Nutshell
SOLR 6 is used in Alfresco to perform two main processes:
• Indexing (or tracking) metadata, permissions and content from Alfresco Repository
• Returning results from search queries supporting several syntaxes (AFTS, CMIS)
Indexing process
Asynchronous

4
Searching process
Eventual consistency
SOLR is indexing the information after the database has committed the transaction, so there is a short period of time
when not all the documents are available in SOLR Index. We call this eventual consistency, as SOLR will catch up with
Repository eventually.
Syntax
AFTS
CMIS
Permission
Checks
Synchronous

5
Alfresco SOLR Storage
By default two SOLR cores are created, one for the living documents (alfresco) and one for the removed documents
(archive).
Each core includes following storage folders:
• Default SOLR Index files in the solrhome/<core>/index folder
• Alfresco customized Content Store in the contentstore folder
• This folder includes a cached copy of Repository content and metadata
• Content Store will be removed in Search Services 2.0
“These folders are populated by the Indexing Process

6
Indexing process
● Each tracker is fired asynchronously according to a cron expression: alfresco.cron or alfresco.*.tracker.cron
● Transactions and ACL Change Sets are processed in batches of Nodes or ACLs
● Batches are split to be executed in parallel by Workers
● However, Content Tracker recovers text from content nodes one by one
● Commit Tracker writes the changes from the different Trackers to SOLR Index "eventually"
>> Cascade Tracker is not running when indexing from scratch

7
Indexing scenarios
1. When updating the repository using applications or bulk ingestion
processes, the transactions will include a long list of nodes to be
indexed
2. When using Alfresco Share to create new content, there will be
more transactions but every transaction will include a small list of
nodes to be indexed
3. When setting the permission level for every node in a hierarchy
manually, the ACL Change Sets will include a long list of ACLs to
be indexed
4. When using default Alfresco permissions design, the ACL
Change Sets will include a small list of ACLs to be indexed
5. When using complex format of documents, Transformation
Service will require additional resources
6. When using large documents, SOLR Index will require additional
storage

8
Indexing scenarios
Controlling what to index
• Content can be excluded from SOLR Index by configuration
solrcore.properties > alfresco.index.transformContent=false
https://docs.alfresco.com/search-community/concepts/solrcore-properties-file.html
• Some nodes can be excluded from SOLR Index by using the Index Control aspect
cm:indexControl > cm:isIndexed :: false, metadata and content is not
indexed
cm:indexControl > cm:isContentIndexed :: false, content is not indexed
https://docs.alfresco.com/community/concepts/admin-indexes.html
• Some properties can be excluded from SOLR Index by design in the Content Model
<property>
<index enabled=”false”/>
</property>
https://docs.alfresco.com/community/references/dev-extension-points-content-model-define-and-deploy.html
Add this setting to
archive
core by default!

9
Re-indexing process can take some time, from a few hours to a few days, in large repositories.
Full re-index
• When upgrading to a major Search Services release, like 2.0
• When the SOLR Index has been corrupted, due to technical reasons
• When breaking changes are introduced in common custom Content Models
Partial re-index
• This process could also take some time, depending on the amount of documents to be re-indexed. But it will take
less than a full re-index
• When incremental changes are introduced in a Content Model, partial reindexation can be fired by using the SOLR
REST API
http://localhost:8983/solr/admin/cores?action=reindex&query=TYPE:person
When to re-index

10
Deployment alternatives
https://docs.alfresco.com/sie/concepts/solr-shard-overview.html
https://docs.alfresco.com/sie/concepts/solr-replication.html
https://docs.alfresco.com/sie/tasks/solr-install.html

11
• Using the ZIP Distribution file
https://docs.alfresco.com/search-community/concepts/solr-install-config.html
• Using Docker or Docker Compose
https://github.com/Alfresco/SearchServices/tree/master/search-services
https://github.com/Alfresco/acs-community-deployment/tree/master/docker-compose
https://github.com/Alfresco/alfresco-docker-installer
• Using Kubernetes
https://github.com/Alfresco/acs-community-deployment/tree/master/helm/alfresco-content-services-community
Installing alternatives

12
Deployment schema to minimize downtime in re-indexing processes
> When using different SOLR version,
configure Alfresco Repository to use the new SOLR server *
> When using the same SOLR version,
INDEX folder can be used directly
* Upgrading from SOLR 4 to SOLR 6 is not allowed when using Alfresco CE 6.2.0-ga (thanks for raising this @AFaust) >> SEARCH-2289
Deployment for Re-Indexing

13
When configuring an Alfresco Node to perform the reindexing process, there are some services you can switch off
depending on your requirements:
• Scheduled Jobs can be disabled, as they will be run by the Alfresco instance in the living service
https://docs.alfresco.com/6.2/concepts/scheduled-jobs.html
• Some ACS features can be disabled
https://docs.alfresco.com/6.2/concepts/maincomponents-disable.html
• Additional subsystems (apart from Search or Transformation) can be disabled
https://docs.alfresco.com/6.2/concepts/subsystem-categories.html
• Activities
• Audit
• Email
• …
“Don’t make a copy of your Alfresco Repository production configuration and press the start button!
Alfresco Repository Indexing Configuration

14
Monitoring
Profiling
• Using VisualVM or YourKit Java Profiler for the JVMs
(Repository, SOLR)
• Using pg_stats_statements extension or some other DB tool
https://hub.alfresco.com/t5/alfresco-content-services-blog/alfresco-6-
profiling-with-docker/ba-p/295846
https://github.com/aborroy/alfresco-6-profiling
Monitoring
• Using Prometheus with Grafana (Repository, SOLR)
https://hub.alfresco.com/t5/alfresco-content-services-blog/monitoring-
alfresco-solr-with-prometheus-and-grafana/ba-p/294157
https://github.com/aborroy/alfresco-solr-monitoring

1515
Demo time without downtime!

16
• Living Docker Compose environment running with around 4,000 text documents indexed
• Using YourKit-Java-Profiler to monitor Repository performance
• Starting a new Search Services 2.0 server locally to start indexing the repository
• Once Search Services 2.0 is updated, change Solr hostname value from Admin Web
Console or modify alfresco-global.properties
Search Services 2.0
is not
released yet!
Demo time without downtime!
http://127.0.0.1:8083/solr/alfresco/select?indent=on&q=TEXT:[* TO *]&wt=json
http://127.0.0.1:8983/solr/alfresco/select?indent=on&q=TEXT:[* TO *]&wt=json

18
1 Billion Documents Review (2015)
• Review from 1 billion benchmarks (November 2015)
• 10 repository nodes (Alfresco 5.1), 20 Solr 4 nodes (Alfresco Index Server)
• Indexed 1b documents in 5 days
How Alfresco powered a 1.2 Billion document deployment on Amazon Web Services

20
1.2 Billion Baseline Plan (2020)
• Customer-sponsored benchmark to see performance of system with their configuration
• Want 1.2b documents indexed into Search Services
• 20 instances, each containing a single shard (DB_ID_RANGE based sharding)

21
• Bottlenecks
• Database (getChildAssocs)
• Transformers (when using large documents)
• Network (when using large metadata/content)
• Time spent processing data for other shards
Performance considerations

22
Baseline Results
• Estimated completion in 21 days

23
Baseline Results
• Estimated completion in 21 days

24
DB_ID_RANGE Sharding
• Does not require specifying total number of shards in advance
• Index can continue to grow with repository
See https://docs.alfresco.com/search-enterprise/concepts/solr-shard-approaches.html

27
Time spend processing transactions for other shards
• With DB_ID_RANGE sharding we know that only a range of transactions are relevant
• Skip transactions when using DB_ID_RANGE
• To support path queries we sometimes need to update data on multiple shards from a single change
• Option to disable cascade tracking

28
Reduce Database Access and Network Usage
• Reduce amount of data requested
• Remove unused calls to getChildAssocs
• Compress communication where appropriate
• Add option to compress content transfer
Lorem ipsum dolor
sit amet,
consectetur
adipiscing elit...
Please give me all
metadata for the
node
Please give me:
● X
● Y
● Z
78 9c 05 c1 81 09
c0 30 08 04 c0 ...

29
Overview of Improvements in 1.4.2
• Search Services 1.4.2 (and Insight Engine 1.4.2)
• ACS Repository 6.2 Enterprise
• No ACS Community release containing this yet
• However can use existing ACS and jars from https://github.com/aborroy/solr-performance-services-repo
Reindex of 1.2b documents in 10 days
(6 repo nodes, 20 search nodes)
Search Services 1.3.0
150 documents/second
Search Services 1.4.2
1200-3500 documents/second*
(depending on the number of
shards, size of documents, etc.)
* Depending on exact configuration
(Nb. Not yet validated on the production system)

31
Future Improvements - Coming in 2.0.0
• Schema Simplification
• Smaller index
• Removing Duplicate Fields
• Smaller communication
• Improved Trackers
• Less duplication with large transactions
• New tracker parallelism option
• Content Store Removal
• Reduced disk usage
• Less duplication
• Better usage of Solr optimisations
• Adds potential to use other Solr features

32
Scenarios datasets
• 100,000 documents created with 100,000 transactions
• 100,000 documents created with 1 transaction
• Changing the path for 100,000 documents
• 200,000 ACLs created with 200,000 ACL change sets
Parameters investigated
• The existing *BatchSize size parameters
• The new *MaxParallelism parameters
• These change the number of workers assigned to the
tracker. They use a ForkJoinPool, and can impact the
resources available to other processes
Improved Trackers - Testing

33
Hotspot calculation
• Increasing the Transaction Batch Size for nodes and ACLs
has an impact while the maximum number for your
deployment is not reached. After that, you can increase this
batch size but there will be no performance changes
• Increasing the Node Batch Size can improve your
performance while you are down the right number for your
deployment. After that, you can increase this batch size but
the performance will be penalised
• Increasing the maximum number of Parallel Threads
improved performance until the maximum number for our
deployment was reached. However in a real world
deployment it may be useful to use a lower number to avoid
impacting other processes.
Improved Trackers - Testing
Duration
(ms)
#

34
Content Store Removal
• Solr Content store removal will reduce disk usage and simplify replication
The Solr Content Store

35
Content Store Removal
• Solr Content store removal will reduce disk usage and simplify replication
The Solr Content Store
Replication of index across Solr nodes

37
When to re-index
• When upgrading to major Search Services releases
How to re-index
• Running some small tests to ensure the performance of the indexing process before running it in production
• Indexing from scratch with the upgraded Repository
• Indexing in a parallel deployment
How to measure
• Profiling
• Monitoring
Recap

39
Relevant works
https://nathanmcminn.com/2017/01/11/alfresco-and-solr-search-reindexing-and-index-cluster-size/
https://www.slideshare.net/JosePortillo26/jose-portillo-dev-con-presentation-1138
https://www.slideshare.net/angelborroy/2019-dev-con115angelborroy
https://blog.xenit.eu/blog/ethias-sharding
https://hub.alfresco.com/t5/alfresco-content-services-blog/large-repository-upgrades/ba-p/287877
https://hub.alfresco.com/t5/alfresco-content-services-blog/scaling-search-with-db-id-range/ba-p/287900
https://www.alfresco.com/technical-whitepaper/alfresco-content-services-solr-deployment-options
https://www.alfresco.com/technical-whitepaper/alfresco-content-services-solr-deployment-example-aws
https://docs.alfresco.com/6.2/concepts/upgrade-path.html

(Re)Indexing Large Repositories in Alfresco

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a (Re)Indexing Large Repositories in Alfresco

Similar a (Re)Indexing Large Repositories in Alfresco (20)

Más de Angel Borroy López

Más de Angel Borroy López (20)

Último

Último (20)

(Re)Indexing Large Repositories in Alfresco