3. Agenda
1 - General best practices on tuning
2 - Common mistakes
3 - Sizing, what to expect from a single server
4 – Solr tuning
5 – Jvm tuning ( Part 2 )
6 – Caches (Part 2 )
7 - Alfresco is running slow.. where to start ?
(Part 2)
4. 1 - General Best Practices on Tuning
Disable Un-used services and features
• Disable virtual file-systems
• cifs.enabled=false, ftp.enabled=false
• webdav.enabled=false, nfs.enabled=false, imap.enabled=false
• Disable thumbnails and documents previews
• system.thumbnail.generate=false
• Disable share web-preview (on share-config-custom)
<config evaluator="string-compare" condition="DocumentDetails" replace="true">
<document-details>
<!-- display web previewer on document details page -->
<display-web-preview>false</display-web-preview>
</document-details>
</config>
• Disable replication
• replication.enabled=false
• transferservice.receiver.enabled=false
5. 1 - General Best Practices on Tuning
Disable Un-used services and features (cont.)
• Disable cloud-sync features
• syncService.mode=OFF
• sync.mode=OFF
• sync.pullJob.enabled=false
• sync.pushJob.enabled=false
• Disable user quotas
• system.usages.enabled=false
• system.usages.clearBatchSize=0
• Disable eager creation of home folders
• system.usages.enabled=false
• Disable activities feed
• activities.feed.notifier.enabled=false
• activities.feed.cleaner.enabled=false
• activities.post.cleaner.enabled=false
6. 1 - General Best Practices on Tuning
Golden Rules for the Repository
• Limit Groups hierarchy to 5 (nested groups)
• User inheritance based permission model
• Limit the maximum number of nodes in a folder
• Have a certain degree of control on the number of sites, do you really need
10000 sites ?
• Keep a low ratio on user/groups membership
• Limit the depth of the folder hierarchy
7. 2 – Common Mistakes
• Not keeping extended configurations and customizations separate in the shared
directory. Do not put them in the configuration root. If you do, you will lose them
during upgrades.
• Not testing the backup strategy.
• Insufficient Monitoring, Insufficient troubleshooting tools
• Making changes to the system without testing them thoroughly on a test and pre-
production machine first.
• Forgetting to adjust the system sizing for increased users and sessions
• Increase the database connection pool
• Tune the maxThreads on tomcat
• Tune Jvm and Gc
• Benchmarks and Stress tests
• Using a shared database with other applications
• Network/Infrastructure constraints
• Not following the SPM
8. 2 – Common Mistakes
• Customizations / Custom Code mistakes
• Not closing search resultsets on try..catch…finally blocks (memory leaks)
• Incorrect usage of policies/behaviors (collisions, poor code quality)
• Using lower case versions of Alfresco beans
• Direct access to the database (should use Spring and the existing DAOs)
• Usage of private API’s
9. 2 – Common Mistakes
• Customizations / Custom Code mistakes
• Using Transaction Service instead of
RetryingTransactionHelper
• Not using CMIS query language when using SearchService
• Improper exception handling
10. 3 – Sizing - what to expect from a single server
Lets assume we’re running alfresco on a single server with the following hardware.
Red Hat Linux 64 bits, 16 GB RAM, 2 quad-core cpus 3.2Ghz, local SSD disk
We will have 3 web-applications running on the same JVM and container (i.e tomcat)
• Alfresco Repository
• Alfresco Share UI
• Solr
According to our internal benchmarks, and highly dependent on the specifics of each
use case this server should be able to serve handle 200 concurrent users or up to
2000 casual users
11. 3 – Sizing - what to expect from a single server
The following facts will affect the sizing and architecture.
• Use Case
• Concurrent users
• Document types, sizes and distribution ratios
• Architecture (virtualization ?, fail safe ?, replication ? Integrations ? Component stack
• Authority structure
• Operations
• Components, Protocols and Apis
• Batch operations
• Response times requirements
13. 3 – Sizing Divide and Conquer
• Know when,where and what are the processes that are running on your
server and what resources are those processes influencing.
• Do it with appropriate monitoring
• Javamelody as a simple approach (DEMO)
• https://github.com/miguel-rodriguez/alfresco-monitoring
• Use support tools for troubleshooting (DEMO)
• https://github.com/Alfresco/alfresco-support-tools
• Have specific servers dedicated to specific tasks
• Offload the user facing nodes
16. 4 – Solr Tuning
Golden Rules for Solr
• Do you search on deleted content ? If not, disable the archive core.
• Go to solrHome and edit the solr.xml file commenting out the archive core
• Also disable the archive core backup scheduled task
• Do you search on content or only meta-data ? You can disable full-text-indexing
• alfresco.index.transformContent=false
• Alfresco can make use of Transactional Metadata queries (db fetch)
• SSL really needed? If inside the intranet, it should be disabled to reduce complexity.
• Optimize your ACL policy, re-use your permissions, use inherit and use groups
17. 4 – Solr Tuning
Golden Rules for Solr Indexing
• Have local indexes (don’t use shared folders, NFS, use Fast hardware
(RAID, SSD,..)
• Tune the mergeFactor, 25 is ideal for indexing, while 2 is ideal for search.
• Tune your Ram buffer size (ramBufferSizeMB) on solrconfig.xml, 32 MB by
default
• Analyze your indexing processes (check alfresco repository health)
• Tune the transformations that occur on the repository side, set a
transformation timeout.
18. 4 – Solr Tuning
Golden Rules for Solr Indexing
• Closely monitor Solr JVM (especially GC and Heap usage)
• Enable GC logs, analyze Gc performance, tune the GC algorithm
• Do you need tracking to happen every 15 seconds ?
• Use a dedicated tracking alfresco instance, several architecture options
• Increase your index batch counts to get more results on your indexing
webscript
• In each core solrcore.properties, raise the batch count to 2000
• Impacting factors in Indexing
• Jvm Memory and Cpu usage on Repository Layer (text extraction /transformations)
• Jvm Memory ,Cpu , Disk I/O, Disk Cache size on Solr Layer
• Number of threads for indexing, Solr caches
19. 4 – Solr Tuning
Golden Rules for Solr Search
• Have local indexes (don’t use shared folders, NFS, use Fast hardware (RAID,
SSD,..)
• Tune the mergeFactor, 2 is ideal for search.
• Increase your query caches and the RAMBuffer
• Avoid path search queries, those are be slow,.Avoid * search, avoid ALL search
• Avoid using sort, you can sort your results on the client side using js or any client
side framework of your choice.
• Search is CPU intensive rather then RAM intensive, increase cpu power.
• Upgrade your Alfresco release with the latest service packs and hotfixes. Those
contain the latest Solr improvements and bug fixes that can have great impact on the
overall search performance
20. 4 – Solr Tuning
Solr Caches
Tracking the usage of the solr caches can help to tune them for your use case.
• http://<solr_server>:<solr_port>/solr/alfresco/admin/stats.jsp#cache
The url above show you statistics on cache usages, If you have many evictions you
should look into increasing that cache module so all elements can fit (but don't
overdo it, adjust it and see what fits for your setup). It is likewise also a idea to decrease
the size of some of them if they have a lot of unused slots.
Goal should be to get the hit rate as close to 1.00 as possible (1.00 beeing 100% hit ratio)
21. 4 – Solr Tuning
Solr usage on Alfresco Share
• Solr indexing and search performance will affect positively the overall share
performance. Share relies on Solr in the following situations:
• Full Text Search (search field in top right corner)
• Advanced Search
• Filters
• Tags
• Categories (implemented as facets)
• Dashlets such as the Recently Modified Documents
• Wildcard searches for People, Groups, Sites (uses database search if not wildcard)
Notas del editor
Disabling some features that are not being used will release important resources allowing them to be used for active tasks, contributing for increased performance.
Transformations
When users are accessing alfresco via the share interface, when they access the document details page a full document preview is generated. This involves calls to various third-party tools such as OpenOffice, Ghoscript, ImageMagick to create a flash version of the document. Since previews are not being used, we can prevent their creation by including the following snipped of xml in the share-config-custom.xml file.
Disabling some features that are not being used will release important resources allowing them to be used for active tasks, contributing for increased performance.
User Quotas
Checking for user quotas can add some overload to alfresco. When not needed, this feature can also be disabled
User home folders
Alfresco creates a home folder for each new user automatically. If your users are not utilizing this folder for any business related tasks you can disable the automatic creation of home folder for new users.
Activities feed
If this feature is not necessary and is not being used disabling it will prevent regular checks to the activities and will again save on system resources.
activities.feed.notifier.enabled=false
activities.feed.cleaner.enabled=false
activities.post.cleaner.enabled=false
1,2 - Acl checks are known to slow down performance when the maximum group hierarchy depth exceeds 5 levels. Our advice is, when possible, limiting the max group hierarchy depth to 5.
3 - When using share or another client to browse a repository folder, alfresco needs to perform a series of actions before it actually renders or delivers the content (permission checking, etc). The more nodes that reside inside a specific folder the slower will be the response time. We recommend, when possible, limiting the maximum number of document nodes inside a folder to 2000.
4 - The number of sites on the system has some relative influence on performance, especially when checking the site membership for the users. Although this is (on 3 limits suggested here) the factor that has less impact, we recommend keeping the number of sites below 5000.
5- The number of groups a user belongs to has impact on performance while rendering some client pages (specially on some share dashlets, like the mysites dashlet). Alfresco run some complex queries (based on the user groups membership) while determining the assets to render on some share pages. We advice on keeping a low ratio on the number of groups a user belongs to. When possible, and to optimize the share client rendering performance, a user should not belong to more than 5 or 6 groups.
6 - The depth of the folder hierarchy also has an impact when browsing and performing document actions under a certain folder. We recommend, when possible, limiting the maximum depth of a folder hierarchy to 15 levels
Not closing resources
Certain resources in Alfresco (specifically search result objects) are not cleaned up automatically by Alfresco and must therefore be cleaned up correctly by extension code (i.e. in a “finally” block). Failing to clean up such resources results in leaks, not only of memory but also, in some cases, “real” operating system resources (such as file handles) as well.
Lower case beans
The “lower case” versions of the Alfresco service beans (i.e. those whose first name starts with a lowercase letter e.g. nodeService) are configured to bypass Alfresco’s security, transaction and auditing checks, with no recourse for the administrator to turn them back on. There have been persistent (but incorrect) rumors in the Alfresco community that these versions of the services perform significantly better than the official (“upper case”) versions, but that hasn’t been the case since at least Alfresco v2.x.
Content policies
Content policies are wired in at a very low level in the repository and as a result can be called many hundreds of times a second in some cases (e.g. when content is being manipulated via CIFS). In addition they are executed synchronously within each Alfresco transaction. The result is that even the slightest poor performance in a custom content policy or behavior can have a profound impact on Alfresco performance. For that reason it is critically important that custom content policies / behaviours are either fast (conduct minimal computation and only perform minimal I/O to the repository) or are made asynchronous.
Direct access to database
Alfresco’s database schema and the SQL the product uses has been carefully tuned. Uncontrolled access to the same tables can interfere with Alfresco’s normal operation, impacting both performance and (in some cases) stability. Note that this includes reads (SELECTs), as this can block concurrent write operations in some circumstances.
Use of private API’s
Only the public Alfresco Java APIs may be used in a certified extension, the private APIs should not to be used and are not supported. http://docs.alfresco.com/4.2/concepts/java-public-api-list.html
Using Transaction Service instead of RetryingTransactionHelper
As the name implies, RetryingTransactionHelper contains retry logic for certain recoverable, expected database exceptions (deadlocks, basically). It also uses Spring’s “template” pattern to ensure a transaction is always completed (committed or rolled back) correctly, regardless of what happens in the logic inside the transaction.
The “raw” TransactionService provides neither of these benefits and for that reason is considered unsafe for use in extensions
CMIS query Language
Alfresco’s SearchService API supports different “languages” (XPath, Lucene, SOLR and CMIS), which roughly correlate to the different underlying search implementations in Alfresco. Of these languages, only CMIS is fully abstracted away from the underlying implementation, and so is the only language that provides some guarantee of consistent behavior, regardless of how a given Alfresco instance has been configured (SOLR vs Lucene, or MDQ, for example). Note however that SearchService doesn’t fully implement CMIS-QL – specifically, the “SELECT” clause in CMIS queries sent to the SearchService are not processed (they are silently ignored). SearchService, regardless of the query language used, always returns sets of matching NodeRefs.
Improper exception handling
Exceptions should either be caught and recovered from, or allowed to flow up the call stack. It is almost never appropriate to “swallow” an exception (catch it and do nothing), and excessive wrapping of exceptions inside other exceptions should be minimized (it makes triage more difficult).
Catching or throwing java.lang.Error instances, and catching java.lang.Throwables is also inappropriate – these classes (java.lang.Error, specifically) indicate fatal JVM problems and therefore cannot be safely caught or thrown.
Alfresco sizing like other systems has many subtleties that are hard to fully systematize. Each deployment will have specificities that will demand consideration while estimating sizing for the different layers (share front-end, repository, indexing, transformation, storage). In any case, for the numbers presented above we are assuming that there are certain fairly generic steps and concerns that are mostly common between the different use cases. Changes on such assumption can drastically change the predictions above.
For memory calculations, consider the repository L2 Cache, plus initial VM overhead, plus basic Alfresco system memory.
This means that you can run the Alfresco repository and web client with many users accessing the system with a basic single CPU server ,However, you must add additional memory as your user base grows, and add CPUs depending on the complexity of the tasks you expect your users to perform, and how many concurrent users are accessing the client.
The terms concurrent users and casual users are used. Concurrent users are users who are constantly accessing the system through Alfresco with only a small pause between requests (3-10 seconds maximum) with continuous access 24/7. Casual users are users occasionally accessing the system through the Alfresco or WebDAV/CIFS interfaces with a large gap between requests (for example, occasional document access during the working day).
Common use cases
Alfresco use cases vary considerably because of the elasticity of the platform and although we can enumerate some generic common use cases, the details on which each real implementation differs may be very important from architecture and sizing perspective. We can consider that generally alfresco solutions can be classified on one of the 2 following cases:
• Collaboration
• Backend Repository
Authority Structure
We know from recent benchmarks comparisons that authority structure has a direct and important impact on performance of SOLR especially. So its important when sizing SOLR the importance of search operations for the solution, the types of searches being executed, the repository size and characteristics but also and equally important the authority structure of the corresponding use case. Collaboration use cases will so be in general more demanding in principle (keeping the other mentioned factors constant) than backend use cases with simpler authority structures.
Common use cases
Alfresco use cases vary considerably because of the elasticity of the platform and although we can enumerate some generic common use cases, the details on which each real implementation differs may be very important from architecture and sizing perspective. We can consider that generally alfresco solutions can be classified on one of the 2 following cases:
• Collaboration
• Backend Repository
Injection Rate: the repository growth or document upload rate. The document types and sizes will impact transformation, text extraction and indexing. The repository size has impact on performance will grow with the ratio of search operations expected on the solution.
The injection rate has an obvious impact on the server concerning near future repository sizes, but also around the capability of the different solution layers on handling the throughput that not only stress the content storage and database (metadata extraction/upload/rules) but also the indexing layer. This may imply depending on the amount of document injection happening that dedicated nodes are reserved for the injection (that can be on cluster or not with the front end service nodes) and also Solr layer scaled up/out. The requirements around uploading and downloading large or small documents, and indexing, transforming different types of documents will vary. This can have architecture consequences. Preference for certain type of protocols and mechanisms for large files for example: FTP, bulk ingestion; or use of dedicated caching solutions for large documents downloads. But this also has consequences at the sizing level. It will be very different having to index large documents than smaller ones, and also between different types.
Repository size is also dynamical, and the project may expect even on the near term to go through different phases: first migrating legacy repository, new content roll out, archival, etc. So sizing estimates should cover the different phases especially on short term.
One of the secrets of a successful architecture is to know exactly what, when and where are the processes occurring and what resources are those processes influencing. Having this information brings the architect the power to “Divide and Conquer”.
Working with a fairly flexible technology he can wisely divide the overall processes across the resources (servers) achieving the necessary balance. Let’s consider for example the schedule jobs that Alfresco executes, on a distributed architecture we have lots of advantages to offload some of those jobs to a specific server, releasing important resources from the servers that are actually serving user requests.
From an alfresco perspective offloading (disabling) this scheduled jobs from the front-end servers is no more than configured their cron job to execute very far away in the future and have a dedicated server (normally separated from the cluster) to execute this jobs.
Capacity Planning is the science and art of estimating the space, computer hardware, software and connection infrastructure resources that will be needed over some future period of time. It’s a mean to predict the types, quantities, and timing of critical resource capacities that are needed within an infrastructure to meet accurately forecasted workloads.
Predicting and sizing a system depends on a good understanding of user behavior. The following diagram represents the stages of a capacity-planning scenario.
Validate your predictions with stress tests, compare obtained data with the LIFE data from production adjusting the REAL user behaviors to your stress tests scenarios.
Performing a regular analysis to the monitoring/capacity planning data will help you know exactly when and how you need to scale our architecture, allowing for incremental and continuous improvement. The more data that gets indexed inside elastic search along the application life cycle the more accurate your capacity predictions will be. They represent the “real” application usage and how that usage affects the various layers of your application. This plays a very important role when modeling and sizing the architecture for future business requirements.
The Peak period Methodology is an efficient way to implement a capacity planning strategy allowing to analyze vital performance information when the system is under more load/stress, furthermore it represents YOUR system. The peak period may be an hour, a day, 15 minutes or any other period that is used to analyze the collected utilization statistics. Assumptions may be estimated based on business requirements or specific benchmarks of a similar implementation.
The diagram shows what are the most important factors in the deployment that should be analyzed on a regular basis.Note that certain inspection targets have overhead when they are being inspected. These targets might not be appropriate for long-term monitoring or may require tuning to minimize impact.
Pay attention to alfresco transformations and text extraction
Alfresco executes a high number of transformations while working with documents, those include document’s text extraction (for indexing), previews generation, thumbnails generations and renditions. It’s wise to keep a regular monitoring on the health and performance on the Transformations, check the longest running transformations, measuring transformation times and using transformation limits when applicable.
HTTP Requests and Responses
Debugging HTTP requests can yield useful information to help you tune your applications. Find out what component are making the page take so long to load, or make sure the JSON your web script is returning looks like you expect it to. There are a couple of tools that can be used to debug http requests and responses like Charles or Fiddler, some other tools are included in the browsers (firebug, chrome inspector).
Disable all full-text indexing
We can disable all full-text indexing activities and tune our search layer for performance on meta-data based searches. We can make usage on a recent feature that alfresco makes available called transactional metadata queries. To disable full-text search we need to configure the workspace Spacestore solr core. Edit the solrcore.properties file and set the following property:
alfresco.index.transformContent=false
Disable archive core
If you are not planning on searching for deleted content, we can safely disable the indexing of archive content. Alfresco never searches for content inside files that are deleted/archived.
We can disable the indexing of archived content by going into solrHome and edit the solr.xml file that resides on the root of this folder. Edit this file and comment out the archived content indexing using the xml below.
<?xml version='1.0' encoding='UTF-8'?><solr sharedLib="lib" persistent="true">
<cores adminPath="/admin/cores" adminHandler="org.alfresco.solr.AlfrescoCoreAdminHandler">
<!-- <core name="archive" instanceDir="archive-SpacesStore"/>-->
<core name="alfresco" instanceDir="workspace-SpacesStore"/>
</cores>
</solr>
Note that we are commenting out the solr core named “archive”. This will prevent solr from indexing archived content. This saves on disk space, memory on Solr, Cpu during Indexing and overall resources.
We also need to disable the archive core backup scheduled task. We do this by configuring the cron to a date in a very far away future. You should do this on every alfresco node (including the new tracking instance)
# disabling the archive backup as we are not using archive search
solr.backup.archive.cronExpression=* * * * * ? 2199
solr.backup.archive.numberToKeep=0
Optimize your ACL policy, re-use your permissions, use inherit and use groups. Don’t setup specific permissions for users or groups at a folder level. Try to re-use your Acls.
ramBufferSizeMB
ramBufferSizeMB sets the amount of RAM that may be used by Solr indexing for buffering added documents and deletions before they are flushed to disk.
generally increasing this to 64 or even 128 has proven increased performance. But this depends on the amount of free memory you might have available.
Analyze Indexing process
During the indexing process, plug in a monitoring tool (YourKit) to check the repository health during the indexing. Sometimes, during indexing, the repository layer executes heavy and IO/CPU/Memory intensive operations like transformation of content to text in order to send it to Solr for indexing. This can become a bottleneck when for example the transformations are not working properly or the GC cycles are taking a lot of time.
GC Tuning
Solr operations are memory intensive so tuning the Garbage collector is an important step to achieve good performance. Jclarity Cemsum tool is really good to analize gc logs, but there are others.
Tracking frequency
Consider if you really need tracking to happen every 15 seconds (default). This can be configured in Solr configuration files on the Cron frequency property.
alfresco.cron=0/15 * * * * ? *
This property can heavily affect performance, for example during bulk injection of documents or during a lucene to solr migration. You can change this to 30 seconds or more when you are re-indexing.
This will allow more time for the indexing threads to perform their actions before they get more work on their queue.
Increase your index batch counts to get more results on your indexing webscript on the repository side. In each core solrcore.properties, raise the batch count to 2000 or more alfresco.batch.count=2000
For index updates, Solr relies on fast bulk reads and writes. One way to satisfy these requirements is to ensure that a large disk cache is available. Use local indexes and the fastest disks possible.
In a nutshell, you want to have enough memory available in the OS disk cache so that the important parts of your index, or ideally your entire index, will fit into the cache. Let’s say that you have a Solr index size of 8GB. If your OS, Solr’s Java heap, and all other running programs require 4GB of memory, then an ideal memory size for that server is at least 12GB. You might be able to make it work with 8GB total memory (leaving 4GB for disk cache), but that also might NOT be enough.
Troubleshooting common Indexing problems
Database – If it’s a database performance issue, normally adding more connections to the connection pool can increase performance.
I/O – If it’s a IO problem, it can normally occur when using virtualized environments, you should use hdparam to check read/write disk speed performance if you are running on a linux based system, there are also some variations for windows. Find the example below: sudo hdparm -Tt /dev/sda
The rule for troubleshooting involves testing and measuring initial performance, apply some tuning and parameter changes and retest and measure again until we reach the necessary performance. I strongly advice to plugin a profiling tool such as Yourkit to both the repository and Solr servers to help with the troubleshooting.
Make sure you are using only one transformation subsystem. Check the alfresco-global.properties and see if you are using either OooDirect or JodConverter, never enable both sub-systems.
Typical issues with Searching
It can happen that you are searching and indexing on the same time, this causes concurrent accesses to the indexes and that is known to cause performance issues. There are some workarounds for this situation.
To start you should Plugin a profiler and search for commit Issues (I/O locks), this will allow you to check if you are facing this problem.
Solr makes some statistics available by default. Analysing those statistics can help you to tune your solr caches for your use case.
The way to read this results (analysing the caches) is the following.
If you have many evictions you should look into increasing that cache module so all elements can fit (but don't overdo it, adjust it and see what fits for your setup).
It is likewise also a idea to decrease the size of some of them if they have a lot of unused slots.
A goal should be to get the hit rate as close to 1.00 as possible (1.00 beeing 100% hit ratio)
If your project relies on the share client offered by Alfresco, you should know that tuning your Solr indexing and search performance will affect positively the overall share performance.