Caching

This section provides technical information about the caching technologies used in Sophora.

Table of Contents

Basic Information

This section provides technical information about the caching technologies used in Sophora.

The directory that should be used to store the cache can be set in the sophora.properties of the delivery. The following subdirectories are created automatically within the configured cache directory:

  • a directory db and
  • a directory removal and
  • a link htdocs.

The directory db is the location of the registry of the cache manager. The directory removal contains the persistent data of the removal- as well as the pre-generating queue. The link htdocs ensures that the HTML cache is located at a defined position within the directory structure but may be spotted elsewhere if necessary.

The cache manager executes two tasks:

  1. The pre-generation of files: Requests are sent to the Tomcat. Files that are created by these requests are written into the docroot directory of the webserver.
  2. Clearing the caches: If necessary, files in the docroot directory are renamed or deleted.

Normally, the pre-generation and deletion are triggered by activities in Sophora. Files are not directly removed from the cache but renamed (to files with the same name but the extension .old). Doing this has the advantage that these renamed files can be used as backup in the case that the new generation is in progress or has failed when a document is requested.

When rendering a document the content provider registers this document with its ID, URL and according JSP template at the cache manager. Thus, the cache database knows all documents that have been used to render HTML pages. When the cache manager is informed that a document has been published (or deleted) it searches its database for the list of HTML files that need to be updated.

The same accounts for resources as JSP templates: When a template is changed the cache manager determines the corresponding entries in the database and updates them accordingly.

When using such a registry mechanism you have to be aware that

  1. a huge cache comes along with many entries in the registry/cache database
  2. an inconsistency/mistake in the registry requires to reset/clear the cache entirely (and rebuild it)

Caching the content in general is carried out on the file system. The Apache only has reading access to the cache and uses a rewrite condition to validate whether a file exists in the cache. If that's not the case it forwards the request to the Tomcat server.

The Tomcat controls the writing access to the cache:

1. Cache files are deleted. In fact, these files are not removed directly but renamend into .old files.

2. Cache files are renewed. Changes in the content are caused by:

  1. Pre-generation has been initiated
  2. A requested file doesn't exist in the cache. The request has to wait until the cache entry is created. If other requests enquire the same content in the meanwhile, the content from the .old file will be delivered. If the cache entry is generated, the .old file will be deleted.

To assure that the Apache doesn't deliver .old files from docroot directly, you can use the following configuration in your httpd.conf:

<Files *.old>
Order allow,deny
Deny from all
</Files>

The Apache is responsible to resolve SSIs. Within the Tomcat this behaviour is turned off when the isMounted property in the configuration is set to true.

The caching of content in the docroot directory of the webserver, though on the file system, is powerful and robust. This way the delivery by the webserver almost achieves the performance of a static website.

Pre-generation

When content is flushed, two situations can occur afterwards:

  1. The content is added to the pre-generation queue. In the meantime the cached file remains unchanged in the cache.
  2. The content will be generated when it is requested again. Therefore, the cached file will be renamed to a "OLD" file.

The pre-generation is triggered when the sophora.delivery.cache.pregeneration.enabled property in the sophora.properties file is set to true and either no cached file exists or the last document access happened within the configured time frame. Thus, rarely accessed sites/documents are not pre-generated.

Dependencies between Content and Cache Fragments

During the generation phase of a cache fragment the recent content is read from sophora documents and structurenodes. Consequently the cache fragment must be invalided when the content is de-published or published again. Such a relationship between a cache fragment and content is called 'dependency'.

Not only the used documents and structurenodes are tracked during the generation phase but also their used properties and childnodes. So when a cache fragment only displays the topline of a story, the cache fragment does not need to be invalided, when the copytext has changed into a newer version of the document. In this case the cache fragment has only a dependency to the property topline. So only when this property changes between versions or when the corresponding document is de-published the cache fragment will be invalidated.

The same applies to structurenodes. When for example the name and the property sophora:isActive are read during the generation of a cache fragment, the cache fragment has only dependencies to these two properties. When e.g. the default document of the structurenode changes, the cache fragment does not need to be invalidated.

Dependencies to properties are very useful when a document (e.g. with the Sophora id story100) is linked within a different document (story102). So the document story102 must only be invalided when the Sophora id or the structurenode of story100 changes.

If more than ten properties or at least one childnode is used, the cache fragment has a full dependency to the sophora document.

Details about Cache Update

Changes on a single document may effect many cache files, so that these files are marked as invalid. Deleting and pre-generating cache entries can take a long time. During this time small changes in important sites shouldn't be obstructed by long running removal or generation processes. To handle such situations the delivery uses two priority queues:

  • Removal queue
  • Pre-generation queue

Each entry in these queues always refers to an indiviual file in the cache.

Cache entries are marked as invalid, if one of the following actions is performed:

  • Changing a document's status to "publish" or "deleted"
  • Changing the YellowData of a document (if it had a dependency to one of its yellow data objects)
  • Time scheduled content update
  • Modification of JSP templates (or resources in general)
  • Flushing explicitly using the flushCacheEntry tag

If a document has been changed, all related cache files are determined via the cache database. In general the total number of found files is used to assign a priority for the removal job. The more files are found the lower is the priority.

Before a priorization takes place, all cache entries are differentiated between files that start with the Sophora-ID of the changed document and all remaining files. If (and only if) the amount of files starting with the Sophora-ID is smaller than the rest, the removal job will be split into two tasks and their priorities are calculated seperately.

Every entry is either assigned to the removal queue or to the pre-generation queue. The decision is based on:

  • If the pre-generation is turned off, the entry is assigned to the removal queue.
  • If the particular file has been processed by the forcePregeneration tag, the entry is moved to the pre-generation queue.
  • When the file has has been accessed recently (within the intervall specified by the property sophora.delivery.cache.pregeneration.maxAccessTimeInterval and at least once after the file creation), it is assigned to the pre-generation queue.
  • Otherwise the entry is written to the removal queue.

If the cache doesn't contain an entry belonging to this Sophora document, the generateNonExistentFiles property is checked. If it's set to true, a URL pointing to this document is determined and the document is added to the pre-generation queue assigning a priority "50". The document's URL is discovered using a JSP page which includes the property sophora.delivery.cache.pregeneration.urlProviderTemplate.

The entire process of determining and adding the queue entries is carried out single-threaded. It is fast since the relatively big cache database is only read and the queues themselves are held in a separat database.

If doubled entries occur, the queues are organized in a way that only the entry with the higher priority is kept.

The removal queue is processed single-threaded whereas the pre-generation queue is processed multi-threaded. The amount of possible threads can be configured by the property sophora.delivery.cache.pregeneration.numberOfThreads. High priority entries are processed first. In theory, low priority entries may be blocked away endlessly, if entries with a higher priority are added constantly.

Flush Jobs and Synchronization (Cache Servlet)

Explicit flush jobs can be defined not only by the sophora:flushCacheEntry tag but also via a special cache servlet. When using the cache servlet, flush jobs are synchronized between different delivery installations automatically if this is configured in the Staging Slave's sophora.properties accordingly. Therefore, deliveries are organised in groups and the synchronization is done by these configured groups.

A cache flush or synchronization can be triggered by calling the servlets URL with a set of given parameters. See section "Usage" for additional information.

Configuration

The configuration of the cache servlet takes two steps.

  1. Defining the cache servlet within the web.xml file of the webapp and (optionally) set a password to restrict access to the servlet.
  2. Set up communication between deliveries within the Sophora Staging Slaves configuration.

Delivery

In order to configure the cache servlet it must be added to the web.xml of the web application. To achieve this you can use the following code snippet. Afterwards the servlet is available at http://HOSTNAME/CONTEXT/system/servlet/cache.servlet.

web.xml

<servlet>
    <servlet-name>cacheServlet</servlet-name>
    <servlet-class>com.subshell.sophora.delivery.cache.servlet.CacheServlet</servlet-class>
</servlet>
 
<servlet-mapping>
    <servlet-name>cacheServlet</servlet-name>
    <url-pattern>/system/servlet/cache.servlet</url-pattern>
</servlet-mapping>

If a cache servlet needs to be protected by a password, you have to add the additional property sophora.delivery.cache.event.password to the delivery's sophora.properties file. Please note that the passwords of all deliveries need to be equal.

Server

As a prequesite for synchronizing flush jobs the deliveries must be able to communicate with each other. Therefore the connected Staging Slaves must know which deliveries are connected. This information must be added to the Staging Slave's configuration file. Please refer to the Server documentation for details.

Usage

A flush job is always sent to a single delivery by invoking a subsequent URL like the following:

http://HOSTNAME/CONTEXT/system/servlet/cache.servlet?action=flush&key=KEY&group=GROUP

An individual flush job will be synchronized with all deliveries that are in the group defined by the flush job. These deliveries receive a request to the cache servlet and execute this flush accordingly. To avoid cyclic requests the request parameter forward is set to false internally so that no more servlet calls will be executed. At startup, a delivery synchronizes its flush database with the other deliveries in the same group(s). All flush jobs obtained that way are executed to catch up (and synchronize).

Parameter

The following table lists gives an overview of all available url parameters (and values) of the cache servlet

ParameterDescriptionValuesMandatory
actionDefines the action to trigger. Available actions are flush for triggering a flush event for a specific cache fragment and sync to trigger synchronisation between deliveries and to retrieve a list of last flush events.flush, syncyes
keyThe name of the key of the cache entry that should be flushed. It is not checked whether it is a valid and existing cache key.
However if the cache key is the UUID of a document and some fragments have been removed actually then pregeneration for this key is triggered.
Any String value representing a cache keyyes for action flush
groupThe delivery group's name which needs to be synchronised. It is not checked whether a group with the given name actually exists.Any String value representing a configured groupyes
forwardIf set to false a flush or sync event is not forwarded to the other deliveries in the specified group. Internally this parameter is used to avoid cyclic requests when synchronising between deliveries. If not set the parameter is set to true.true or falseno
sinceThis parameter can be used to trigger a sync event ignoring all flushes before the given date. The date must be specified as an UNIX timestamp. This parameter only effects sync actions.UNIX timestamp e.g. 1328196977070no
timestampUsed internally to forward the point of time when a flush event was triggered to other deliveries. Do not set this parameter manually!UNIX timestamp e.g. 1328196977070no
passwordIf the cache servlet is protected via the property sophora.delivery.cache.event.password, the parameter password is needed to specify the set password.The configured passwordno
It is not recommended to use the parameter forward explicitly in your own cache flush trigger. If you do so you have to make sure that calls to the cache servlets do not result in cyclic servlet calls and that cache flushes will not lead to different states of the cache.

Response

The next table lists all possible responses of the cache servlet

ResponseError MessageDescription
HTTP 200 OK-The flush or sync event was triggered successfully. In case of a sync event, the content of the reponse might contain a list of the last flushed cache fragments. The fragments are specified by their cache key and the event's timestamp. The two values are separated by semicolon.
HTTP 500 Internal Server ErrorInvalid use of the cache servlet ...The cache is disabled in the deliveries sophora.properties. Therefore no flush or sync event can be triggered for this web application.
unknown actionThe mandatory parameter action is missing or invalid.
password required / invalid passwordA password was set in the sophora.properties of the delivery via the property sophora.delivery.cache.event.password but the parameter is missing or contains a wrong password.
no cache key specifiedThe parameter key is missing for this flush action.
no group specifiedThe mandatory parameter group is missing.
Forward and Inactive Cache
Please note that flush events are forwarded to all other deliveries even if the cache is disabled for the delivery the servlet is invoked on. In this case no error message is returned.

Examples

Flush cache fragment with key 'test', for all deliveries in group 'live' and use password 'secret' for authentication.

http://HOSTNAME/CONTEXT/system/servlet/cache.servlet?action=flush&password=secret&key=test&group=live

Trigger synchronisation for all deliveries of group 'live' and get list of last flush events within this group.

http://HOSTNAME/CONTEXT/system/servlet/cache.servlet?action=sync&group=live

Derby NG and Flush Jobs for Resource Changes

Derby NG can be used as cache database by setting the property sophora.delivery.cache.db to derbyng. Flushes caused by resource changes will then be done based on the resources' content instead of the resources' modification date. For each resource a hash value will be stored in the cache database and this hash value will be used when comparing the webapps' resources for changes. If the hash value has changed, that means the file's content has been changed and the corresponding cache fragments will be flushed.

If you switch from another cache database to the Derby NG implementation you have to reset the entire cache. Note that the comparison via hash values is only available if you are using Derby NG as cache database.

Remove Cache Files and Cache Entries from Cache Database

The standard procedure for deleting cache entries is shown in the image below. Once there is a removal job, the associated file is deleted from the file system, whereas all references to this file are removed from the cache database. Removing the references from the cache database may lead to significant delays of the removal job, e.g. if the server is under high load.

In order to avoid these delays when deleting files from the filesystem, the property sophora.delivery.cache.removeFromCacheRegistry can be set to 'true' in the delivery's property file. When set to 'true', only files and not the references in the cache database are deleted. Superfluous references might then be removed by a cleanup job at a later point of time by setting the property sophora.delivery.cache.cleanup.enabled to 'true' within the delivery's property file. The frequency in which the job should be executed is defined by the property sophora.delivery.cache.cleanup.cron. The use of the cleanup job is shown in the following picture.

Using this configuration not only the removal job is faster, but also generating files is finished in a shorter amount of time. This is because the cache-db has more time to save the references of the generated files.

Asynchronous Update of the Cache Database

Since versions 1.32.4 and 1.33.1 the update behavior of the cache database has changed from a synchronous update to an asynchronous update. The cache database is updated during the generation process of a requested html page. Using a synchronous update the cache database is updated after the html page is generated but before the generated page is delivered to the requesting client. Thus the update decelerate the delivery of a page. The following diagram illustrates this process in versions before 1.32.4  respectively 1.33.1.

To improve the performance of the delivery the cache database is updated asynchronously. For this reason the generated page can be delivered immediately after it is generated. While the database is not up to date (in the following diagram this time is marked with the keyword 'meantime') it is assured that reading operations calculate valid values. Furthermore it is assured that in the event of a system crash during this specific period of time, the database is updated after a restart of the application. This is done with the help of persistent update objects, which are stored in the folder queue within the cache folder.

Ephemeral Cache Fragments

If your webbapp generates some cache fragments with a really short life time then the overall effort of maintaining and organizing them might be disproportional. We call these short time cache fragments "ephemeral".

The Sophora delivery framework offers a special handling for ephemeral cache fragments to prevent an organization overhead and still offer all required support. This only works in conjunction with the asynchronous cache implementation and is configured using this parameter:

sophora.delivery.cache.ephemeralEntriesThresholdInSeconds=60

If this property is set then all cache fragments are considered ephemeral if they will be flushed within the configured amount of time after their generation.

Such ephemeral fragments will not be added to the cache database but instead will be hold in memory. In order to still guarantee a clean up of these fragments if the webapp server crashes a backup-file with crucial information is written in the webapp's data directory.

This feature is by default not enabled but implicitly switched on if the mentioned parameter is set.

Support for ephemeral cache fragments requires a delivery version of 2.5.26, 2.6.0 or newer.