Sophora Offline Document Indexer

The offline document indexer provides a separate data base for providing smart redirects when clients request documents which are no longer online.

Table of Contents

The offline document indexer is a server plugin that - if enabled - fills a separate solr core with information about all existing documents that are offline. Based on this solr core's information, the delivery component might provide redirects for links to documents that had been online before.

Master server configuration

The offline document indexer plugin relies on some configuration properties which must be set on the master server. These properties are listed below.

Property nameExplanation
offlineDocumentIndexer.activeEnables the feature. The default is false
offlineDocumentIndexer.propertyNamesMay contain a comma separated list of additional document properties to be available on the solr documents. See below section for solr fields for more details
offlineDocumentIndexer.coreNameNames the solr core to use for the offline documents. The default is offline
offlineDocumentIndexer.cacheSizeThe offline document indexer caches documents for sophora and solr to improve performance when indexing into the indexes of various slaves at the same time. This controls the size of both caches. The default is 1000.
offlineDocumentIndexer.cacheDurationInMinutesThe offline document indexer caches documents for sophora and solr to improve performance when indexing into the indexes of various slaves at the same time. This controls how long documents are cached. The default is 10 (in minutes).
sophora.solr.usernameThe username for accessing solr. This parameter is optional
sophora.solr.passwordThe password for the solr access. This parameter is optional

Solr fields

By default, the solr documents representing offline documents have the following fields

Solr fieldContent
channel_names_ssThe names of the channels this document was enabled for.
channel_uuids_ssThe UUIDs of the channels this document was enabled for.
idThe UUID of the document
primaryType_sThe document's primary type
sophora_structureNode_sThe UUID of the structure node of the document. If there was a recent live version, then this versions structure node will be used. The current version's structure node is used elsewise
sophora_modificationDate_dtThe milliseconds time stamp of the documents property sophora:modificationDate
sophora_id_sThe documents readable ID from the document property sophora:id
sophora_idHistory_ssContains all readable IDs the document ever had. This property is multivalued
sophora_cronNextOnDate_dtIf present, the date when the document will be published again. See "Cron Server Feature" for details.

An additional list of properties might be passed through the offlineDocumentIndexer.propertyNames property. For all of these properties a solr field matching the field naming conventions will be added. A solr field name consists of the actual property name whereas ":" is replaced by "_" plus a suffix indicating type and multivalue status.
The sophora property sophora-content:title would result in the solr field sophora-content_title_t.

Slave server configuration

The offline document indexer does not require a specific configuration for the slaves to be enabled. However the slave needs to properly propagate its hostname to the master so that the master can actually access the slaves solr. The slave will try to determine its own hostname and propagate this as part of its ServerInfo-Object. Commonly this determined hostname is not fully qualified and therefore the master might have trouble reaching the slave. In order to correct this you can explicitly set a fully qualified host name for a slave.

Property nameExplanation
sophora.replication.slaveHostnameSets the fully qualified hostname for this Sophora server. This property should also be set on Sophora master servers in order to be prepared for switching the master.

Delivery example

One way to use the offline core are redirects. For this purpose you have to create a class that implements the interface IRedirectBuilder.
The class might look like this:

public class ProjectRedirectBuilder implements IRedirectBuilder {
	private static final String OFFLINE_CORE_NAME = "offline";
	private static final String REDIRECT_DOCUMENT_S = "redirectDocument_s";
	private static final String SOPHORA_STRUCTURE_NODE_S = "sophora_structureNode_s";
	private static final Logger log = LoggerFactory.getLogger(ProjectRedirectBuilder.class);
	private final String[] fields = { "*" };
	@Override
	public String createRedirectUrl(SophoraUrl url, IContentMapContext context) {
		if (url == null || url.getSophoraId() == null) {
			return null;
		}
		String sophoraId = url.getSophoraId();
		if (StringUtils.isNotBlank(sophoraId)) {
			SolrQuery solrQuery = new SolrQuery();
			solrQuery.setQuery("sophora_id_s:" + sophoraId);
			solrQuery.setFields(fields);
			solrQuery.setRows(1);
			solrQuery.setStart(0);
			solrQuery.addSort("score", ORDER.desc);
			SolrResult solrResult = SolrClient.query(OFFLINE_CORE_NAME, solrQuery);
			String redirectDocumentUuid = null;
			if (solrResult != null && !solrResult.getEntries().isEmpty()) {
				SolrDocument solrDocument = solrResult.getEntries().get(0);
				if (solrDocument.containsKey(REDIRECT_DOCUMENT_S)) {
					// Ein RedirectDocument ist vorhanden
					try {
						String redirectDocUuid = (String) solrDocument.getFieldValue(REDIRECT_DOCUMENT_S);
						context.getDocumentByUuid(redirectDocUuid);
						return createRedirectUrlForUuid(context, redirectDocUuid);
					} catch (ItemNotFoundException e) {
						log.debug(e.getMessage(), e);
					}
				}
				// Strukturknotenhierarchie nach default Dokumenten durchgehen
				String structureNodeUuid = (String) solrDocument.getFieldValue(SOPHORA_STRUCTURE_NODE_S);
				StructureInfo origStructureInfo = context.getStructureInfo(UUID.fromString(structureNodeUuid));
				List<UUID> structureNodeHierarchy = origStructureInfo.getStructureNodeHierarchy();
				List<UUID> reverseStructureNodeHierarchy = Lists.reverse(structureNodeHierarchy);
				for (UUID snUuid : reverseStructureNodeHierarchy) {
					StructureInfo structureInfo = context.getStructureInfo(snUuid);
					UUID defaultDocumentUUID = structureInfo.getDefaultDocumentUUID();
					if (defaultDocumentUUID != null) {
						redirectDocumentUuid = defaultDocumentUUID.toString();
						break;
					}
				}
				return new RedirectResult(createRedirectUrlForUuid(context, redirectDocumentUuid));
			}
		}
		return null;
	}
	private String createRedirectUrlForUuid(IContentMapContext context, String redirectDocumentUuid) {
		String redirectUrl = null;
		if (StringUtils.isNotBlank(redirectDocumentUuid)) {
			redirectUrl = context.createUrl(false, false, true, (String) null, redirectDocumentUuid, (String) null, (String) null, new HashMap<String, Object>(), new HashMap<String, Object>(), null, null);
			redirectUrl = "/" + StringUtils.substringAfter(redirectUrl.replaceFirst("/", ""), "/");
			log.debug("Creating redirect url {} for document uuid {}", redirectUrl, redirectDocumentUuid);
		}
		return redirectUrl;
	}
}

Technical details

Though the offline documents core is available on solrs for master and slaves (either replication and staging), all of the indexing processes are done by the master.
Updating a slave is not part of the regular synchronization process. When a new slave connects then it will get updated by the master just by comparing the current offline documents core with the overall list of offline documents.
The offline documents index will not contain any documents that are deleted.

Controlling the OfflineIndexer through JMX

The Sophora master server will provide a specific JMX-Bean if the OfflineIndexer is activated. This bean is com.subshell.sophora.server.plugins/OfflineDocumentIndexer and will provide a list of indexers where there is one indexer for each server including the master. Each indexer comes with these properties:

Indexer JMX Properties
NameDescription
hostHolds the Solr-Base URL for this indexer, e.g. http://stage01.mycompany.com:1196/solr
idThe ID of the Indexer which is also the ID of its corresponding Sophora server.
stateCan be one of
  • inactive: The indexer has been explicitly switched off or can not reach the host
  • running: The indexer currently is indexing elements from a queue
  • ready: The indexer has been started and is listening for events
currentWorkToDoThe number of documents in the queue (in case the state is running).

This bean also offers operations on these indexers. They all take the ID of the indexer as input.

  • deactiveIndexer: Explicitly deactivates a single indexer
  • activateIndexer: Activates a deactivated indexer
  • fullRebuild: Rebuilds the index from scratch. You might want to use this if you have configured new properties for your offline index and want them to be present on all indexed documents. We do not recommend to trigger a rebuild of all indexers at the same time.

If an indexer is activated (either due to the JMX operation or due to its corresponding Sophora server reconnecting to the Sophora master server after having been disonnected for some time) then it will automatically fill its queue with all the IDs of documents that have been set offline since the latest modification to offline index of this server.

A full rebuild therefore is only performed if the offline core on the server is completely empty or the JMX operation "fullRebuild" has been triggered.

Setting an indexer to inactive is not persistent. If you restart the master server or switch the master role to a former slave then the inactive indexer will switch back to running.

Cache metrics through JMX

The offline document indexer furthermore provides the bean com.subshell.sophora.server.plugins/OfflineDocumentIndexerMetrics. This holds the gathered metrics for the caches for documents and solr documents. The more slaves the higher the hit rate should be. If you have a lot of sophora staging slaves and the hit rates are still low, consider increasing the cache sizes.

Caching and metrics are available with Sophora Server versions 2.6.0, 2.5.25 and newer.