Indexer 4

Indexer Guide

Sophora's Indexer synchronises a search engine's index with the documents of the Sophora Server.

The Sophora Indexer is connected to the Sophora Primary Server's ContentManager and receives notifications when structure nodes or documents are changed. How this applies to the connected search engine is defined through an indexer plugin.

The Indexer holds a priority update queue to handle state change events and to manage resulting document operations with the following priorities:

EventPriorityExplanation
Publishing a structure nodelowAll subordinate documents of the published structure node are updated; i.e. re-indexed.
Publishing a single documenthighIndividual documents are favoured and re-indexed.
Enabling a structure nodemediumAll subordinate documents of the published structure node are updated; i.e. re-indexed.
Disabling a structure nodeNo need for an update. All subordinate documents will just be removed from the index. The same applies when a structure node is set offline.
Removing a documentNo need for an update. The document will just be removed from the index. The same applies when a document is set offline.
Removing a structure nodeHas no impact on the document index since a structure node may only be deleted if no document is located at this node anymore.

The indexer is separated into the projects com.subshell.sophora.Indexer and com.subshell.sophora.indexer.api. It is started by the class com.subshell.sophora.indexer.Indexer.

Directory Structure

As described in the Sophora Server Documentation it is recommended to use a certain directory structure for an installation of the Sophora Indexer.

The internal arrangement of files and directories within the tar.gz archive file, in which the indexer application is assembled, encourages the use of this structure.

----cms-directory
--------apps
------------...
------------com.subshell.sophora.indexer-2.3.0
------------...
------------sophora-indexer -> Symbolic link to com.subshell.sophora.indexer-2.3.0
------------...
--------indexer
------------config
----------------indexer.properties
------------groovy
------------logs
------------plugins
------------sophora-indexer.sh -> Symbolic link to ../apps/sophora-indexer/indexer.sh

apps – The directory apps contains the software components used in your Sophora environment. In this case the Sophora Indexer application is located in the subdirectory com.subshell.sophora.indexer-2.3. A symbolic link points to this diretory in order to enable an easy change of different versions of the indexer - e.g. in case of an update.

indexer - This folder is the workspace of the indexer installation. It contains the indexer's configuration file (in the subdirectory config), the log files (in the subdirectory logs), the plugin to use (in the subdirectory plugins), groovy scripts for providing field values, and a symbolic link to the start and stop script. Except for the logs directory, which is created automatically after the first execution, this directory including its subdirectories and the symbolic link must be created manually.

plugins - Place an indexer plugin jar of your choice in this directory. The plugin will then be made available for the indexer application at runtime. See the plugin mechanism section for more information.

Configuration

The following sections describe how to configure the Sophora Indexer for your needs.

Configuration file (sophora.properties)

The Indexer's behaviour is defined by a configuration file. This file is mandatory. At the start it is handed to the Indexer as VM argument. The syntax to specify the configuration file is as follows:

-Dsophora.properties=<path to properties file>

Properties from the configuration file overwrite those from the default configuration. To apply changes in this file the Indexer needs to be restarted.

PropertyMandatoryDescription / sample value
sophora.contentmanager.serviceUrlyesURL of the content manager; protocol: RMI or HTTP. Example: rmi://localhost:1199/ContentManager
sophora.contentmanager.usernameyesUsername for the content manager
sophora.contentmanager.passwordyesPassword for the content manager
sophora.contentmanager.proxyHostnoURL of the proxy
sophora.contentmanager.proxyPortnoPort of the proxy (between 1024 - 65535)
sophora.contentmanager.proxyUsernamenoUsername for the proxy
sophora.contentmanager.proxyPasswordnoPassword fot the proxy
sophora.contentmanager.connectRetriesnoNumber of attempts to log into the sophora server in casethe login fails on first try.
sophora.contentmanager.connectRetryIntervalnoThe time in seconds to wait between connection attempts.
sophora.searchEngine.connectionyesDefine a specific implementation here. There must be a correspondent spring bean which implements the ISearchEngineFactory interface (e.g. subsearch, solr, forum, facebook)
sophora.indexer.jolokia.portnoThe Port for the jolokia JMX adapter service.
sophora.indexer.db.directorynoDirectory for the update-queue DB (default: ./db)
sophora.indexer.searchMixinNameyesOnly documents with this mixin will be indexed. If this property is not set no documents will be indexed. If the search mixin changes it is necessary to reset an existing search index manually.
sophora.indexer.unsearchableFieldNameno 
sophora.indexer.removeBeforeUpdatenoIf set to true, a remove request for all index keys will be send to the search enginge before updating.
Default value is true.
sophora.indexer.removeAfterUpdatenoDefines whether after an update, the document should be removed from all of the index keys to which it was not added. Default is true. Applies only if the property sophora.indexer.removeBeforeUpdate is set to false.
sophora.indexer.urlService.urlnoThis is the URL of a web service that generates URLs for sophora documents. See Generating URLs for Documents for details. If this is not set, URLs are generated using a built-in algorithm.
sophora.indexer.urlService.onError.maxDelaynoIf an error occurred while requesting the URL from the specified web service, the indexer retries the attempt. The maximum number of seconds the Indexer is trying to get the URL of a document is configured with this property. The time is specified in seconds. The default setting is 600 (10 Minutes). The indexing of all subsequent documents is also delayed by this duration.
sophora.replication.restartDatenoStarting date of the synchronisation process after a restart. All documents that have been modified after this date are re-indexed. The date has to have this format: yyyy.mm.dd hh:mm
sophora.replication.restartQuerynoQuery that is executed at a restart. Only documents that match this query are indexed. If this property is set, the sophora.replication.restartDate will be ignored. The query requires a XPath statement like the following: element(*, sophora-mix:document) [@sophora:id = 'test100']
sophora.startDatePropertyNamenoName of the property which contains the "online from" information of a document (e.g. sophora:startdate).
sophora.searchEngine.fullupdatenoDefines whether all available documents are indexed. Default is true
sophora.indexer.alive.logfilenoDestination of the logfile, which stores the last indexing date. It behaves like sophora.replication.restartDate, but reads and sets its date from the logfile automatically. If a specific restart date is set with sophora.replication.restartDate or sophora.searchEngine.
fullupdate
is set to true , this logfile will be ignored. (default: logs/indexerLastAlive.log)
sophora.indexer.jmx.registry.portnoPort for JMX connections (between 1024 - 65535)
sophora.indexer.rmi.registry.portnoPort for the RMI registry (between 1024 - 65535)
sophora.indexer.jmx.registry.usernamenoLogin for JMX connections
sophora.indexer.jmx.registry.passwordnoPassword for JMX connections
sophora.indexer.directory.xslnoDirectory containing XSL files for transformation of string properties having values in XML format
sophora.indexer.useExternalIdnoWhen set to true, the External-ID is used to identify a document instead of the UUID. This id is set as the value of the key documentKey in the document data map given to the indexer plugin. Most plugins use this key to identify each index record. E.g. in the case of the solr plugin this value is written into the search index field id. Should not be changed in a running system. Consequently it is necessary to clear the search index before changing the id.
Default is false.
sophora.indexer.queue.mechanismnoDefines how the indexer processes elements from the queue. Possible values are:
singleProcessing (default)
One document at a time is processed.
bulkProcessing
The indexer takes a given number of documents at once from the queue. This can speed up processing with some plug-ins that profit from bulk processing (currently only the GSA plug-in). The number of documents has to be set with sophora.indexer.queue.bulkSize. The indexer then tries to get up to the defined number of documents. If the queue does not hold that amount of documents all available documents will be processed. If documents are processed faster than the queue is filled, this mechanism will behave like singleProcessing.
delayedBulkProcessing
Like bulkProcessing a number of documents is processed at a time. All the description of bulkProcessing applies. Additionally, to maximize the number of processed documents, the indexer waits up to a defined time to collect documents. The maximum time to wait before processing is defined with sophora.indexer.queue.maxDelay. The delay is counted since the first document is available in the queue. It is the maximum delay for indexing a document. The processing is delayed till the defined bulkSize is reached or the maxDelay, whatever comes first. If the queue is always filled with more elements than configured, than this mechanism behaves exactly like bulkProcessing.
sophora.indexer.queue.bulkSizenoDefines the maximum number of documents to process at once (default is 100). See the description of sophora.indexer.queue.mechanism for more information when to use this.
sophora.indexer.queue.maxDelaynoDefines the maximum delay in milliseconds for processing a document (default is 1000 ms). See the description of sophora.indexer.queue.mechanism for more information when to use this.
sophora.indexer.numberOfRepeatAttemptsnoDefines the number of repeat attempts the indexer should perform if a search engine throws a RetryException (default is 6).
sophora.indexer.repeatAttemptDelaynoDefines the delay in milliseconds between repeat attempts after the occurrence of a RetryException within a search engine plugin (default is 10000).
sophora.indexer.namenoThe Importer's name to be used for JMX.
sophora.client.dataDir noDefines a directory which may be used by the Sophora Client Api for persisting information like the available nodes in a cluster. The directory must be specified over an absolute path.

Exemplary configuration

# Connection to the ContentManager
sophora.contentmanager.serviceUrl=rmi://localhost:1199/ContentManager
sophora.contentmanager.username=admin
sophora.contentmanager.password=admin
 
# JMX settings
sophora.indexer.jmx.registry.port=50
sophora.indexer.rmi.registry.port=5031
sophora.indexer.jmx.registry.username=admin
sophora.indexer.jmx.registry.password=secret
 
# Query for the synchronisation after restarting (inactive)
#sophora.replication.restartQuery=element(*, sophora-mix:document)[@sophora:id = 'test100']
 
# Starting date for the synchronisation
sophora.replication.restartDate=2015.03.27 12:00
sophora.subsearch.fullUpdate=false
 
# Selected connection
sophora.searchEngine.connection=dummySubsearch
 
# Search mixins and fields
sophora.indexer.searchMixinName=sophora-content-mix:searchable
sophora.indexer.unsearchableFieldName=sophora-content:unsearchable
 
# Directory for XSL files to transform property values
sophora.indexer.directory.xsl=c:/temp

Updating the Index on Startup

Mapping Document Properties to Index Fields of the Search Engine

The mapping of Sophora properties to index fields of the search engine is done in the siteAndMappingConfiguration.xml file. This file has to be created and placed next to the sophora.properties file in the same directory. If you apply changes to this configuration file, the Indexer needs to be restarted for changes to take effect.

The following example demonstrates all supported use cases and possible configurations. The XML scheme file can be downloaded here: indexer-configuration-1.0.0.xsd

Example of siteAndMappingConfiguration.xml

<?xml version="1.0" encoding="UTF-8"?>
<configuration xmlns="http://www.sophoracms.com/indexer-configuration/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.sophoracms.com/indexer-configuration/1.0 http://www.sophoracms.com/indexer-configuration/1.0/indexer-configuration-1.0.0.xsd">
     
    <!-- Assign index keys to sites and filters -->
    <indexes>   
        <index indexKey="indexKey1" isDefault="true">
            <sites>
                <!-- name is optional, id is required. Use the uuid of the structure node or the externalid of the structure node document to identify a structure node. -->
                <site name="sitename1" id="5c34195a-5574-4948-9b72-bc1df857fb8a" />
            </sites>
            <filter>
                <allowedNodeTypes>
                    <allowedNodeType>sophora-content-nt:story</allowedNodeType>
                </allowedNodeTypes>
                <requiredChannel>c0970f7e-85e6-412b-9e52-27073ca84e58</requiredChannel>
                <requiredProperty>sophora-content:topline</requiredProperty>
            </filter>
        </index>
        <index indexKey="indexKey2">
            <sites>
                <site name="sitename2" id="32dc2576-93b4-407d-a971-c2c6a437d7fc" />
                <site name="sitename3" id="b6d3cf76-b114-4e6c-95da-2033a413c08e" />
                <site name="aStructureNodeOfSite3" id="c7ad6486-b35b-4e16-9cec-c3b26332580d" />
                <site name="anotherStructureNodeOfSite3" id="05621d7b-585d-471d-8604-538c0b315880" />
            </sites>
            <filter>
                <allowedNodeTypes>
                    <allowedNodeType>sophora-content-nt:audio</allowedNodeType>
                    <allowedNodeType>sophora-content-nt:story</allowedNodeType>
                </allowedNodeTypes>
            </filter>
        </index>
    </indexes>
     
    <!-- Assign index key fields to Sophora document properties -->
    <mappings>
        <!-- The simpliest mapping is to assign a Sophora property to a search engine's index field -->
        <mapping key="sophoraid">
            <property>sophora:id</property>
        </mapping>
         
        <!-- You can set the format of date properties as they will appear in the search engine's index.
             For instance, if you want a property called "dateToSearch" to appear in the format "yyyy.MM.dd.HH.mm.ss",
             add the following lines to the configuration.
             In most cases the format "yyyy.MM.dd.HH.mm.ss" is used as default value. Exceptions are the properties
             "dateToSearch" and "publicationDate" which occure in "yyyy.MM.dd" by default. -->
        <mapping key="dateToSearch" format="yyyy.MM.dd.HH.mm.ss">
            <property>sophora:publicationDate</property>
        </mapping>
         
        <!-- Selectvalues: Without special configuration the selected key of a drop-down list is indexed.
             If the mapped property is configured as a select value, it is possible to write the label of the
             selected key into the index field. This is achieved by appending ".value" to the property name: -->
        <mapping key="selectedValue">
            <property>sophora:dropdownField.value</property>
        </mapping>
         
          <!-- raw property value: To map the property value without replacing all html/xml tags, you can append a ".rawValue" to the property name: -->
          <mapping key="rawPropertyValue">
          	<property>sophora:property.rawValue</property>
          </mapping>
          
        <!-- For properties containing XML data, an XSL transformation can be performed.
             The result of the transformation is then written into the index field.
             The XSL file name is given by the attribute "xsl".
             The file name is relative to the directory set in the property "sophora.indexer.directory.xsl" of the sophora.properties file. -->
        <mapping key="longitude" xsl="longitude.xsl">
            <property>sophora-content:map</property>
        </mapping>
         
        <!-- In order to write the content of a childnode into an index field use path expressions like these: -->
        <mapping key="teaserImageOverwrittenAlttext">
            <property>sophora-content:image/sophora-extension:alttext</property>
        </mapping>
        <mapping key="teaserImageUuid">
            <property>sophora-content:image/sophora:reference</property>
        </mapping>
        <mapping key="teaserCopytextImageUuid">
            <property>sophora-content:copytext/sophora-extension:paragraph/sophora-extension:paragraphimage[0]/sophora-extension:image[0]/sophora:reference</property>
        </mapping>
         
        <!-- If you want to insert multiple property values into a single index field, you can define these properties as a list.
             As you can see, each part of the property list can also be a path expression which refers to childnode values.
             Such a configuration will result in a single index field value, where the assigned values are separated by a white space character. -->
        <mapping key="sequence">
            <property>sophora-content:topline</property>
            <property>sophora-content:headline</property>
            <property>sophora-content:teasertext</property>
            <property>sophora-content:image/sophora-extension:alttext</property>
        </mapping>
         
        <!-- It is also possible to configure alternative properties, if the intented property of a document is empty. You can also define multiple alternatives.
             Example: If the 'sophora-content:date property' is not set, the value of 'sophora:publicationDate' is taken alternatively and so on: -->
        <mapping key="date" format="yyyy.MM.dd.HH.mm">
            <alternative>
                <property>sophora-content:date</property>
                <property>sophora:publicationDate</property>
                <property>sophora:modificationDate</property>
            </alternative>
        </mapping>
         
        <!-- Define multiple properties into a single index field, containing alternatives: -->
        <mapping key="sequenceWithAlternatives">
            <alternative>
                <property>sophora-content:topline</property>
                <property>sophora-content:headline</property>
            </alternative>
            <property>sophora-content:teasertext</property>
            <alternative>
                <property>sophora-content:date</property>
                <property>sophora:publicationDate</property>
                <property>sophora:modificationDate</property>
            </alternative>
        </mapping>
         
        <!-- If a property is not available for a document, it may be retrieved from the structure hierarchy. To do so, you can use the following expression.
             The example would be interpreted as follows: If the property "sophora:property" does not exist in the document that should be indexed,
             the indexer tries to retrieve it from the structure node documents of the parent structure nodes. -->
        <mapping key="field">
            <alternative>
                <property>sophora:property</property>
                <operation>sophora.indexer.getPropertyValueFromStructureHierarchy</operation>
            </alternative>
        </mapping>
         
        <!-- To write the UUIDs of the active channels of a document into a single index field,
             you need to configure the mapping property accordingly. The index field's content will be a space separated list of UUIDs like:
             e91c87d4-8e16-4e40-9d69-689548efe5ab f833f3f6-b894-4064-9d10-eadb244a52cf. The list includes the default structure hierarchy information;
             e.g. if no channels are defined for this document, it might inherit some from the parent nodes. -->
        <mapping key="channels">
            <operation>sophora.indexer.activeChannels</operation>
        </mapping>
              
        <!-- Another possibility to use information from a document's structure is to determine the structure nodes' UUIDs by defining the subsequent mapping:  -->
        <mapping key="structureNodes">
            <operation>sophora.indexer.generateStructurePathUuids</operation>
        </mapping>
         
 
        <!-- If you want to generate the URL of the document that is indexed, configure the following:  -->
        <mapping key="generatedUrl">
            <operation>sophora.indexer.generateUrl</operation>
        </mapping>
        <!-- Or alternatively, only generate an url if the document doesn't provide one: -->       
        <mapping key="url">
            <alternative>
                <property>sophora-content:url</property>
                <operation>sophora.indexer.generateUrl</operation>
            </alternative>
        </mapping>
 
        <!-- It is possible to provide values using classes defined in Groovy or Java. The content
             of the <operation>-element is the fully qualified classname of a class implementing
             the interface IFieldValueSource. -->
        <mapping key="groovyFoo">
            <operation>Foo</operation>
        </mapping>
        <mapping key="groovyAlternative">
            <alternative>
                <operation>Foo</operation>
                <operation>mycompany.Bar</operation>
                <operation>sophora.indexer.generateUrl</operation>
            </alternative>
        </mapping>
    </mappings>
</configuration>

Built-in field operations

NameDescription
sophora.indexer.activeChannelsWrites the UUIDs of the active channels of a document into the index field. The field's content will be a space separated list of UUIDs like: "e91c87d4-8e16-4e40-9d69-689548efe5ab f833f3f6-b894-4064-9d10-eadb244a52cf". The list includes the default structure hierarchy information; e.g. if no channels are defined for this document, it might inherit some from the parent nodes.
sophora.indexer.getPropertyValueFromStructureHierarchyRetrieves the content of a property from the structure hierarchy. This operation searches the structure nodes in the path of the document for the first structure node document that contains the property. This operation may only be used as the second entry in an <alternative>-Block. The first entry must be a <property>-Element, which defines the property to search for.
sophora.indexer.generateStructurePathUuidsWrites the space-separated UUIDs of the structure path of the document.
sophora.indexer.generateUrlGenerates an URL for the document using an internal algorithm, or, if the property sophora.indexer.urlService.url is set, by asking the url-service.

Using Groovy scripts to implement custom operations

Operations providing field values can be implemented using Groovy scripts. The scripts must be located in the groovy-directory next to the config-directory. Each script must define a class that implements the interface com.subshell.sophora.indexer.source.IFieldValueSource. A custom operation implented by a script is referenced in the mapping using the fully qualified name of the class defined by the script.

The following example defines an operation which sets the value of the field "structureNodeName" to the name of the structure node, where the document to be indexed is located.

siteAndMappingConfiguration.xml

<mapping key="structureNodeName">    <operation>GetStructureNodeName</operation></mapping>

groovy/GetStructureNodeName.groovy

import com.subshell.sophora.api.content.INode;
import com.subshell.sophora.api.structure.StructureInfo
import com.subshell.sophora.client.ISophoraClient;
import com.subshell.sophora.indexer.api.IFieldValueSource;
 
class GetStructureNodeName implements IFieldValueSource {
    private ISophoraClient client;
     
    @Override
    public void setClient(ISophoraClient client) {
        this.client = client;
    }
     
    @Override
    public String getValue(INode document, String fieldName) {
        def structure = client.getStructureInfo(document.getString("sophora:structureNode"))
        return structure.getStructureNodeName()
    }
}

Deprecated: Mapping Document Properties to Index Fields of the Search Engine in the sophora.properties File

If the siteAndMappingConfiguration.xml file does not exist, the mapping of Sophora properties to index fields of the search engine is read from the sophora.properties file.

The following properties can be set in the sophora.properties file to configure the mapping:

PropertyMandatoryDescription + exemplary value
sophora.indexer.sitesyesComma separated list of sites. For each site the properties sophora.indexer.site..id and
sophora.indexer.site..indexkey
are required.
sophora.indexer.site.<sitename>.idyesUUID of a site or structure node or the ExternalId of a structure node document. The placeholder "sitename" refers to the value of the property sophora.indexer.sites.
sophora.indexer.site.<sitename>.indexkeyyesFor example tagesschauKey
sophora.indexer.site.default.indexkeynoIf the site's UUID is empty, this one will be used instead
mapping.<propertyname>noMaps a document's property to an index field of the search engine; e.g. sophora:id
sophora.indexer.filter.<indexkey>.allowedTypesnoEnumeration of document types (separated with commas) that should be included in the index. If this property is empty, all document types will be indexed. Example:
sophora-nt:audio, sophora-nt:video
sophora.indexer.filter.<indexkey>.requiredChannelnoUUID of a delivery channel. If a document is excluded from this channel explicitly, it won't be indexed. If this property is empty, delivery channels are not considered while indexing. Can be set for each index separately.
dateFormat.<propertyname>noDefines the date format for the mapping of the property with the given type, e.g yyyy.MM.dd.HH.mm.ss

Example mapping configuration in sophora.properties file (same configuration as in siteAndMappingConfiguration.xml explained above):

Mapping in sophora.properties

# -----------------------------
# Configure index keys to sites
# -----------------------------
sophora.indexer.sites=sitename1,sitename2,sitename3,structureNodeOfSite3,anotherStructureNodeOfSite3
 
sophora.indexer.site.sitename1.id=5c34195a-5574-4948-9b72-bc1df857fb8a
sophora.indexer.site.sitename1.indexkey=indexKey1
 
sophora.indexer.site.sitename2.id=32dc2576-93b4-407d-a971-c2c6a437d7fc
sophora.indexer.site.sitename2.indexkey=indexKey2
 
sophora.indexer.site.sitename3.id=b6d3cf76-b114-4e6c-95da-2033a413c08e
sophora.indexer.site.sitename3.indexkey=indexKey2
 
sophora.indexer.site.aStructureNodeOfSite3.id=c7ad6486-b35b-4e16-9cec-c3b26332580d
sophora.indexer.site.aStructureNodeOfSite3.indexkey=indexKey2
 
sophora.indexer.site.anotherStructurNodeOfSite3.id=05621d7b-585d-471d-8604-538c0b315880
sophora.indexer.site.anotherStructurNodeOfSite3.indexkey=indexKey2
 
# If the site UUID is empty
sophora.indexer.site.default.indexkey=indexKey1
 
# -----------------
# Configure filters
# -----------------
sophora.indexer.filter.indexKey1.allowedTypes=sophora-content-nt:story
sophora.indexer.filter.indexKey1.requiredChannel=c0970f7e-85e6-412b-9e52-27073ca84e58
sophora.indexer.filter.indexKey1.requiredProperty=sophora-content:topline
 
sophora.indexer.filter.indexKey2.allowedTypes=sophora-content-nt:audio,sophora-content-nt:story
 
# ---------------------------------------------------------
# Configure index key fields to Sophora document properties
# ---------------------------------------------------------
# Each mapping of Sophora properties to index fields of the search engine is done in the following way:
#  mapping.SEARCH_ENGINE_INDEXFIELD=SOPHRA_PROPERTY_EXPRESSION
 
# The simpliest mapping is to assign a Sophora property to a search engine's index field:
mapping.sophoraid=sophora:id
 
# You can set the format of date properties as they will appear in the search engine's index.
# For instance, if you want a property called "dateToSearch" to appear in the format "yyyy.MM.dd.HH.mm.ss",
# add the following in the configuration.
# In most cases the format "yyyy.MM.dd.HH.mm.ss" is used as default value. Exceptions are the properties
# "dateToSearch" and "publicationDate" which occure in "yyyy.MM.dd" by default.
mapping.dateToSearch=sophora:publicationDate
dateFormat.dateToSearch=yyyy.MM.dd.HH.mm.ss
 
# Selectvalues: Without special configuration the selected key of a drop-down list is indexed.
# If the mapped property is configured as a select value, it is possible to write the label of the
# selected key into the index field. This is achieved by appending ".value" to the property name:
mapping.selectedValue=sophora:dropdownField.value

# raw property value: To map the property value without replacing all html/xml tags, you can append a ".rawValue" to the property name: 
mapping.rawPropertyValue=sophora:property.rawValue
 
# For properties containing XML data, an XSL transformation can be performed.
# The result of the transformation is then written to the index field.
# The XSL file name is given in a property with the name "mapping.<field>.xsl".
# The file name is relative to the directory set in the property "sophora.indexer.directory.xsl".
sophora.indexer.directory.xsl=c:/temp
mapping.longitude=sophora-content:map
mapping.longitude.xsl=longitude.xsl
 
# In order to write the content of a childnode into an index field use a path expression like:
mapping.teaserImageOverwrittenAlttext=sophora-content:image/sophora-extension:alttext
mapping.teaserImageUuid=sophora-content:image/sophora:reference
mapping.teaserCopytextImageUuid=sophora-content:copytext/sophora-extension:paragraph/sophora-extension:paragraphimage[0]/sophora-extension:image[0]/sophora:reference
 
# If you want to insert multiple property values into a single index field, you can define these properties as list.
# As you can see each part of the property list can also be a path expression to refer to childnode values.
# Such a configuration will result in a single index field value where the assigned values are separated by a white space character.
mapping.sequence=sophora-content:topline,sophora-content:headline,sophora-content:teasertext,sophora-content:image/sophora-extension:alttext
 
# It is also possible to configure alternative properties, if the intented property of a document is empty.
# This is achieved using the delimiter "|". You can also define multiple alternatives.
# Example: If the sophora-content:date property is not set, the value of sophora:publicationDate is taken alternatively and so on:
mapping.date=sophora-content:date|sophora:publicationDate|sophora:modificationDate
dateFormat.date=yyyy.MM.dd.HH.mm
 
# NOTE: The combination of "," and "|" within one mapping expression is allowed. A combination of "/" and "|" on the contrary is prohibited.
# Define multiple properties into a single index field, containing alternatives:
mapping.sequenceWithAlternatives=sophora-content:topline|sophora-content:headline,sophora-content:teasertext,sophora-content:date|sophora:publicationDate|sophora:modificationDate
 
# If a property is not available for a document, it may be retrieved from the structure hierarchy. To do so add the following expression.
# The example would be interpreted as follows: If the property "sophora:property" does not exist in the document that should be indexed,
# the indexer tries to retrieve it from the structure node documents of the superior structure nodes.
mapping.field=sophora:property|sophora.indexer.getPropertyValueFromStructureHierarchy
 
# To write the UUIDs of the active channels of a document (including default structure hierarchy information;
# e.g. if no channels are defined for this document, it might inherit some from the superior nodes) into a single index field,
# you need to configure the mapping property accordingly. The index field's content will be a space separated list of UUIDs like:
# e91c87d4-8e16-4e40-9d69-689548efe5ab f833f3f6-b894-4064-9d10-eadb244a52cf
mapping.channels=sophora.indexer.activeChannels
 
# Another possibility to use information from the structure is to determine the structure nodes' UUIDs by defining the subsequent mapping:
mapping.structureNodes=sophora.indexer.generateStructurePathUuids
 
# If you want to generate the URL of the document that is indexed, configure the following:
mapping.generatedUrl=sophora.indexer.generateUrl
# Or alternatively, only generate an url if the document doesn't provide one:       
mapping.url=sophora-content:url|sophora.indexer.generateUrl

Generating URLs for Documents

If you want to generate the URL of the document that is indexed, using the sophora.indexer.generateUrl operation, the URL is generated from these parameters by default:

  • URL configured for the site (e.g.http://www.sophoracms.com),
  • name of the structure node from the structure path (e.g. "home"),
  • the Sophora ID (e.g. "sophoraid100") and
  • the extension ".html"

Example: http://www.sophoracms.com/home/sophoraid100.html

Using a Web Service

Instead of using the built-in algorithm, URLs for indexed documents can also be queried from a web service. To use this feature, you need to set the sophora.indexer.urlService.url property in the configuration to the URL of the web service. The web service will be given the UUID of the indexed document as the HTTP-GET parameter "uuid" and must return the URL of the document as plain text.

The indexer sets the following HTTP-GET parameters:

  • uuid: The UUID of the document for which the web service should return the URL.
  • modificationDate: The modification date of the document as milliseconds since the epoch (UTC). This is used for checking that the indexer and the web service have the same version of the document.

The following example shows the interaction between indexer and url-service for one document:

sophora.properties of indexer

sophora.indexer.urlService.url=http://mydomain.de/system/servlet/urlService.servlet

sophora.properties of web application

sophora.delivery.site.demosite.domain=http://mydomain.de

UUID of the indexed document: a2acc8f7-e2c3-4180-ada1-4fc4794453c9

Request by the indexer: http://mydomain.de/system/servlet/urlService.servlet?
uuid=a2acc8f7-e2c3-4180-ada1-4fc4794453c9&modificationDate=1319720134811

Response by the webservice: http://mydomain.de/demosite/news/news104.html

It is necessary to configure the service within your web application accordingly. Therefore you have to make sure the servlet is set up in the web.xml as follows.

web.xml

[...]
<servlet>
  <servlet-name>urlServlet</servlet-name>
  <servlet-class>com.subshell.sophora.delivery.servlet.UrlForIdServlet</servlet-class>
</servlet>
 
<servlet-mapping>
  <servlet-name>urlServlet</servlet-name>
  <url-pattern>/system/servlet/urlService.servlet</url-pattern>
</servlet-mapping>
[...]

Besides the UUID, the servlet may take additional parameters for URL creation:

  • type: defines the template type to be included within the URL. Default: 'default'
  • suffix: file suffix to use for the URL creation. Default: 'html'.
  • modificationDate: The modification date of the document as milliseconds since the epoch (UTC). This is used for checking that the indexer and the servlet have the same version of the document. If the modification date differs, an error is returned.
  • channel: The URL is created for the given channel. Default is the default channel. Do not use together with domainProperty.
  • domainProperty: The domain may be optionally determined via a property, which is passed through this parameter to the servlet. The value of the property is read from the siteproperties. Do not use together with channel.

Whereas the uuid parameter is set by the indexer, all other parameters can be set for project specific adjustments. In this case a JSP template has to be called instead of invoking the servlet directly. This JSP sets the parameters and then redirects the call to the serlvet. See the following example for a JSP template which automatically sets the file suffix based on the mimetype of the binary data of a document.

indexer.jsp

<%@ page session="false" pageEncoding="utf-8" contentType="text/html; charset=UTF-8"%>
<%@ taglib uri="http://java.sun.com/jsp/jstl/core" prefix="c" %>
<%@ taglib tagdir="/WEB-INF/tags/sophora-commons" prefix="sc"%>
<%@ taglib uri="http://www.subshell.com/sophora/jsp" prefix="sophora" %>
 
<c:if test="${not empty param.uuid}">
    <sophora:getDocument var="document" uuid="${param.uuid}" />
 
    <%-- Suffix herausfinden und setzen --%>
    <c:choose>
        <c:when test="${not empty document.binarydata}">
            <c:set var="mimetype" value="${document.binarydata.mimeType}" />
            <c:if test="${not empty mimetype}">
                <sophora:getSuffixByMimeType var="suffix" mimeType="${mimetype}" />
            </c:if>
        </c:when>
        <c:when test="${document['jcr:primaryType'] eq 'sophora-extension-nt:image'}">
            <c:set var="suffix">jpeg</c:set>
        </c:when>
    </c:choose>
</c:if>
 
<c:set var="redirectUrl" >/system/servlet/urlService.servlet</c:set>
 
<c:redirect url="${redirectUrl}">
    <c:param name="uuid" value="${param.uuid}" />
    <c:param name="suffix" value="${suffix}" />
    <c:param name="type" value="${param.type}" />
</c:redirect>

If no URL is configured an error is returned. This is necessary due to the fact, that the generated URL always has to point to the live version (with the live domain) of the passed document.

Update Queue Database

The update queue database is a prioritized queue (like the update queue in the delivery). The queue contains the UUIDs of documents which have been sent to the index.

Actions like structure node changes are inserted to the queue with a lower priority than the processing of a single operation, for instance, changes to an individual document. This is because structure node changes effect all documents that are located at the node at hand and thus might take longer. Document set offline and document deleted events are not added to the update queue at all, because these actions are sent directly to the index.

In general, the queue does not contain large numbers of UUIDs, except for structure node changes or indexer is synchronizing.

To backup the indexer the update queue database is irrelevant. When you (re)start the indexer the sophora.replication.restartDate property should be set to a date which will ensure that all changes are synchronized since stoppage.

Indexer Plugin Mechanism

The Sophora Indexer uses a plugin mechanism to be able to connect to different search engines or to other external systems. Since version 1.33.1 the indexer itself only contains a dummy implementation which creates and writes to a log file. So in order to use the indexer to work with a specific search engine, a plugin has to be added and configured.

Add and configure an Indexer Plugin

First of all the library of the corresponding plugin has to be added to the indexer's plugins folder. This plugins folder must be created within the directory, where the config folder is located. Usually the plugin consists of one jar file, which contains the java code, necessary configuration files and all dependencies. Second of all the added plugin must be configured in the indexer's properties file. To achieve this you have to add the plugin specific configuration properties to the file. These properties are listed in the plugin's documentation. In addition you have to add the name of the plugin's bean which is defined in the indexerExtension.xml file of the plugin. The name has to be set to the property sophora.searchEngine.connection. When you restart the indexer the plugin will be used.

Please note that it is possible to use only one plugin at a time.

Create your own Plugin

To integrate another search engine or external system you have to set up a corresponding Java project. There, you have to put the library com.subshell.sophora.indexer.api into the build path.
Two classes are required that implement the interfaces com.subshell.sophora.indexer.api.ISearchEngine and com.subshell.sophora.indexer.api.ISearchEngineFactory respectively. The implementation of ISearchEngineFactory provides the method getEngine which returns an object of ISearchEngine. All necessary instantiations can be done here as well.

Spring Configuration

Within the build path of the newly created project there must be a directory called "spring". This folder needs to contain a XML file "indexerExtension.xml" (the names of the folder and the file must not be altered). This configuration file comprises a Spring bean definition that is an instance of com.subshell.sophora.indexer.api.ISearchEngineFactory, like the following example:

indexerExtension.xml

<beans xmlns="http://www.springframework.org/schema/beans"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:context="http://www.springframework.org/schema/context"
    xsi:schemaLocation="
        http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.0.xsd
        http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-3.0.xsd">
     
    <!-- enable annotation driven dependency injection -->
    <context:component-scan base-package="com.subshell.sophora.indexer.myindexerplugin"/>
     
    <bean id="myConnection" lazy-init="true" class="com.subshell.sophora.indexer.myindexerplugin.MyConnection">
        <property name="searchEngine" ref="mySearchEngine" />
    </bean>
     
</beans>

In this example only the bean defining an instance of ISearchEngineFactory is included. All other beans (like mySearchEngine) and their dependecies are configured using annotation driven dependency injection. The following code snippets show the MyConnection class, which makes use of Spring's dependency injection and an excerpt of the MySearchEngine class which is instantiated automatically using annotations.

MyConnection.java

public class MyConnection implements ISearchEngineFactory {
     
    private MySearchEngine mySearchEngine;
     
    @Override
    public ISearchEngine getSearchEngine(ISophoraClient client) {
        return mySearchEngine;
    }
 
    public void setSearchEngine(MySearchEngine mySearchEngine) {
        this.mySearchEngine = mySearchEngine;
    }
     
}

MySearchEngine.java

@Component
@Qualifier("mySearchEngine")
public class MySearchEngine implements ISearchEngine {
 
   [...]
 
}

}In addition to the interfaces ISearchEngine and ISearchEngineFactory the API provides an interface called ISiteIndexKeyProvider. A class implementing this interface is instantiated on startup and can be used within all plugins to retrieve the configured index keys for specified sites. You can get an instance of this class by using Spring's dependency injection. The name of the bean to inject is siteIndexKeyProvider.

Build your Plugin

Plugins should be build with and managed by Maven. Therefore the Indexer-API must be added as a dependency to the Maven project. The API itself brings some dependend libraries into the project, like Spring, Apache Commons, Sophora Client etc. On the one hand this makes it is easy to use those libraries for your own plugin, but on the other hand you have to be careful not to create conflicts when adding new dependencies.

Due to the fact that plugins should only consist of one jar file, it is recommended to use the Maven assembly plugin for building.

mvn package assembly:single

JMX Connection

To set up a JMX connection use the following pattern:

service:jmx:rmi://<host>:<sophora.indexer.jmx.registry.port>/jndi/rmi://<host>:<sophora.indexer.rmi.registry.port>/server

An example:

service:jmx:rmi://localhost:5030/jndi/rmi://localhost:5031/server

Username and password are read from the sophora.properties file (sophora.indexer.jmx.registry.username and sophora.indexer.jmx.registry.password), if configured.

The indexer provides the following operations:

MBeanOperationDescription
IndexerupdateDocument(uuid)Indexes the document with the given UUID
UpdateQueuegetSize()How many documents are enqueued
UpdateQueuegetSizeByPriority(priority)How many documents are enqueued with given priority
UpdateQueuegetSizePriorityMap()Returns a map where key=priority and the value=documents enqueued
UpdateQueueremoveAllByPriority(priority)Remove all documents with the given priority from the queue

Last modified on 1/12/21

The content of this page is licensed under the CC BY 4.0 License. Code samples are licensed under the MIT License.

Icon