Importer 4

Importer Migration Guide

Migrating from an older version of the Sophora Importer to the Sophora Importer 4.

With Sophora 4, we have renovated the Sophora Importer. It is now based on Spring Boot, which we use for all our tools developed in recent years. This brings a few new features, e.g. metrics export in Prometheus format, as well as making future enhancements easier.

With this change, the configuration format of the Importer changes. Once you have converted your configuration to the new format, existing imports using the watch folder or the web service should work without modification.

Configuration Files

The new Importer uses the following configuration files:

  • application.yml: This is the main configuration file which replaces the sophora-importer.properties and the sophora-importer_instance-X.properties.
  • sophora-importer-<version>.conf (optional): Used to set the JVM properties (JAVA_OPTS) such as heap size.
  • loader.properties (optional): Used for adding the contents of a folder to the classpath of the Importer.
  • logback-spring.xml: Logging configuration.

Deployment

Spring Boot expects to find the configuration files next to the jar file, so we recommend to put the Importer application jar and the configuration files into the same directory. The name of the sophora-importer.conf file must exactly match the name of the jar file, i.e., if the name of the jar file includes the version number, the name of the conf file must include the version number as well.

Recommended directory structure:

/cms
    /sophora-importer
        /additionalLibs
        /groovy
        /logs
        application.yml
        sophora-importer-4.0.0.conf
        sophora-importer-4.0.0.jar
        loader.properties
        logback-spring.xml

Our Maven repository contains two files suitable for deploying the Importer:

  • com.subshell.sophora.importer-<VERSION>-executable.jar
  • com.subshell.sophora.importer-4.0.0-SNAPSHOT-bin.tar.gz

The executable jar is basically the Importer application without any configuration files. It is suitable when deploying the Importer using Ansible, Puppet or similar configuration management tools. The bin.tar.gz contains the executable jar as well as sample configuration files. Use this for manual deployments to get started quickly.

Starting and Stopping

The Importer jar file is a Spring Boot executable jar. On Linux and MacOS, the Importer can be started by running the executable jar as follows:

./sophora-importer-<version>.jar run

For running the Importer as a background daemon, use the options start and stop.

More options are documented in the Spring Boot documentation.

sophora.importer.additionalClasspath

With the Sophora Importer 3, it was possible to specifiy a directory which should be added to the classpath of the Importer using the configuration property sophora.importer.additionalClasspath. This must now be done with the following entry in the file loader.properties:

# Loads resources (.class files etc.) from nested jar files in directories.
# Should contain comma-separated list of directories, archives, or directories within archives
# (e.g. lib,${HOME}/app/lib, earlier entries take precedence).
loader.path=additionalLibs

JAVA_OPTS

Options for the Java VM, such as heap size, can be set using the environment variable JAVA_OPTS or using an entry in the sophora-importer.conf file. For example:

JAVA_OPTS="-Xmx1G"

logback-spring.xml

The configuration file for logback must now be named logback-spring.xml.

Management-Endpoints / Actuators

The Importer exposes a few HTTP endpoints for management and metrics. These are available at the same HTTP port as the SOAP web service. Access to the management endpoints is also using the authentication settings for the web service. Notable endpoints are:

  • /actuator/health
  • /actuator/jolokia
  • /actuator/prometheus
  • /actuator/sophora-server

Note that the jolokia and health endpoints have moved from /health and /jolokia to /actuator/health and /actuator/jolokia respectively.

The Importer will now open the HTTP port even if the web service is disabled.

Configuration Format

The configuration files sophora-importer.properties and sophora-importer_instance-X.properties have been replaced with the single configuration file application.yml. See below for an example file and the mapping of old configuration options to the new format.

Instance Keys

The importer instances now each have a key (string) instead of an index number. In previous versions, SOAP imports used the instance index to select the instance to import into. In the new version, the instance is selected using the key. The instanceIndex in the SOAP XML can now be a string and refers to the instance key.

When converting an older configuration file to the new format, we recommend using the old instance index as the key, so that existing SOAP imports continue to work. For new instances configured in the future, a descriptive key is recommended.

Binary Properties

With Sophora 3, configuration of binary property names was done using the file binary-property-names.xml. With Sophora 4, this configuration has moved into the application.yml:

# Map of binary properties to mimetype properties.
  # Every binary property entry maps a binary property to the name of a corresponding mimetype. While importing, those
  # properties whose name matches one of the "binaryProperty" entries, are interpreted as binary data. At the same time,
  # there must be another property on the same level of the Sophora-XML that matches the value of the corresponding
  # "binaryProperty" entry.
  binaryProperties:
    'sophora-content:binarydata': 'sophora:mimetype'

Webservice Users

With Sophora 3, usernames and passwords for access to the web service were configured using the webservice_users.json file. With Sophora 4, this configuration has moved into the application.yml:

importer:
  webService:
    # Enables or disables the SOAP webservice interface.
    enabled: true
    # Enables basic authentication for the SOAP webservice interface.
    authenticationRequired: true
    # List of users for authentication.
    logins:
      admin: xxx

site-mappings.xml

This feature was removed.

Mapping of Configuration Options

Global Options
Importer 3 (sophora-importer.properties)Importer 4 (application.yml)
sophora.client.dataDirsophora.client.misc.data-dir
sophora.contentmanager.connectRetriessophora.client.server-connection.retries
sophora.contentmanager.connectRetryIntervalsophora.client.server-connection.retry-interval
sophora.contentmanager.documentCacheSizesophora.client.cache.document-cache-elements-in-memory
sophora.contentmanager.migrationModesophora.client.server-connection.use-migration-mode
sophora.contentmanager.passwordsophora.client.server-connection.password
sophora.contentmanager.proxyHostsophora.client.proxy.host
sophora.contentmanager.proxyPasswordsophora.client.proxy.password
sophora.contentmanager.proxyPortsophora.client.proxy.port
sophora.contentmanager.proxyUsernamesophora.client.proxy.username
sophora.contentmanager.publishedDocumentCacheSizesophora.client.cache.published-document-cache-elements-in-memory
sophora.contentmanager.serviceUrlsophora.client.server-connection.url
sophora.contentmanager.usernamesophora.client.server-connection.username
sophora.importer.additionalClasspathsee Deployment
sophora.importer.cleanupFolders.cronimporter.cleanupFoldersCron
sophora.importer.cleanupFolders.failure.maxAgeimporter.cleanupFoldersFailureMaxAge
sophora.importer.cleanupFolders.successful.maxAgeimporter.cleanupFoldersSuccessfulMaxAge
sophora.importer.directory.failureimporter.folders.failure
sophora.importer.directory.feedpolling.dataimporter.folders.feedPollingData
sophora.importer.directory.successfulimporter.folders.success
sophora.importer.directory.xslimporter.folders.xsl
sophora.importer.disableImportimporter.disabled
sophora.importer.feedpolling.activeimporter.feedPollingEnabled
sophora.importer.fileaccess.basedirimporter.folders.fileAccessBase
sophora.importer.filenames.addTimestampimporter.filenamesAddTimestamp
sophora.importer.httpSoTimeoutimporter.httpSoTimeout
sophora.importer.jolokia.portNot available anymore.
sophora.importer.keepTempfilesimporter.keepTempFiles
sophora.importer.maximumImportsToKeepimporter.maximumImportsToKeep
sophora.importer.minimumFailedImportsToKeepimporter.minimumFailedImportsToKeep
sophora.importer.nameimporter.name
sophora.importer.preProcessing.classNameimporter.preprocessing.className
sophora.importer.preProcessing.scriptfolderimporter.preprocessing.scriptFolder
sophora.importer.proxy.hostimporter.httpProxyHost
sophora.importer.proxy.passwordimporter.httpProxyPassword
sophora.importer.proxy.portimporter.httpProxyPort
sophora.importer.proxy.userimporter.httpProxyLogin
sophora.importer.spring.additionalBasePackagesimporter.springAdditionalBasePackages
sophora.importer.transformationModeimporter.transform
sophora.importer.validate.documentsimporter.validateDocuments
sophora.importer.watchfolder.checkIntervalimporter.folders.watchCheckInterval
sophora.importer.watchfolder.includeSubfolderimporter.folders.watchRecursive
sophora.importer.watchfolder.regex.filesToImportimporter.folders.watchFilesRegex
sophora.importer.webservice.activeimporter.webService.enabled
sophora.importer.webservice.authentication.activeimporter.webService.authenticationRequired
sophora.importer.webservice.baseAddressserver.port
server.address
sophora.importer.webservice.defaultInstanceimporter.webService.defaultInstance
sophora.importer.xslTransformerFactoryimporter.xslTransformerFactory
sophora.jmx.passwordimporter.jmxPassword
sophora.jmx.usernameimporter.jmxLogin
sophora.rmi.registryPortimporter.rmiRegistryPort
sophora.rmi.servicePortimporter.rmiServicePort
Instance Options
Importer 3 (sophora-importer_instance-X.properties)Importer 4 (application.yml)
sophora.importer.cleanupFolders.failure.maxAgecleanupFoldersFailureMaxAge
sophora.importer.cleanupFolders.successful.maxAgecleanupFoldersSuccessfulMaxAge
sophora.importer.defaultSitedefaultSite
sophora.importer.defaultStructureNodedefaultStructureNode
sophora.importer.directory.failurefolders.failure
sophora.importer.directory.feedpolling.datafolders.feedPollingData
sophora.importer.directory.successfulfolders.success
sophora.importer.directory.tempfolders.temp
sophora.importer.directory.watchfolderfolders.watch
sophora.importer.directory.xslfolders.xsl
sophora.importer.disableImportdisabled
sophora.importer.fileaccess.basedirfolders.fileAccessBase
sophora.importer.filenames.addTimestampfilenamesAddTimestamp
sophora.importer.instance.namename
sophora.importer.instance.webservice.enabledwebServiceEnabled
sophora.importer.keepTempfileskeepTempFiles
sophora.importer.maximumImportsToKeepmaximumImportsToKeep
sophora.importer.minimumFailedImportsToKeepminimumFailedImportsToKeep
sophora.importer.preProcessing.classNamepreprocessing.className
sophora.importer.preProcessing.scriptfolderpreprocessing.scriptFolder
sophora.importer.transformation.repairXmlNot available anymore.
sophora.importer.transformationModetransform
sophora.importer.validate.documentsvalidateDocuments
sophora.importer.watchfolder.checkIntervalfolders.watchCheckInterval
sophora.importer.watchfolder.includeSubfolderfolders.watchRecursive
sophora.importer.watchfolder.regex.filesToImportfolders.watchFilesRegex
sophora.importer.xslTransformerFactoryxslTransformerFactory

Example application.yml

# Connection to the Sophora server.
# Note: There is one connection which is shared among all importer instances of the importer process.
sophora:
  client:
    server-connection:
      # The hostname to connect with the Sophora server (e.g. http://sophora.example.com:1196)
      url: http://localhost:1196
      # Username to access the Sophora server.
      username: alice
      # Password to access the Sophora server.
      password: secret
      # If a connection to the Sophora server is not possible, try again a few times. Default is 3.
      retries: 100
      # The time in seconds to wait between connection attempts.
      retry-interval: 10

    # cache:
      # The size of the document cache.
      # If you apply a transformation or a preprocessor that frequently accesses different existing documents from the
      # Sophora server, you may want to increase this cache. Consider the increased memory footprint and assign more
      # memory to the importer if necessary.
      # document-cache-elements-in-memory: 1000
      # The size of the published document cache.
      # Similar to documentCacheSize, except that this value only considers the published versions of documents. If you
      # retrieve the published version of documents in a transformation or a preprocessor, this is the value you may want
      # to adjust.
      # published-document-cache-elements-in-memory: 100

    # Optional proxy configuration for HTTP connections to the Sophora server.
    # proxy:
    #   host: proxy.example.com
    #   port: 8080
    #   username: alice
    #   password: secret

importer:
  # The Importer's name to be used for JMX and logging.
  name: Demo-Importer

  # Disable import for test purposes (e.g. if you want to check the XSL transformation): The
  # importer won't try to send XML files to the content manager, if the value of this property
  # is set to true. Values: "true" or "false"; default value is "false".
  disabled: false

  # Proxy for accessing external content during the import.
  # A proxy configuration is needed if the importer operates behind a
  # proxy and the Import XML is passed to the webservice as a remote URL
  # or if the Import XML refers to binary files via http or https.
  # httpProxyHost:
  # httpProxyPort:
  # httpProxyLogin:
  # httpProxyPassword:

  # JMX
  rmiServicePort: 5000
  rmiRegistryPort: 5001
  # Username and password for the JMX interface [Optional]
  jmxLogin: importerjmx
  jmxPassword: password

  folders:
    # Interval (in milliseconds) to check the import directory (watch folder); e.g. 10000.
    watchCheckInterval: 1000
    # If set to true all subfolders (and their subfolders etc.) of the watch folder are included when watching for
    # incoming Sophora-XML files. Make sure that no system folders (success or failure) are configured as subfolders of
    # the watch folder if this paramter is set to true. Default value is false. The importer instance executes the
    # individual document imports in lexicographical order based on the relative paths of the documents; i.e. an
    # incoming file subfolder-A/import.xml is handled before a file subfolder-B/import.xml.
    watchRecursive: true

  webService:
    # Enables or disables the SOAP webservice interface.
    enabled: true
    # Enables basic authentication for the SOAP webservice interface.
    authenticationRequired: true
    # Key of the instance to use for SOAP requests which don't specify an instance.
    defaultInstance: common
    # List of users for authentication.
    logins:
      admin: xxx

  # Set to true for polling feeds configured in the DeskClient. Default is false.
  feedPollingEnabled: true

  # Determines whether to keep the temporary files after the Importer finishes. If the value is true, these files are
  # moved to the success or failure directory together with the XML files.
  keepTempFiles: true
  # Determines whether a timestamp is attached to the names of the files that are imported and to the names of the
  # temporary files.
  filenamesAddTimestamp: false

  # A cron expression that specifies when the "success" and "failure" folders will be cleaned up. The expression uses
  # the format of the Quartz CronTrigger.
  cleanupFoldersCron: "0 0 9 ? * * *"
  # When cleaning up the "success" folder of the instance, files in the folder must be at least this many days old to be
  # deleted. Set to 0 to disable deletion for this instance / folder.
  cleanupFoldersSuccessfulMaxAge: 90
  # When cleaning up the "failure" folder of the instance, files in the folder must be at least this many days old to be
  # deleted. Set to 0 to disable deletion for this instance / folder.
  cleanupFoldersFailureMaxAge: 90

  # Configuration of the importer instances
  instances:
    - name: Common Imports
      # The key is used to reference this instance in SOAP and feed imports.
      key: common
      # Defines the XSL transformation mode. The following values are valid:
      # 'transformIfNotSophoraXml' (default), 'forceTransform', 'skipTransform'
      transform: skipTransform
      folders:
        # The import directory that the importer instance monitors.
        watch: /cms/data/import/incoming
        # This regular expression determines which files in the watch folder are processed by the importer.
        watchFilesRegex: "(?i).+[.](xml)"
        # Directory to save temporary files in.
        temp: /cms/data/import/temp
        # Target directory to move the XML files to, if the import process finished successfully.
        # This property allows to use patterns within the given path in the form of ${pattern}. Supported patterns:
        # ${date;<DateFormat>} - a date of the import in the defined form. Supported date formats see https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html
        # ${<xslParameter>} - XSL parameter keys defined in the XML-Feeds
        success: /cms/data/import/success
        # Target directory to move the XML files to, if the import process failed.
        # This property allows to use patterns within the given path. Supported patterns see above.
        failure: /cms/data/import/failure
        # The directory where the XSL files are located.
        # This property may be omitted if the 'transform' property has the value 'skipTransform'
        xsl: /cms/sophora-importer/xsl/
      # The classname of the XSL transformer factory, which is used for XSL transformations.
      # The default value is 'org.apache.xalan.xsltc.trax.TransformerFactoryImpl'.
      # xslTransformerFactory: org.apache.xalan.xsltc.trax.TransformerFactoryImpl
      # Enables or disables webservice for according instance.
      webServiceEnabled: true
      # The site to import the documents to. This parameter is considered, if
      # - the XML neither contains an empty <site> nor empty <structureNode> tag
      # - the import operation is not an update of an existing document
      # defaultSite: demosite
      # The structure node to import the documents to. This parameter is considered, if
      # - the <structureNode> element in the XML is empty
      # - the import operation is not an update of an existing document
      defaultStructureNode: /import
      # Optional: Preprocess files before the import using a Java or Groovy class.
      # The result of the preprocessor must be either valid Sophora XML or XML to be transformed using XSL.
      # preprocessing:
        # The class which implements the IPreProcession interface.
        # className:
        # Folder containing groovy preprocessing scripts. Can be left undefined if the preprocessor class is on the
        # classpath.
        # scriptFolder: /cms/sophora-importer/groovy

server:
  # HTTP port of the web server for the SOAP web service and management endpoints (e.g. health).
  port: 8081

Last modified on 10/16/20

The content of this page is licensed under the CC BY 4.0 License. Code samples are licensed under the MIT License.

Icon