Powered by Zoomin Software. For more details please contactZoomin

Pipelines API - Developer Guide

Product category
Technology
Doc type
Version
Product lifecycle
This publication

Pipelines API - Developer Guide: Configurations available for pipeline developers

HomePipelines API - Developer Guide...Develop pipelinesConfigurations available for pipeline developersCurrent page
Table of Contents

Configurations available for pipeline developers

As mentioned in the Develop pipelines section, the end product of the application development process is a JAR file that can be deployed to the pipeline service and used to process data. There are several configuration options available during the application development, including setting up system variables and creating configuration files such as:

The following sections examine configuration options.

Use of the runtime environment

An essential part of the pipeline development process is the selection of the runtime environment. The HERE platform provides two types of runtime environments - batch and stream. Different versions of the stream and batch runtime environments are based on the different versions of the Apache Flink and Apache Spark frameworks, with a number of additional libraries included.
For a list of libraries included in the latest versions of the runtime environments, see the following articles:

Note

It is recommended to use the latest versions of runtime environments available and avoid using deprecated versions.

To ensure that library versions are aligned during the pipeline development, we recommend using the sdk-batch-bom_2.12.pom and sdk-stream-bom_2.12.pom BOM files, depending on the chosen runtime environment.
For more information on these BOM files, please see this article.

credentials.properties

The credentials.properties file is used to manage access to services and resources provided by the HERE platform. You can download this file from the platform portal when you create an access key for an application. For more information, please see the Credentials setup.

Local development

For local development, you need to copy the credentials.properties file to the .here folder in your home directory. For more information, see the Set up your credentials user guide.

Platform development

The credentials.properties file is not used during the platform pipeline development. Instead, the HERE account token is provided, which is generated based on the application or user credentials that were selected during the pipeline version activation:

activation-select-credentials.png

Select runtime credentials

This token is available to the Data Client library which resolves it and refreshes it before it expires. This token has the same access level as the application or user selected when the pipeline version was activated.
For more information, see the Identity and Access Management - Developer Guide.

Logging configuration

For troubleshooting and other maintenance purposes, your data processing pipelines may need to track various custom events. To control how events are logged and how logs are processed, you need to provide a logging configuration for your pipeline. The configuration details depend on whether you develop your pipelines locally or via the HERE platform.

Note

The user is charged for the amount of logs written during the execution of the pipeline.
The Log Search IO chargeable billing record is responsible for this.

Local development

During the local development, if you want to add logging to your application code, the slf4j-abstracted log API should be used. You are free to provide any slf4j binding, although we recommend using logback.

To specify a logging configuration, it's possible to use the external configuration files in .xml, Java Properties, or any other format.
Whichever option you choose, make sure that the configuration files you've added are not included in the application's Fat JAR file - this can lead to unexpected application behavior and the loss of logs, as multiple logging configuration files are present in the process classpath at the same time.

Another requirement is that no separate logging implementation JAR files should be included in the application JAR file artifact - such as slf4j-api or slf4j-log4j12. For example, slf4j-api should be a provided JAR file defined in the BOM for the application's Fat JAR file.

Platform development

Files related to the logging configuration are not used during the platform pipeline development - the platform itself is responsible for this. The amount of information reported in the logs depends on the logging level you select for each pipeline version when it is executed. The Debug, Info, Warn, and Error logging levels are supported, with Warn being used by default.

Use the Logging configuration menu on the pipeline version page to update it:

img.png

Logging configuration

For more information about the basics of pipeline logging, changing and retrieving the pipeline version logging level, etc., see the Pipeline logging section.

Runtime parameters

During pipeline development, certain parameters can be specified at runtime to configure the pipeline runtime environment. There are several ways to use them for your pipeline. All of these options are described below.

Local development

For local development, you can use the application.properties file to describe the runtime parameters in the Java Properties format. You need to include this file in the process classpath, or specify its location on the development machine using the config.file system property:

mvn compile exec:java -D"exec.mainClass"="YourApplicationMainClass" -Dconfig.file=PATH/TO/application.properties

Platform development

For the platform development, this file is constructed from the value of the pipeline template’s defaultRuntimeConfig property overridden on a key-by-key basis with the value of the pipeline version’s customRuntimeConfig property.
Please note, that the pipeline template’s defaultRuntimeConfig property could only be specified if the template was created using the OLP CLI. If only platform portal is used for pipeline deployment, the values specified in the runtime parameters form will be used as the contents of the application.properties file.

The example below demonstrates how the defaultRuntimeConfig and customRuntimeConfig properties interact during the construction of application.properties:

    # Value of Pipeline Template's "defaultRuntimeConfig" property
    "myexample.threads = 3\nmyexample.language = \"en_US\"\nmyexample  .processing.window=300\nmyexample.processing.mode=stateless"

    # Value of Pipeline Version’s "customRuntimeConfig" property
    "myexample.threads=5\n\n myexample.processing.mode=    \"stateful\"\nmyexample.processing.filterInvalid = true"

    # The resulting Application.properties file on the pipeline classpath
    # (for the given values of "defaultRuntimeConfig" and "customRuntimeConfig")
    myexample.threads = 5
    myexample.language = "en_US"
    myexample.processing.window = 300
    myexample.processing.mode = "stateful"
    myexample.processing.filterInvalid = true

Note

For stream applications, if the JAR contains application.properties, then it will take precedence in the classpath over the application.properties provided by the runtime.

pipeline-config.conf

The pipeline-config.conf is a configuration file that specifies output, input catalogs, and billing tags.
An example of the pipeline-config.conf is shown below:

    pipeline.config {
         billing-tag = "first-billing-tag,second-billing-tag"
         output-catalog { hrn = "hrn:here:data::realm:example-output" }
         input-catalogs {
             test-input-1 { hrn = "hrn:here:data::realm:example1" }
             test-input-2 { hrn = "hrn:here:data::realm:example2" }
             test-input-3 { hrn = "hrn:here:data::realm:example3" }
         }
     }

Where:

  • billing-tag specifies cost allocation tags used to group billing records. If multiple tags are used, they should be separated by a comma (,).
  • output-catalog specifies the HRN that identifies the output catalog of the pipeline.
  • input-catalogs specifies one or more input catalogs for the pipeline. For each input catalog, its fixed identifier is provided along with the HRN of the actual catalog.

Note

The format of the file is HOCON, a superset of JSON and Java properties. It can be parsed by the open-source Typesafe Config library of Lightbend.

Local development

For local development, you can include the pipeline-config.conf file in the process classpath or specify its location on the development machine using the pipeline-config.file system property:

mvn compile exec:java -D"exec.mainClass"="YourApplicationMainClass" -Dpipeline-config.file=PATH/TO/pipeline-config.conf

Whichever option you choose, make sure that the pipeline-config.file file you've added is not included in the application's Fat JAR file, as explained in the next chapter.

If the data processing application is implemented using the Data Processing Library, the parsing is handled automatically by the pipeline-runner package.

Platform development

The pipeline-config.conf file is not used during the platform pipeline development. Instead, it is generated by the pipeline service based on the values of billing tags, input and output catalogs that are specified during the pipeline template and pipeline version creation. For more information about these properties, please the see following chapters in the Deploy a pipeline via the web portal section:

During platform development, we strongly recommend against using Fat JAR files that contain pipeline-config.conf files.
It is considered as a bad practice because:

  • Pipeline implementations may bind to and distinguish between multiple input catalogs using fixed identifiers. The fixed identifiers are defined in a pipeline template. An HRN is defined for each pipeline version so that the same pipeline template may be reused in multiple setups. If the pipeline-config.conf file is included in the template's Fat JAR, such a template may not be reusable for different pipeline versions, because the HRNs of the catalogs are hard-coded in the config file at the pipeline template level.
  • It can lead to unexpected application behaviour because two pipeline-config.conf files (one generated by the pipeline service and another included in the template's Fat JAR) are available in the process classpath at the same time.

pipeline-job.conf

Batch pipelines perform a specific job and then terminate. Stream pipelines don't perform a specific, time-constrained job, but run continuously. For batch pipelines, you may be interested in customizing the execution mode of the application, so that it only runs when certain conditions are met.

Use the pipeline-job.conf file to do this:

	pipeline.job.catalog-versions {
		output-catalog { base-version = 42 }
		input-catalogs {
			test-input-1 {
				processing-type = "no_changes"
				version = 19
				}
			test-input-2 {
				processing-type = "changes"
				since-version = 70
				version = 75
				}
			test-input-3 {
				processing-type = "reprocess"
				version = 314159
			}
		}
	}

Where:

  • base-version of output-catalog indicates the already-existing version of the catalog on top of which new data should be published.

  • input-catalogs contain, for each input, the version of that input that is the most up-to-date. This is the version that should be processed. In addition, information that specifies what has changed since the last time the job ran is also included. Catalogs can be distinguished via the same identifiers present in the pipeline configuration file.

  • processing-type describes what has changed in each input since the last successful run. The value can be no_changes, changes , and reprocess.

    • no_changes indicates that that input catalog has not changed since the last run.
    • changes indicates that that input catalog has changed. A second parameter since-version is included to indicate which version of that catalog was processed the last run.
    • reprocess does not specify whether that input catalog has changed or not. The pipeline is requested to reprocess that whole catalog instead of attempting any kind of incremental processing. This may be due to an explicit user request or to a system condition, such as the first time a pipeline runs.

Local development

For local development, you can include the pipeline-job.conf file in the process classpath or specify its location on the development machine using the pipeline-job.file system property:

mvn compile exec:java -D"exec.mainClass"="YourApplicationMainClass" -Dpipeline-job.file=PATH/TO/pipeline-job.conf

Whichever option you choose, make sure that the pipeline-job.conf file you've added is not included in the application's Fat JAR file, as explained in the next chapter.

Platform development

The pipeline-job.conf file is not used during the platform pipeline development. Instead, it is generated based on the properties selected during the pipeline version activation, and then added to the process classpath.

Two activation modes are available. The first is the Run Now mode, which forces the pipeline version to run immediately without waiting for the input data to change:

img.png

Activation options - Run now

When this mode is selected, the contents of the generated pipeline-job.conf file will look like this:

    pipeline.job.catalog-versions {
       output-catalog { base-version = 1 }
       input-catalogs {
          input {
             processing-type = "reprocess"
             version = 2
          }
       }
    }

We can see that the content of the generated file is fully aligned with the values specified during the pipeline version activation, the including input catalog key, its version, etc.

The other activation mode is Schedule. In this mode, the pipeline version only runs when the input data changes:

img_1.png

Activation options - Schedule

As you can see from the screenshot above, the web portal does not allow you to specify which catalog version you want to depend on. It is determined automatically by the Pipelines API - when the input data changes, the new version of the catalog is created, then the input catalogs are validated and an appropriate version is selected. Based on this information, the pipeline-job.conf file is generated:

    pipeline.job.catalog-versions {
      output-catalog { base-version = 1 }
       input-catalogs {
        input {
          processing-type = "changes"
          since-version = 1
          version = 2
        }
      }
    }

For more information about the batch pipeline activation options, see this article.

During platform development, we strongly recommend against using Fat JAR files that contain pipeline-job.conf files.
It is considered as a bad practice because:

  • If the pipeline-job.conf file is included in the template's Fat JAR, this may prevent the activation mode from being customized for different pipeline versions, because the values of processing type and catalogs versions are hard-coded in the config file at the pipeline template level.
  • It can lead to unexpected application behaviour because two pipeline-job.conf files (one generated by the pipeline service and another included in the template's Fat JAR) are available in the process classpath at the same time.

System properties

The following JVM system properties are set by the Pipeline API when a pipeline is submitted as a new job to provide integration with other HERE services.
They can be obtained using the System.getProperties() method, or the equivalent:

  • olp.pipeline.id: Identifier of the pipeline, as defined in the Pipeline API.
  • olp.pipeline.version.id: Identifier of the pipeline version, as defined in the Pipeline API.
  • olp.deployment.id: Identifier of the job, as defined in the Pipeline API.
  • olp.realm: The customer realm.

Below are additional properties paths used by the platform:

  • env.api.lookup.host
  • akka.*
  • here.platform.*
  • com.here.*

In addition to these, other properties are set by the system to configure the runtime environment. These include Spark or Flink configuration parameters associated with the pipeline version configuration that you have selected. These configuration parameters are specific to the chosen framework and its version. Because these configuration parameters may change, they are considered implementation-specific and are left to your determination.

System properties specified in this section are visible from the main user process only. These system properties are not necessarily replicated to the JVMs that run in worker nodes of the cluster.

Configuration for third-party services

Connecting your application to third-party services can offer several advantages and functionalities that might be challenging or impractical to implement independently. This section presents the method of connecting a pipeline application to a third-party service using the credentials for that service and the platform's secrets mechanism.

Local development

For example, you have developed an application that lists all available S3 buckets with an AWS credentials file:

    S3Client s3client = S3Client.builder()
            .region(Region.US_EAST_1)
            .httpClient(UrlConnectionHttpClient.builder().build())
            .build();

    List<Bucket> buckets = s3client.listBuckets().buckets();
    for (Bucket bucket : buckets) {
        LOGGER.info(bucket.name());
    }

The following dependencies are used for this application:

    <dependency>
        <groupId>software.amazon.awssdk</groupId>
        <artifactId>s3</artifactId>
        <version>2.20.37</version>
    </dependency>

    <dependency>
        <groupId>software.amazon.awssdk</groupId>
        <artifactId>url-connection-client</artifactId>
        <version>2.20.37</version>
    </dependency>

To run this application successfully and to allow interaction with S3 buckets, the location of the AWS credentials file must be provided to the pipeline application via the AWS_SHARED_CREDENTIALS_FILE environment variable:

AWS_SHARED_CREDENTIALS_FILE=PATH/TO/AWS_CREDENTIALS_FILE

Platform development

As mentioned above, during the platform pipeline development, you can use the platform's secrets mechanism to securely upload and manage third-party credentials that are used to connect your pipeline to third-party services. The platform supports two types of third-party credentials - custom and AWS.

Credentials of the custom type are used to connect pipeline applications to a variety of web services that are provided by different vendors. The format of such a credentials file is defined by the vendor and may vary from one third-party service to another.

Credentials of the AWS type are used to connect to and use various Amazon web services - for example, to interact with S3 buckets. For more information about AWS credentials, their format, etc., please see the AWS SDKs and Tools User Documentation.

Note

The AWS credentials must be in the form of AWS Key-Secret (AWS IAM roles are not supported at this time). Contact your AWS administrator or manager to create it and set up the access. To reduce the security risk, it is recommended to grant minimal privileges to this new identity.

To run an application from the above chapter as a platform pipeline, follow these steps:

  1. Create all the necessary resources such as pipeline, pipeline template, pipeline version, etc.
  2. Use the olp secret create command with the --grant-read-to parameter to create a new platform secret for the same AWS credentials file that was used previously. This grants read permission on the secret to the HERE application or user whose HRN is specified by the --grant-read-to parameter.
  3. During pipeline version activation, select the appropriate HERE application or user from the SELECT RUNTIME CREDENTIALS drop-down menu:

third-party-secrets-1.png

Select the appropriate HERE application or user

Once the pipeline is activated, the AWS SDK reads the credentials from the file whose location is specified by the AWS_SHARED_CREDENTIALS_FILE variable, which is set by the platform.

If custom secrets have been used, the credentials are stored as credentials file in the /dev/shm/identity/.here/ directory. Note that this file may not be read automatically by your pipeline application - in this case you will need to do this programmatically.

Third-party credentials are automatically refreshed every 12 hours to maintain pipeline functionality. If the credentials were changed and needed to be consumed immediately, the pipeline version had to be manually reactivated.

See also

Was this article helpful?
TitleResults for “How to create a CRG?”Also Available inAlert