Deploy a pipeline via the web portal
- Last UpdatedSep 18, 2024
- 11 minute read
As described on the Pipeline lifecycle page, deploying a pipeline is required to process any data. This section of the Developer's Guide focuses on the process of setting up the pipeline to run. The object of the exercise is to load an application JAR file onto the pipeline service and define the runtime parameters that are needed to create a pipeline version, which is the operational form of the pipeline. While you can do this from the command line using the OLP CLI, the platform portal provides a quick and easy way to deploy and run a pipeline.
Let's start with creating a pipeline.
Add a pipeline
At first, you must have HERE platform credentials to use the HERE platform portal. After logging in, you will see the home page shown in Figure 1.
You can follow many functional paths from the home page. Usually, you will click on the application icon (the three short lines in the top right corner of the home page) to open the Launcher.
The Launcher is a menu of the major functional areas of the HERE platform (see Figure 2). Next, you need to click on the Pipelines
menu item under the Build
section in the Launcher.
If you are a new user to the HERE platform, there will be no deployed pipelines listed on this page, as shown in the Figure 3 below.
To start creating pipeline, click on the Create new pipeline
to open the pipeline creation page shown in Figure 4. You must specify the pipeline's name, description, and identify the group or project that will share access to the pipeline.
Note
When creating pipelines, it is recommended to use projects.
The notification email information is used in case the HERE platform needs to send you a message. Only 1 email address is allowed, so if more than one person needs to be notified, you should use a group email address.
All the fields with an asterisk (*) are required ones, other fields are optional. Do not attempt to finalize the creation process until all required information is complete.
Note
Please consider the following limitations for pipeline parameters:
- The name cannot be longer than 64 characters.
- The description cannot be longer than 512 characters.
- The contact email cannot be longer than 256 characters.
When you complete this step, the pipeline service assigns a unique pipeline ID (UUID) to identify this new pipeline entity.
Note
The HERE platform typically assign IDs (such as a pipeline ID and a pipeline version ID) using a Universally Unique Identifier, abbreviated as UUID. The term globally unique identifier (GUID) is another name for this ID. It is common to find references in the HERE platform documentation as the ID name with (UUID) appended to it. This simply means that the ID used a UUID designation as its unique identifier. To learn more about UUIDs, see the article Universally unique identifier.
Click Next
to proceed to the next page and define the pipeline version's runtime parameters using the pipeline template.
Configure pipeline template and pipeline version
Next, you must create the pipeline template and pipeline version. Most of this process deals with runtime information to be applied to the application JAR file by the pipeline service, the combination of which is uniquely identified as an executable pipeline version. This template information is applied to the pipeline instance you have created by its pipeline ID. You must also specify at least one input catalog layer as a data source and one output catalog layer as a data sink.
Let's take a closer look at this process.
Select a runtime environment
The first option you need to specify is the runtime environment on which your application will run.
Different versions of the stream
and batch
runtime environments are based on the different versions of the Apache Flink and Apache Spark frameworks with a set of additional libraries included. For a list of libraries included in the latest versions of the runtime environment, see the following articles:
Note
It is recommended that you use the latest versions of runtime environments available and avoid using deprecated versions.
Pipeline template selection
The next step is to select a template for your data processing pipeline. If there are no existing pipeline templates to use, you will need to create a new one in order to proceed. Start by clicking Upload JAR file
.
If there are existing templates created within this group or project, the layout of this page will change to allow you to select them, as shown in the image below:
In this case, reusing an existing template also reuses the pipeline JAR file and the runtime configuration specified by that template. Note that the JAR file associated with the selected template is displayed so that you can confirm that it is the correct JAR file.
Entry point class name
Once the pipeline template has been selected, the entry point class name property must be specified. This is specific to the application JAR file identified in the template and must be specified correctly (case is important here):
Multi-region setup
To ensure that a pipeline continues to operate even if the platform's primary region fails, enable the Multi-region
option when creating the pipeline version so that the pipeline automatically switches to a secondary region if the primary region fails:
This is achieved by periodically transferring the state of the pipeline from the primary to the secondary region. This saved state is used to restore the processing in the secondary region if the primary region fails.
Caution
Additional cost for multi-region setup
Enabling this feature introduces additional cost for the extra resources.
For more information on this option, see Enable multi-region setup for pipelines.
The catalogs also need to be configured to be multi-regional.
Input and output catalogs
Now, let's specify the pipeline's input catalog. You may specify more than one catalog, but only if the application is designed to handle multiple input catalogs. Select the catalog from the drop-down list of available catalogs:
Caution
Input catalog ID
Be aware that the input catalog identifier field requires an input catalog ID value. The value used here must be the same input catalog ID value that is used for the corresponding input catalog in the localpipeline-conf.conf
file. If the input catalog ID values do not match in both places, the generated runtime error may not correctly indicate the source of the problem.
After the input catalog is selected, let's continue by selecting the output catalog from the drop-down list. Please note, that you may only specify one output catalog.
Note
Use same catalog for input and output
A stream pipeline version can use the same catalog for input and output. This does not apply to batch pipeline versions, which must use a different output catalog.
Cluster configuration
Now it's time to configure your cluster to match the specific requirements of your pipeline. The following configuration options are available:
Resource profiles
- predefined ratio of hardware assets (amount of CPU, memory (GB) and disk space (GB)) allocated per resource unit that you can use for the pipeline execution.Garbage collector profiles
- garbage collector profile, which is used to configure the garbage collection strategy for the pipeline execution.Additional non-heap memory
- amount additional non-heap memory per resource unit that you can use for the pipeline execution.
Let's take a closer look at each of them starting with Resource profiles
.
Resource profiles
When configuring a pipeline template, you need to choose the amount of resources to allocate to a supervisor and workers. Resources are allocated as units, each containing a pre-defined number of CPUs, RAM size, and disk space.
The amount of resources per unit is defined by the Resource profile
that you can select for each pipeline template.
The HERE platform supports multiple resource profiles with different combinations of CPU number, RAM size, and disk space per unit to ensure the most effective resource consumption. There are memory-intensive, CPU-intensive, and default resource profiles. Memory-intensive and CPU-intensive resource profiles contain respectively a higher or lower RAM amount that default profile.
Depending on the data processing type and the tasks that the pipeline is designed to perform, you can choose a resource profile that is the most optimal for your project goals.
To specify the amount of resources to be allocated to a supervisor and workers, use the following fields:
- For Spark clusters:
Spark Driver & Executor Profiles
: Opt for a pre-configured profile or customize the Spark Driver or Executor settings to suit your pipeline's needs, ensuring efficient job orchestration.Spark Driver & Executor Sizes
: Define the CPU, RAM, and Disk Space for Spark, ensuring it has the resources to manage application execution.
- For Flink clusters:
Flink JobManager & TaskManager Profiles
: Select from standard profiles for TaskManager and JobManager or customize settings to meet the demands of your task execution.Flink JobManager & TaskManager Sizes
: Assign CPU, RAM, and Disk Space to each TaskManager and JobManager, optimizing for task processing and state management capabilities.
Caution
Limited resource requirements
HERE platform limits are still applicable. A pipeline can only consume a maximum of 200 CPUs and 1.4 TB RAM. In other words:
Size of supervisor + (Size of each worker * Number of workers) ≤ 200 CPU AND 1.4 TB RAM
If you experience any resource provisioning issues within these limits, you should contact HERE Support/Services.
Additional Non-Heap Memory
If you're working with batch pipelines, it's possible to improve their performance by increasing the amount of non-heap memory. The overall (heap + non-heap) memory is sized sufficiently to support regular JVM activities such as garbage collection. You can increase the non-heap memory size by up to 5% (0.05) of the heap size. The increased use of memory is billed as part of Compute RAM.
If a value is not specified:
- For a
pipeline template
: The system defaults this to 0% (0.00). - For a
pipeline version
: It inherits the overhead memory from its associated pipeline template.
GC Profile selection
Another way to improve the performance of your pipeline is to select a specific Garbage Collector profile. When configuring your pipeline, select the GC profile that best suits your workload demands to optimize efficiency.
If a GC profile is not specified:
- For a
pipeline template
: The system defaults to the pre-configured pipeline GC settings. These are theG1 GC
for stream pipelines andParallel GC
for batch pipelines. - For a
pipeline version
: It inherits the GC profile from its associated pipeline template.
List of available GC Profiles:
G1 GC
: Designed for large memory systems, it improves performance with predictable pause times, making it ideal for applications requiring consistent responsiveness. This GC is pre-configured for stream processing environment in the default settings. More aboutG1 GC
is here.Parallel GC
: Focused on maximizing throughput by parallelizing garbage collection tasks, this profile is suitable for high-throughput systems where longer GC pauses are acceptable. This GC is pre-configured for batch processing environment in the default settings. More aboutParallel GC
can be found here.
Runtime parameters
If you need to specify any additional configuration for the pipeline application, please use the Runtime parameters
to do so. Note that all runtime parameters should be specified in key and value pair format, one pair per line. If no additional configuration is required, leave this field blank - it's optional.
For more information about parameters that can be included, see Configurations available for pipeline developers - Runtime parameters.
Cost allocation tag
This field defines a set of cost allocation tags (also known as billing tags) used to group billing records together:
Multiple cost allocation tags can be entered, separated by commas.
Note
Please note that the maximum length of the cost allocation tags is 101 characters. If you want to specify multiple cost allocation tags, up to six tags are allowed.
For more information on cost allocation tags, see Cost Management - Developer Guide.
For more information on how usage is calculated, see Billable services.
Pipeline version name
Since the pipeline version is the executable form of the pipeline, it is good practice to assign it a meaningful name. This parameter is near the bottom of the page:
Save the pipeline version and template
When you have filled all the required fields, click Save at the bottom of the screen to save the pipeline version and start the process of creating the pipeline template. Note that this process may take some time to complete, and it will not be possible to activate a pipeline version during this time:
As you can see, the pipeline version is assigned a unique pipeline version ID (UUID), which is used to ensure that all the operational commands are sent to the correct pipeline version. The new pipeline version will be available in the Ready
state, and it will be possible to activate it once the pipeline template has been created:
Note that the new pipeline version does NOT start running automatically as soon as you click Save. For further information on activating pipeline versions, see the Run pipelines section.
In This Article
- Add a pipeline
- Configure pipeline template and pipeline version
- Select a runtime environment
- Pipeline template selection
- Entry point class name
- Multi-region setup
- Input and output catalogs
- Cluster configuration
- Resource profiles
- Additional Non-Heap Memory
- GC Profile selection
- Runtime parameters
- Cost allocation tag
- Pipeline version name
- Save the pipeline version and template
- Select a runtime environment