Incremental data validation
- Last UpdatedMar 27, 2025
- 10 minute read
Note
This chapter describes a new feature of the Data Processing Library and the API might change in future versions. Validation Suites are currently available for the Scala language only.
Data Validation Workflow
Validating datasets in the catalogs can prevent situations where data changes may cause libraries or services to show unexpected behavior or stop working. In addition, large data sets like cartographic data require systematic and full coverage testing.
In the Data Processing Library, the validation workflow comprises:
- Feature extraction — the extraction logic reads the input catalog's data and groups it in self-contained partitions of data that can be validated in parallel.
- Validation — the validation logic validates each test data partition against a set of acceptance criteria. It outputs a test report and extracts a set of test metrics.
- Assessment — the assessment logic inspects the test metrics to make a final decision about the quality of the input data. The result is published and can be used to further gate or trigger a live deployment of your original data release candidate's input catalog.
The Data Processing Library provides specific classes and transformations to implement these phases in the com.here.platform.data.processing.validation
package.
Feature Extraction
The validation module does not provide any specific API to extract test features from the input data, but it relies on the extraction logic to provide a DeltaSet[K, TestData]
. K
is either Partition.HereTile
in case of geographically partitioned data or Partition.Generic
in case of non-geographical data. TestData
is a user-defined type representing the data under test. Each TestData
value comprises a self-contained fully-specified subset of the input data that can be tested in isolation (for example, the content of a tile and the tiles referenced from it).
The DeltaSet
must be partitioned with a PartitionNamePartitioner
, which guarantees that all the transformations that publish the test-reports and aggregate the metrics do not shuffle. The default partitioner provided by a DeltaContext
is a PartitionNamePartitioner
and can be safely used.
See the chapter about DeltaSets
for a description of all the available transformations.
Validation Suite
Test scenarios operate on a single instance of TestData
and interact with the module via a TestContext
to register the outcome of each test case and log metric values. Test scenarios extend the base class Suite
and implement their testing logic in the run
method.
Custom data can be attached to each test outcome, as well as custom GeoJSON geometry, which can be rendered when you inspect the test report layers in the platform portal.
The Suite
class can be subclassed and the TestContext
interface can be used directly to implement test scenarios. However, the intended usage is through built-in extensions of the Suite
interface which integrate with existing testing frameworks. The module currently provides one such extension based on Scalatest
.
The following snippet shows how the Suite
and TestContext
interfaces can be used directly:
The snippet above shows the difference between partition-level test cases, that verify a global property of the whole partition, and sub-partition-level test cases, which verify properties of sub-features extracted from the partition (for example, single topology roads or nodes). Therefore, even though a Suite
is run on a TestData
partition, the module tracks per test-case statistics, which provides the best granularity during the assessment phase.
Note
Implementing a test
Suite
itself does not require knowledge of Spark,DeltaSet
transformations, nor partitioning concepts. Given the definition of aTestData
type and the set of acceptance criteria, a developer with no previous knowledge of the Data Processing Library can immediately start writing test scenarios. This is further simplified by the integration of popular test DSLs, like Scalatest.
Scalatest Integration
The com.here.platform.data.processing.validation.scalatest
package contains a set of traits to mix in a org.scalatest.Suite
that provide access to the TestData
instance and the TestContext
. Tests can be written using any of the available Scalatest domain specific languages and outcomes are automatically registered. In particular, the Bindings
trait provides access to the current test context and data under test. The PayloadAndGeometry
trait provides methods to register custom data and geometry that is automatically attached to each test outcome. Nested org.scalatest.Suite
s can be implemented, to test sub-features extracted from the TestData
partition. The snippet below shows the same example described earlier, this time implemented with Scalatest:
Note
The use of nested
org.scalatest.Suite
s is recommended over Scalatest's idiomatic "should behave like", as the latter might create a large number of test-cases, which are less efficiently run by Scalatest as opposed to a large number of nested suites.
An org.scalatest.Suite
class can be adapted into a Suite
using the ScalatestSuite
class, as shown in the snippet below:
Metrics and Accumulators
For each suite, and for each test case, the validation module tracks the number of failures and successes. This information is later stored and aggregated in a Metrics
object, together with custom accumulated metric values stored in generic Accumulator
s.
The library provides a set of built-in Accumulator
implementations to accumulate and track Long
and Double
values. You can use TestContext.withAccumulator
to create or update an existing accumulator:
You can implement custom Accumulator
classes, by subclassing the Accumulator
interface. If you use the default JSON serializers you must then augment the Metrics
serializer/deserializer with an additional type hint for your custom Accumulator
class:
Running, Publishing and Metrics Aggregation
Given a DeltaSet[K, TestData]
containing the distributed TestData
partitions, and a Suite[TestData]
implementing a test scenario, you need to map the suite over all values of the test deltaset, get the returned test Report
and Metrics
, serialize them into Payload
s and map the payloads to the right output layers.
A SuiteCompiler
takes care of all this. Given a Suite
or a collection of suites and an instance of TestData
it returns a Map[Layer.Id, Payload]
with the encoded test reports and metrics mapped to the report layer (or layers), and to the metrics layer.
A SuiteCompiler
is typically mapped over the test data using DeltaSet
mapValues
transformation, as the snippet below shows:
Alternatively, TestData
may provide an API to retrieve additional input data referenced by the current partition, without actually resolving all references beforehand. mapValuesWithResolver
can be used to achieve this, passing a Resolver
instance to the TestData
constructor, as shown in the following code snippet:
Multiple Suite
instances, parametrized on the same TestData
can be grouped together in a collection of suites and used from the same SuiteCompiler
:
By default, a SuiteCompiler
will publish the test reports in the "report" layer, and the test metrics in the "metrics" layer, but you can change these defaults. If you use a SuiteCompiler
to run multiple suites you can specify a different report layer for each suite.
A SuiteCompiler
requires implicit serializers/deserializers for the Report
and Metrics
classes. The snippets above use the default serializers that use JSON to encode the test reports and metrics.
Applying a SuiteCompiler
to a DeltaSet[K, TestData]
will produce a DeltaSet[K, Map[Layer.Id, Payload]]
, where K
is again either Partition.HereTile
or Partition.Generic
, based on the original partitioning of the input data or feature extraction logic. A DeltaSet[K, Map[Layer.Id, Payload]]
can be published using a choice of implicit transformations in Transformations
.
All these transformations publish the test reports and the metrics in the corresponding output layers, and recursively aggregate the published metrics partitions to build a single fully-aggregated metrics partition. How the aggregation is realized depends on the partitioning of the TestData
(the type of K
). HERE tile partitioned metrics are aggregated at progressively higher zoom levels. The snippet below shows such scenario:
Note
By default,
publishAndAggregateByLevel
walks all zoom levels configured in the output metrics layer, up to zoom level 0. Level 0 (the root HERE tile covering the whole map) must be included in the set of valid tile levels.
Generically partitioned metrics (for example, admin hierarchies, phonetics, any other non-geographical data) are aggregated in a fixed number of steps, where you can specify the number of aggregated partitions in each step:
Both these methods return the PublishedSet
of the test reports and metrics and a DeltaSet[Partition.Key, Metrics]
containing a single fully-aggregated Metrics
partition, for later assessment.
Validation as Part of Compilation Process
If needed, you can manually run a suite without a SuiteCompiler
. For example, you may want to run a test scenario from the same pipeline that is compiling the release candidate catalog, to immediately abort the batch job if the output data does not comply with some strict acceptance criteria or to add a quality marker to it:
Assessment
The assess
transformation can be applied on a DeltaSet[Partition.Key, Metrics]
to compile a custom assessment type containing the final quality assurance assessment. This typically contains a boolean value indicating whether the validation has succeeded or not, but can also contain custom per use-case evaluations:
If you have multiple SuiteCompiler
s, mapped on different TestData
types and/or different partitioning schemes, you will end up with a sequence of DeltaSet[Partition.Key, Metrics]
, one per SuiteCompiler
. You can still use the assess
transformation on the sequence of deltasets, which will further aggregate the Metrics
partitions generated by the different SuiteCompiler
s:
Note
Two different
SuiteCompiler
s cannot publish in the same report and metrics layers.
Rendering Test Reports and Metrics
If you use the default JSON serializers, you can configure your HERE tiled report and metrics layers with the following schema HRNs:
hrn:here:schema:::com.here.platform.data.processing.validation.schema:report_v2:1.0.0
hrn:here:schema:::com.here.platform.data.processing.validation.schema:metrics_v2:1.0.0
These schemas include rendering plugins that draw the geometry stored in the test reports and render metrics as a heatmap.

