Using PowerCenter Lookup transformations with Ranges of Data

Content

INFA_Content

The Lookup transformation may be used to map an input value to a key if the input value falls within a range. However, the intuitive way to configure the Lookup transformation for this type of semantic is not the highest performing way.

Data Model - Types of Ranges

Three cases should be considered when you look at the data model for range checking:

The first case is when the ranges are contiguous. To fall into this case, all values must fall within a range.

The second case is when the ranges are non-contiguous. In this case, a value may fall outside of a range.

The third case is when the ranges overlap. A value may fall into multiple ranges.

This article describes how to develop high performance connected Lookup transformations to find a key for contiguous or non-contiguous ranges of data.

Note: This article does not address unconnected Lookup transformations, overlapping ranges, or the need to report an error on multiple match.

Mapping a Value into a Range

The intuitive way to find a value that falls within a range is to use a Lookup transformation with two conditions. The conditions are used to satisfy start_range <= input_value <= end_range. However, a Lookup transformation that has two range conditions may have to search through many rows before the conditions are met.

A better performing implementation is to use a Lookup transformation with only one range condition and a Filter transformation. The lookup condition will only use the start value and take advantage of the "Use Last Value" Lookup transformation property. The row returned from the lookup will be the greatest start value that is less than or equal to the given input value. If a value falls outside of a valid range, then a Filter transformation is also needed to check that the input_value <=end_range criteria is met.

As a side note, the lookup condition may also have to satisfy an equality condition that is separate from the range checks. If this is the case, the lookup will contain two conditions with the equality condition being the first and the start_range <= input_value being the second.

Example

You want to find the store identifier forgiven a department number. All department numbers for a given store fall within a range. For example, store identifier "1" has department numbers in the range of 0-999, store identifier "2" has department numbers in the range of 1000-1999, and so on. The lookup table has three columns: store_id, department_id_start, department_id_end.

You create a Lookup transformation with the following ports:

store_id department_id_start department_id_end

Intuitive Solution

The intuitive way to find the store identifier given a department number is to use two lookup conditions:

Lookup Conditions:

Lookup Table Column Operator Transformation Port

department_id_start <= input_department_id

department_id_end >= input_department_id

This transformation will return the correctstore_id. However, the number of rows that must be checked to satisfy the condition is large. This will negatively impact performance.

Better Solution

A better way to do this is to use one lookup condition and one filter condition:

Lookup Condition:

Lookup Table Column Operator Transformation Port

department_id_start <= input_department_id

Filter Condition:

NOT ISNULL(store_id) and

input_department_id <= department_id_end

Note: The check for the store_id being not NULL is used to insure the lookup found a row.

Use the lookup property "Use Last Value". With that property set, the lookup row satisfying the condition will have the highest department_id_start value that is less than theinput_department_id.

If all input values fall into exactly one range, then the row returned will be the same as the lookup with two conditions.

If a value falls outside of a range, the store_id,department_id_end, and input_department_id from the lookup needs to be passed to a Filter transformation. A filter condition is used to check that the value is less than or equal to the range end value(otherwise it is outside of the range).

Using one range condition in the Lookup transformation and one condition in the Filter transformation instead of using two range conditions in the Lookup transformation reduces the number of rows that the lookup searches.

Lookup Index

The next two sections are included to give a better understanding about the Lookup transformation with regard to the amount of data that needs to be searched in order to satisfy a condition.

The Lookup transformation builds an index using the Lookup ports specified in the condition. A composite key is used when there is more than one lookup port in the condition. The port order (from most to least significant) of the keys are the same as the order listed in the condition. A key used in two conditions is only included once in the composite key. The key order is very important to the performance of the Lookup transformation. The index is used to reduce the number of rows that need to be checked. This set of rows is called the search space.

There are three items that affect how much data is searched when finding data:

Index key order

Operators in the condition

Lookup Policy On Multiple Match property

The search space is formed by reducing the row set using the condition from the most significant to the least significant key. The number of keys used to prune the search space depends on the lookup condition operator.

The equality operator reduces the search space the most. These ports should be placed first in the list of conditions. If all lookup conditions contain the equality operator, then the search space is reduced to only one row. The not equal operator does not reduce the search space and should be used judiciously. The <, >, <=, and >= operators will reduce the search space but only will eliminate rows outside of the range that the input value determines.

The Lookup Policy On Multiple Match property affects processing in two ways. The "Use First Value" and "Use Last Value" determine which way to scan the data defined in the search space. "Use First Value" scans the data from low to high, while the "Use Last Value" scans the data from high to low. The first value found to satisfy the condition is returned. The "Error On Multiple Match" property will cause extra processing after a row is found that meets the conditions. The extra processing is needed to see if other rows also meet the condition. This may have a heavy impact if operators other than equality are used.

Why Performance is Better

Why does the lookup perform so much faster using only the start range condition? The answer is the amount of data that needs to be searched is much less. Lets look at the Lookup transformation that uses two conditions. Suppose the store identifiers range from 1-1,000,000 (the departments range is the same in the above example). Take the worst case scenario when the Lookup transformation has the "Use First Value" property set, and the lookup conditions are department_id_start <=input_department_id and department_end_id >=input_department_id. Suppose the input_department_id port is equal to 1,000,999 (the very last department). Since we are looking forth first value (lowest) to satisfy the condition, we need to look at the data starting from department 1 to department 1,000,000. Every start value in the cache satisfies the first condition (start). All values in the lookup cache will be examined in this case. Talk about poor performance!

If the property "Use Last Value" is set, we will greatly reduce the search space since we will start looking from department 1,000,000. Since we also have a second range condition (end) the rows that satisfy the first condition need to be looked at to see if they also satisfy the second condition. In this case the first row that is found in the index satisfies both conditions. The addition of the Filter transformation to check the end range will not improve performance when the data is 1,000,000.

The use of the filter optimizes the lookup when the input value is not within a valid range. Suppose we use two conditions, and the input value is 1,001,000. This department does not exist in the lookup tables data. The first condition (department_id_start <= input_department_id) is satisfied by all rows in the cache. We need to examine all rows to see that the second condition is never met.

Using only one range condition with a filter reduces the search space for the lookup. The search space is reduced to the minimum when the data falls within a range or falls outside of a range (found or not found).

Summary

The performance of range lookups may be increased significantly by designing transformations that reduce the lookup search space. Use the following rules when making range checks:

Place equality conditions before range conditions in the Lookup transformation.

Use only the start of range condition and the "Use Last Value" lookup property.

Use a Filter transformation after the Lookup transformation to check for values that fall within the end range.

This document resolved my issue This document did not resolve my issue This document helped but additional information was required to resolve my issue

What can we do to improve this information (2000 or fewer characters)