Getting Rid of Range Joins

Posted June 1, 2017 by Soniya Shah, Information Developer

This blog post was authored by Soniya Shah.

You can use range joins to categorize data into buckets. Vertica provides performance optimizations for <, <=, >, >=, and BETWEEN predicates. These optimizations are particularly useful when a column from one table is restricted to be in a range specified by two columns of another table.

Range joins can be long running if you have very large table(s). This post provides you with options to work with range joins by using the following methods:

• Convert range join to inline lookup function—create a new function to group the age into two categories.
• Convert range join to equi join – where a range predicate (such as between) is replaced by an = predicate

The following describes a typical use case for using a range join:
• There is a large fact table (person) with individual age data
• There is a dimension table (buckets), which is used to classify each individual in the person table along two dimensions, based on age => CREATE TABLE person (ssn varchar primary key, classid int, dateOfBirth date, dateDeceased date) ORDER BY classid, ssn segmented by hash(classid) all nodes ksafe;
=> CREATE TABLE buckets (classid int NOT NULL, fromValue int NOT NULL, toValue int NOT NULL, description varchar(100) NOT NULL) ORDER BY classid, fromValue, toValue unsegmented all nodes ksafe;
=> SELECT * FROM person; ssn | classid | dateOfBirth | dateDeceased --------------+---------+-------------+-------------- 170-749-963 | 2 | 1957-04-28 | 1978-04-28 924-435-7894 | 1 | 0198-04-28 | 1994-04-28 25-154-3821 | 1 | 1952-04-28 | 1987-04-28 430-263-3135 | 2 | 1910-04-28 | 1977-04-28 705-218-7617 | 1 | 1964-04-28 | 1992-04-28 502-687-4360 | 2 | 1957-04-28 | 1963-04-28 568-959-8677 | 1 | 1927-04-28 | 692-368-8739 | 2 | 1944-04-28 | 715-83-4558 | 1 | 1954-04-28 | 732-260-3942 | 2 | 1984-04-28 | 121-01-1111 | 1 | | (11 rows)
=> SELECT * FROM buckets; classid | fromValue | toValue | description ---------+-----------+---------+------------- 1 | 0 | 2 | infant 1 | 3 | 5 | toddler 1 | 6 | 12 | child 1 | 13 | 19 | teen 1 | 20 | 62 | adult 1 | 63 | 100 | senior 2 | 0 | 18 | dependent 2 | 19 | 100 | taxpayer (8 rows)
Buckets classify the person based on their age and categorizes them into taxpayers and dependents. Use Query 1 to join the buckets and person table. The description column in this query is a classification of the person based on the data in the buckets table. It is a result of the range join operation on the person’s age (as derived from dateDeceased and dateOfBirth) and the range data in the buckets table (fromValue and toValue). => SELECT p.ssn,p.classid,dateOfBirth,dateDeceased,description FROM person p JOIN buckets b ON (p.classid=b.classid AND COALESCE(DATEDIFF('year', p.dateOfBirth, COALESCE(p.dateDeceased,SYSDATE)),-5) BETWEEN b.fromValue AND b.toValue) -- <== This is the range join that can add time to this query. It checks to see if a persons age is between a given range order by 1,2,3,4,5; ssn | classid | dateOfBirth | dateDeceased | description --------------+---------+-------------+--------------+------------- 170-749-963 | 2 | 1957-04-28 | 1978-04-28 | taxpayer 25-154-3821 | 1 | 1952-04-28 | 1987-04-28 | adult 430-263-3135 | 2 | 1910-04-28 | 1977-04-28 | taxpayer 502-687-4360 | 2 | 1957-04-28 | 1963-04-28 | dependent 568-959-8677 | 1 | 1927-04-28 | | senior 692-368-8739 | 2 | 1944-04-28 | | taxpayer 705-218-7617 | 1 | 1964-04-28 | 1992-04-28 | adult 715-83-4558 | 1 | 1954-04-28 | | senior 732-260-3942 | 2 | 1984-04-28 | | taxpayer (9 rows) Many times this range join can be slow running and can increase the time it takes for a query to complete. Let’s see two options to eliminate range joins.

Option 1: Convert the range join to inline lookup function

To eliminate the slow running range join, replace the entire join by a materialized column on the person table. This option is only possible if the buckets table is small.

The following is an example of how you can modify the person table to eliminate the buckets table entirely: => ALTER TABLE person add column description varchar default case WHEN classid=2 THEN ( CASE WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=0 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5)<=18 then 'dependent' WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=19 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) <=100 then 'tax payer' ELSE null END) when classid=1 then ( CASE WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=0 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5)<=2 then 'infant' WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=3 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5)<=5 then 'toddler' WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=6 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5)<=12 then 'child' WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=13 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5)<=19 then 'teen' WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=20 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5)<=62 then 'adult' WHEN COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5) >=63 and COALESCE(DATEDIFF('year', dateOfBirth, COALESCE(dateDeceased,SYSDATE)),-5)<=100 then 'senior' ELSE null END) ELSE null END; Rewrite Query 1 as follows. This query has no range join and the query produces the same output as before. => SELECT ssn,classid,dateOfBirth,dateDeceased,description FROM person WHERE description is not null ORDER by 1,2,3,4,5; ssn | classid | dateOfBirth | dateDeceased | description --------------+---------+-------------+--------------+------------- 170-749-963 | 2 | 1957-04-28 | 1978-04-28 | tax payer 25-154-3821 | 1 | 1952-04-28 | 1987-04-28 | adult 430-263-3135 | 2 | 1910-04-28 | 1977-04-28 | tax payer 502-687-4360 | 2 | 1957-04-28 | 1963-04-28 | dependent 568-959-8677 | 1 | 1927-04-28 | | senior 692-368-8739 | 2 | 1944-04-28 | | tax payer 705-218-7617 | 1 | 1964-04-28 | 1992-04-28 | adult 715-83-4558 | 1 | 1954-04-28 | | senior 732-260-3942 | 2 | 1984-04-28 | | tax payer (9 rows)

Option 2: Convert the range join to an equi join

If the buckets table is very large you cannot write a function to materialize the description column. Use another alternative and flatten the buckets table as shown below.

Here, we use the buckets2 table instead of the buckets table. The buckets2table is a flattened table and does not have a fromValue and toValue for age. Instead, the table has a single value age, and a corresponding description. => SELECT * FROM buckets2; classid | age | description ---------+-----+------------- 1 | 0 | infant 1 | 1 | infant 1 | 2 | infant 1 | 3 | toddler 1 | 4 | toddler 1 | 5 | toddler 2 | 6 | child … … 2 | 0 | dependent 2 | 1 | dependent 2 | 2 | dependent 2 | 3 | dependent … 2 | 19 | taxpayer … 2 | 100 | taxpayer (202 rows) Rewrite Query 1 as follows. This query has no range join (between clause), instead there are equi-joins (in bold). => SELECT * FROM ( SELECT ssn,p.classid,dateOfBirth,dateDeceased,description FROM person p join buckets2 b using (classid, age) where p.age > 0 union SELECT ssn,p.classid,dateOfBirth,dateDeceased,description FROM person p join buckets2 b on (p.classid = b.classid and datediff('year',dateofbirth,sysdate) = b.age) WHERE p.age < 0 ) AS X ORDER BY 1,2,3,4,5; ssn | classid | dateOfBirth | dateDeceased | description --------------+---------+-------------+--------------+------------- 170-749-963 | 2 | 1957-04-28 | 1978-04-28 | taxpayer 25-154-3821 | 1 | 1952-04-28 | 1987-04-28 | adult 430-263-3135 | 2 | 1910-04-28 | 1977-04-28 | taxpayer 502-687-4360 | 2 | 1957-04-28 | 1963-04-28 | dependent 568-959-8677 | 1 | 1927-04-28 | | senior 692-368-8739 | 2 | 1944-04-28 | | taxpayer 705-218-7617 | 1 | 1964-04-28 | 1992-04-28 | adult 715-83-4558 | 1 | 1954-04-28 | | senior 732-260-3942 | 2 | 1984-04-28 | | taxpayer (9 rows)

Summary

The example above shows how to remove a long running range join from your query and replace it with a (option 1) inline materialized column OR (option 2) flattened table with equi-joins.