myVertica  

Vertica Blog

Working with Joins

This blog post was authored by Soniya Shah.

Vertica supports a variety of join types. This post discusses the following joins:

• Inner joins
• Left, right, and full outer joins
• Natural joins
• Cross joins

In Vertica, we refer to the tables participating in the join as left or right. The left table is specified first in the join statement and the right table is the table mentioned second.

Example Scenario

This post uses the following scenario:

Your organization wants employees to increase or refresh their computer programming skills. To accomplish this, you put together a list of courses from a local university that employees can take. To help keep track of employees and completed classes, you reference the following two tables:

• employees: A table with detailed employee information for all of your organization’s employees. For simplicity, let’s imagine you only have ten employees.
• courses: A table the university uses to record each time someone takes a course. This includes people from outside your organization. For simplicity, let’s imagine the university only has records for ten class sessions.

The following code builds each table: => CREATE TABLE employees (personal_id INT, employee_name VARCHAR(50), employee_age INT, emplyee_gender CHAR(1)); => INSERT INTO employees VALUES (1, 'John', 43, 'M'); => INSERT INTO employees VALUES (2, 'Dan', 52, 'M'); => INSERT INTO employees VALUES (3, 'Lori', 38, 'F'); => INSERT INTO employees VALUES (4, 'Tom', 55, 'M'); => INSERT INTO employees VALUES (5, 'Mary', 39, 'F'); => INSERT INTO employees VALUES (6, 'Virginia', 66, 'F'); => INSERT INTO employees VALUES (7, 'Gary', 63, 'M'); => INSERT INTO employees VALUES (8, 'Rebecca', 19, 'F'); => INSERT INTO employees VALUES (9, 'Steven', 32, 'M'); => INSERT INTO employees VALUES (10, 'Jessica', 25, 'F'); => CREATE TABLE courses (record_id INT, personal_id INT, course_id INT, course_name VARCHAR(50), date_taken DATE); => INSERT INTO courses VALUES (1, 10, 1, 'Intro to Comp Sci', '2015-01-05'); => INSERT INTO courses VALUES (2, 7, 4, 'Java 303', '2015-04-05'); => INSERT INTO courses VALUES (3, 53, 3, 'Database Architecture', '2015-08-03'); => INSERT INTO courses VALUES (4, 4, 2, 'SQL 101', '2016-01-13'); => INSERT INTO courses VALUES (5, 6, 4, 'Java 303', '2014-09-10'); => INSERT INTO courses VALUES (6, 6, 1, 'Intro to Comp Sci', '2015-03-29'); => INSERT INTO courses VALUES (7, 2, 2, 'SQL 101', '2014-10-30'); => INSERT INTO courses VALUES (8, 27, 2, 'SQL 101', '2013-07-04'); => INSERT INTO courses VALUES (9, 8, 4, 'Java 303', '2016-01-05'); => INSERT INTO courses VALUES (10, 3, 3, 'Database Architecture', '2015-07-13');

How Vertica Performs Joins

Vertica uses two basic algorithms to perform a join operation:

•Hash join: Used to join large data sets. In a hash join, Vertica uses the smaller (or inner) table to build an in-memory hash table on the join column. The Vertica execution engine then scans the larger (or outer) table and examines the hash table to look for matches. The table size is determined by the number of rows times the size of each row. The optimizer chooses a hash join when projections are not sorted on the join columns. Although there are no sort requirements, the cost for performing a hash join can rise if the entire hash table can’t fit in memory.



•Merge join: If both inputs are pre-sorted on the join column, the optimizer chooses a merge join.

The good news is you don’t need to pay attention to the join algorithms because the Vertica optimizer automatically chooses the most appropriate algorithm given the query and projections in a system. In the JOIN SQL statement, the join predicate specifies how Vertica should join the tables. Specify the join predicates in the ON clause along with the columns from the left and right tables that you want to combine. For the examples in this tutorial, we’ll use the equality predicate (=), meaning we want to joins columns that match:

Inner Joins

An inner join combines records from two tables based on a join predicate and returns rows from columns specified in the SELECT statement that satisfy that predicate. In the diagram below, the green shaded area represents values that are returned by an inner join. Duplicate values might be returned.



If you want to list your employees who have taken classes, along with information about these classes, you can use the equality predicate to join the two tables on their common column, personal_id. This inner join includes rows from both tables where the personal_id from the employees table matches a value in the courses table’s personal_id column: => SELECT e.personal_id AS e_personal_id, employee_name, c.personal_id AS c_personal_id, c.course_name, c.date_taken FROM employees e INNER JOIN courses c ON e.personal_id = c.personal_id ORDER BY e.personal_id ASC;

If a row from the left table’s joined column matches more than one row from the right table’s specified column, the join will return all of those rows (provided the columns are specified in the SELECT statement). This is why some employees, like Virginia, show up multiple times because they have taken multiple classes.

When a value in the joined column appears in one table but not the other, that row is not returned. That is why some employees like Mary don’t show up at all, because they haven’t taken any classes yet.

Outer Joins

Outer joins extend the functionality of inner joins by letting you preserve rows in one or both tables that do not have matching rows in the other table.

A left outer join preserves a complete set of records from the left (preserved) table, along with any matched records in the right (non-preserved) table. When Vertica finds no match, it inserts a null value for the right table.

In the diagram below, the green area represents values that are returned by a left outer join:



Suppose you want to see a list of all your employees along with classes they’ve taken. The result of a left outer join will be a record of every employee and a blank entry for rows from the courses table if that employee hasn’t taken a course. => SELECT e.personal_id AS e_personal_id, employee_name, c.personal_id AS c_personal_id, c.course_name, c.date_taken FROM employees e LEFT OUTER JOIN courses c ON e.personal_id = c.personal_id ORDER BY e.personal_id ASC;

When performing a right outer join, Vertica returns a complete set of records from the right-join (preserved) table as well as matched values from the left joined (non-preserved) table. Where Vertica finds no match, it returns null values for the left table.

In the diagram below, the green area represents values that are returned by a right outer join:



You can use this example to list all courses taken, whether or not one of your employees participated: => SELECT e.personal_id AS e_personal_id, employee_name, c.personal_id AS c_personal_id, c.course_name, c.date_taken FROM employees e RIGHT OUTER JOIN courses c ON e.personal_id = c.personal_id ORDER BY e.personal_id ASC;

You can combine left outer joins and right outer joins with a full outer join. A full outer join returns all records specified in the SELECT list, including nulls (missing matches), from either table in the join. In the diagram below, the green area represents values that are returned by a full outer join:



A full outer join is useful if you want to see, for example, each employee who has taken a particular class and each class that was taken by one of your employees, but you also want to see all of the employees who have not taken any courses, as well as any course that has not been attended by one of your employees: => SELECT e.personal_id AS e_personal_id, employee_name, c.personal_id AS c_personal_id, c.course_name, c.date_taken FROM employees e FULL OUTER JOIN courses c ON e.personal_id = c.personal_id ORDER BY e.personal_id ASC;

Natural Joins

A natural join in a join with an implicit join predicate. An implicit join predicate doesn’t include any explicit join conditions, such as e.personal_id = c.personal_id. Instead, implicit connections are formed by matching all pairs of columns in the tables that have the same name and compatible data types. The result set contains only one column representing the pair of equally-named columns.

Natural joins take the following form: => SELECT FROM NATURAL [ INNER | LEFT OUTER | RIGHT OUTER | FULL OUTER ] JOIN Vertica performs a natural join by creating an inner join on the column common to both tables. This is useful both syntactically and if you don’t know the names or commonality of columns.

For example, this query naturally joins the two personal_id columns:

Cross Joins

Use a cross join when you want to list all possible combinations of matching one table with another.

Cross joins take the following form: => SELECT FROM CROSS JOIN For example, you might want to compare every employee with every course record to see who they’ve taken classes with. In the result set, Vertica retrieves a record from the employees table and creates a row for every record in the courses table. It then does the same for the rest of the records in the employee table until each row in the employee table is displayed with each row of the courses table. The number of rows equals the number of rows in the first table multiplied by the number of rows in the second table. In our example, the output is 100 rows (10 *10). The example output below only shows a sample: => SELECT * FROM employees e CROSS JOIN courses ORDER BY e.personal_id ASC;

For more information, see Joins in the Vertica documentation.