Vertica Integration with Data Virtuality: Connection Guide

About Vertica Connection Guides

Vertica connection guides provide basic instructions for connecting a third-party partner product to Vertica. Connection guides are based on our testing with specific versions of Vertica and the partner product.

Data Virtuality and Vertica: Latest Versions Tested

This document is based on testing using the following versions:

Software	Version
Partner Product	Data Virtuality Studio 2.3.5 Data Virtuality Server 2.3.5
Desktop Platform	Windows Server 2016
Vertica Client	Vertica JDBC 9.3.1
Vertica Server	Vertica Analytic Database 9.3.1

Data Virtuality Overview

Data Virtuality is a federated relational query engine that lets you access multiple data sources as a single virtual database. You can use standard SQL or client tools to query the virtual database. The following diagram illustrates the Data Virtuality architecture.

Data Virtuality software allows you to:

Configure Vertica for use as analytical storage.
Connect multiple data sources, including Vertica, to the Data Virtuality server.
Write cross-database joins and queries.
Create virtual views to provide only relevant information for business users.
Connect SQL and client tools to Data Virtuality Server to generate reports using data in views and data obtained directly from the source tables. Client tools may be visualization tools such as Tableau, database browsers such as DBVisualizer, or spreadsheet managers such as Excel.
Optimize reports by accepting Data Virtuality optimizations. When you accept optimizations, query reports are stored in tables in the analytical storage.
Run optimized reports. After you accept optimizations, Data Virtuality automatically redirects the queries to data in the analytical storage and not to the live data in the source databases.

About Data Virtuality Server

You can install Data Virtuality Server on site or as a cloud-based instance on Amazon. In production environments, Data Virtuality Server uses Postgres as the query engine. The trial version of Data Virtuality uses an H2 database.

About Analytical Storage

The analytical storage, also known as the virtualization engine or the logical or internal data warehouse, is a database that Data Virtuality Server uses to materialize query results. Data Virtuality Server only uses analytical storage when you accept query optimizations. By default, Data Virtuality Server accesses live data directly from the data sources without accessing the analytical storage.

The benefit of query optimization is enhanced performance. However, the materialized queries in the analytical storage must be kept up to date. You can choose to refresh the contents of the analytical storage incrementally, or you can perform a full load each time you refresh the source data.

You can choose to query live data from the source, from the analytical storage-hosted data, or from a combination of the two. You can specify which data you want to store in the analytical storage and which data you want to query directly from the source.

Where Does Vertica Fit In?

You can configure Vertica to run as the analytical storage, as a data source, or both.

Important If you use Vertica as the analytical storage, the materialized tables count towards your Vertica license.

Installing Data Virtuality Suite

The Data Virtuality Suite includes

Data Virtuality Server: The query engine that executes the data federation.
Data Virtuality Studio: The client tool for managing Data Virtuality Server. Data Virtuality Studio uses the Data Virtuality JDBC driver to connect to Data Virtuality Server.
Data Virtuality drivers: JDBC and ODBC drivers to connect to Data Virtuality Server from SQL and client tools.

Use the Data Virtuality Suite installer to install the Data Virtuality software. The installer is available for Windows, Linux, and MacOS.

This guide provides installation instructions for the trial version of the Data Virtuality Suite on Windows.

For an onsite installation on Windows, follow these instructions:

Go to the Data Virtuality download site.
Register and download the Data Virtuality Suite installer.
As Windows Administrator, start the installer and select all the components for a full on-premises installation.
Follow the installation instuctions in the Data Virtuality documentation.

About the Vertica JDBC Driver

Data Virtuality ships the Vertica JDBC driver version 9.3.1 with the Data Virtuality Suite. You do not need to install the Vertica driver, unless you need to enable capabilities in later versions of Vertica.

Important Data Virtuality Studio uses this driver to connect to Vertica as a data source or as analytical storage.

If you need a different version of the Vertica JDBC driver, replace it in the following folder:

C:\Program Files\Data Virtuality Suite\DVServer\modules\com\vertica\main

For information about installing Vertica drivers, see the Vertica documentation.

Connecting Data Virtuality Studio to Data Virtuality Server

After you install Data Virtuality Suite, you must create a connection between Data Virtuality Server and Data Virtuality Studio.

Starting Data Virtuality Server

When installed on Windows for onsite access, Data Virtuality Server runs as a Windows service. You can start Data Virtuality Server from the Windows Start menu or by using the Service Manager utility in Windows Control Panel.

Starting Data Virtuality Studio

You can start Data Virtuality Studio from the Windows Start menu or by double-clicking the executable file, dvstudio.exe, located in the DVStudio folder. The default path in Windows is:

 C:\Program Files\Data Virtuality Suite\DVStudio\dvstudio.exe

Creating the Connection

When you start Data Virtuality Studio for the first time, the Connect to Data Virtuality Server wizard opens.

Provide the following connection information:

Field	Value	Description
Connections	Connection name	Select a connection from the list if connections to Data Virtuality Server have previously been configured, or provide information for a new connection.
Title	Title name	Enter the name of the connection.
Host	localhost	For Data Virtuality Server on premises, `Host` is the name or IP address of the machine running Data Virtuality Server. For Data Virtuality Server on the cloud, `Host` is the URL of the Data Virtuality instance on Amazon.
Port	31000	Default port is 31000.
SSL	OFF	Disabled by default.
Database	datavirtuality	Name of the current Data Virtuality database. The default value is datavirtuality.
Schema	datavirtuality	Do not change the schema name.
User name	admin	Name of the administrative user. The default value is admin.
Password	admin	Password for the administrative user. The default value is `admin`.

Click Test Connection.
Click Connect to connect to Data Virtuality Server.

Note Ensure that Data Virtuality Server is running before you connect. If the server is not running, a Connection failed error is displayed.

Configuring the Analytical Storage

After configuring the connection to Data Virtuality Server, you can configure the analytical storage. This step is required if you intend to accept query optimizations, as described in Accept Optimizations.

Follow these steps to configure Vertica for use as analytical storage:

Create an empty schema (dwh in this example) in your Vertica database.
```
=> CREATE SCHEMA dwh;
```

Create a new user and grant read/write privileges to the dwh schema.

=> CREATE USER datavirtuality_user;
=> ALTER  USER datavirtuality_user IDENTIFIED BY 'datavirtuality_admin';              
=> GRANT ALL PRIVILEGES ON SCHEMA dwh TO datavirtuality_user WITH GRANT OPTION;
=> GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA dwh TO datavirtuality_user;

In the Data Explorer of Data Virtuality Studio, right-click Analytical Storage node and select Add Analytical Storage.
In the Add Analytical Storage wizard, select Vertica as the data source type and click Next.
In the New Analytical Storage window, enter the following required connection parameters:
- Host (required): Vertica server IP address or name.
- Port (required): By default 5433.
- Database (required): The Vertica database name, for example Partner72DB.
- Schema (required): The schema name in this example is dwh.
- User (required): The database user in this example is datavirtuality_user.
- Password (required): The database password in this example is datavirtuality_admin.
- Datasource Parameters (required):
```
importer.useFullSchemaName=FALSE,importer.tableTypes="TABLE,VIEW", importer.importIndexes=TRUE
```
- Auto-generated parameters:
```
importer.schemaPattern=dwh,importer.defaultSchema=dwh
```
  Import settings search and collect all available metadata about the database. This metadata is stored in the schemas SYSTEM and SYS in the Data Virtuality Server data engine (H2 or Postgres database). In the case of Vertica, the setting importer.importIndexes is ignored.
- Translator Parameters: empty by default.
  
  A translator is an interface between Data Virtuality Server and the data source. A translator imports metadata and determines which SQL constructs are supported for pushdown and how data is retrieved.
  
  In most cases, you do not need to adjust the translator parameters except when your queries use international character sets. To support multibyte characters, you should increase the value of thevarcharReserveAdditionalSpacePercent translator parameter to accommodate the Vertica VARCHAR data type. Most SQL databases, including Data Virtuality Server, calculate the length of VARCHAR in characters. Vertica, however, calculates the length of VARCHAR in bytes. This means that a VARCHAR(X) field in Vertica can sometimes store fewer characters than comparable data types in other systems, especially when international characters are used.
- JDBC parameters: Use this field to specify additional JDBC settings such as Native Connection Load Balancing. For a complete list of Vertica JDBC settings see the Vertica documentation.
To add Vertica as the analytical storage, click Test connection then Finish.

To view the objects in the analytical storage, double-click the Analytical storage node (dwh) in the Data Explorer.

Adding Data Sources

Before you can use Data Virtuality to query data, you must connect Data Virtuality Server to one or more data sources. You can configure different SQL databases as Data Virtuality data sources.

To add Vertica as a data source using JDBC, follow these steps:

In Data Virtuality Studio, right-click the Data source node.
Choose Add data source.
In the Add data source wizard, select Databases and then select Vertica.
Click Next.
Enter the connection information for your specific Vertica database.
- Alias (required): The name of the data source that will appear in the Data Explorer and that you use in the query editor to fully qualify the table names. In this example, the alias is VerticaVMartSource.
- Host (required): The Vertica server IP address or name.
- Port: The default is 5433.
- Database (required): The Vertica database name. In this example, the database name is VMart.
- User name (required): The database user.
- Password (required): The database password.
- Data source parameters:
  
  This example shows the default import settings with import.schemaPatterntoquery set to the VMart public, store, and online_sales schemas.
```
importer.schemaPattern="store,public,online_sales",
importer.useFullSchemaName=FALSE,
importer.tableTypes="TABLE,VIEW",importer.importIndexes=TRUE
```
  The default import settings search and collect all available metadata about the data source. This metadata is stored in the schemas SYSTEM and SYS in the Data Virtuality Server data engine (H2 or Postgres database). You can add other import settings such as importer.importKeys, which causes the keys from data source tables to be visible in the table SYS.Keys.
  
  Similarly, you can set UseCommentsInSourceQuery=true to label queries with session and request IDs. The label looks as follows:
```
/*teiid sessionid:Wj37rmp5TLX8, requestid:Wj37rmp5TLX8.15.8*/
```
  However, import metadata is an expensive process. You may want to limit the import settings to two: importer.schemaPattern and importer.tableTypes. In the case of Vertica, the setting importer.importIndexes is silently ignored.
- Translator Parameters: Empty by default.
  
  A translator is an interface between Data Virtuality Server and the data source. A translator imports metadata and determines which SQL constructs are supported for pushdown and how data is retrieved.
  
  In most cases, you do not need to adjust the translator parameters except when your queries use international character sets. To support multibyte characters, you should increase the value of thevarcharReserveAdditionalSpacePercent translator parameter to accommodate the Vertica VARCHAR data type. Most SQL databases, including Data Virtuality Server, calculate the length of VARCHAR in characters. Vertica, however, calculates the length of VARCHAR in bytes. This means that a VARCHAR(X) field in Vertica can sometimes store less characters than comparable data types in other systems, especially if and when international characters are used.
- JDBC parameters: Use this field to specify additional JDBC settings such as Native Connection Load Balancing. For a complete list of Vertica JDBC settings, see the Vertica documentation.
Click Next.
Check Gather Statistics to collect statistics.

The reason to collect statistics is that these metrics can help the query engine to make better recommendations for optimizing queries. Some statistics collected include:
- Table statistics: Total number of records in the table.
- Column statistics: Number of distinct values, number of null values, and the min and max values in a column.
Click Finish to add the data source.

Running Live Queries Against Multiple Data Sources

After you add data sources, you can start writing SQL queries in the SQL Query Editor. To open a new SQL editor window, choose Open SQL editor from the Window menu or click the Open SQL editor icon. In this editor window you can issue queries against one or many tables from one or many connected data sources. To distinguish between different data sources, you must fully qualify the table names with the appropriate schema name (data source alias) and a dot.

In this example, VerticaVMartSource is the data source name, public is the schema in Vertica, and inventory_fact is the table in Vertica:

VerticaVMartSource.public.inventory_fact

Column names can be referenced in the following form:

VerticaVMartSource.public.inventory_fact.product_key

You can write simple or complex queries that are directed to different data sources. The following example is a SQL query that returns data from Vertica, Oracle, and SQL Server:

--Inventory fact table - analysis

=> SELECT
"VMartVertica.date_dimension".date,
"VMartVertica.date_dimension".full_date_description,
"VMartVertica.date_dimension".day_of_week,
"VMartVertica.date_dimension".calendar_month_name,
"VMartVertica.date_dimension".calendar_month_number_in_year,
"VMartVertica.date_dimension".calendar_year_month,
"VMartVertica.date_dimension".calendar_quarter,
"VMartVertica.date_dimension".calendar_year_quarter,
"VMartVertica.date_dimension".calendar_year,
"VMartSQLServer.product_dimension".product_key || "VMartSQLServer.product_dimension".product_version as product_version_key,
"VMartSQLServer.product_dimension".product_description,
"VMartSQLServer.product_dimension".sku_number,
"VMartSQLServer.product_dimension".category_description,
"VMartSQLServer.product_dimension".department_description,
"VMartSQLServer.product_dimension".package_type_description,
"VMartSQLServer.product_dimension".package_size,
"VMartSQLServer.product_dimension".fat_content,
"VMartSQLServer.product_dimension".diet_type,
"VMartVertica.warehouse_dimension".warehouse_name,
"VMartVertica.warehouse_dimension".warehouse_city,
"VMartVertica.warehouse_dimension".warehouse_state,
"VMartVertica.warehouse_dimension".warehouse_region,
"VMartOracle.inventory_fact".qty_in_stock
FROM
"VMartOracle.inventory_fact"
INNER JOIN "VMartVertica.date_dimension"
ON "VMartOracle.inventory_fact".date_key = "VMartVertica.date_dimension".date_key
INNER JOIN "VMartVertica.warehouse_dimension"
ON "VMartOracle.inventory_fact".warehouse_key = "VMartVertica.warehouse_dimension".warehouse_key
INNER JOIN "VMartSQLServer.product_dimension"
ON "VMartOracle.inventory_fact".product_key = "VMartSQLServer.product_dimension".product_key AND
"VMartOracle.inventory_fact".product_version = "VMartSQLServer.product_dimension".product_version
WHERE
"VMartVertica.date_dimension".date >= '2003-01-01' AND "VMartVertica.date_dimension".date <= '2017-12-31' AND
"VMartSQLServer.product_dimension".discontinued_flag = 0;

Data Virtuality executes separate queries at each source to get the necessary data. Data Virtuality Server applies the filters from the original query at each of the sources and minimizes the data it transfers to the Data Virtuality server data engine. The result sets from each source are transferred to the Data Virtuality server. The Data Virtuality Server data engine performs the joins and the final calculations. After that, the Data Virtuality Server passes the result to the client.

All queries connect live to the original sources unless optimizations have been accepted, in which case Data Virtuality Server uses the data stored in the analytical storage to process the queries.

Configuring Virtual Views

Data Virtuality Studio allows you to create views of queries that are complex and used frequently. This allows you to avoid typing the query every time the user wants to see data. You can also use views to select the important data instead of all the data that the original source offers.

To save your query as a view, use the CREATE VIEW command.

=> CREATE VIEW <virtual schema name>.<view name> AS SELECT...

This is an example of a virtual view based on VMart tables from different data sources:

--Average inventory overtime by product department

=> CREATE VIEW views.avg_inventory_by_product AS
=> SELECT
"VMartVertica.date_dimension".calendar_year AS "Year",
"VMartVertica.date_dimension".calendar_month_number_in_year AS "Month number",
"VMartVertica.date_dimension".Calendar_month_name AS "Month name",
"VMartSQLServer.product_dimension".category_description AS "Product category",
AVG("VMartVertica.inventory_fact".qty_in_stock) AS "Average quantity in stock"
FROM
"VMartVertica.inventory_fact"
INNER JOIN "VMartVertica.date_dimension"
ON "VMartVertica.inventory_fact".date_key = "VMartVertica.date_dimension".date_key
INNER JOIN "VMartOracle.warehouse_dimension"
ON "VMartVertica.inventory_fact".warehouse_key = "VMartOracle.warehouse_dimension".warehouse_key
INNER JOIN "VMartSQLServer.product_dimension"
ON "VMartVertica.inventory_fact".product_key = "VMartSQLServer.product_dimension".product_key AND
"VMartVertica.inventory_fact".product_version = "VMartSQLServer.product_dimension".product_version
WHERE
"VMartVertica.date_dimension".date >= '2003-01-01' AND "VMartVertica.date_dimension".date <= '2017-12-31' AND
"VMartSQLServer.product_dimension".discontinued_flag = 0
GROUP BY
1, 2, 3, 4
ORDER BY
1, 2

Data Virtuality Server provides a default virtual schema named views to hold all virtual views, but you can create new virtual schemas if necessary. To see all virtual views, click on an appropriate virtual schema inside the node in the Virtual Schemas Data Explorer.

Connecting Client Tools to Data Virtuality Server

After you add data sources and create virtual views, you can connect a front-end application to the Data Virtuality server using the JDBC or ODBC drivers supplied by Data Virtuality. For example, you can connect Microsoft Excel to the Data Virtuality server using the Data Virtuality ODBC driver to query and report data from multiple data sources. Excel is a 32-bit application, so you need to install the Data Virtuality 32-bit ODBC driver and create a 32-bit DSN.

Downloading Data Virtuality Drivers

You can download the Data Virtuality drivers in two ways:

Use Data Virtuality Studio to install the drivers on the client machine where the front-end application is running.
Download the driver that matches your client operating system from http://my_DV_Server:8080.

Installing and Configuring ODBC

To install the Data Virtuality ODBC driver, run the installer and follow the instructions. After you install the ODBC driver, create a system DSN using the Windows ODBC Administrator tool.

To configure the 64-Bit ODBC driver, use the 64-bit version of the ODBC Administrator tool:

$ %SystemRoot%\system32\odbcad32.exe

To configure the 32-bit ODBC driver, use the 32-bit version of the ODBC Administrator tool:

$ %SystemRoot%\SysWOW64\odbcad32.exe

This example shows how to use the Data Virtuality 64-bit ODBC driver to create a 64-bit DSN.

Parameters:

Data Source is an arbitrary DSN name.
Database must be datavirtuality.
Server is the IP address of the Data Virtuality server.
SSL Mode is disable.
Port is 35432.
User: The Data Virtuality Server user name.
Password: The Data Virtuality Server password.

Test the connection and then save the DSN. You can now use the DSN in a front-end tool to connect to the Data Virtuality server.

Installing and Configuring JDBC

Install the JDBC driver by placing the jar file datavirtuality-jdbc.jar in the folder for external libraries for the front-end application. Then use the following JDBC URL in the front-end application. Supply the IP address or name of the machine where Data Virtuality Server is running.

 jdbc:datavirtuality:datavirtuality@mm://<my_DV_Server>:31000;SHOWPLAN=ON

Accept Optimizations

Optimizations are query results that Data Virtuality stores in tables in the analytical storage. Data Virtuality recommends that you optimize queries based on how often the queries are issued.

When you accept optimizations, the query results in the source are transferred to the analytical storage where they are available for fast access. Data Virtuality Server automatically detects data that is fully or partially available in the analytical storage and knows when to use stored data instead of querying the original source.

Each optimization has a priority based on a color, as follows:

Red means that the query is issued very often. You should accept red-colored optimizations first.
Yellow means that the query is issued quite often.

Green means that the query is rarely issued.

To accept an optimization, go to Optimization View in Data Virtuality Studio. Right-click on the optimization and choose Enable and Run Optimizations. This action creates a table in the analytical storage with the name mat_table_xx_xx and redirects all subsequent queries to this table.

You can disable redirecting a query to the analytical storage by reverting the already accepted optimization. Later you can re-enable it by clicking Accept optimization. After re-enabling optimizations, be sure to run or schedule the optimizations to ensure that the analytical storage is kept up to date.

Updating Data in the Analytical Storage

Data Virtuality does not automatically refresh the query results in the analytical storage. To ensure that the data in the analytical storage is kept up to date, you must rerun the optimizations. You can schedule optimizations to rerun periodically, based on how often the data changes at the source.

To refresh an optimization, click its name and select Run Optimization. To schedule periodic refreshes, select Add Schedule.

To specify the type of refresh, you must create a replication job. There are several types of replication jobs, including:

Complete replication (full load), which replaces the copy of the data in the storage with every refresh.
Up to history update (incremental load), which lets you keep track of new data and changes over time. This replication technique is also known as slowly-changing dimensions.

Running Reports on Optimized Data

After you have accepted optimizations and scheduled refreshes, the data needed to populate reports is available in the analytical storage. Data Virtuality Server automatically recognizes that the data is fully or partially stored in the analytical storage and returns it from the tables in the analytical storage instead of the original sources.

Data Virtuality SQL Syntax

Data Virtuality, being a data virtualization product, does not use the Vertica SQL dialect directly. Instead it uses a generic SQL dialect that is independent of the database running underneath it. Data Virtuality provides common SQL constructs. Most SQL constructs are present, but some of them work differently. For example, the to_hex function in Vertica is expressed as to_chars(value,'HEX')in Data Virtuality:

--Original query using Vertica functions and syntax:

=> SELECT
  ValueDesc,
  to_hex(binary_column) AS binary_column,
  LENGTH(to_hex(binary_column)),
  SUBSTR(to_hex(binary8k_column), 0, 15) AS binary8k_column,
  LENGTH(to_hex(binary8k_column))
  FROM VERT_DATATYPE_v1_0_5_SCHEMA.Binary_Table;

This query using the to_hex function will crash with this error:

You must rewrite the query and express the to_hex function as to_chars using Data Virtuality syntax:

SELECT
  ValueDesc,
  TO_CHARS(binary_column, 'HEX') as binary_column,
  LENGTH(TO_CHARS(binary_Column, 'HEX')) as length_binary_column,
  SUBSTRING(TO_CHARS(binary8k_Column, 'HEX'), 0, 15) as binary8k_column,
  LENGTH(TO_CHARS(binary8k_Column, 'HEX')) AS length_binary8k_column
  FROM "VerticaDatatypesTest.Binary_Table";

The following are examples of Data Virtuality SQL syntax for data type conversion:

To define a date literal explicitly, use the escape syntax {d '2003-01-01'}.
To convert a date to a string, use cast(datefield as string) or use the formatdatefunction, which is described in the reference guide and which accepts different patterns. For example: formatdate(date, ‘YYYY-DD-MM’)
Data Virtuality automatically converts string literals to the implied type. For example: SELECT * FROM my_table WHERE created_by = ‘2003-01-02’

Data Virtuality Documentation

Data Virtuality documentation is installed with Data Virtuality Server. To access the documentation, use the following URL, where my_dv_server is the IP address or host name of the computer where Data Virtuality Server is running:

http://my_dv_server:8080/

To access the documentation, supply the user name and password that you specified in the connection information for Data Virtuality Server.

Troubleshooting

The Data Virtuality log files, located in the Data Virtuality Server directory, are useful for troubleshooting.

The following log file (on Windows) contains information about queries, errors, and other actions performed by the Data Virtuality server:

%pathToDVserver%\standalone\log\boot.log

For example:

C:\Program Files(x86)\datavirtuality\dvserver\standalone\log\boot.log

Known Limitations

Data Virtuality has the following limitations in its support for Vertica data types:

For TIMETZ and TIMESTAMPTZ, time zone offset is not displayed.
For TIME data type, milliseconds are not displayed.
For TIMETZ data type, milliseconds are rounded off.

For More Information

See the following resources: