Share this article:

Vertica Integration with Talend: Tips and Techniques

For Vertica 7.0.x and 7.1.x

 

About Vertica Tips and Techniques

Vertica develops Tips and Techniques documents to provide you with the information you need to use Vertica with third-party products. This document provides guidance using one specific version of Vertica and one version of the vendor's software. While other combinations are likely to work, the specific versions you are using may noy have been tested.

Overview

This document provides guidance for configuring Talend Open Studio to connect to Vertica.  This document covers integration with Talend Data Integration; it does not specifically cover any other products in the Talend Unified Platform. However, connectivity options for other Talend products should be similar to the options covered here.

Compatibility Matrix

The examples and recommendations in this guide were tested with Talend DI 5.5.1 and Vertica 7.0.x. The guidance provided in this document should also apply to earlier versions of both products.

The following table lists and describes connection testing success across various Talend and Vertica versions, using the native Vertica connector provided by Talend. Recommendations in this document have been tested using the following product versions.

Talend Data Integration Vertica JDBC Driver Vertica Server

5.2 MR4

5.1.1

5.1.1

5.2 MR4

6.0.1

6.0.1

5.2 MR4

6.1.1

6.1.1

5.4.1

7.0.2

7.0.2

5.4.1

7.1.1

7.1.1

5.5.1

7.1.1

7.1.1

5.6 MR2

7.1.1

7.1.1

Connecting Talend and HP Vertica

While you can use a generic JDBC driver for a Talend/Vertica connection, the best way to connect is using the Vertica JDBC driver. When you create a connection, Talend automatically downloads the version of the Vertica JDBC driver appropriate for your version of Vertica. After the download, you can find the driver here:  <Talend_Install_Dir>/lib/java

 

Setting Up a Connection

To create a connection using the Vertica JDBC driver, select Db Connections, enter the information for your database, and click Next.

image002.png

Upgrading Your JDBC Driver

Sometimes you need to upgrade the JDBC driver version you’re using to access fixes in a later version. In these instances, download a more current driver from my.vertica.com and replace your driver. The following example assumes that you need to replace the default Vertica driver provided by Talend with the version 6.1.3 driver, downloaded from my.vertica.com

  1. Exit Talend.
  2. To locate the .jar file containing your JDBC driver, search for vertica*jar in the directory where Talend is installed.
  3. Move the .jar file (vertica-jdk5-6.0.0-0.jar) to another location outside of your Talend directory. Be sure to move all instances of the .jar file.
  4. Copy the new .jar file (vertica-jdk5-6.1.3-0.jar) to all the locations where you found the original file.
  5. Rename the new jar file (vertica-jdk5-6.1.3-0.jar) you just downloaded from the Vertica web site to vertica-jdk5-6.0.0-0.jar.

When you restart Talend, the new version of the driver is active.

Vertica Client Driver/Server Compatibility

Usually, each version of the Vertica server is compatible with the previous versions of the client drivers. This compatibility lets you upgrade your Vertica server without having to immediately upgrade your client software. However, some enhancements in the new server version may not be available through the old drivers.

The following table summarizes the compatibility of each recent version of the client drivers with the Vertica server versions.

Vertica Client Driver Version Compatible Vertica Server Versions

6.1.x

6.1.x, 7.0.x, 7.1.x

7.0.x

7.0.x, 7.1.x

7.1.x

7.1.x


Only the first two digits in the version string matter for client/server compatibility. For example, any 7.0.x client driver can communicate with any 7.0.x server. 

For more information about client driver/server compatibility, see the Connecting to Vertica guide.

Installing Service Packs

Always install the latest service pack when upgrading your driver. Service packs contain important fixes and updates. (The third digit in the release number represents the service pack.) Check the my.vertica.com downloads page to see release notes for the latest service pack.

Configuring Talend for Use with Vertica

To configure Talend for use with Vertica, you must understand:

  • Using Talend Components for Vertica
  • Using the Talend SQL Builder
  • Enabling Parallelization in Talend

Using Talend Components for Vertica

Talend provides a number of Vertica-specific components that you can use to manage your ETL jobs. Understanding the ETL flow and loading guidelines can help you determine which components to use and how best to use them. For complete information about all Vertica components, see these Talend resources:

Understanding ETL Flow and Loading Guidelines

Understanding the three basic ETL flows helps you decide which Vertica component and options to use. This section addresses the following load types:

Type of Load Use this COPY Option Results

Small Bulk Load COPY (<100MB)

AUTO

  • Writes to WOS.
  • Spills to ROS when WOS overflows.

Large Bulk Load COPY

DIRECT

  • Writes to ROS.
  • Each commit becomes a new ROS container.

Incremental Load COPY

TRICKLE

  • Writes to WOS.
  • Errors when WOS overflows.

Follow these guidelines when loading data into Vertica:

  • Use COPY to load large amounts of data. Using COPY helps to avoid fragmenting the WOS and doesn't require the overhead that using INSERT does.
  • If your row count is small (fewer than 1000 rows), use INSERT.
  • Load multiple streams on different nodes.

 If you are using the INSERT INTO...SELECT...FROM syntax in ETL jobs with large volumes, be sure to use the following syntax.

=> INSERT /+*direct*/ INTO table SELECT…

Commonly Used Vertica Specific Talend Components

The following Talend 5.5.1 components are specifically for use with Vertica:

Component Description

tVerticaInput

Extracts data from Vertica.

tVerticaBulkExec

Loads from a file.

tVerticaOutputBulk

Writes to a file.

tVerticaOutputBulkExec

Writes to a file and loads the data.

tVerticaOutput

Inserts or updates rows into a Vertica table.

tVerticaRow

Executes the SQL query stated against the Vertica database.

tVerticaInput

The tVerticaInput component allows you extract data from Vertica. Give special attention to the DATE and VARCHAR fields, as follows: 

  • DATE fields—All DATE fields must be in MM-DD-YYYY format. Check your DATE fields and correct the format wherever necessary, as shown in the following dialog box.

image003.png

  • VARCHAR fields—Talend examines a sampling of the data when determining size of the VARCHAR field. If the table contains large VARCHAR values, consider increasing the default column size.

Note  You can also use the tMap component to transform or remap the data type.

tVerticaOutput

tVerticaOutput provides options that you can change, depending on the size and type of load. Specifically, you should change the Action on Data and Action on Table options need to be changed as described in the following topics.

Basic Settings for Trickle or Incremental Loads (Writing to WOS)

Set the Action on Table and Action on Data options according to your use case. For trickle or incremental loads, set Action on Data to INSERT.

image004.png

Advanced Settings for Trickle or Incremental Loads (Writing to WOS)

When writing to the WOS, use the default values for Advanced Settings as follows.

image005.png

To enable fields for INSERT, UPDATE, and DELETE, select the Use field options checkbox. The default settings in the previous graphic generate the following SQL commands:

2014-09-16 17:10:59.679 Init Session:0x7f0a4800fe30-13000000009c0a8 [Txn] 
<INFO> Starting Commit: Txn: 13000000009c0a8 'INSERT INTO 
datatype_ss.VarChar_Table (KeyColumn,Varchar_Column,Varchar_Max_Column) 
VALUES (?,?,?)'

Note These default settings do not generate a COPY statement.

Basic Settings for Large, Bulk Loads (Writing to ROS)

For large, bulk loads, use the DIRECT syntax to write to ROS. To do so, you must set the Action on data option to COPY. 

image006.png

Advanced Settings for Large, Bulk Loads (writing to ROS)

When using COPY, use Advanced Settings to specify filenames and locations for exception files.

image007.png

The settings in the preceding graphic generate the following SQL:

2014-09-16 17:26:50.474 Init Session:0x7f0a4800fe30-13000000009c13e [Txn] 
<INFO> Rollback Txn: 13000000009c13e 'COPY datatype_ss.VarChar_Table 
(KeyColumn,Varchar_Column,Varchar_Max_Column) FROM local 
STDIN UNCOMPRESSED WITH DELIMITER ';' RECORD TERMINATOR 
'' ENCLOSED BY '|' NULL AS '' EXCEPTIONS 'C:/data/except_file.txt' 
REJECTED DATA 'C:/data/except_records.txt' DIRECT NO COMMIT'

tVerticaOutputBulkExec

The tVerticaOutputBulkExec component writes to a file and then copies the data using standard input (STDIN).

Basic Settings

image008.png

Advanced Settings

By default, tVerticaOutputBulkExec writes to the ROS.

image009.png

The settings in the preceding graphic result in the following SQL:

2014-09-16 17:46:14.526 Init Session:0x7f0a48010510-13000000009c1fe [Txn] 
<INFO> Starting Commit: Txn: 13000000009c1fe 'COPY 
datatype_ss.VarChar_Table FROM local STDIN DELIMITER ';' NULL 'null' 
DIRECT returnrejected'

 image010.png

tVerticaRow

The tVerticaRow component allows you to specify any valid Vertica SQL, including COPY statements.  Use tVerticaRow to load data into Vertica flex tables, or for other scenarios that require HP Vertica structures that are not supported with custom Talend components. 

Example: High-Performance COPY

In the following example, the source file is on the Vertica server and uses the high performance Vertica COPY (not COPY FROM LOCAL). Load data in this way whenever the source files are on the Vertica cluster.

"COPY talend.store_target FROM '/home/dbadmin/store.csv' DELIMITER ';' 
NULL '' DIRECT;"
Example: Loading into Flex Tables

The next example shows how to run the Vertica flex example included in the Vertica package directory.

To create the flex table:

=> CREATE FLEX TABLE mountains();

To load data to the flex table:

=> COPY mountains FROM 
   '/opt/vertica/packages/flextable/examples/mountains.json' 
   PARSER FJSONPARSER();

To create the view of the flex table:

=> SELECT COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW('mountains');
Example:  Performing Vertica-to-Vertica COPY

The following example shows an Vertica-to-Vertica COPY:

“CONNECT TO VERTICA vmart USER dbadmin PASSWORD '' ON 
'VertTest01',5433;COPY customer_dimension FROM  VERTICA 
vmart.customer_dimension DIRECT;”

You can also use the tVerticaRow component to copy from Vertica to Vertica when no transformations are required. To do so, you need to perform additional steps to define export addresses. Connecting to a public network requires some configuration. For information about using this statement to copy data across a public network, see Using Public and Private IP Networks in the Vertica documentation.

Using Generic Talend ELT Components with Vertica

The ELT family of components group together database connectors and processing components for ELT mode, where the target DBMS becomes the transformation engine.  When possible, SQL statements are combined and processed in a single query on the database. The following graphic illustrates how Talend supports ELT with Vertica. This example uses the generic components for the entire job.  

image011.png

When this job runs, Talend generates the following SQL:

'INSERT INTO store.profit_by_region(store_region, store_state, 
cost_dollar_amount, gross_profit_dollar_amount) 
(SELECT STORE.store_region, STORE.store_state,
SUM(SALE.cost_dollar_amount ), SUM(SALE.gross_profit_dollar_amount )
FROM store.store_dimension STORE INNER JOIN store.store_sales_fact
SALE ON( SALE.store_key = STORE.store_key ) group by STORE.store_region,
STORE.store_state ORDER BY STORE.store_region, STORE.store_state )'

Important As of 2014, Talend does not provide ELT components specific to Vertica. You can use the generic components, but be aware the INSERT syntax is not optimized.  As you can see from the preceding SQL, the /+*DIRECT*/ hint does not appear.

Other Components for Bulk Loading

The tVerticaOutputBulk and tVerticaBulkExec components are generally used together in a two-step process: The first step generates an output file. In the second step, the output file is used in the INSERT operation that loads data into a database.

You can also choose to do both steps at once using the tVerticaOutputBulkExec component.  However, using tVerticaOutputBulk and tVerticaBulkExec allows the data to be transformed before it is loaded in the database.

Using the Talend SQL Builder

When using the SQL builder, be sure to include the schema name in the query:

=> SELECT * FROM store.store_dimension;

Enabling Parallelization in Talend

You can enable parallelized data flows. To do so, partition an input data flow of a Talend subjob into parallel processes and execute these processes simultaneously.    

Enabling Parallelization in the Talend Enterprise Edition

If you are using the Talend Enterprise Edition, you can enable or disable the parallelization with a single click. Talend Studio then automates the implementation across a given job.  For more information about this feature, click the following link:

Enabling Parallelization in the Talend Community Edition

If you are using the Talend Community Edition, add a WHERE clause to the original query to chunk the data.  This example results in four chunks.

original_sql + " and hash(" + primaryKey + ") % " + noOfThreads + " = " + i

Example:

=> SELECT if.* FROM inventory_fact if, warehouse_dimension wd WHERE
   if.warehouse_key=wd.warehouse_key

The preceding query chunks the SELECT operation into the following four queries:

=> SELECT if.* FROM inventory_fact if, warehouse_dimension wd WHERE
  if.warehouse_key=wd.warehouse_key AND HASH(product_key, date_key) % 4=1;
=> SELECT if.* FROM inventory_fact if, warehouse_dimension wd WHERE
  if.warehouse_key=wd.warehouse_key AND HASH(product_key, date_key) % 4=2;
=> SELECT if.* FROM inventory_fact if, warehouse_dimension wd WHERE
   if.warehouse_key=wd.warehouse_key AND HASH(product_key, date_key) % 4=3;
=> SELECT if.* FROM inventory_fact if, warehouse_dimension wd WHERE
   if.warehouse_key=wd.warehouse_key AND HASH(product_key, date_key) % 4=4;

Choose keys that have equal distribution. For example, the two keys chosen in the preceding example provide the following counts:

Key : 235164 Value : product_key , Key : 50148 Value : date_key
count, chunk
7501441, 1
7500008, 2
7497028, 0
7501523, 3

Vertica Tips and Techniques

Consider these modifications and best practices for using HP Vertica with Talend DI, described in the next sections:

  • Managing Vertica Resources
  • Managing Vertica Storage Containers
  • Monitoring Data Load
  • Validating Load Results
  • Exporting Data from the Source Database to Flat Files
  • Expected Results for Various Scenarios
  • Data Type Mappings
  • Using Vertica COPY
  • Loading Over the Network
  • Using Non-ASCII data
  • Internationalization

Managing Vertica Resources

To prevent query users from being impacted by ETL jobs, create a separate resource pool for Talend.  Your resource pool settings should depend on the amount of memory on your machine and how many other resource pools are created for other users. The following example creates a resource pool for the Talend user:

=> CREATE RESOURCE POOL batch_pool MEMORYSIZE '4G' 
MAXMEMORYSIZE '84G' MAXCONCURRENCY 36;
=> DROP USER talend;
=> DROP SCHEMA talend;
=> CREATE SCHEMA IF NOT EXISTS talend;
=> CREATE USER talend IDENTIFIED BY 'talend@pwd' SEARCH_PATH talend;
=> GRANT USAGE ON SCHEMA talend TO talend;
=> GRANT USAGE ON SCHEMA PUBLIC TO talend;
=> GRANT USAGE ON SCHEMA online_sales TO talend;
=> GRANT USAGE ON SCHEMA store TO talend;
=> GRANT SELECT ON ALL TABLES IN SCHEMA PUBLIC TO talend;
=> GRANT SELECT ON ALL TABLES IN SCHEMA store TO talend;
=> GRANT SELECT ON ALL TABLES IN SCHEMA online_sales TO talend;
=> GRANT CREATE ON SCHEMA talend TO talend;
=> GRANT USAGE ON RESOURCE POOL batch_pool TO talend;
=> ALTER USER talend RESOURCE POOL batch_pool; 

Managing Vertica Storage Containers

Vertica supports INSERT, UPDATE, DELETE, bulk load operations (COPY), and other queries in a typical data warehouse workload. The storage model consists of three elements that operate identically on each Vertica node:

  • Write Optimized Store (WOS) is a memory-resident data structure for storing INSERT, UPDATE, DELETE, and COPY (without /*+DIRECT*/ hints) actions. To support very fast data load speeds, the WOS stores records without data compression or indexing. The WOS organizes data by epoch and holds both committed and uncommitted transaction data.
  • Read Optimized Store (ROS) is a highly optimized, read-oriented, disk storage structure. The ROS makes heavy use of compression and indexing. You can use the COPY...DIRECT and INSERT (with /*+DIRECT*/ hints) statements to load data directly into the ROS.
  • The Tuple Mover (TM) is the Vertica database optimizer component that moves data from memory (WOS) to disk (ROS). The Tuple Mover runs in the background, performing some tasks automatically at time intervals determined by its configuration parameters.

For more information on Vertica ROS and WOS, see Loading Data into the Database in Best Practices for OEM Customers in the Vertica documentation.

Managing Vertica ROS Containers

Vertica creates ROS containers:

  • With every moveout
  • Each time COPY DIRECT is executed
  • Whenever a table is partitioned

ROS containers consume system resources, so their number has a direct effect on system performance. A large number of ROS containers can degrade performance, while too few ROS containers can prevent the system from taking advantage of inherent parallelism.  Too many ROS containers can also result in errors.

For best performance, Vertica recommends:

  • 10 ROS containers per projection when not loading data
  • Up to 20 ROS containers per projection during data load
  • No more than 50 ROS containers per projection
ROS Container Errors

ROS container exceptions can occur if you accidently target frequent, small loads directly to ROS instead of WOS. This situation occurs if the batch load API does not correctly convert INSERTs into  COPY statements. (Check the Vertica logs to see whether this is the case.)

ERROR:  Too many ROS containers exist for the following projections:
customer_100061.ark_DBD_1_seg_ptoole2_ptoole2 (limit = 1,000,000, 
ROS files = 999673, DV files = 0, new files = 479)
HINT:  Please wait for the tuple mover to catch up.  Use 'select * from
_monitor.tuple_mover_operations;' to monitor.

Best Practices for Loading and Updating Data

To avoid creating too many or too few ROS containers, follow these recommendations:

  • Load and delete data less frequently in larger batches.
  • If you are working with many small files of raw data, use a single COPY statement to load these files at the same time. Use the wildcard support to COPY from all files, or concatenate all files to a named pipe and have COPY load from the pipe.
  • Use a single DELETE/UPDATE statement to delete or update batches of rows. When possible, delete/update batches of rows at a time to avoid creating too many delete vector ROS containers.

Managing Vertica WOS Containers

The WOS can use a maximum of 25% of the physical memory on each node. Vertica issues a WOS overflow error when you reach or exceed this limit. For example, if you have up to 1 GB available for your WOS, you'll receive a WOS Overflow error if you fill the 1 GB before the data can be processed by the tuple mover's moveout operation.

Monitoring Data Load

You can monitor the progress of running jobs using either standard SQL statements or by using the Vertica Management Console.

Using SQL to Monitor Load Streams

You can use the following SQL statements to view how your data load is progressing:

\echo ..Load streams
SELECT
table_name AS table,
stream_name AS stream,
TO_CHAR(DATE_TRUNC('second', current_timestamp - 
load_start::timestamp), 'hh24:mi:ss') AS run_time,
TO_CHAR(load_start::timestamp, 'yyyy-mm-dd hh24:mi') AS load_start,
TO_CHAR(accepted_row_count, '999,999,999,999') AS accepted,
TO_CHAR(rejected_row_count, '999,999,999,999') AS rejected,
TO_CHAR(unsorted_row_count, '999,999,999,999') AS unsorted,
TO_CHAR(sorted_row_count, '999,999,999,999') AS sorted,
sort_complete_percent AS sort_pct,
TO_CHAR(accepted_row_count/extract(EPOCH FROM current_timestamp - 
load_start::timestamp), '999,999') AS rps,
TO_CHAR(sorted_row_count/extract(EPOCH FROM current_timestamp - 
load_start::timestamp), '999,999') AS sort_rps
FROM load_streams ls;
--
--
\echo ..Load Totals
SELECT
LPAD(COUNT(DISTINCT table_name)::CHAR, MAX(LENGTH(table_name))) 
AS tables, 
LPAD(COUNT(DISTINCT stream_name)::CHAR, MAX(LENGTH(stream_name)))
AS streams,
TO_CHAR(DATE_TRUNC('second', current_timestamp - 
MIN(load_start::timestamp)), 'hh24:mi:ss') AS run_time,
TO_CHAR(MIN(load_start), 'yyyy-mm-dd hh24:mi') as load_start,
TO_CHAR(SUM(accepted_row_count), '999,999,999,999') AS accepted,
TO_CHAR(SUM(rejected_row_count), '999,999,999,999') AS rejected,
TO_CHAR(SUM(unsorted_row_count), '999,999,999,999') AS unsorted,
TO_CHAR(SUM(sorted_row_count), '999,999,999,999') AS sorted,
SUM(sort_complete_percent) AS sort_pct,
TO_CHAR(SUM(accepted_row_count)/extract(EPOCH FROM current_timestamp - 
MIN(load_start::timestamp)), '999,999') AS rps
load_streams ls;

Using Management Console to Monitor Running Jobs

You can use the Vertica Management Console to better understand resource usage while jobs are executing. The following graphic shows executing COPY statements and the resources that the ETL job is consuming. Below the list of statements, charts display the resources consumed by the batch load job that is running.

image013.png

Downloading and Installing Vertica Management Console

You can download the version of Management Console you need from my.vertica.com. Follow the installation instructions in the Installing and Configuring Management Console.  

   

Additional Information on Management Console

See the HP Vertica Core documentation:

Using Linux to Monitor Disk and Resource Usage

Use the following Linux commands to monitor disk and resource usage:

  • df—Use df to monitor the disk so that enough space remains available.
  • vmstat—Use vmstat to monitor CPU usage at specified intervals. The column values bi (blocks in) and bo (blocks out), are reported in kilobytes per second.

The vmstat display also shows swapping I/O using columns si (swap in) and so (swap out), also in kilobytes per second.

vmstat estimates the disk bandwidth used in the load in kilobytes per second. The maximum values observed for the bi + bo columns provide the disk bandwidth estimate.

The vmstat display shows swapping I/O using columns si and so, for swap in and swap out, also in KB/sec. Suppose the swapping I/O is significant (si + so is more than 20% of maximum-seen bi + bo, thus stealing up to 20% of the disk bandwidth) over many minutes, especially during the high block I/O periods. In this situation, the system is under stress. The parallelism of the load should be reduced at the earliest opportunity by reducing the number of load streams after the COPY for a chunk is finished.

For example, suppose you run vmstat 60 to obtain a report every 60 seconds. During the load, you see high CPU usage in bursts for the sequence of memory sorts in the first part of chunk processing. During the rest of the process, you observe high block I/O (vmstat columns bi and bo). The system should not become idle until the load is done, where "idle" means low CPU use and low bi + bo.

sar and sar –r can also be used.

Validating Load Results

Vertica does not enforce constraints at data load. Constraints are enforced when data is loaded into a table with a pre-joined dimension, or when the table is joined to a dimension table during a query. Therefore, you could experience constraint errors in any of these scenarios:

  • If there is not exactly one dimension row that matches each foreign key value.
  • An inner join query is processed. Vertica treats an outer join as an inner join due to the presence of foreign key.
  • A new pre-join projection anchored on the table with the foreign key constraint is refreshed.

Note If you are testing your queries for performance, you may not see the full query speed right away. Delays in performance can occur because Vertica can be busy reorganizing the data for several hours. How long data reorganization takes depends on how much data was loaded and the type of load.

Exporting Data from the Source Database to Flat Files

If, for any reason, you cannot connect directly to the source database, you may need to export data from the source system to flat files. Before doing so, be aware of these considerations:

  • Smaller tables generally fit into a single load file. Split any large tables into 250–500 GB load files. For example, a 10 TB fact table requires 20–40 load files to maintain performance.
  • The default delimiter for the COPY statement is a vertical bar (|). Before loading your data, verify that no CHAR(N) or VARCHAR(N) data values include this delimiter character. To test for the existence of a specific character in a column, use a query like the following:

    => SELECT COUNT(*) FROM T WHERE X LIKE '%|%'
    If only a few rows contain |, you can eliminate them from the load file using a WHERE clause. Then, load the rows separately using a different delimiter.

Expected Results for Various Scenarios

Depending on the type of SQL insert generated by Talend, mapping exceptions can have varying results:

Mapping Exceptions HP Vertica COPY LOCAL SQL INSERT

CHAR field with n characters mapped to CHAR column with less than n characters

Record is inserted with silent truncation.

Record is not inserted; warning message is issued.

EN (Edited Numeric) field mapped to INTEGER column with data that forces numeric overflow

Record is rejected and sent to the rejected records log file.

0 is silently loaded in place of the number that would cause the overflow.

EN field mapped to NUMERIC column with data that exceeds the scale

Record is rejected and sent to the rejected records log file.

Record with the value that exceeds the scale is silently rejected.

Data Type Mappings

Talend has the ability to generate Vertica DDL automatically when moving data from a non Vertica database to Vertica. However, you should have a solid understanding of the differences between the two databases. For instance, Oracle supports NVARCHAR data types, and Vertica does not.

UTF-8 Considerations

Sometimes, the source system contains multi-byte character data. In such cases, when you convert the DDL from the source to the Vertica target, increase the size of the column by a factor of 3 (at least) to allow for the possibility of non-UTF-8 character sets. For example, change an NVARCHAR(4) source column DDL varchar(12) in Vertica. 

Vertica and Oracle Data Type Mappings

Oracle uses proprietary data types for all main data types (for example, VARCHAR, INTEGER, FLOAT, and DATE). When migrating to HP Vertica, follow these guidelines:

  • If your data is multi-lingual, convert your schema before migrating. Converting your schema before migrating can minimize errors and minimize time lost spent fixing erroneous data issues.
  • You must convert the non-standard NUMBER data type to the SQL-standard INT or INTEGER.
  • Include the necessary foreign key clauses in the table definitions or in separate ALTER TABLE T ADD CONSTRAINT … The foreign key constraints play an important role in guiding the Vertica Database Designer in its work.

For a table comparing Oracle and Vertica data types, see Data Type Mappings Between Vertica and Oracle in the Vertica product documentation.

Using Vertica COPY

The COPY statement bulk loads data into an Vertica database. You can load one or more files, or use pipes on a cluster host. You can also load directly from a client system, using the COPY statement with its FROM LOCAL option. 

Raw input data must be in UTF-8, delimited text format. Vertica compresses and encodes the data for efficient storage.

For more details on COPY command, see the COPY in the Vertica product documentation.

Loading Over the Network

A 1 Gbps (gigabits per second) network can deliver about 50 MB per second, or 180 GB per hour. Vertica can load about 30–50 GB per hour/node for a 1-Ksafe projection design. Therefore, use a dedicated 1Gbps LAN. Using a LAN with a performance that is less than 1Gbps is proportionally slower.

Do not load data across an external network, because the delays over distance slow down the TCP protocol to a small fraction of its available bandwidth, even without competing traffic.

Note The actual load rates you obtain can be higher or lower. These rates depend on the properties of the data, number of columns, number of projections, and hardware and network speeds. You can improve load speeds further by using multiple parallel streams.

Using Non-ASCII data

Vertica stores data in the UTF-8 compressed encoding of Unicode. The resulting UTF-8 codes are identical to ASCII codes for the ASCII characters (codes 0 to 127 in one byte). If your table data (CHAR columns) is all ASCII, it should be easy to transport, since all current operating systems treat ASCII the same way. If you have UTF-8 data, be sure that the extraction method you use does not convert CHAR column values to the current (non-UTF-8) locale of the source system.

On most UNIX systems, you can use the locale command to see the current locale. You can change the locale for a session by setting the LANG environment variable to en_US.UTF-8.

Sometimes, data actively uses the non-ASCII characters of Latin-1, such as the euro sign (€) and the diacritical marks of many European languages. If you have data in another character encoding such as Latin-1 (ISO 8859), you must convert it to UTF-8. Use the Linux tool iconv to perform the necessary conversions. Large fact tables are unlikely to contain non-ASCII characters, so these conversions are usually needed only for data in smaller dimension tables.

Internationalization

Vertica supports Unicode character encoding and locales. 

Unicode Character Encoding: UTF-8 (8-bit UCS/Unicode Transformation Format)

All input data received by the database server must be in UTF-8 format, and all data output by Vertica is in UTF-8. The ODBC API operates on data in UCS-2 on Windows systems, and in UTF-8 on Linux systems. (The A UTF-16 ODBC driver is available for use with the DataDirect ODBC manager.)

JDBC and ADO.NET APIs operate on data in UTF-16. The Vertica client drivers automatically convert data to and from UTF-8 when sending to and receiving data from Vertica using API calls. The drivers do not transform data loaded by executing a COPY or COPY LOCAL statement.

Locales

The locale parameter defines the user's language, country, and any special variant preferences such as collation. Vertica uses the locale to determine the behavior of various string functions and for collation for various SQL commands that require ordering and comparison, such as GROUP BY, ORDER BY, joins, the analytic ORDER BY clause, and so forth. 

By default, the locale for the database is en_US@collation=binary (English US). You can establish a new default locale that is used for all sessions on the database. You can also override individual sessions with different locales. In addition, you can also set the locale through ODBC, JDBC, and ADO.net.

Be aware of the following when working with locales:

  • Projections are always collated using the en_US@collation=binary collation, regardless of the session collation. Any locale-specific collation is applied at query run time.
  • The maximum length parameter for VARCHAR and CHAR data type refers to the number of octets (bytes) that can be stored in that field and not number of characters. When using multi-byte UTF-8 characters, the fields must be sized to accommodate from 1 to 4 bytes per character, depending on the data.

When the locale is non-binary, you can use the collation function to transform the input to a binary string that sorts in the proper order. This transformation increases the number of bytes required for the input according to this formula (CollationExpansion defaults to 5):

result_column_width = input_octet_width * CollationExpansion + 4 

For more information, see About Locale in the product documentation.

Known Issues

This section describes known issues in Vertica/Talend connections. This list was last updated on 12/9/14.

Case for Schema Names

 Schema names in Vertica are case-insensitive. However, in Talend, you must provide the exact case of schema name.

Schema Name not Provided with Generic JDBC Connection

 The Generic JDBC Connection does not append the schema name with the table name; therefore, if you use this connection, jobs will fail in Talend.

Vertica Data Type Mapping and Min-Max Values

Testing has discovered several data-type transfer issues. In the following situations, all the rows are transferred, but the data precision is lost for extremely large values. The loss of precision occurs for INTEGER, DOUBLE, AND DECIMAL data types. For more information, see the Talend known issue at https://jira.talendforge.org/browse/TDI-29446.

To ensure the values described in this issue transfer correctly, map FLOAT to DOUBLE.

Loss of precision also occurs with INTEGER.  As you can see, the default mapping for our INTEGER data types is INTEGER. If the type is changed to LONG, the data type is changed to BIGINT and the transfers are accurate.

Editing Data Type Mappings

To edit the mappings in Talend Open Studio, navigate to Window > Preferences > Talend > Specific Settings.

Toucan find the default mappings file (mapping_vertica.xml) for Vertica in the following folders:

Linux:

<Talend_Install_Dir>/ configuration\org.eclipse.osgi\bundles\1707\1\.cp\
mappings

Windows:

C:\unzipped\talend_di\TOS_DI-Win32-r118616-
V5.5.1\configuration\org.eclipse.osgi\bundles\1707\1\.cp\mappings

Before you make changes to the mapping_vertica.xml file, contact Talend support for guidance. For more information, see Changing the default data type mapping.

Share this article: