Vertica Integration with Syncsort DMExpress: Tips and Techniques

For Vertica 6.0.x to 7.1.x

About Vertica Tips and Techniques

Vertica develops Tips and Techniques documents to provide you with the information you need to use Vertica with third-party products. This document provides guidance using one specific version of Vertica and one version of the vendor's software. While other combinations are likely to work, they may not have been tested. 

Overview

This document assumes that the reader is familiar with both Syncsort and Vertica . The focus of this document is to provide best practices for configuring Syncsort DMExpress (DMX) to connect and load data to your Vertica database.

The recommendations in this document are based on testing Syncsort DMExpress 7.x and Vertica 6.0 and 7.1.

Connecting Vertica and Syncsort DMExpress

There are several ways to connect to an Vertica database from Syncsort DMExpress.

Effective with Vertica 7.0, DMExpress 7.13 establishes parallel connections using the COPY LOCAL statement. This capability provides optimal performance, ease of use, and dynamic tuning. 

Connecting to Vertica Using vsql

Prior to DMX Express 7.13, using vsql was the preferred way to connect to Vertica . However, this method required shell scripts. You can still use vsql to connect, but this option requires more setup, which this document does not describe. Syncsort has customized the default Vertica ODBC connector to leverage parallel COPY operations and auto source partitioning to obtain very fast loads without having to split up the source data.

Connecting to Vertica Using ODBC

DMExpress supports both Windows and Linux environments. The following dialog box shows how to connect to Vertica using the Windows implementation of ODBC:

image002.png

If you are creating a connection in the Windows ODBC and your deployment environment is Linux, verify that the ODBC DSN names match. For examples of ODBC configuration files, see Using ODBC Configuration Files in this document.

For additional details on setting up and configuring, refer to the DMExpress Help that ships with the DMExpress software,

Loading Data into Vertica

When connecting to an Vertica database using ODBC, DMExpress uses different load methods, depending on which Vertica version you are connecting to. The following table applies to DMExpress 7.13 and later.

Vertica Version Load Method

Prior to 6.0

Multi-stream Vertica COPY LOCAL via ODBC on both Windows and Linux

6.x

  • Linux: Multi-stream COPY LOCAL via ODBC
  • Windows: Multi-stream INSERT via ODBC

7.x and later

Single-stream INSERT via ODBC on both Windows and Linux

This document focuses on loading data into Vertica using Syncsort DMExpress 7.13.

Using Vertica COPY

The COPY statement bulk loads data into an Vertica database. You can initiate loading one or more files or pipes on a cluster host. Alternatively, you can load directly from a client system using the COPY statement with its FROM LOCAL option.  As of DMExpress 7.13, Syncsort DMExpress only generates the COPY LOCAL option.

Raw input data must be in UTF-8, delimited text format. Data is compressed and encoded for efficient storage. 

For more information about the COPY command, see COPY in the Vertica product documentation.    

Sample COPY Statements

The following example shows an Vertica COPY LOCAL using a named pipe as input. In this example, the ss_tpch"."H_CUSTOMER table loads, with reject and exception logs specified. 

2012-09-18 09:44:28.445 Init Session:0x2aaaf40c2540 [Session] <INFO> [Query] TX:a00000000ae9d8(engvmco141-29267:0x2321ab) COPY public.MXR_HTTP_Access1 ( ID, DstIp, msisdn, starttime, method, server, url_path, version, status, RxBytes, TxBytes, referrer, content_type, content_length, user_agent, DstPort, device, Date_Key, Time_Key, content_type1, domain, domain_key, contenttype_key ) FROM LOCAL STDIN NATIVE VARCHAR ENFORCELENGTH RETURNREJECTED AUTO

2012-09-18 09:44:28.623 nameless:0x1f9f5490 [EE] <INFO> COPY: Load 2114 rows and Rejected 0 rows from inputfile:  Txn: a00000000ae9d8 stream name: MXR_HTTP_Access1-57174020

2014-11-12 11:30:29.238 Init Session:0x7f213800fa40-c000000019d7fd [Txn] <INFO> Starting Commit: Txn: c000000019d7fd 'COPY "ss_tpch"."H_CUSTOMER" ( "C_CUSTKEY", "C_NAME", "C_ADDRESS", "C_NATIONKEY", "C_PHONE", "C_ACCTBAL", "C_MKTSEGMENT", "C_COMMENT" ) FROM LOCAL '/tmp/dmexpress_2014.11.12.11.30.04_1396929832_18486_139910686074624_218785883.dat' EXCEPTIONS '/usr/tmp/dmexpress_ss_tpch.H_CUSTOMER_2014.11.12.11.30.04_exceptions_18486_139910686074624_1424127860.log' REJECTED DATA '/usr/tmp/dmexpress_ss_tpch.H_CUSTOMER_2014.11.12.11.30.04_rejectedRecords_18486_139910686074624_1778448321.log' NULL 'NULL' DELIMITER '|' ESCAPE '\''

Parallel Loading

As of Vertica 7, when the target Vertica cluster consists of multiple multi-core machines, DMExpress uses an asynchronous multi-stream approach to improve load performance.

DMExpress dynamically adjusts the number of parallel streams and balances the load among the streams. How DMX performs this work depends on available ETL and Vertica server resources, as follows:

  • The number of streams increases according to the speed at which DMExpress produces data, subject to a maximum.
  • The faster Vertica nodes receive more data than the slower nodes.
  • The maximum number of connections that DMExpress attempts to use for loading data is 75% of the average number of cores per node in the Vertica cluster.
  • When loading data into Vertica from Hadoop, DMExpress further divides the calculated maximum by the number of reducers. It is up to the user to configure the system so that each reducer gets at least one connection.
  • The number of parallel connections can exceed the maximum resource limit of the Vertica server, typically determined by the Vertica PLANNEDCONCURRENCY parameter. If this situation occurs, DMExpress stops creating new connections, finishes the load as is, and issues an error that indicates that the resource limit was exceeded.

Vertica Bulk Loading

Starting with DMExpress 7.13, Syncsort auto-chunks the source data and initiates multiple COPY operations with an ODBC connection. You do not require any add-ons or plug-ins. The following figure describes the bulk-loading differences between DMX 7.11 and 7.13. The figure also contains an illustration that shows how parallel load works.

image003.png

Most data loads into Vertica go through ODBC COPY LOCAL. The only scenario where ODBC INSERT is still used is when the user requests that a COMMIT be performed every n records instead of at the end of the load.

Bulk Loading Using vsql

If you are working in a Linux environment and want to leverage the Vertica vsql client for bulk loading, consider the following example.

Load the data using the DMExpress database load utility. Using the native vsql COPY statement, the DMExpress database load utility scales an Vertica database load horizontally and vertically. The utility distributes the workload across multiple initiator nodes in the cluster. Multiple parallel COPY operations to multiple initiator nodes load data faster than a single COPY to an initiator node. This capability improves the performance of the data load. 

When the ETL project consists of one large input file that must be loaded to an Vertica database, initiate the DMExpress runLargeFileLoader job initiator script. The runLargeFileLoader job initiator script performs the following tasks:

  • Executes the configuration file, LargeFileLoader.env.
  • Submits a DMExpress partition job, which partitions the input and places the partitions on named pipes.
  • For each partition, submits an instance of the DMExpress transformation and load job that runs in parallel with other instances of the DMExpress transformation and load job.
  • Calls the DMExpress transformation task, which converts and reformats the data on the named pipes into standard input.
  • Calls the custom load task, which initiates the database load script, sh.

The following figure illustrates the steps that DMExpress takes to parallelize and execute bulking loading.

image004.png

Loading Data over the Network

A 1 Gbps (gigabits per second) network can deliver about 50 MB/s, or 180 GB/hr, performance. Vertica can load about 30–50 GB/hour/node for a K-safety = 1 projection design. Therefore, you should use a dedicated 1Gbps LAN. Using a LAN with performance less than 1 Gbps will be proportionally slower.

Do not load data across an external network. The delays over distance slow down the TCP protocol to a small fraction of its available bandwidth, even without competing traffic.

The actual load rates that you experience can be higher or lower, depending on the properties of the data, the number of columns, the number of projections, and the hardware and network speeds. Load speeds can be further improved by using multiple parallel streams.

Loading Data to Ephemeral Nodes

Syncsort designed DMExpress to be aware of ephemeral nodes in an Vertica cluster. You can configure DMExpress jobs to load only to ephemeral nodes. Alternatively, you can configure DMExpress to avoid loading data to ephemeral nodes.

To specify whether DMExpress loads data to ephemeral nodes, to non-ephemeral nodes, or both, set the following options in the DMExpress configuration file, DMXVertica.ini:

 UseEphemeralNodesForLoading = yes/no
 UseNonEphemeralNodesForLoading = yes/no

To set the path to DMXVertica.ini, define the DMExpress environment variable DMX_VERTICA_INI_FILE. If you do not define the DMX_VERTICA_INI_FILE environment variable, DMExpress loads data to non-ephemeral nodes by default.  For more details about DMX_VERTICA_INI_FILE, refer to the DMExpress help.

Monitoring the Data Loads

There are two ways to see the progress of running data load jobs:

  • Running SQL statements
  • Using the Vertica Management Console

Using SQL to Monitor Data Loads

Use the following SQL statements to monitor the progress of your data load:

\echo ..Load streams
SELECT
table_name AS table,
stream_name AS stream,
TO_CHAR(DATE_TRUNC('second', current_timestamp - load_start::TIMESTAMP), 'hh24:mi:ss') AS run_time,
TO_CHAR(load_start::TIMESTAMP, 'yyyy-mm-dd hh24:mi') AS load_start,
TO_CHAR(accepted_row_count, '999,999,999,999') AS accepted,
TO_CHAR(rejected_row_count, '999,999,999,999') AS rejected,
TO_CHAR(unsorted_row_count, '999,999,999,999') AS unsorted,
TO_CHAR(sorted_row_count, '999,999,999,999') AS sorted,
sort_complete_percent AS sort_pct,
TO_CHAR(accepted_row_count/EXTRACT(epoch FROM current_timestamp - load_start::TIMESTAMP), '999,999') AS rps,
TO_CHAR(sorted_row_count/EXTRACT(epoch FROM current_timestamp - load_start::TIMESTAMP), '999,999') AS sort_rps
FROM load_streams ls;
--
\echo ..Load Totals
SELECT
LPAD(COUNT(DISTINCT table_name)::CHAR, MAX(LENGTH(table_name))) AS tables,
LPAD(COUNT(DISTINCT stream_name)::CHAR, MAX(LENGTH(stream_name))) AS streams,
TO_CHAR(DATE_TRUNC('second', current_timestamp - MIN(load_start::TIMESTAMP)), 'hh24:mi:ss') AS run_time,
TO_CHAR(MIN(load_start), 'yyyy-mm-dd hh24:mi') AS load_start,
TO_CHAR(SUM(accepted_row_count), '999,999,999,999') AS accepted,
TO_CHAR(SUM(rejected_row_count), '999,999,999,999') AS rejected,
TO_CHAR(SUM(unsorted_row_count), '999,999,999,999') AS unsorted,
TO_CHAR(SUM(sorted_row_count), '999,999,999,999') AS sorted,
SUM(sort_complete_percent) AS sort_pct,
TO_CHAR(SUM(accepted_row_count)/EXTRACT(epoch FROM current_timestamp - MIN(load_start::TIMESTAMP)), '999,999') AS rps
FROM load_streams ls;

Using Management Console to Monitor Data Loads

Use the Vertica Management Console (MC) to better understand resource usage of executing jobs. The following figure shows various running COPY statements and the resources consumed by an ETL job. Below the list of statements, sortable charts display the resources consumed by each running batch load job.

image005.png

Downloading and Installing Management Console

Download the appropriate version of Management Console from http://vertica.com. Follow the instructions in Installing Vertica.

After the installation, you can access Management Console at https://xx.xx.xx.xx:5450/, where xx.xx.xx.xx is the IP address or host name where Management Console is installed.

For more information about Management Console:

Validating Load Results

Vertica  does not enforce constraints at data load. Constraints are enforced when:

  • Data is loaded into a table with a pre-join dimension.
  • The table is joined to a dimension table during a query.

You may experience constraint errors in any of these scenarios:

  • If there is not exactly one dimension row that matches each foreign key value.
  • An inner join query is processed. An outer join is treated as an inner join due to the presence of foreign keys.
  • A new pre-join projection anchored on the table with the foreign key constraint is refreshed.
  • If your dimension tables have duplicate values in their primary key columns or do not have foreign keys in the fact tables, Vertica may return errors indicating constraint violations. These violations occur because Vertica checks these constraints only as part of a join and does NOT do a referential integrity (RI) check at load time.

Optimizing Loads

There are many possible reasons why a load is not performing to expectations.  Both Syncsort and Vertica provide a number of documents and best practices. The next two topics review the capabilities you can implement in Syncsort and Vertica .

Note If you are testing your queries for performance, you may not see the full query speed right away. Delays in performance can occur because Vertica can be busy reorganizing the data for several hours. The amount of time that data reorganization requires depends on how much data was loaded and the type of load.

Vertica Best Practices for Data Load

The following links provide information about optimizing Vertica for data loads:

The size of a COMMIT operation is important for data load performance. However, you cannot optimize the data loads by changing the COMMIT size in Vertica . DMExpress controls that setting.

DMExpress Best Practices for Data Load

To optimize your data load in DMExpress, consider maximum pool size override. The maximum pool size override setting controls the number of parallel Vertica connections on the target side. The number of connectors that Syncsort creates is based on how fast the DMExpress engine can process the source data for loading into Vertica .  The example below limits the number of corrections to 31:

DMX_VERTICA_MAX_PARTITIONS=31

Configuring Vertica to Work with SyncSort DMExpress

Consider the following tips for configuring Vertica to work optimally with Syncsort DMExpress:

Creating Resource Pools for ETL Jobs

To prevent query users from being affected by ETL jobs, create a separate resource pool for PDI. Your resource pool settings depend on the amount of memory on the machine and the number of resource pools available for users.

The following SQL statements show how to create a DMX resource pool:

=> CREATE RESOURCE POOL batch_pool MEMORYSIZE '4G' MAXMEMORYSIZE '84G'
    MAXCONCURRENCY 36;
 => DROP USER dmxdemo;
 => DROP SCHEMA dmxdemo_s;
 => CREATE SCHEMA dmxdemo_s;
 => CREATE USER dmxdemo identified by 'Syncsort_pwd' SEARCH_PATH dmxdemo_s;
 => GRANT USAGE ON SCHEMA dmxdemo_s TO dmxdemo;
 => GRANT USAGE ON SCHEMA PUBLIC to dmxdemo;
 => GRANT USAGE ON SCHEMA online_sales to dmxdemo;
 => GRANT USAGE ON SCHEMA store to dmxdemo;
 => GRANT SELECT ON ALL TABLES IN SCHEMA PUBLIC TO dmxdemo;
 => GRANT SELECT ON ALL TABLES IN SCHEMA store TO dmxdemo;
 => GRANT SELECT ON ALL TABLES IN SCHEMA online_sales TO dmxdemo;
 => GRANT ALL PRIVILEGES ON SCHEMA dmxdemo_s TO dmxdemo WITH GRANT OPTION;
 => GRANT CREATE ON SCHEMA dmxdemo_s TO dmxdemo;  
 => GRANT USAGE ON RESOURCE POOL batch_pool TO dmxdemo;
 => ALTER USER dmxdemo RESOURCE POOL batch_pool; 

Using Linux to Monitor Resource Usage

On the Vertica cluster, use the following Linux commands to monitor disk and resource usage:

  • df— Use df to monitor the disk so that enough space remains available.
  • vmstat— Use vmstat to monitor CPU usage at specified intervals. The column values bi (blocks in) and bo (blocks out), are reported in kilobytes per second.

The vmstat display shows swapping I/O using columns si (swap in) and so (swap out), also in kilobytes per second.

When you use vmstat, you can estimate the disk bandwidth used in the load in kilobytes per second. The maximum values observed for the bi + bo columns provide the disk bandwidth estimate.

The vmstat display also shows swapping I/O using columns si and so, for swap in and swap out, also in KB/sec. Suppose the swapping I/O is significant (si + so is more than 20% of maximum-seen bi + bo, thus stealing up to 20% of the disk bandwidth) over many minutes, especially during the high block I/O periods. In this situation, the system is under stress. The parallelism of the load should be reduced at the earliest opportunity, by reducing the number of load streams after the COPY for a chunk is finished.

For example, suppose you run vmstat 60 to obtain a report every 60 seconds. During the load, you see high CPU usage in bursts for the sequence of memory sorts in the first part of chunk processing. During the rest of the process, you observe high block I/O (vmstat columns bi and bo).

The system should not become idle until the load is done, where "idle" means low both CPU use and low bi + bo.

sar and sar –r can also be used.

Exporting Data from the Source Database to Flat Files

If, for any reason, you cannot connect directly to the source database, you need to export data from the source system to flat files. Before doing so, be aware of the following considerations:

  • Smaller tables generally fit into a single load file. Split any large tables into 250‒500 GB load files. For example, a 10 TB fact table requires 20‒40 load files to maintain performance.
  • The default delimiter for the COPY statement is a vertical bar (|). Before loading your data, verify that no CHAR(N) or VARCHAR(N) data values include this delimiter character. To test for the existence of a specific character in a column, use a query similar to the following:
    => SELECT COUNT(*) FROM t WHERE x LIKE '%|%'
    If only a few rows contain |, you can eliminate them from the load file using a WHERE clause. Then, load the rows separately using a different delimiter.

Data Type Mappings

Syncsort can generate Vertica DDL automatically when moving data from another database to an Vertica database. Make sure that you understand the differences between the two databases. For example, Oracle supports NVARCHAR and Vertica does not.  If the source system contains multi-byte character data, when converting the DDL from the source to the Vertica target, you must increase the amount of characters for a VARCHAR or CHAR column by at least a factor of 3. Doing so allows for the possibility of non-UTF 8 character sets. For additional information, see Locales in this document.

Contact Vertica Customer Support for example scripts on how to convert DDL from SQL Server and Oracle to Vertica .

Vertica and Oracle Data Type Mappings

Oracle uses proprietary data types for all common data types (for example, VARCHAR, INTEGER, FLOAT, and DATE). If you plan to migrate your database from Oracle to Vertica , convert the schema. Doing so is a simple and important exercise that can minimize errors and the time lost spent fixing erroneous data issues.

For a table comparing Oracle and Vertica data types, see Data Type Mappings between Vertica and Oracle in the Vertica product documentation.

For information about the SQL data types that Vertica supports, see SQL Data Types in the product documentation. For example, if you are migrating from Oracle, you must convert the non-standard type named NUMBER to SQL-standard INT or INTEGER. The required foreign key clauses can be in the table definitions themselves or in separate ALTER TABLE t ADD CONSTRAINT … commands. The foreign key constraints are very important to guide the Vertica Database Designer in its work.

Expected Data Type Mapping Results

The following table describes several data type mapping exceptions. The results differ depending on the type of SQL generated by Syncsort.

Mapping Exceptions Vertica COPY LOCAL SQL INSERT

CHAR field with n characters is mapped to a CHAR column with fewer than n characters.

Record is inserted and silently truncated.

Record is not inserted; warning message occurs.

EN (edited numeric) field is mapped to an INTEGER column with data that forces numeric overflow.

Record is rejected and sent to the rejected records log file.

0 is silently loaded in place of a number that causes an overflow.

EN (edited numeric) field is mapped to a NUMERIC column with data that exceeds the scale.

Record is rejected and sent to the rejected records log file.

Any record whose value exceeds the scale is silently rejected.

Using SET SESSION AUTOCOMMIT

When connecting to an Vertica database through a vsql COPY statement or through an ODBC connection, by default, the SET SESSION AUTOCOMMIT option is set to Off.

To ensure optimal performance, retain the default SET SESSION AUTOCOMMIT setting. When SET SESSION AUTOCOMMIT is set to Off, transactions are not automatically committed after a specified number of records are processed.

For additional information, see SET SESSION AUTOCOMMIT in the Vertica product documentation.  

Using Non-ASCII data

Vertica stores data in the UTF-8 compressed encoding of Unicode. The resulting UTF-8 codes are identical to ASCII codes for the ASCII characters (codes 0‒127 in one byte). Because all current operating systems treat ASCII in the same way, table data (CHAR columns) in all ASCII transports are also stored in UTF-8 compressed format. For UTF-8 data, verify that the extraction method you use does not convert CHAR column values to the current (non-UTF-8) locale of the source system.

On most UNIX systems, use the locale command to see the current locale. You can change the locale for a session by setting the LANG environment variable to en_US.UTF-8.

Sometimes, data actively uses the non-ASCII characters of Latin-1, such as the Euro sign (€) and the diacritical marks of many European languages. If you have data in another character encoding such as Latin-1 (ISO 8859), you must convert it to UTF-8. To perform the conversion, use the Linux tool iconv. Large fact tables are unlikely to contain non-ASCII characters, so these conversions are usually needed only for data in tables with smaller dimensions.

Internationalization

Vertica supports the following internationalization features, described in the sections that follow:

  • Unicode character encoding
  • Locales

For more information on configuring internationalization for your database, see Internationalization Parameters in the Vertica product documentation.

Unicode Character Encoding: UTF-8 (8-bit UCS/Unicode Transformation Format)

All input data that the database server receives must be in UTF-8 format. All data output by Vertica must also be in UTF-8 format. The ODBC API operates on data in

  • UCS-2 on Windows systems
  • UTF-8 on Linux systems

A UTF-16 ODBC driver is available for use with the DataDirect ODBC manager.

JDBC and ADO.NET APIs operate on data in UTF-16. The Vertica client drivers automatically convert data to and from UTF-8 when sending to and receiving data from Vertica using API calls. The drivers do not transform data that you load by executing a COPY or COPY LOCAL statement.

Locales

The locale parameter defines the user's language, country, and any special variant preferences, such as collation. Vertica uses the locale to determine the behavior of various string functions. It also uses the locale when collating various SQL commands that require ordering and comparison. Such commands can include, for example, GROUP BY, ORDER BY, joins, and the analytic ORDER BY clause.

By default, the locale for the database is en_US@collation=binary (English US). You can establish a new default locale to be used for all sessions on the database. In addition, you can override individual sessions with different locales. You set the locale through ODBC, JDBC, and ADO.net.

Be aware of the following considerations when working with locales:

  • Projections are always collated using the en_US@collation=binary collation, regardless of the session collation. Any locale-specific collation is applied at query run time.
  • The maximum length parameter for VARCHAR and CHAR data types refers to the number of octets (bytes) that can be stored in that field, not the number of characters. When you use multi-byte UTF-8 characters, the fields must be sized to accommodate from 1 to 4 bytes per character, depending on the data.
  • When the locale is non-binary, use the collation function to transform the input to a binary string that sorts in the proper order. This transformation increases the number of bytes required for the input according to this formula (CollationExpansion defaults to 5):
result_column_width = input_octet_width * CollationExpansion + 4  

Configuring Syncsort DMExpress for Use with Vertica

The following topics describe settings in DMExpress that can improve the performance of your integration with Vertica :

DMExpress Job Design

When designing jobs to run in DMExpress, you can build them with multiple tasks (called swim lanes) executing concurrently. To achieve the highest throughput, balance your loads so that the total number of rows to be processed is equal for each parallel load.

In the following figure, the job has two parallel swim lanes with three DMExpress tasks. Each of these DMExpress tasks runs multiple concurrent copies at the same time. You should test to see if this scenario is faster than having three swim lanes with two tasks each. Your goal is to verify that the Vertica cluster is going to be busy enough with the load. Do not overload the cluster so that requests do not error out and get pushed to the queue.

image006.png

DMExpress Parallelization

DMExpress 7.13 and later and Vertica 6.x and later perform automatic parallelization of data loads. In the previous figure, DMExpress divides the data load into two swim lanes. The two jobs execute concurrently, and DMExpress loads data into two HP different Vertica tables.

Using DMExpress Command-Line Interface Scripts

Any ETL job in a Windows environment can also run on a DMExpress Linux machine. The following commands show how to run a DMExpress task from the command line:

cd /home/dbadmin/partner/syncsort/calin
 . syncsort_env_7.12.9.sh
 vsql -h 192.168.118.24 -U userid -w owd -d vmart -f create_table_lineitem.sql
 dmexpress /run VerticaLoaderUI.dxt

The following commands show another example of running a DMExpress job from the command line:

#!/bin/bash
# the following values are constants for the run
 export DMXDataDirectory=/home/dmxuser/vertica/POC/data
 curTime=`date +"%Y_%m_%d_%H.%M.%S"`
 export log_directory=logs_${curTime}
 export job_master_log=main.log
 mkdir $log_directory
 mlog=${log_directory}/${job_master_log}
 echo created log directory $log_directory 
 echo created log directory $log_directory > ${mlog} 
 default_jobs=1
 export V_USER=dbadmin
 export PW=password
 export V_DB=vmartdb
 export V_TABLE=MXR_HTTP_Access
 export V_TEMP_DATA_1=temp_data1
 export V_TEMP_DATA_2=temp_data2
 export V_CompressionType=
 
# test whether command line argument is presentif [ -n "$1" ]
 then
   jobs=${1}
 else
   jobs=$default_jobs
 fi 

 for (( job=1; job <= $jobs; job++ ))do
    # Begin execution loop 
    echo ' ' 
    echo ' ' >> ${mlog}
    echo Beginning RUN ${job} >> ${mlog}
    export V_INFILE=$DMXDataDirectory/HTTP_Access${job}.txt
    export V_HOST1=192.168.91.173
    export V_HOST2=192.168.91.173
    export V_SLICE=${job}
    echo Environment Variables used during job ${job} were: >> ${mlog}
    env | grep V_ >> ${mlog}
   echo 'dmxjob /run T07_SplitReformatting.dxj > '
 {log_directory}'/dmxjob_'${job}'.log 2>&1 &'   echo 'dmxjob /run 
 07_SplitReformatting.dxj  > ' ${log_directory}'/dmxjob_'${job}'.log 2>&1 &' >>
  {mlog}
 #dmxjob /run T07_SplitReformatting.dxj  > ' ${log_directory}'/dmxjob_${job}.log 
 >&1 &
 time dmxjob /run T07_SplitReformatting.dxj  >  {log_directory}/dmxjob_${job}.log 2>&1 &
    echo ' ' >> ${mlog}
    echo DMExpress log for run ${job} is in dmxjob_${job}.log
    echo END OF RUN ${job}  >> ${mlog}
    echo '****************' >> ${mlog}
    echo ' ' >> ${mlog}
 done
 exit

The following script shows an example of how to set the DMExpress environment variables:

# ODBC settings
 export ODBCSYSINI=/home/dbadmin/partner/syncsort/
 export ODBCINI=${ODBCSYSINI}/odbc.ini
 export VERTICAINI=/home/dbadmin/partner/vertica.ini
# DMExpress settings
 export  D_LIBRARY_PATH=/home/dbadmin/partner/syncsort/calin/dmexpress_7.12.9/lib:$LD_LIBRARY_PATH
 export PATH=/home/dbadmin/partner/syncsort/calin/dmexpress_7.12.9/bin:$PATH
# turn on DMExpress tracing – optional
 export DMXTraceDirectory=/home/dbadmin/partner/syncsort/calin/trace
 export DMX_TRACE_DB_TARGET=1
# set source file and target table for the DMExpress task
 export SYNCSORT_TARGET_TABLE=syncsort_load_lineitem_calin
 export SYNCSORT_INPUT_FILE=/data/tpch/lineitem.tbl
 export DMX_VERTICA_MAX_PARTITIONS=31
 D

Using ODBC Configuration Files

If you are developing ETL jobs on Windows but you want to run the Syncsort ETL job on the DMExpress Linux host, you must set some environment variables. These settings allow DMExpress to locate the necessary configuration files. 

In the following two Run Job dialog boxes, set the data directory location to any directory for which you have writer permissions. This directory is where DMExpress copies the ETL job and tasks to execute on the DMX server.

image007.png image008.png

The following values are examples of three profile environment variables that you should set for the Syncsort user:

  • ODBCSYSINI—Location where the ODBC-related files are stored
  • ODBCINI—Location of the system ini file
  • VERTICAINI—Location of the ini file

For example:

export ODBCSYSINI=/usr/local/dmexpress/etc   (this is a folder not file)
 export ODBCINI=/usr/local/dmexpress/etc/odbc.ini
 export VERTICAINI=/usr/local/dmexpress/etc/vertica.ini

Set these variables in the user profile file, or in any script that sets the environment variables for DMExpress.

When configuring the Vertica ODBC client driver 5.1 or later on UNIX/Linux platforms, you must include the following DSN parameters:

  • ODBCInstLib—Absolute path to the file that contains the ODBC installer library.
  • ErrorMessagesPath—Absolute path to the parent folder that contains the Vertica client driver's localized error message files. These files are usually stored in the same folder as the Vertica ODBC driver files.

The following is an example of a vertica.ini file:

[Driver]
 #ODBCInstLib=/usr/lib64/libodbcinst.so
 ODBCInstLib=/usr/dmexpress/lib/libodbcinstSSL.so ErrorMessagesPath=/opt/vertica/lib64
 DriverManagerEncoding=UTF-16

The following is an example of an odbc.ini file:

[ODBC Data Sources]
 vmart = Vertica database vmart
 
 [vmart]
 Description = Vertica database vmart
 Driver = Vertica
 ServerName = part1 
 Database = vmart
 Port = 5433
 ColumnsAsChar = 1

The following examples show two different ways to set the Linux VERTICAINI environment variable:

  • VERTICAINI environment variable that is set to the absolute path where a system-wide vertica.ini file is located:
export VERTICAINI=/etc/vertica.ini
  • VERTICAINI environment variable that is set to a user’s .profile, which points to the location of an individual user’s vertica.ini file:
export VERTICAINI=~/.vertica.ini

The following dialog box lists example environment variables settings when submitting a job from Windows to run on Linux:

image009.png

Troubleshooting DMExpress

To troubleshoot DMExpress, turn on DMExpress tracing on Linux using the following commands:

# turn on DMExpress tracing – optional
 export DMXTraceDirectory=/home/dbadmin/partner/syncsort/calin/trace
 export DMX_TRACE_DB_TARGET=1

To enable ODBC tracing, specify two parameters in the vertica.ini file:

  • LogLevel—Severity level of the messages logged between the client and the server.
  • LogPath—Absolute path of the folder in which to store log files.

For example:

 LogLevel=DEBUG
 LogPath=/tmp

Known Issues

Database Connection Could Not Be Established

If you are using the Syncsort-provided driver for ODBC instead of unixODBC, you may see an error about not being able to establish the database connection.

Symptom

The following text displays when this error occurs:

DMExpress : (DBCNNTES) database connection "DatabaseConnection1" to ODBC data source "VerticaODBC" could not be established
 (ODBC_ERROR) [unixODBC][Vertica][ODBC] (11560) Unable to locate SQLGetPrivateProfileString function.

Area Affected

This error occurs if you are running tasks that connect to Vertica databases after you have upgraded to DMExpress 7.14 or later.

Problem

In DMExpress 7.14, the SQLGetPrivateProfileString function that the Vertica driver needs has been moved to a different library.

Solution

In the vertica.ini file, modify the ODBCInstLib setting to point to <dmx_install_dir>/lib/libodbcinstSSL.so instead of libodbcSS.so.

Reference Number

DMX-13943

Unsupported Data Types

The DateTime Table data type is not supported. Syncsort has documented the following limitations: 

  • The default date/time mask does not include fractional seconds. However, you can specify a suitable mask by selecting Other format in the Source Database Table dialog box.
  • Syncsort does not support extracting time zone data from Vertica using ODBC.

The DMExpress Help topic "Conversion between Vertica data types and DMExpress data types" contains a matrix documenting the supported data types and default behaviors.