Share this article:

Vertica Integration with Tungsten: Connection Guide

Applies to Vertica 7.2.x and earlier 

 

 Click here to view a PDF version of this document. 

About Vertica Connection Guides

Vertica connection guides provide basic information about setting up connections to Vertica from software that our technology partners create. These documents provide guidance using one specific version of Vertica and one specific version of the third party vendor’s software. Other versions of the third-party product may work with Vertica. However, other versions may not have been tested. This document provides guidance using the latest versions of Vertica and Tungsten as of December, 2015.

Tungsten Replicator Overview

The Tungsten Replicator is an open source replication engine that supports many different extractor and applier modules. Data can be extracted from MySQL and Oracle, and applied to transactional stores. Data can also be applied to NoSQL stores such as MongoDB, and data warehouse stores such as Vertica.

The core of the replication functionality lies in three major components:

  • Transaction History Log (THL): The THL is the data that Tungsten takes from the master's binary logs and transports to its servers, with the addition of some metadata.
  • Extractor: The extractor component reads data from the source data server and writes that information to the Transaction History Log (THL). The extractor is also responsible for writing the data into the THL in the native or derived format either as a SQL statement or as row-based information.
  • Applier: Appliers within the Tungsten Replicator convert the THL information and write it to a destination data server. The applier has the capability to work with a number of different target databases, such as Vertica or MySQL.

This document demonstrates how users can replicate data from MySQL to Vertica using Tungsten.

Tungsten uses the JDBC driver to connect to Vertica. This document provides the replication mechanism from MySQL 5.6 to Vertica 7.2.x using Tungsten 4.0. This document is based on the results of testing Vertica 7.1.x and 7.2.x with Tungsten 4.0.0.18 on a Linux platform.

Replication to Vertica

In heterogeneous systems, it is not possible to execute the DDL and DML statements from source systems into target systems because the SQL dialects are different. Tungsten uses row-based replication when performing replication from MySQL to Vertica.

Replication to Vertica follows this flow:

Data is extracted from the source database into the THL.

 arrowtungsten.png

While extracting data from the THL, the Tungsten Replicator writes the data into CSV files based on the names of the source tables. These files contain all of the row-based data, including the unique global transaction ID generated by the Tungsten Replicator during replication. The operation type (Insert or Delete) is also listed as part of the CSV file.

 arrowtungsten.png

Vertica loads the CSV files into staging tables.

 arrowtungsten.png

Tungsten executes SQL statements to perform updates on live versions of the tables. The statements use the CSV, batch-loaded information to delete old rows and insert new data into tables. The statements also perform updates as necessary to work effectively with Vertica.

Download and Install Tungsten

Follow these steps to download and install the Tungsten Replicator.

  1. Download the software for Tungsten 4.0.
  2. Unpack the file into the Tungsten Replicator deployment directory using the following command:
Tungsten: shell > $>tar –xvf tungsten-replicator-4.0.0-18.tar.gz

Download and Install the Vertica Client Drivers

Before you can connect to Vertica using Tungsten, you must install the Vertica client package. This package includes the JDBC client driver that Tungsten uses to connect to Vertica.

Download Vertica Client Drivers

  1. Go to the Vertica Client Drivers page. 
  2. Download the version of the Vertica client package that is compatible with the architecture of your operating system and Vertica server version.

Note Vertica drivers are forward compatible, so you can connect to the Vertica server using previous versions of the client. For more information about client and server compatibility, see Client Driver and Server Version Compatibility in the Vertica documentation.

Install Vertica Client Drivers

Based on the client package you downloaded, follow the steps for installation from the Vertica documentation.

Copy the Vertica JDBC driver into the Tungsten Replicator deployment directory:

<Tungsten_Home>/tungsten-replicator-4.0.0-18/tungsten-replicator/lib

Set up MySQL to Vertica Replication

Before you begin, you must configure the staging host that is responsible for setting up the replication services involved in the replication process. For details on configuring the staging host, see Staging Host Configuration in the Tungsten documentation.

Configure Hosts

MySQL uses row-based replication to replicate data to Vertica. You must create tables that need to be replicated in Vertica; tables are not automatically created. You must use the ddlscan utility to create tables for replication. To prepare the MySQL and Vertica hosts, follow the steps in Preparing Hosts for Vertica Deployments.

Prepare Schemas in Vertica

The Tungsten Replicator does not prepare target database schemas and tables based on the source database. The Tungsten Replicator includes a tool called the ddlscan, which reads the schema definition from MySQL and translates that information into the schema definition required for the target database. You should prepare the source database tables you want to replicate to Vertica and then prepare table definitions for your Vertica database. Follow these steps to create schema and tables for the staging tables and target tables:

  1. Create a table in MySQL using the following command:
mysql -utungsten DBNAME
 ...
mysql> CREATE TABLE Char1_Table(Id int primary key,
                          DataTypeSet VARCHAR (20) 'Char1', 
                          ValueDesc   VARCHAR (50),
                          Char1_Column CHAR
                          ); default charset=utf8;
  1. Execute the ddlscan tool twice.
Shell $> cd <Tungsten_Home>/tungsten-replicator-4.0.0-18/tungsten-replicator/samples/extensions/velocity

The first time generates the live table definitions:

Tungsten: shell $> /<Tungsten_Home>//tungsten-replicator-4.0.0-18/tungsten-replicator/bin/ddlscan -user tungsten –url jdbc:mysql:thin://host1:3306/DBNAME -pass password -template ddl-mysql-vertica.vm -db DBNAME >>ddl.sql

The second time creates the table definitions for the staging data using the staging template:

Tungsten: shell $> /<Tungsten_Home>//tungsten-replicator-4.0.0-18/tungsten-replicator/bin/ddlscan -user tungsten –url jdbc:mysql:thin://host1:3306/DBNAME -pass password -template ddl-mysql-vertica-staging.vm -db DBNAME >>ddl.sql
  1. Edit ddl.SQL file and update this file according to the target database.
  2. Execute the Vertica database using the following command:
dbadmin: shell $> vsql –U dbadmin –w <PASSWORD> <ddl.sql

Install Vertica Replication

  1. Change to the staging directory using the following command:
Tungsten: shell $> cd <Tungsten_Home>/ tungsten-replicator-4.0.0-18
  1. Configure the main parameters for the replicator service:
Tungsten: shell $> ./tools/tpm configure alpha \
    --master=host1 \
    --members=host1, host2 \
    --install-directory=/opt/continuent \
    --disable-relay-logs=true \
    --skip-validation-check=HostsFileCheck \
    --enable-heterogenous-service=true \
    --start
  1. Configure and install the MySQL Master:
Tungsten: shell $> ./tools/tpm update alpha\
        --master=host1 \
        --hosts=host1 \
        --datasource-host=host1 \
        --datasource-user=tungsten \
        --datasource-password=password \
        --datasource-mysql-conf=/usr/my.cnf \
        --home-directory=/opt/continuent \
        --java-file-encoding=UTF8 \
        --java-user-timezone=GMT \
        --svc-extractor-filters=colnames,pkey \
        --
property=replicator.filter.pkey.addColumnsToDeletes=true \
        --
property=replicator.filter.pkey.addPkeyToInserts=true \
        --mysql-use-bytes-for-string=false \
        --start-and-report

Note The preceding command has some essential settings that help with heterogeneous replication:

  • The Java VM file encoding and time zone are UTF-8 and GMT, respectively. Standardizing these values is required to avoid corrupting data in batch loads.
  • Tungsten translates string values to UTF-8 rather than passing these values to slaves as bytes.
  • Tungsten inserts filters to add column names and identify the primary key on tables. These additions are required for batch loading to work properly.
  1. Configure and install the Vertica slave server:
Tungsten: shell $> ./tools/tpm update alpha \
    --hosts=host2 \
--replication-user=dbadmin \
--replication-password=password \
--batch-enabled=true \
--batch-load-language=js \
--batch-load-template=vertica6 \
--datasource-type=vertica \
--vertica-dbname=DBNAME \
--replication-host=host2 \
--replication-port=5433 \
--skip-validation-check=InstallerMasterSlaveCheck \
--svc-applier-block-commit-size=25000 \
--svc-applier-block-commit-interval=30s \
--start-and-report

Monitor Vertica Deployment

Monitoring a Vertica replication scenario requires checking the status of both the master, which extracts data from MySQL, and the slave, which retrieves the remote THL information and applies it to Vertica.

The following graphic shows the master server. The output of the trepctl shows the current sequence number and applier status.

tungsten1.png

The following graphic shows the slave server. The output of the trepctl shows the current sequence number and applier status.

tungsten2.png

Test MySQL with Vertica Replication

To replicate data, you need a table on MySQL to hold some data.

Follow these steps to test MySQL with Vertica using the Tungsten Replicator. The following example shows how to move a row from one table to another:

  1. Log in to MySQL and insert a row:
mysql -utungsten DBNAME
...
mysql> INSERT INTO Char1_Table VALUES(1,default, 'Empty', '');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO Char1_Table VALUES(2,default, 'Typical', 'a');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO Char1_Table VALUES(3,default, 'Max', 'Z');
Query OK, 1 row affected (0.00 sec)
  1. If you configured things properly, you should see the following on the Vertica side:
dbadmin=> SELECT * from tungsten.char1_table;
Id | DataTypeSet | ValueDesc | Char1_Column
----+-------------------------------------------
1 | default     | Empty     |
2 | default     | Typical   | a
3 | default     | Max       | Z
(3 rows)

Example: Use the SimpleBatchApplier

The Tungsten Replicator applies data to Vertica using a new applier class called the SimpleBatchApplier.  It implements the CSV files through the following process: 

  1. As new transactions arrive, the Replicator writes them to CSV files named after the corresponding Vertica tables. For example, if you have updates for a table named simple_tab in a schema test, the format would look like the following:
Schema  |   Table     |     Column      |  Type       | Size |
--------+-------------+-----------------+--------------+-----  
|tungsten| Char1_Table | id              | int        |    8 |
|tungsten| Char1_Table | DataTypeSet     | varchar(20)|   20 |
|tungsten| Char1_Table | ValueDesc       | varchar(40)|   40 |
|tungsten| Char1_Table | Char1_Column    | char       |    1 |
  1. The updates go into a file named simple_tab. The following is an example of the data in the CSV file:
"64087","I","5","Some data","Some data","B","1"
"64087","I","6","more data","more data","C","2"
"64088","D",3",","default","Max",”z”,"3"
  1. The CSV file includes a unique global transaction ID, an operation code (I for insert and D for delete), and the primary key. For inserts, there are additional columns that contain data; for deletes, columns contain nulls. The last column is a row number, which allows for ordering after the data is loaded into Vertica.
  2. The Tungsten Replicator applies transactions to replicas in serial order without deviations. If you INSERT and then UPDATE a row, it always works because the Replicator applies them to the slave server in the same order.
  3. The Tungsten Replicator continues writing transactions until it reaches the block commit maximum. The Replicator then closes each CSV file and loads the content into a staging table that is named according to the base name. The staging table format mimics the CSV file columns. For example, the staging table could look like the following example:
Schema |          Table     |     Column      |     Type     |Size |
--------+------------------+-----------------+--------------+------+
tungsten| stage_xxx_Char1_Table| tungsten_seqno  | int        |  8 |
tungsten| stage_xxx_Char1_Table | tungsten_opcode| char(1)    |  1 |
tungsten| stage_xxx_Char1_Table | id             | int        |  8 |
tungsten| stage_xxx_Char1_Table | DataTypeSet    | varchar(20)| 20 |
tungsten| stage_xxx_Char1_Table | ValueDesc      | Varchar(40)| 40 |
tungsten| stage_xxx_Char1_Table | Char1_Column   | Char       |  1 |
tungsten| stage_xxx_Char1_Table | tungsten_row_id| int        |  8 |
  1. Finally, the Replicator applies the deletes and inserts to the table simple_tab by executing SQL commands as shown in the following example:
DELETE FROM tungsten.Char1_Table WHERE id IN
    (SELECT id FROM tungsten.stage_xxx_Char1_Table
        WHERE tungsten_opcode = 'D');
INSERT INTO tungsten.Char1_Table(id, DataTypeSet,ValueDesc,Char1_Column)
    SELECT id, DataTypeSet,ValueDesc,Char1_Column
    FROM tungsten.stage_xxx_Char1_Table AS stage_a
    WHERE tungsten_opcode='I' AND tungsten_row_id IN
        (SELECT MAX(tungsten_row_id)
          FROM tungsten.stage_xxx_Char1_Table GROUP BY id);

Known Limitations

Tungsten currently has some important limitations for batch loading, namely:

  • Primary keys must be a single column only. Tungsten does not handle multi-column keys. You must define the primary keys.
  • Binary data may cause problems when converted to CSV as it converts to Unicode.

Data Type Mapping: MySQL to Vertica

The following table shows data type mapping between MySQL data types and Vertica data types.

MySQL Data Type Vertica Data Type

DATETIME

DATETIME

TIMESTAMP

TIMESTAMP

DATE

DATE

TIME

TIME

TINYINT

TINYINT

SMALLINT

SMALLINT

MEDIUMINT

INT

INT

INT

BIGINT

INT

VARCHAR

VARCHAR

CHAR

CHAR

BINARY

BINARY

VARBINARY

VARBINARY

TEXT, TINYTEXT, MEDIUMTEXT, LONGTEXT

VARCHAR(65000)

BLOB, TINYBLOB, MEDIUMBLOB, LONGBLOB

VARBINARY(65000)

FLOAT

FLOAT

DOUBLE

DOUBLE PRECISION

ENUM

VARCHAR

SET

VARCHAR(4000)

BIT(1)

BOOLEAN

BIT

CHAR(64)

For More Information

For More Information About… See…

Tungsten

http://pubs.vmware.com/continuent/tungsten-replicator-4.0/index.html

Vertica Community Edition

https://my.vertica.com/community/

Vertica Documentation

http://my.vertica.com/docs/latest/HTML/index.htm

Big Data and Analytics Community

https://my.vertica.com/big-data-analytics-community-content/

Share this article: