Vertica Integration with Tungsten: Tips and Techniques

Applies to Vertica 7.2.x and earlier 

 

About Vertica Tips and Techniques

Vertica develops Tips and Techniques documents to provide you with the information you need to use Vertica with third-party products. This document provides guidance using one specific version of Vertica and one version of the third-party vendor's software. While other combinations are likely to work, they may not havebeen tested. This document provides guidance using the latest versions of Vertica and Tungsten as of December, 2015.

Document Overview

This document provides in depth guidance for configuring the Tungsten Replicator to connect to Vertica. Tungsten uses the JDBC client driver to connect to Vertica.

This document provides the replication mechanism from MySQL 5.6 to Vertica 7.2.x using Tungsten 4.0. This document is based on the results of testing Vertica 7.1.x and 7.2.x with Tungsten 4.0.0.18 on a Linux platform.

Configuration Setting Changes

Tungsten replicates data based on how you have configured Vertica. You can make configuration changes to alter the filter type or log location in the Tungsten Replicator. For example, you may want to make a configuration change to add new filters to an existing configuration. You can make changes to a specific host from the /opt/continuent/tungsten directory.

Enter the following command to make changes to a host:

shell> ./tools/tpm update service_name –hosts=host1, host2

This command updates the local configuration with new configuration changes and restarts the Tungsten Replicator. You can view which components restarted by using the ./tools/tpm help update command.  

Manage Vertica Schemas with Tungsten Replicator

Migration of Multiple Schemas from MySQL to Vertica

The database transform filter transforms the name of the source database (MySQL) to the target schema (Vertica). You should apply the filter in between reading data from the internal queue and applying the data to the destination database. 

The following command applies the database transform filter, where DB1 and DB2 represent the MySQL database and database1 and database2 represent the Vertica schema:

Shell $>./tools/tpm update alpha –hosts=host2 \
--svc-applier-filters=dbtransform \
--property=replicator.filter.dbtransform.from_regex1=DB1 \
--property=replicator.filter.dbtransform.to_regex1=database1\
--property=replicator.filter.dbtransform.from_regex2=DB2 \              
--property=replicator.filter.dbtransform.to_regex2=database2 

Exclusion and Inclusion of Schemas and Tables for Replication to Vertica

The replicate filter performs explicit inclusion or exclusion of tables and schemas. The filter replicates all tables and schemas, unless you specify otherwise. Each specification supports wildcards and multiple entries. Use the .do configuration setting to include a table or schema. Use the .ignore configuration setting to exclude a table or schema.  

The following sample command applies the replicate filter, with the specification to ignore the table named database1.tab1 for replication.

Shell $>./tools/tpm update alpha --hosts=host2 \
 --svc-applier-filters=replicate \
--property=replicator.filter.replicate.ignore=database1.tab1 \                  

The replicate filter applies the following rules to the filter parameters:

  • When both .do and .ignore are empty, you can update to any table.
  • When only .do is specified, Tungsten replicates only the schemas and tables mentioned in the list.
  • When only .ignore is specified, Tungsten replicates all schemas and tables except those defined.

Control Database Schema Upgrades

Tungsten does not replicate DDL statements to the target database. You must manually upgrade schemas and tables to both the master and slave servers. If both the master and slave servers are in ONLINE mode, operations that Tungsten supports with the online schema changes must be in backwards compatible mode. Backwards compatible mode ensures that the new configuration changes do not affect the existing replication service.

Note You cannot make changes to the schema on slaves servers that would otherwise break the replication by using the online method.  

To perform a manual upgrade, follow these steps:

  1. Bring the slave server offline using the following command:
Shell $>< home-directory>/tungsten/tungsten-replicator/bin/trepctl -host host2 offline
  1. Update your Vertica schema with the latest changes.
  2. Bring the slave server back online using the following command:
Shell $>< home-directory>/tungsten/tungsten-replicator/bin/trepctl  -host host2 online
  1. Bring the master server offline using the following command:
Shell $>< home-directory>/tungsten/tungsten-replicator/bin/trepctl  -host host1 offline
  1. Update your MySQL database with the latest change.
  2. Bring the master server back online using the following command:
Shell $>< home-directory>/tungsten/tungsten-replicator/bin/trepctl  -host host2 online

Increase Batch Performance

Tungsten loads data into Vertica using COPY or COPY LOCAL. The batch_size parameter determines how many rows are inserted into Vertica in one batch. The more rows that Tungsten inserts, the greater performance. You can increase batch performance by using the block commit parameters. The Tungsten Replicator commits changes read from the Transaction History Log (THL). Tungsten commits these changes in the slave servers during the applier stage based on the block commit size or interval size.

The block commit parameters replace the single replicator.global.buffer.size parameter that controls the size of buffers used within each stage of replication.

Two parameters control the decision to commit a block of transactions applied to the database:

  • When the event count reaches the specified event limit (set by blockCommitRowCount).
  • When the commit timer reaches the specified commit interval (set by blockCommitInterval).

The block commit size is controlled by the –repl-svc-applier-block-commit-size option to tpm or by the blockCommitRowCount parameter. The block commit interval is controlled by the --repl-svc-applier-block-commit-interval option to tpm, or by the blockCommitInterval.

If only a number is supplied, that number is used as the interval in milliseconds. Tungsten supports suffixes of s, m, h, and d for seconds, minutes, hours and days, respectively, as shown in the following example:

shell> ./tools/tpm update alpha --repl-svc-applier-block-commit-size=20000 --repl-svc-applier-block-commit-interval=100s

Note Whichever parameter is met first (block commit size or block commit interval) commits to a block of transactions.

Troubleshooting Tungsten with Vertica

Tungsten Replicator Does Not Run on Vertica Server Nodes

The Tungsten Replicator uses the COPY_DIRECT command to load CSV files into a staging table in Vertica. If the Tungsten Slave Replicator runs on a different node, the COPY_DIRECT command fails, because the CSV files do not reside on the same server.

To fix the issue with COPY_DIRECT, follow these steps:

  1. Update the vertica6.js file to use COPY LOCAL. The vertica6.js file is available here:
<Tungsten_Home>//tungsten-replicator-4.0.0-18/tungsten-replicator/samples/scripts/batch/vertica6.js
  1. Replace COPY with COPY LOCAL using the following command line:
copy_sql = runtime.sprintf("COPY %s FROM LOCAl '%s' DELIMITER ',' ENCLOSED BY '\"'",
stage_table_fqn,csv_file );

Tables Do Not Have Primary Keys

Replication can fail on the slave servers because primary keys were not defined on the source table. However, you can use the vertica6.js file with the Tungsten Replicator to work around the issue with primary keys.

Follow these steps to use the vertica6.js file:

  1. Locate the vertica6.js file at the following location:
<Tungsten_Home>//tungsten-replicator-4.0.0-18/tungsten-replicator/samples/scripts/batch/vertica6.js
  1. Replace the original code with the workaround code as shown below:

In the original code, the where_clause shows an empty string with no key defined in the source table:

// Remove deleted rows from base table. 
      delete_sql = runtime.sprintf(
      "DELETE FROM %s WHERE EXISTS (SELECT * FROM %s WHERE %s AND %s.tungsten_opcode = 'D')",
      base_table_fqn, 
      stage_table_fqn, 
      where_clause, 
      stage_table_fqn
  );

In the workaround code, this issue is fixed:

//add new condition
 where_clause=(where_clause.equals("")?"":where_clause+" AND ");   
// Remove deleted rows from base table. 
      delete_sql = runtime.sprintf(
      "DELETE FROM %s WHERE EXISTS (SELECT * FROM %s WHERE %s %s.tungsten_opcode = 'D')",
      base_table_fqn, 
      stage_table_fqn, 
      where_clause, 
      stage_table_fqn
  );

In the original code, the pkey_column values show empty strings with no key defined in the source table:

// Insert non-deleted INSERT rows, i.e. rows not followed by another INSERT
  // or a DELETE. 
  insert_sql = runtime.sprintf(z
     "INSERT INTO %s (%s) SELECT %s FROM %s WHERE tungsten_opcode='I' AND tungsten_row_id IN (SELECT MAX(tungsten_row_id) FROM %s GROUP BY %s)", 
      base_table_fqn, 
      base_columns, 
      base_columns, 
      stage_table_fqn, 
      stage_table_fqn, 
      pkey_columns
  );

 In the workaround code, this issue is fixed:

insert_sql = runtime.sprintf(
     "INSERT INTO %s (%s) SELECT %s FROM %s WHERE tungsten_opcode='I' AND tungsten_row_id IN (SELECT MAX(tungsten_row_id) FROM %s GROUP BY %s)", //removed group by clause
      base_table_fqn,  
      base_columns, 
      base_columns, 
      stage_table_fqn, 
      stage_table_fqn, 
      base_columns
  );

Debug the Tungsten Replicator

You may run into errors when setting up heterogeneous replication. Check the Replicator logs if the trepctl status does not show a meaningful error message.

The most common issues that you might face are:  

  • MySQL writes to databases, but Vertica has a single database that writes to schemas. If you write to a database named “test” in MySQL, it goes to the schema named “test” in Vertica. Be sure to differentiate between databases and schemas.
  • If you define an incorrect staging table, the Vertica slave server fails. If this happens, drop the staging table and recreate it using ddlscan. Then bring the Tungsten Replicator back online.
  • You may see that dates appear off by a few hours. Check to verify that Java time zones and dates were properly configured.

Configuration and Server Logs

  • To check the configuration logs for an issue such as ERROR>>There was a problem parsing the arguments, navigate to the following:
/tmp/tungsten-configure.log
  • To view the server logs (both slave and master) navigate to the following:
<home-directory>/tungsten/tungsten-replicator/log/trepsvc.log

For More Information

For More Information About… See…

Tungsten

http://docs.continuent.com/tungsten-replicator-4.0/index.html

Vertica Community Edition

https://vertica.com/community/

Vertica Documentation

http://vertica.com/docs/latest/HTML/index.htm

Big Data and Analytics Community

https://vertica.com/big-data-analytics-community-content/