Share this article:

Vertica QuickStart for Talend Data Integration

To read this document in PDF format, click here.

What is a QuickStart Application?

The Vertica QuickStarts are sample applications that show how complementary technologies can work together to deliver outstanding benefits to end users. Each QuickStart uses the Vertica Analytic Database with a different BI or ETL tool from a Vertica technology partner.

The QuickStarts are available for download on the Big Data Marketplace in the QuickStart Examples category.

The Vertica QuickStarts are freely available for demonstration and educational purposes. They are not governed by any license or support agreements and are not suitable for deployment in production environments.

About this Document

This document provides instructions for installing, configuring, and deploying the Vertica QuickStart for Talend Data Integration. It includes an overview of the QuickStart ETL functionality.

Details about the ETL processing and the source and target data sets are provided in the companion document, Vertica VHist ETL Overview. Both documents are included in the QuickStart download package.

About the Vertica QuickStart for Talend Data Integration

The Vertica QuickStart for Talend Data Integration is a sample ETL application powered by Vertica Analytic Database. The QuickStart uses Talend Data Integration to extract data from Vertica system tables and load it into a data warehouse called VHist (Vertica History).

VHist ETL Overview

VHist ETL occurs in two steps:

  1. Extracts the data from system tables in the V_CATALOG and V_MONITOR schemas and loads it into a staging schema called VHIST_STAGE. Only minimal transformation occurs during this step.
  2. Extracts the data from VHIST_STAGE, transforms it, and loads it into the VHist star schema.

For more information about VHist ETL and Vertica system tables, see:

VHist ETL Processing Modes

Talend Data Integration ETL jobs are encapsulated in jar files and are thus portable to any environment that supports Java. You can create the QuickStart VHist ETL jar file in developer mode or in batch mode:

Software Used to Create the QuickStart

The Vertica QuickStart for Talend Data Integration was created and tested using:

  • Talend Open Studio for Data Integration version 6.1 (the Community Edition)
  • Linux Centos 5
  • Vertica Analytic Database 7.1.0
  • Vertica JDBC driver 7.0.1
  • JDK 1.8

Requirements

The Vertica QuickStart for Talend Data Integration requires the following:

  • a Vertica database
  • JDK 1.8 or above
  • Talend Data Integration, either the Community Edition (Talend Open Studio for Data Integration) or the Enterprise Edition.

Install the Software

To install the software that is required for running the QuickStart, follow these steps.

Install Talend Data Integration

If you want to develop the Talend QuickStart ETL job yourself, you must have Talend Data Integration installed on your computer. If you do not already have Talend Data Integration, you can download the free Open Studio version, or you can obtain a free trial of the Enterprise Edition.

To install Talend Open Studio for Data Integration:

  1. Go to the Download page on the Talend website: https://www.talend.com/download/
  1. Scroll down to Download Free Talend Products.

  2. Under Data Integration, click DOWNLOAD FREE TOOL.
  3. Download the compressed file and extract the contents to <Talend_Location> on your local machine.
  4. Verify that the TOS folder for your version of Talend Data Integration is present in <Talend_Location>. For example: TOS_DI-20151214_1327-V6.1.1 for Talend Data Integration 6.1.
  5. Add EXECUTE permission to all shell scripts:
    chmod +x <Talend_Location>/<TOS_folder>/*.sh

To install a free trial of Talend Enterprise for Data Integration:

  1. Go to the Download page on the Talend website: http://www.talend.com/download/
  2. Under Test Drive Talend Products, select Data Integration and click Try Now.
  3. Download the compressed file and extract the contents to <Talend_Location> on your local machine.
  4. Start the installer, and follow the installation instructions.
  5. Verify that the data-integration folder is present in <Talend_Location>.
  6. Add EXECUTE permission to all shell scripts:
    chmod +x <Talend_Location>/<TOS_folder>/*.sh
    

Install the Vertica Database Server

The Vertica database server runs on Linux platforms. If you do not have the Vertica database server, you can download the Community Edition free of charge from my.vertica.com. You can download directly from my.vertica.com, or you can link to the download page from the Big Data Marketplace.

To download directly from my.vertica.com:

  1. Go to https://my.vertica.com/.
  2. Click the Register
  3. Provide your information and click Register.
  4. Follow the on-screen instructions to download and install Vertica Community Edition.

To download from the Big Data Marketplace:

  1. Go in to the Big Data Marketplace and sign in with your Marketplace credentials.
  2. Select Vertica Platform.
  3. Select Vertica Analytic Database Server.
  4. Click Get It!.
  5. Provide your information and click Register.
  6. Follow the on-screen instructions to download and install Vertica Community Edition.
  7. For more information, see Vertica Integration with Talend: Tips and Techniques.

Install the QuickStart Application

The Vertica QuickStart for Talend Data Integration download package includes:

  • the QuickStart application
  • the JDBC driver that connects Talend Data Integration to the Vertica database server
  • two PDFs:
    • Vertica QuickStart for Talend Data Integration, which provides installation and deployment instructions (this document)
    • VHist ETL Overview, which provides details about VHist ETL and the source and target schemas.

To install the QuickStart:

  1. Go to the Big Data Marketplace and sign in with your Marketplace credentials.
  2. Select the QuickStart Examples
  3. Select Vertica QuickStart for Talend Data Integration.
  4. Click Download.
  5. Save the compressed file, zip, on your machine.
  6. Extract the contents of the file to <VHIST_Job_Location >. You will see these subdirectories:
    • config—contains information for configuring the ETL source and target
    • dev—contains the source code of the VHist ETL project as a zip file (Job Designs.zip)
    • jobs—contains the Talend ETL jobs
    • logs—contains log files in which the batch scripts record events
    • setup—contains batch scripts for creating VHist and performing ETL
    • sql—contains the SQL scripts that create and populate the VHist_stage and VHist schemas
  7. Add EXECUTE permission to all shell scripts:
    chmod +x < VHIST_Job_Location>/VHIST_ETL/setup/*.sh
    chmod +x < VHIST_Job_Location>/VHIST_ETL/jobs/create_schema/*.sh
    chmod +x < VHIST_Job_Location>/VHIST_ETL/jobs/Load_VHISTDW/*.sh 

Configure the Source and Target

To configure the ETL process with your source- and target-specific information, follow these steps:

  1. Open the configuration file, properties:
    <VHIST_Job_Location>/VHIST_ETL/config/Config.properties
  2. Edit the file, supplying the following information:
    • Your source and target server
    • Your database name
    • Your database credentials

Create the ETL Job in Developer Mode

To create the ETL job in developer mode, you must create and configure a project in Talend Data Integration. Then you can create and populate the data warehouse within the project.

Create a Project

  1. Open Talend Data Integration and create a new project.

    image001.png

  2. To import the QuickStart VHist ETL source code, right click Job Designs and click Import items.

    image002.png

  3. In the Import items dialog box, select the path of the source code archive file, then select all the jobs listed in the bottom pane and click Finish.

    image003.png

    The QuickStart VHist ETL jobs are now available under Job designs.

  4. Under Contexts, select Global_Context

    image004.png 

  5. Set the context_path to <VHIST_Job_Location>/VHIST_ETL. Leave the other variables blank.

    ;image005.png

  6. Click Finish to complete the configuration of the QuickStart VHist ETL project.

    The VHist ETL jobs are now available for Edit, Execute, Build and Export.

Create the Data Warehouse

  1. Under Job Designs, select create_schema.
  2. Click the run icon.
  3. The create_schema job displays.

    image006.png

Populate the Data Warehouse

The Load_VHISTDW job loads the stage and star schemas. Run this job for the initial and incremental loads.

  1. Under Job designs, select Load_VHISTDW.
  2. Click the run icon.

    The load process displays.

    image007.png

Create the ETL Job in Batch Mode

Once you have installed, configured, and tested the development deployment, you can use Talend Data Integration to build the ETL job as a jar file. The Build functionality in Talend Data Integration lets you create standalone scripts for Windows batch and Linux shell deployments. To build the jar file, follow the instructions in the Talend documentation.

The following instructions for Batch Mode deployment assume that you have already built the ETL job using Talend Data Integration:

  1. Create the data warehouse (a one-time task):

    Run this command, providing the parameter, <VHIST_Job_Location >:

    sh <VHIST_Job_Location >/VHIST_ETL/setup/setup_schema.sh
           VHIST_Job_Location >

    Check the execution log to determine if the script executed successfully:

    <VHIST_Job_Location >/VHIST_ETL/logs/create_schema.txt
  2. Populate the data warehouse:

    Run this command, specifying the parameter <VHIST_Job_Location >:

    sh <VHIST_Job_Location >/VHIST_ETL/setup/load_VHISTDW_run.sh
           VHIST_Job_Location >

    Check the execution log to determine if the job executed successfully.

    /<VHIST_Job_Location >/VHIST_ETL/logs/job.txt

Validate the ETL

The VHist ETL process records events in log tables that you can query to determine the success or failure of the data load.

To query the ETL log tables:

  1. Connect to the target database using vsql or a client tool like DBVisualizer.
  2. Run this query to validate the vhist_stage schema:
    SELECT *
          FROM VHist_stage.vhist_stage_load_log
          WHERE batch_id =(SELECT max(batch_id)
                 FROM vhist_stage.vhist_stage_load_log);
  3. Run this query to validate the vhist schema:
    SELECT *
    	FROM VHist.vhist_load_log
    	WHERE batch_id =(SELECT max(batch_id)
    		FROM vhist.vhist_load_log);

Schedule Incremental Loads

Once the data warehouse has been created and populated, you can perform incremental loads to keep the warehouse up to date. To continually refresh the data warehouse, schedule incremental loads to run at intervals.

To schedule incremental loads:

  1. At the Linux command prompt, type this command:
    $>crontab –e
  2. Type this line. Substitute the appropriate values for your system. In this example, incremental loads are scheduled to run every 30 minutes:
    0,30****<VHIST_Job_Location >/VHIST_ETL/setup/load_VHISTDW_run.sh   
    		<VHIST_Job_Location >
  3. Restart crontab with the following command:
    $>service crond restart

Cautions for Incremental Loads

You should take care when scheduling incremental loads to avoid placing undue demands on system resources or causing the data warehouse to grow too large. The amount of data stored in Vertica system tables is dependent on many factors, and the individual tables are not flushed at the same rate. Keep in mind the following:

  • To avoid running incremental loads more often than is needed, try starting with daily loads then review the results in the log tables. If there are gaps in the results, decrease the interval between loads until you find an optimal balance.
  • Repeated incremental loads increase the size of the data warehouse over time. The growth amount varies depending on system activity and frequency of collection.

    Note The data that you load into VHist counts towards the limit specified in your Vertica license.

  • You may need to increase the size of the heap available to Talend.

    Tip If you are using the Community Edition of Vertica, your license allows up to one terabyte of free storage. If you already have a licensed installation of Vertica, you can build the VHist warehouse using the Community Edition in a separate cluster.

Troubleshooting

Depending on the memory available in your environment and the amount of data that you are processing, you may need to increase the size of the heap that is available to Talend.

If you encounter this error, you need to increase the heap size:

exception:  java.lang.OutOfMemoryError: Java heap space

The default heap size for Talend is:

java -Xms256M -Xmx1024M

To increase the heap size for Talend:

  1. Edit the file sh
    <VHIST_Job_Location >/VHIST_ETL/jobs/Load_VHISTDW/Load_VHISTDW_run.sh
  2. In the following statement, increase the values for -Xms and –Xmx to values that are reasonable for your environment.
    java -Xms256M -Xmx1024M

See the Talend documentation for more information.


Find More Information

Subject Link

Talend

www.talend.com

QuickStart video

https://www.youtube.com/watch?v=VNhXMQT9kk0&

Vertica and Talend Integration

Vertica Integration with Talend: Tips and Techniques

VHist Overview

http://my.vertica.com/docs/Ecosystem/QS_ETL_Vhist_overview.pdf

Vertica documentation

https://my.vertica.com/documentation/vertica/

Vertica Knowledge Base

https://my.vertica.com/kb/tkb-p

Developer Community Big Data and Analytics

https://my.vertica.com/big-data-analytics-community-content/

Big Data Marketplace

https://marketplace.saas.hpe.com/big-data

Contact Us


We welcome your feedback. If you have questions, comments, or suggestions, please contact us by clicking the Contact developer button on the Big Data Marketplace download page for the QuickStart.

Share this article: