Last week we announced the open-source release of vertica.dplyr, an R package that simplifies accessing the fast and scalable SQL analytics of HP Vertica by enabling the popular semantics of dplyr.
Today, our focus turns to leveraging vertica.dplyr to prepare data for machine learning and predictive analytics workloads, which often involve significant data preparation (this interesting NYTimes big-data article reports an estimate of the proportion of time spent on data preparation to be about 50 to 80%).
HP Vertica simplifies the process of collecting and preparing data, with features such as schema-less access to data using flex tables and support for loading and/or reading data from popular formats such as ORC, Parquet, Avro, and CSV files at big-data scale.
I am really excited to see vertica.dplyr bring these powers of HP Vertica to the simple-to-use dplyr package. In this blog, I will take a customer-response modeling scenario in a bank marketing dataset that’s available in the UCI machine learning repository. Using this example, I will illustrate how to use vertica.dplyr to perform data munging in HP Vertica using only the R language. I will also show how vertica.dplyr simplifies the data flow from HP Vertica to Distributed R by leveraging the native connector that optimizes data transfer. It optimizes data transfer by maintaining data locality between HP Vertica and Distributed R worker nodes.
Altogether, this approach greatly reduces the need for writing R scripts to operationalize predictive analytics workflows.
We will leverage HP Vertica, vertica.dplyr and Distributed R to build and end-to-end predictive analytics pipeline.
To achieve this, we will:
- Collect data
- Load data into HP Vertica
- Pre-process the data in HP Vertica using vertica.dplyr
- Split the processed data into training and testing sets
- Load the training data directly into Distributed R data structures using vertica.dplyr
- Train the model using the distributed version of GLM in Distributed R
- Deploy the R logistic-regression model to Vertica
- (Not shown) use the deployed R model to run in-database SQL prediction on the testing set
Step 1: Collect data
This dataset (http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing) contains quantitative and categorical features that describe the targeted customers of a marketing campaign carried out by a Portuguese banking institution, as well as the conditions under which contact was made. In this context, the business goal of generating a predictive model is to optimize future marketing campaigns by identifying the customers who are likely to respond to product offers.
We use bank_additional_full.csv, which contains 45k rows and 20+ columns. Note that this is a very small dataset; in the real-world, big datasets involve billions of rows and hundreds of columns, to which the techniques outlined below still apply.
Step 2: Load the data into HP Vertica
Creating a table in HP Vertica with vertica.dplyr is as simple as defining an R vector of data types, assigning to that vector names corresponding to the column names of the table, and then using the db_create_table:
columns <- c( "int","varchar","varchar", "varchar", "varchar", "varchar", "varchar","varchar", "varchar", "char(5)","int","int","int","int","varchar","float
names(columns) <- c("age","job","marital","education","defaulted","h
Once the destination table is created, we can load the data from a CSV into HP Vertica using db_load_from_file, specifying a delimiter, a table name, and a source file:
orig <- db_load_from_file(vertica, file.name= "/home/dbadmin/bank-additional/bank-additional-ful
By default, db_load_from_file uses the internal HP Vertica COPY command to handle large datasets. If the use of this feature is not desired, we may set fast.load=FALSE to perform data loading using INSERT INTO statements.
Step 3: Data preparation using vertica.dplyr
Besides typical data preprocessing steps like cleaning and formatting, there may be other necessary transformations involved to make the data work with machine-learning algorithms. For example, knowledge of the problem domain influences how a data scientist might define new features in the data.
At a high level, there are three common transformations performed on data before applying algorithms — scaling (aka. normalization, binning), feature decomposition, and aggregation.
In this example, decomposition and aggregations are already done; columns such as month, last contact month, year, and day_of_week are most likely derived from an original date column. pdays, a column which represents the number of days since the client was last contacted, was most likely derived from performing aggregate queries on other data .
In this example, I will use vertica.dplyr to perform normalization, as well as the reduction of domain values in high-cardinality columns.
Looking at the banking dataset, we recognize a couple of things:
- There are several quantitative variables, which range over different scales (e.g,. age ranges over 20-60 while nr.employed has values in the thousands). We should normalize these variables by calculating their z-scores
- Various categorical variables have values that contain low-frequency occurrences (e.g., job seems to have many blue-collar, technician, services, and management values, but professions like housemaid occur infrequently). It is better to reclassify low-frequency occurrences into categories labeled other or miscellaneous.
For the first point, summaries allows us to invoke aggregate functions on columns in the database. Note how we invoke these in R-style (using the function names mean and sd); under the hood, this action is translated into their HP Vertica equivalents, AVG and STDDEV.
m_sd <- summarise(orig,m_age=mean(age),std_age=sd(age),m_c
Following the computation of the means and the standard deviations, we use mutate to define new columns as functions of existing ones (collect moves results from HP Vertica to the R session). Then we select the columns we want to proceed with (rejecting the non-normalized data in the original table).
z <- collect(m_sd)
normalized <- mutate(orig,age_z=(age-z[["m_age"]])/z[["std_age"]
norm_table <- select(normalized,age_z,job,marital,education,defa
We again use the mutate command, but this time, instead of computing the normalized scores, we run the HP Vertica DECODE function, which allows us to redefine values of columns according to a search and replace pattern.
Note that dplyr can convert decode into SELECT AS DECODE( ). This means that it provides simple compatibility for the invocation of HP Vertica analytic functions (vertica.dplyr also supports invocation of Window functions, providing parameters for specifying width, ordering, and partitioning).
decoded <- mutate(norm_table,job=decode(job,'"admin"',"admin"
We can use db_save_view to save our intermediate results (decoded in the above R session) to a view in HP Vertica.
hpdglm requires that all data be numerical, so we convert the categorical data into integers using a cat2num.R script:
Step 4: Divide the data into testing and training sets
Next, we can split our data set into training and testing sets using filter with random(). setdiff (consistent with the analogous function in R-base) allows us to form the training set from the set by taking the set difference between the testing set and the original set.
The use of setdiff is translated into an EXCEPT in SQL, and random() calls HP Vertica’s internal RANDOM() function.
compute will save the results into tables in HP Vertica.
top_tbl <- tbl(vertica,"bank_top_n_num")
testing_set <- filter(top_tbl,random() < 0.2)
testing_set <- compute(testing_set,name="testing_set")
training_set <- compute(setdiff(top_tbl,testing_set),"training_set")
Step 5: Load the data into Distributed R using tbl2darrays
Since we are going to use Distributed R to train our logistic regression model, let’s make use of another vertica.dplyr utility function for Distributed R: tbl2darrays.
tbl2darrays is a wrapper function around HPdata’s db2darrays function. This handy function allows you to move from a dplyr tbl_sql to a Distributed R object in one step!
LoadedData <- tbl2darrays(training_set,resp=list('y_1'))
Step 6: Train the model using hpdglm
Once the data are loaded into Distributed R, we can run the algorithm. hpdglm is a high-performance, high-scalability logistic regression algorithm for Distributed R:
theModel <- hpdglm(responses=LoadedData$Y, predictors=LoadedData$X, family=binomial)
Step 7: Deploy the model into Vertica
Finally, we deploy the constructed model using HPdata’s deploy.model function:
deploy.model(model=theModel, dsn='VerticaDSN', modelName='demoModel', modelComments='A logistic regression model for bank data')
Step 8: Use the model for prediction
In HP Vertica, the model can now be used to predict on the remaining testing data! Experiments show that this technique results in a roughly 90% accurate prediction rate.
A fully R solution
As demonstrated with the above example, vertica.dplyr further strengthens the partnership between R and Vertica in the following ways:
- It provides an easy way to perform data transformations in HP Vertica from R, using R-only syntax.
- It utilizes the tbl2dobject functions, which provide a one-step way of converting a dplyr tbl_sql into a Distributed R distributed object, where it is ready for Distributed R’s powerful machine-learning algorithms.
As vertica.dplyr is still in beta and in active development, we very much look forward to providing additional functionality in support of Distributed R and HP Vertica; the future looks bright.
The demo shown in this post can be found here (where the same dataset is used, but without vertica.dplyr): https://github.com/vertica/DistributedR/tree/master/demo/example/hpdglm