myVertica  

Vertica Blog

Announcing vertica.dplyr: Part 1

Today’s data scientists are challenged by large volumes of data from a number of data stores, such as HDFS, enterprise data warehouses, data marts, and external data APIs. They must spend a significant amount of time organizing data in an understandable manner and looking for patterns to discover an inference or to find a solution to a specific problem.

Toward that end, data scientists prefer languages and tools that allow them to do end-to-end analysis, that allow them to accomplish tasks ranging from preliminary data preparation to applying a wide array of algorithms and statistical techniques.

Among the possible choices, it is no surprise to see R, SQL, and Python leading the pack.

According to a poll from KDnuggets, nearly half of all respondents identified R as a primary tool for data analytics. Furthermore, around 30% responded that they use SQL, since large data often resides in relational databases.

dplyr1

Poll from KDnuggets

As a proud member of the HP Distributed R development team, I fall squarely into the R camp. I’’ve grown quite comfortable with all of R’’s little syntactical quirks, and, when working with HP Distributed R’’s capabilities, I don’’t want to leave my R session, nor do I want to think about any language other than R.

However, as anyone who’’s worked with data for any length of time can tell you, staying entirely within one environment is not always possible. With only a few gigabytes of system memory, what if I’’ve got terabytes of structured data to “data prep”, and don’’t want to give up the convenience of using R’’s data frame? Doesn’’t seem doable, right? You may be thinking of some out-of-memory, chunk-wise, external approach, or something that doesn’’t use R altogether. As it turns out, with vertica.dplyr, you can have your cake and eat it too. At least, you can if you use HP Vertica.

dplyr

Ever the prolific contributor to R, Hadley Wickham released dplyr (currently in beta) earlier this year. As the successor to the popular plyr (identified by KDnuggets as one of the top-5 most popular R packages), dplyr promises high-performance and convenient manipulation of data in R, with a special emphasis on data frames.

With dplyr, I can form complex pipelines of operations by chaining together “verbs” (for example filter, arrange, select, mutate, and so forth) that transform data by filtering, reordering, selecting, and adding rows and columns to my data.

For example, let’’s say I imported some baseball data from the “Lahman” R package, available on CRAN:

library(Lahman)
> nrow(Salaries)
[1] 23956
head(Salaries)
yearID teamID lgID playerID salary
1   1985   BAL   AL murraed02 1472819
2   1985   BAL   AL lynnfr01 1090000
3   1985   BAL   AL ripkeca01 800000
4   1985   BAL   AL lacyle01 725000
5   1985   BAL   AL flanami01 641667
6   1985   BAL   AL boddimi01 625000

In this 24k observation dataset, I’’ve got 5 columns of data, giving information about each player’’s team, league, and his salary for that year. Now suppose (just for demonstration’’s sake) that I want to look only at the salary data from before 1987 for players who played for the California Angels (CAL), and who made more than 700k in that year. I don’’t care about the other columns, so I want to see only the players’’ names and their salaries. Finally, I want to order the rows from the highest-earning player/year combination to least earning.

To accomplish that, I could then do the following:

> Salaries %>% filter(yearID < 1986,teamID == 'CAL',salary>700000) %>% arrange(desc(salary)) %>% select(playerID,salary)
playerID salary
1 decindo01 1100000
2 jacksre01 1058894
3 boonebo01 883000
4 carewro01 875000
5 burleri01 856667
6 downibr01 808333
7 suttodo01 750000
8 hendrge01 737000

dplyr has compiled C++ code, which enables faster execution of these operations from within R. The above operations are just a small sample of what I can easily express using dplyr (for more information, check out the dplyr website.

That’’s cool! But what does this have to do with HP Vertica? What’’s vertica.dplyr?

As mentioned before, R users do not always have the luxury of having everything sitting directly inside of the R session; data may reside in various external data stores, including SQL-databases.

In our case, that entity is HP Vertica.

For those of you familiar with SQL, you may be thinking, “Yes, but I can do the same in SQL by using a single SELECT statement! Why should I care about vertica.dplyr?”

Yes, you could very well do it in SQL if you are proficient in it. R provides an easy way to execute SQL over either JDBC or ODBC using the popular packages RJDBC and/or RODBC. For those of us who prefer a more ‘‘R’’ approach for data transformations, however, the dplyr package provides a comprehensive framework to perform such work either in-memory or in a SQL-database.

To adapt to the unique implementations of different databases, dplyr requires adapter packages for each backend type.

Available now on Github (currently without support – visit the forums for help), vertica.dplyr is an HP Vertica adapter for dplyr. Supporting both JDBC and ODBC, I can easily make a connection to my HP Vertica instance and have the same R-style fun (from the previous section) with my data in HP Vertica, without having to transport the data back into R.

Using the vertica.dplyr package, the key benefit that HP Vertica users can enjoy is the ability to use their favorite R language to handle data transformations in HP Vertica at big-data scale without moving data to memory. Without sacrificing the convenience of the R data frame, dplyr also exposes the concept of a “tbl_sql” object, which, to the happy R user, behaves similarly to an R data frame. And this means–you guessed it–all of those dplyr verbs work on them as well.

So I set up my own private HP Vertica instance in my own virtual machine:

dbadmin => dt
List of tables
Schema |       Name         | Kind | Owner | Comment
--------+---------------------+-------+---------+---------
public | Salaries           | table | dbadmin |
(1 row)

This is the same data as the example used earlier. Now from my R session:

> library(vertica.dplyr)
>
>
> vertica
>
> sal
>
> sal
Source: Vertica ODBC Connection
-----+DSN: VerticaDSN
-----+Host: 127.0.0.1
-----+DB Version: 07.01.0002
-----+ODBC Version: 03.80
From: Salaries [23,956 x 5]
yearID teamID lgID playerID salary
1   1985   ATL   NL barkele01 870000
2   1985   ATL   NL bedrost01 550000
3   1985   ATL   NL benedbr01 545000
4   1985   ATL   NL campri01 633333
5   1985   ATL   NL ceronri01 625000
6   1985   ATL   NL chambch01 800000
7   1985   ATL   NL dedmoje01 150000
8   1985   ATL   NL forstte01 483333
9   1985   ATL   NL garbege01 772000
10   1985   ATL   NL harpete01 250000

What did I do? I loaded up the R package, made a connection to HP Vertica, and created a “tbl_sql” object from the “Salaries” table in the DB.

You may be wondering what’’s actually happening under the hood. When I print the “sal” variable from within R, vertica.dplyr executes a SELECT statement in the database and bringing it back:

 > sal$query$sql
SELECT yearID, teamID, lgID, playerID, salary
FROM Salaries

The code above executes the command, and causes the data to be brought back over the ODBC connection. But it doesn’’t happen if I don’’t print the variable. That’’s another benefit you get from using vertica.dplyr. Nothing happens in the database until you actually look at what you’’ve done (in this case, printing the dplyr “tbl_sql” object), or tell dplyr to collect the data back into R. In fact, each time you act upon a “tbl_sql” object with dplyr, you simply amend the underlying SQL statement associated with that object.

Let’’s revisit our filter-arrange-select chain operation on the Lahman baseball data, but instead this time do it with vertica.dplyr and Vertica:

> ops % filter(yearID < 1986,teamID == 'CAL',salary>700000) %>% arrange(desc(salary)) %>% select(playerID,salary)

So what did vertica.dplyr do in this case? If you thought, “ran a query in HP Vertica corresponding to the dplyr operations”, you’’re wrong. If you instead thought, “constructed a SQL query corresponding to the dplyr operations”, you’’re right. Importantly, this means that dplyr allows you to create efficient chains of operations with ease without unnecessarily triggering work inside of the database.

Take a look at the actual SQL constructed:

> ops$query$sql
SELECT playerID AS playerID, salary AS salary
FROM Salaries
WHERE yearID < 1986.0 AND teamID = 'CAL' AND salary > 700000.0
ORDER BY salary DESC

Because I stored the tbl_sql object into “ops”, nothing executed in the database. This allows for even more chaining and working with the data.

For example, if I wanted to take this result and do something else, say, create a column, I would run the mutate command:

> ops <- mutate(ops,half_salary=salary/2)

And still, there would be no DB activity. All we’’ve done is change the SQL even further:

> ops$query$sql
SELECT playerID AS playerID, salary AS salary, salary / 2.0 AS half_salary
FROM Salaries
WHERE yearID < 1986.0 AND teamID = 'CAL' AND salary > 700000.0
ORDER BY salary DESC

It’’s not until I print the variable that the execution is coerced to happen in HP Vertica:

> ops
Source: Vertica ODBC Connection
-----+DSN: VerticaDSN
-----+Host: 127.0.0.1
-----+DB Version: 07.01.0002
-----+ODBC Version: 03.80
From: Salaries [8 x 3]
Filter: yearID < 1986, teamID == "CAL", salary > 7e+05
Arrange: desc(salary)
playerID salary half_salary
1 decindo01 1100000   550000.0
2 jacksre01 1058894   529447.0
3 boonebo01 883000  441500.0
4 carewro01 875000   437500.0
5 burleri01 856667   428333.5
6 downibr01 808333   404166.5
7 suttodo01 750000   375000.0
8 hendrge01 737000   368500.0

So here we see the convenience and efficiency of performing database operations in R using vertica.dplyr. Given its lazy execution, these operations are not only easy to express to the R user, they are actually efficient and avoid repeated ODBC commands.

Though the examples shown here haven’’t exactly resulted in super-complex generated SQL, given more operations, you could construct a complex one fairly easily, given the intuitiveness of dplyr operation chains. To accomplish the same efficiency using pure SQL, you would have to be comfortable with crafting such a lengthy statement in one go.

Furthermore, this is really useful for any R user who would like to perform data analysis or preparation in HP Vertica from entirely within R, without the need for loading all of the data into the R session nor any knowledge/use of SQL.

The fun doesn’’t stop there

What if I wanted to go in the other direction, and copy an existing data frame in R into Vertica? That’’s doable with vertica.dplyr using the copy_to command. What about, if, upon finishing my prep operations, I wanted to move the data into R’’s memory? I’’d just have to run collect, which will transform the data into a data frame in R.

Also possible with this R package: explaining queries, copying data into HP Vertica from a CSV file, and creating new tables and views. We can also easily run HP Vertica analytic functions on our “tbl_sql” objects using vertica.dplyr.

But wait! There’’s more……

Of course, with HP Distributed R, we saw more opportunity to integrate HP Vertica and Distributed R using vertica.dplyr, to create a fully-integrated R and HP Vertica scalable analytics solution. Stay tuned for Part 2 of this blog post, where we discuss and show an end-to-end predictive analytics demo using vertica.dplyr, HP Vertica, and Distributed R.

For more information, download the package at: https://github.com/vertica/vertica.dplyr

A detailed user guide is available in the form of a vignette under vignettes/.

For more information on dplyr, please visit: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html