Enter the Flex Zone - Under the Hood | OpenText™ Vertica™

With Vertica’s latest release (Vertica 7 “Crane”), we introduced Vertica Flex Zone, based on the patent-pending flex tables technology, which dynamically adapt to whatever schema is present in the data. Flex tables offer dramatic usability improvements over regular tables. In this post, we take a look under the hood and show how flex tables are similar to regular Vertica tables, with a little pinch of magic thrown in.

The Tour Starts with Storage
Let’s take a peek at a flexible table:

=> create flex table tweets();

CREATE TABLE

=> d tweets(2 rows)

Schema	Table	Column	Type	Size	Default	Not Null	Primary Key	Foreign Key
public	tweets	__identity__	int	8		t	f
public	tweets	__raw__	long varbinary(130000)	130000		t	f

At the heart of the implementation is the simple idea that we will store an entire record in __raw__, a single wide varbinary column. Any record provided, no matter the schema, can be stored assuming that the column is wide enough to hold the whole record. From the example above, you can see that the default max size is around 128K. We’ve found that this size is sufficient for most use cases, but you can stretch it up to 32 MB if necessary. For performance reasons, Vertica includes the __identity__ column as the column on which to sort and segment the data.

So what does the __raw__ column actually contain? Let’s take a peek:
=> copy tweets from 'tweets.json' parser fjsonparser(); Rows Loaded -------------------------------------------- 542 (1 row)=> select __raw__ from tweets limit 1; __raw__ ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 01000000H0000000400000024000000&0000008000000@00000023273424499939737623273424499939737638291201382912010400000024000000$0000008000000M000000delete.status.iddelete.status.id_strdelete.status.user_iddelete.status.user_id_str (1 row)

First, notice that this isn’t the JSON value that we supplied as input. When you load a flex table, the parser reads each input record, parses it into a collection of key/value pairs, and writes it into an internal map format. Assuming you query the data at least once, it is worth parsing upfront to avoid re-parsing with each query. Vertica also does a little in-record indexing to make finding field values efficient. Additionally, the storage is now agnostic to the input format, freeing you from having to specify the input format upfront or locking the table to a specific format. Finally, Vertica’s storage engine has more flexibility with compression algorithms for storing the data efficiently on disk.

It’s a Kind of Magic

So why can I run “select text from tweets” despite the “text” column not existing in the table? Looking at the query plan shows us what we would expect, namely we are reading the __raw__ column:
=> EXPLAIN SELECT text FROM tweets; Access Path: +-STORAGE ACCESS for tweets [Cost: 21K, Rows: 10K (NO STATISTICS)] (PATH ID: 1) | Projection: public.tweets_super | Materialize: tweets.__raw__ | Execute on: All Nodes

If you dig into the “GraphViz” (see http://www.graphviz.org/) query plan, you will discover that the actual query that is running is:
SELECT MapLookup(__raw__,'text') FROM tweets;Flex tables have a special parse rule that rewrites unknown column references into calls to the MapLookup function. The MapLookup function takes a key and returns the value, or null if the key does not exist in the map:
=> SELECT maplookup(__raw__,'id') FROM tweets limit 1; maplookup ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 257000038482460672 (1 row)=> SELECT maplookup(__raw__,'nonexistentkey') FROM tweets limit 1; maplookup ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------(1 row)This little piece of magic is what lets you write vanilla SQL queries against data with unknown or varying schema.

Maps to Treasure

Now that you know about the __raw__ column, we can play with a number of the map functions that come with flex tables. Here are a couple of examples:]

Getting JSON back:

=> SELECT maptostring(__raw__) FROM tweets limit 1; maptostring --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- { "delete.status.id" : "230242346078457856", "delete.status.id_str" : "230242346078457856", "delete.status.user_id" : "524683721", "delete.status.user_id_str" : "524683721" }

Shredding a record into rows of key value pairs:

=> SELECT mapitems(__raw__) OVER (partition auto) FROM (SELECT__raw__ FROM tweets limit     1) a;

keys

values

delete.status.id

232734244999397376

delete.status.id_str

232734244999397376

delete.status.user_id

38291201

delete.status.user_id_str

38291201

(4 rows)

Finding keys of all records, similar to computing contents of the keys table:
=> SELECT distinct key FROM (select mapkeys(__raw__) OVER (partition auto) as key from tweets ) a ORDER BY key; ---------------------------------------------------------------------------------------------------------------------------- --------- contributors coordinates coordinates.coordinates coordinates.type created_at delete.status.id delete.status.id_str delete.status.user_id ...Checking how close you are to overflowing the record size:
=> SELECT max(length(__raw__)) FROM tweets; max ------------------------------------------------------------------------------------------------------------------ 8675 (1 row)Isn’t that a row store?

By storing the whole record in a single column, Vertica must read the entire record even if the query needs only a subset of the keys the definition of a row-store database. Vertica’s flex tables perform decently well due to their solid executor and distributed execution but Vertica’s columnar tables are orders of magnitude faster. As described in a previous post, accelerating flex table performance is done by promoting important columns to be real columns in the table. These hybrid tables have the best of both worlds: flexible schema due to a __raw__ column and columnar access to the most important data. You can see this for yourself by creating a hybrid flex table:

bvandiver=> create flex table logs (time timestamp, sessionid varchar, component     varchar);

CREATE TABLE

bvandiver=> d logs

Schema	Table	Column	Type	Size	Default	Not Null	Primary Key	Foreign Key
public	logs	__raw__	long varbinary(130000)	130000		t	f
public	logs	"time"	timestamp	8		f	f
public	logs	sessionid	varchar(80)	80		f	f
public	logs	component	varchar(80)	80		f	f

(4 rows)

Conclusion

Hopefully this whirlwind tour through the back end of flextables has given you a better understanding of how it all works and the performance tradeoffs we made. As usual we encourage you to download our Community Edition and give it a try!

Product Overview

Vertica Announces Vertica 12 for Future-Proof Analytics

Harness the Internet of Things (IoT)

Support & Services

Partners

Vertica Inside – Embedded Analytics at Scale

Resources

About Vertica

Stay Informed

Enter the Flex Zone – Under the Hood

About the Author

Search The Blog

Explore Popular Topics

Subscribe For Email Updates