Enter the Flex Zone –– An Exploration in Processing Twitter Data

Posted January 13, 2014 by Ben Vandiver

With our Vertica 7 release, we announced Vertica Flex Zone, a new product offering that simplifies the way that you consume and then explore semi-structured data, such as Web logs, sensor data, and other emerging data types. In this blog post, our first “Flex Zone Friday” post, let’s look at how you can use Vertica Flex Zone to get a leg up on your latest data analysis problem, using Twitter data as the sample data type.

Tweet child of mine

For this example, let’s look at Twitter data. Tweets are an interesting source because the data volume is large, the tweet format is typically JSON, and many companies are mining tweet data for insights about their business in how the community perceives their brand, product, or service. A tweet might look like the following:
{"id":257000038495043584,"text":"Let's play uno","created_at":"Sat Oct 13 06:09:18 2012","user":{"name":"scout",    ;"lang":"en",    "followers_count":521,...}...}
It has the tweet text itself, with a lot of interesting metadata. Note that much of the metadata is stored in sub-structures like the “user” field.

You can acquire twitter data quite easily by making requests to the Twitter API. For the purposes of following along here, you can download a small sample of tweets: tweets.json.gz

Look Ma: No Columns!

Typically, loading this type of data into a relational database would have required transforming the raw data into something more easily consumable – or a pre-built custom Twitter data loader. With Vertica Flex Zone, you can skip the schema definition phase and just load JSON directly into the database, saving you hours of time:
CREATE FLEX TABLE tweets ();COPY tweets FROM '/path/to/tweets.json.gz' GZIP PARSER fjsonparser();
As easy as “SELECT … FROM T”

Now we can just write SQL queries against the data, as if it were naturally loaded into columns. For example, to discover what the average tweet length by language is:
select "user.lang",avg(length(text))::int from tweets group by "user.lang" order by 2 desc;
user.lang | avg
------------+--------
       de | 119
       ar | 108
       es | 80
       en | 67
       it | 66
       tr | 66
       nl | 64
       id | 61
       fr | 60
       ru | 57
       ja | 44
       pt | 43
       ko | 40
       th | 22

From the SQL point of view, it doesn’t matter that the source data was JSON. The only concession is the column names themselves: we do need to “” the column names if they contain the ‘.’ character.

Let’s make a Schema

But how can you tell what columns are available? The Vertica Analytics Platform has a couple of mechanisms for discovering and making explicit the columns available.
select compute_flextable_keys_and_build_view('tweets');
compute_flextable_keys_and_build_view
--------------------------------------------------------------------------------------------------
Please see public.tweets_keys for updated keys.

The view public.tweets_view is ready for querying
(1 row)

The tweets_keys table now contains a row for each available column in the data. For example, to find all the user-related count columns:
select key_name from tweets_keys where key_name::varchar ilike 'user%count';
     key_name
------------------------------------------------------------user.statuses_countuser.listed_countuser.friends_countuser.followers_countuser.favourites_count
(5 rows)

Armed with these new columns, we can compute some statistics about the followers in our dataset:
select min("user.followers_count"::int),avg("user.followers_count")::int,max("user.followers_count"::int) from tweets;
min | avg | max  
-----+------+--------
  0 | 1968 | 253435
(1 row)

Clearly someone is tweeting to the void. And we probably have some observer bias in our small sample: there’s probably a correlation between tweeting frequency and follower count.

In addition to building the keys table, the function also whipped up a view that selects all available columns, which can be handy to see what is in the table. Tweets are quite large:
select * from tweets_view limit 1;
contributors | created_at | entities.hashtags | entities.urls | entities.user_mentions | favorited | id | id_str | in_reply_to_screen_name | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | in_reply_to_user_id_str | retweet_count | retweeted | source | text | truncated | user.contributors_enabled | user.created_at | user.default_profile | user.default_profile_image | user.description | user.favourites_count | user.follow_request_sent | user.followers_count | user.following | user.friends_count | user.geo_enabled | user.id | user.id_str | user.is_translator | user.lang | user.listed_count | user.location | user.name | user.notifications | user.profile_background_color | user.profile_background_image_url | user.profile_background_image_url_https | user.profile_background_tile | user.profile_image_url | user.profile_image_url_https | user.profile_link_color | user.profile_sidebar_border_color | user.profile_sidebar_fill_color | user.profile_text_color | user.profile_use_background_image | user.protected | user.screen_name | user.statuses_count | user.time_zone | user.url | user.utc_offset | user.verified | coordinates | geo | place | user.profile_banner_url | delete.status.id | delete.status.id_str | delete.status.user_id | delete.status.user_id_str | retweeted_status.contributors | retweeted_status.created_at | retweeted_status.entities.hashtags | retweeted_status.entities.urls...

A key differentiator of Vertica Flex Zone is that you can visualize this semi-structured data during data exploration with any of your preferred BI and visualization tools. These tools are suitable for the views, as they expose the metadata about the “flexible table” in a manner that the tools already understand. For example, we used Tableau to generate a report against the tweets_view:

unstructured-tableau

And now for something completely different

It turns out that the tweet stream includes two types of records: tweets and deletes. Deletes are revocations of tweets and they have the following structure:
{"delete":{  "status":{   "user_id_str":"228092142",   "id_str":"218710262428073984",   "id":218710262428073984,   "user_id":228092142  } }}

Our previous load actually loaded both record types into the same table. How did the previous queries work? Well, if you reference a column that doesn’t exist in a record, you get null. We can use this to find out how many deletes we loaded:
select count(*) from tweets where "delete.status.id" is not null; count -----------------  118 (1 row)

Lots yet to come

We haven’t talked about a number of key things like column promotion (performance!), dynamic schema, or substructure. Here’s a teaser about substructure — namely the JSON input is quite complex and the Vertica Analytics Platform loaded it all. You can descend into the nested structures and get the most out of the data. The following is a query that computes the top hashtags in the last month:
select values,count(*) from    (select mapitems("entities.hashtags") over ()     from tweets    where created_at::timestamp > now() - interval '1 month') awhere keys::varchar like '%text'group by valuesorder by count desclimit 10;
                                 values | count
----------------------------------------+------------
       10ThingsThatGetsOnMyNerves       |     2
       Massa                            |     1
     ???????????????                    |     1
       Up                               |     1
       ????????????????      |     1
       Turn                            |     1
       50ThingsAboutMyBoyfriend         |     1
       askenchancer                     |     1
       always                           |     1
       kaburrr                          |     1

Note: If you got recent data from twitter, this will work without modification––if you used our data set, use interval ‘3 years’ (or more).

We encourage you to give Vertica Flex Zone a test drive and actively participate in our community site! Stay tuned for our next Flex Zone Friday blog post.