Vertica Quick Tip: Lightning Fast Text Search

Posted January 25, 2018 by Soniya Shah, Information Developer

This blog post was authored by Jim Knicely.

Searching the contents of a sizeable CHAR, VARCHAR, LONG VARCHAR, VARBINARY, or LONG VARBINARY field within a table to locate a specific keyword can be quite time consuming. Especially when dealing in Big Data.

Fortunately, Vertica includes a text indexing feature which allows you to query that data with mind blowing results!

Example: dbadmin=> \d big_varchar_table List of Fields by Tables Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key --------+-------------------+-------------+----------------+-------+---------+----------+-------------+------------- public | big_varchar_table | pk | int | 8 | | t | t | public | big_varchar_table | the_varchar | varchar(25000) | 25000 | | f | f | (2 rows) dbadmin=> \timing Timing is on. dbadmin=> SELECT COUNT(*) FROM big_varchar_table; COUNT ------------ 1000000000 (1 row) Time: First fetch (1 row): 315.768 ms. All rows formatted: 315.819 ms dbadmin=> SELECT COUNT(*) FROM big_varchar_table WHERE the_varchar LIKE '%GAEMO%'; COUNT ------- 703 (1 row) Time: First fetch (1 row): 82770.310 ms. All rows formatted: 82770.419 ms That took about 1 minute and 40 seconds.

Now let’s try a text index. dbadmin=> CREATE TEXT INDEX big_varchar_table_text_index ON big_varchar_table (pk, the_varchar); CREATE INDEX Time: First fetch (0 rows): 3432163.531 ms. All rows formatted: 3432163.531 ms dbadmin=> SELECT COUNT(*) FROM big_varchar_table_text_index WHERE token = v_txtindex.StemmerCaseInsensitive('GAEMO'); COUNT ------- 703 (1 row) Time: First fetch (1 row): 94.237 ms. All rows formatted: 94.296 ms Wow, querying the text index takes less than one second!

Now I can join the text index back to the source table to grab the complete contents of its text field… dbadmin=> SELECT pk, substr(the_varchar, 1, 35) "Small portion to fit on screen" dbadmin-> FROM big_varchar_table dbadmin-> WHERE pk IN (SELECT doc_id dbadmin(> FROM big_varchar_table_text_index dbadmin(> WHERE token = v_txtindex.StemmerCaseInsensitive('GAEMO')) dbadmin-> LIMIT 1; pk | Small portion to fit on screen ---------+------------------------------------- 4408169 | IGCUY OOWNS GAEMO DRJQK NOVXN DUBVY (1 row) Time: First fetch (1 row): 300.099 ms. All rows formatted: 300.174 ms Have fun!