myVertica  

Vertica Blog

Phrase Search with Vertica Text Search

This blog post was authored by Serge Bonte.

Vertica Text Search

Vertica already provides Text Search. Text Search allows you to quickly search the contents of a single CHAR, VARCHAR, LONG VARCHAR, VARBINARY, or LONG VARBINARY field within a table to locate a specific token.

Vertica implements that capability using a dedicated Text Index to store all the individual tokens from a source table’s text field.

In this blog post, we will explore how to enable sentence (text fragment) or phrase search (consecutive keywords) with a Vertica text index that is configured to track tokens’ position.

Create a Tokenizer

First, let’s instantiate a tokenizer. The role of a tokenizer is to break the text in a field into a sequence of tokens. Vertica provides a variety of prebuilt tokenizers, here we use a tokenizer that will also store the token position within the text. We also configure the tokenizer (via the major and minor separators settings) so that the text will be tokenized in both sentences and keywords. => CREATE TRANSFORM FUNCTION v_txtindex.rhyme_tokenizer AS LANGUAGE 'C++' NAME 'logWordITokenizerPositionFactory' LIBRARY v_txtindex.logSearchLib NOT FENCED; select proc_oid from vs_procedures where procedure_name = ' rhyme_tokenizer'; ---> 45035996308658904 SELECT v_txtindex.SET_TOKENIZER_PARAMETER('stopwordscaseinsensitive','for,the' USING PARAMETERS proc_oid='45035996308658904'); SELECT v_txtindex.SET_TOKENIZER_PARAMETER('majorseparators', E'{}()\r\n&[].' USING PARAMETERS proc_oid='45035996308658904'); SELECT v_txtindex.SET_TOKENIZER_PARAMETER('minorseparators', E' ' USING PARAMETERS proc_oid='45035996308658904'); SELECT v_txtindex.SET_TOKENIZER_PARAMETER('used', 'True' USING PARAMETERS proc_oid='45035996308658904');

Create a Text Index

Let’s use a simple fact table of nursery rhymes sentences. => CREATE TABLE rhymes(id int primary key not null, text1 varchar); => INSERT into rhymes values(1,'Humpty Dumpty sat on a wall'); => INSERT into rhymes values(2,'Humpty Dumpty had a great fall'); => INSERT into rhymes values(3,'they could not put Humpty together again'); => INSERT into rhymes values(4,E'Humpty Dumpty sat on a wall.\nHumpty Dumpty had a great fall.');

Next, let’s create a text index for that table using the tokenizer we previously created. create TEXT INDEX idx_rhymes on rhymes (id, "text1") tokenizer v_txtindex.rhyme_tokenizer (long varchar) STEMMER NONE; If you look at the index, you will notice that at position zero we have two tokens, the “sentence” (e.g. Humpty Dumpty sat on a wall) and the first word (e.g. Humpty)

Token Search

You can use the Vertica text index to quickly locate all rhymes that contain the token wall. dbadmin=> SELECT id,text1 FROM rhymes a WHEREa.id in (SELECT a.doc_id FROM idx_rhymes a WHERE a.token='wall'); id | text1 ----+-------------------------------------------------------------- 1 | Humpty Dumpty sat on a wall 4 | Humpty Dumpty sat on a wall. Humpty Dumpty had a great fall. (2 rows)

Sentence Search

Because we configured our tokenizer to first separate on sentences (majorseparators), we can quickly locate all rhymes containing the sentence: “Humpty Dumpty sat on a wall” dbadmin=> SELECT id,text1 FROM rhymes a WHERE a.id in (SELECT a.doc_id FROM idx_rhymes a WHERE a.token='Humpty Dumpty sat on a wall'); id | text1 ----+-------------------------------------------------------------- 1 | Humpty Dumpty sat on a wall 4 | Humpty Dumpty sat on a wall. Humpty Dumpty had a great fall. (2 rows)

Phrase Search

Because we also tokenize on words (space is a minor separator) and we also keep track of the words positions, we can also search on the phrase “great fall” as follows and get the two rhymes containing that phrase (i.e. token “great” right before token “fall”): dbadmin=> SELECT id,text1 FROM rhymes a WHERE a.id in (SELECT a.doc_id FROM idx_rhymes a, idx_rhymes b WHERE a.doc_id=b.doc_id and a.token='great' and b.token='fall' and a.position < b.position); id | text1 ----+-------------------------------------------------------------- 2 | Humpty Dumpty had a great fall 4 | Humpty Dumpty sat on a wall. Humpty Dumpty had a great fall. (2 rows)