Thursday, May 20, 2010

wildcard query: permuterm index and k-gram index

Let “$” mark the end of a term such that “hello” –> “hello$”. The permuterm entry for term “hello” are:

hello$, ello$h, llo$he, lo$hel, o$hell, $hello

In a wildcard query “m*n”, rotate the query so that the “*”appears at the end of the string. So “m*n” changes to “n$m*”. The look up this string in the permuterm index. Seeking “n$m*” leads the rotations of the terms “man” and “moron”.

When query X, look up on X$, query X*, look up on $X*

*X –> X$*, *X* – > X*, X*Y –> Y$X*

That means: rotate wildcard query to the right and use B-tree to lookup.

permuterm index need a lot of space.

--

In a k-gram index, the dictionary contains all k-grams that occur in any term in the vocabulary. For example, the 3-gram “etr”would point to terms sucha as: retrieval, petrify, metric, beetroot.

Use “$”as a special word boundary symbol. The 2-grams in text “the month” are: $t, th, he e$, $m, mo, on, nt, th, h$”

In wildcard query “mon*”, run as: $m AND mo AND on

No comments:

Post a Comment