Tuesday, May 18, 2010

Dictionary data structure in IR

If using an array, it looks like:
term ---- document freq. ---- pointer to postings list
a       |       34594          | pointer
..
as     |    32423             |
..
zzz   | ....

The three parts are 'char[20], int, postings *', the sizes are 20 bytes, 4 bytes, and 4 bytes (or 8)

In IR systems, Hash table or Tree is used.
Hash table: each vocabulary term is hashed to an integer, the lookup time is O(1). The problems are: not easy to find minor variants, no prefix search (tolerant retrieval), and if vocabulary growing, need to rehash everything.

Trees: B-trees is more usual than binary tree. This solves the prefix problem, but the search time is O(logM).

No comments:

Post a Comment