Random Noise: Dictionary data structure in IR

If using an array, it looks like:
term ---- document freq. ---- pointer to postings list
a       |       34594          | pointer
..
as     |    32423             |
..
zzz   | ....

The three parts are 'char[20], int, postings *', the sizes are 20 bytes, 4 bytes, and 4 bytes (or 8)

In IR systems, Hash table or Tree is used.
Hash table: each vocabulary term is hashed to an integer, the lookup time is O(1). The problems are: not easy to find minor variants, no prefix search (tolerant retrieval), and if vocabulary growing, need to rehash everything.

Trees: B-trees is more usual than binary tree. This solves the prefix problem, but the search time is O(logM).

Random Noise

Tuesday, May 18, 2010

Dictionary data structure in IR

No comments:

Post a Comment

About Me

Categories

Blog Archive