If using an array, it looks like:
term ---- document freq. ---- pointer to postings list
a | 34594 | pointer
..
as | 32423 |
..
zzz | ....
The three parts are 'char[20], int, postings *', the sizes are 20 bytes, 4 bytes, and 4 bytes (or 8)
In IR systems, Hash table or Tree is used.
Hash table: each vocabulary term is hashed to an integer, the lookup time is O(1). The problems are: not easy to find minor variants, no prefix search (tolerant retrieval), and if vocabulary growing, need to rehash everything.
Trees: B-trees is more usual than binary tree. This solves the prefix problem, but the search time is O(logM).
Tuesday, May 18, 2010
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment