Monday, May 17, 2010

Information Retrieval (IR)

Information retrieval is finding material (specific documents) of an unstructured nature (text) that satisfied an information need from within large collections.
IR vs. Database: structured vs. unstructured data. Structured data tends to refer to information in 'tables', unstructured data typically refers to free text, allows keyword queries including operators, sophisticated concept queries.

Indexer steps: dictionary, postings, and document frequency information (freq used for query optimization).
Query steps: for example "aaa AND bbb".
  1. locate aaa in the dictionary, retrieve its postings (postings sorted by docID)
  2. locate bbb in the dictionary, retrieve its postings
  3. merge the two postings: walk through the two postings simultaneously, if the list lengths for aaa and bbb are x and y, the merge takes O(x+y) operations.
while p1 != NULL && p2 != NULL
do  if  docID(p1) == docID(p2)
      then add docID(p1) to answer
else if docID(p1) < docID(p2)
then p1 = next(p1); p2 = next(p2);
In general query (aaa AND bbb AND ccc), we should start with the smallest document frequency, then keep cutting further. 

Basic indexing pipeline: Documents to be indexed -> (Tokenizer) -> Token stream -> (Linguistic modules) -> Modified tokens -> (Indexer) -> inverted index

Tokenization
A token is an instance of a sequence of charcters, each token is a candidate for an index entry. (input:friends, romans and countrymen", output tokens: friends, romans, countrymen

Normalization to terms
A term is a normalized word type, which is an entry in IR system dictionary. You have to normalize indexed text and query terms into the same form.

Stemming suggest crude affix chopping ("compressed, compression ..." -> compress). Porter's algorithm.

Biwordk indexes for phrases

No comments:

Post a Comment