IR vs. Database: structured vs. unstructured data. Structured data tends to refer to information in 'tables', unstructured data typically refers to free text, allows keyword queries including operators, sophisticated concept queries.
Indexer steps: dictionary, postings, and document frequency information (freq used for query optimization).
Query steps: for example "aaa AND bbb".
- locate aaa in the dictionary, retrieve its postings (postings sorted by docID)
- locate bbb in the dictionary, retrieve its postings
- merge the two postings: walk through the two postings simultaneously, if the list lengths for aaa and bbb are x and y, the merge takes O(x+y) operations.
while p1 != NULL && p2 != NULLIn general query (aaa AND bbb AND ccc), we should start with the smallest document frequency, then keep cutting further.
do if docID(p1) == docID(p2)
then add docID(p1) to answer
else if docID(p1) < docID(p2)
then p1 = next(p1); p2 = next(p2);
Basic indexing pipeline: Documents to be indexed -> (Tokenizer) -> Token stream -> (Linguistic modules) -> Modified tokens -> (Indexer) -> inverted index
A token is an instance of a sequence of charcters, each token is a candidate for an index entry. (input:friends, romans and countrymen", output tokens: friends, romans, countrymen
Normalization to terms
A term is a normalized word type, which is an entry in IR system dictionary. You have to normalize indexed text and query terms into the same form.
Stemming suggest crude affix chopping ("compressed, compression ..." -> compress). Porter's algorithm.
Biwordk indexes for phrases