5. Information Retrieval vs. Data Retrieval Database tables, structured Free text, unstructured Data Knowledgeable users or automatic processes Non-expert humans Accessibility Unordered Ordered by relevance Results Exact matches Approximate matches Results SQL, Relational algebras Keywords, Natural language Queries Data Retrieval Information Retrieval
6. Information Retrieval Systems IR System query processor text processor user query ranked retrieved docs User Corpus ranking procedure system query retrieved docs index indexer tokenized docs postings raw docs
7. Search Engines Search Engine query processor text processor user query ranked retrieved docs User Web ranking procedure system query retrieved docs index indexer tokenized docs postings crawler global analyzer repository
8. Classical IR vs. Web IR Hypertext Text Documents Large Small # of matches Partially accessible Accessible Data accessibility Huge Large Volume Link-based Content-based IR techniques Widely diverse Homogeneous Format diversity In flux Infrequent Data change rate Noisy, dups Clean, no dups Data quality Web IR Classical IR
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30. Inverted Index Michael 1 Jordan 2 , the 3 author 4 of 5 “graphical 6 models 7 ”, is 8 a 9 professor 10 at 11 U.C. 12 Berkeley 13 . The 1 famous 2 NBA 3 legend 4 Michael 5 Jordan 6 liked 7 to 8 date 9 models 10 . d 1 d 2 author: (d 1 ,4) berkeley: (d 1 ,13) date: (d 2 ,9) famous: (d 2 , 2) graphical: (d 1 ,6) jordan: (d 1 ,2), (d 2 ,6) legend: (d 2 ,4) like: (d 2 ,7) michael: (d 1 ,1), (d 2 ,5) model: (d 1 ,7), (d 2 ,10) nba: (d 2 ,3) professor: (d 1 ,10) uc: (d 1 ,12) Vocabulary Postings