SlideShare una empresa de Scribd logo
1 de 69
INFORMATION RETRIEVAL A Look into the Science of Web Search Engines 1 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk Muhammad AtifQureshi
Contents Story Mode Learning Learning by Imagination Appendix 2 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Story Mode Learning (Borrowed from Prof. Jimmy Lin, University of Maryland, Scientist in Twitter) 3 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Information Retrieval Systems Information What is “information”? Retrieval What do we mean by “retrieval”? What are different types information needs? Systems How do computer systems fit into the human information seeking process? 4 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is Information? What do you think? There is no “correct” definition Cookie Monster’s definition:  “news or facts about something” Different approaches: Philosophy Psychology Linguistics Electrical engineering Physics Computer science Information science 5 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Dictionary says… Oxford English Dictionary information: informing, telling; thing told, knowledge, items of knowledge, news knowledge: knowing familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known Random House Dictionary information: knowledge communicated or received concerning a particular fact or circumstance; news 6 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Intuitive Notions Information must Be something, although the exact nature (substance, energy, or abstract concept) is not clear; Be “new”: repetition of previously received messages is not informative Be “true”: false or counterfactual information is “mis-information” Be “about” something Robert M. Losee. (1997) A Discipline Independent Definition of Information. Journal of the American Society for Information Science, 48(3), 254-269. 7 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Three Views of Information Information as process Information as communication Information as message transmission and reception 8 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
One View Information = characteristics of the output of a process Tells us something about the process and the input Information-generating process do not occur in isolation Input Output Process Input Output Input Output Process1 Process2 Input Output … 9 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Where’s the human? If a tree falls in the forest, and no one is around to hear it, is information transmitted? In the “information as process”: Yes, but that’s not very interesting to us We’re concerned about information for human consumption Transmission of information from one person to another Recording of information Reconstruction of stored information 10 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Another View Information science is characterized by “the deliberate (purposeful) structure of the message by the sender in order to affect the image structure of the recipient” This implies that the sender has knowledge of the recipient's structure Text = “a collection of signs purposefully structured by a sender with the intention of changing image-structure of a recipient” Information = “the structure of any text which is capable of changing the image-structure of a recipient” 11 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Transfer of Information Communication = transmission of information Thoughts Thoughts Telepathy? Words Words Writing Sounds Sounds Speech Encoding Decoding 12 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Information Theory Better called “communication theory” Developed by Claude Shannon in 1940’s Concerned with the transmission of electrical signals over wires How do we send information quickly and reliably? Underlies modern electronic communication: Voice and data traffic… Over copper, fiber optic, wireless, etc. Famous result: Channel Capacity Theorem Formal measure of information in terms of entropy Information = “reduction in surprise” 13 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The Noisy Channel Model Communication = producing the same message at the destination that was sent at the source The message must be encoded for transmission across a medium (called channel) But the channel is noisy and can distort the message Semantics (meaning) is irrelevant channel Receiver message Transmitter noise Source Destination message 14 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
A Synthesis Information retrieval as communication over time and space, across a noisy channel Sender Recipient Encoding Decoding Transmitter Receiver channel storage message message indexing/writing retrieval/reading noise Source Destination message message noise 15 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
“Retrieval?” “Fetch something” that’s been stored Recover a stored state of knowledge Search through stored messages to find some messages relevant to the task at hand Encoding Decoding storage Sender Recipient message message indexing/writing Retrieval/reading noise 16 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is IR? Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user Anomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5, 133-143. 17 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Types of Information Needs Retrospective “Searching the past” Different queries posed against a static collection Time invariant Prospective “Searching the future” Static query posed against a dynamic collection Time dependent 18 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Retrospective Searches (I) Ad hoc retrieval: find documents “about this” Known item search Directed exploration Identify positive accomplishments of the Hubble telescope since it was launched in 1991. Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them. Find Jimmy Lin’s homepage. What’s the ISBN number of “Modern Information Retrieval”? Who makes the best chocolates? What video conferencing systems exist for digital reference desk services? 19 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Retrospective Searches (II) Question answering Who discovered Oxygen? When did Hawaii become a state? Where is Ayer’s Rock located? What team won the World Series in 1992? “Factoid” What countries export oil? Name U.S. cities that have a “Shubert” theater. “List” Who is Aaron Copland? What is a quasar? “Definition” 20 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Prospective “Searches” Filtering Make a binary decision about each incoming document Routing Sort incoming documents into different bins? Spam or not spam? Categorize news headlines: World? Nation? Metro? Sports? 21 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What types of information? Text (Documents and portions thereof) XML and structured documents Images Audio (sound effects, songs, etc.)  Video Source code Applications/Web services 22 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Content-Based Search This is a relative new concept! What else would you search on? What’s more effective? Why is this hard in many applications? 23 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Interesting Examples Google image search Google video search Query by humming http://images.google.com/ http://video.google.com/ http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming.html 24 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What about databases? What are examples of databases? Banks storing account information Retailers storing inventories Universities storing student grades What exactly is a (relational) database? Think of them as a collection of tables They model some aspect of “the world” 25 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
A (Simple) Database Example Student Table Department Table Course Table Enrollment Table 26 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Database Queries What would you want to know from a database? What classes is John Arrow enrolled in? Who has the highest grade in LBSC 690? Who’s in the history department? Of all the non-CLIS students taking LBSC 690 with a last name shorter than six characters and were born on a Monday, who has the longest email address? 27 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Databases vs. IR 28 IR Databases What we’re retrieving Mostly unstructured.  Free text with some metadata. Structured data. Clear semantics based on a formal model. Queries we’re posing Vague, imprecise information needs (often expressed in natural language). Formally (mathematically) defined queries.  Unambiguous. Results we get Sometimes relevant, often not. Exact.  Always correct in a formal sense. Interaction with system Interaction is important. One-shot queries. Other issues Issues downplayed. Concurrency, recovery, atomicity are all critical. Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The Big Picture The four components of the information retrieval environment: User Process System Collection What computer geeks care about! What we care about! 29 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The Information Retrieval Cycle Resource Query Ranked List Documents query reformulation, vocabulary learning, relevance feedback Documents source reselection Source Selection Query Formulation Search Selection Examination Delivery 30 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Supporting the Search Process Source Selection Resource Query Formulation Query Search Ranked List Selection Indexing Documents Index Examination Acquisition Documents Collection Delivery 31 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Simplification? Resource Query Ranked List Documents query reformulation, vocabulary learning, relevance feedback Documents source reselection Source Selection Is this itself a vast simplification? Query Formulation Search Selection Examination Delivery 32 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Tackling the IR Challenge Divide and conquer! Strategy: use encapsulation to limit complexity Approach: Define interfaces (input and output) for each component Define the functions performed by each component Study each component in isolation Repeat the process within components as needed Make sure that this decomposition makes sense Result: a hierarchical decomposition 33 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Where do we make the cut? Study the IR black box in isolation Simple behavior: in goes query, out comes documents Optimize the quality of documents that come out Study everything else around the black box Put the human back in the loop! Search Query Ranked List 34 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The IR Black Box Documents Query Hits 35 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Inside The IR Black Box Documents Query Representation Function Representation Function Query Representation Document Representation Index Comparison Function Hits 36 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
The Central Problem in IR Information Seeker Authors Concepts Concepts Query Terms Document Terms Do these represent the same concepts? 37 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Learning by Imagination 38 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Imagine a System We have 1000s of web pages, what make these web pages different? May be different key terms or key words occurring in different web pages (e.g., sports, education, video sharing) 39 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Realize Query Needs What do we expect when query? Query can be single word (no order), collection of words i.e., free sentence (order does not matter) or strict phrase (order matters e.g., "I love Pakistan") How to manage data of web pages Bag of words data structure with/without position of words/terms (simply, posting list of words/terms) What’s the best match? We have many matching results, but what’s the order? 40 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Order of Matching Results How could we rank web pages?  Via query content matching score against web pages i.e., content based methods  Via importance of web pages i.e., link based methods 41 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What does Content Tell? Content Information: Rare terms give more information than frequent terms as common terms do not differentiate well between the content of documents (Information entropy) So what does common words make?  Stop words (extreme case, e.g., it, a, the) or words with lesser importance (e.g., word science inside scientific documents) 42 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Ranking Methods Content based methods: Examples: Tf-idf with cosine similarity, bm25, etc. Link based methods: Examples: PageRank, HITS, etc. 43 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is More in Ranking? What other measures we can take for ranking better? Combining content based methods with link based methods How about learning to rank by user click through data (apply machine learning) How about learning from social web (apply social science theories) 44 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Lots of Web Pages How about scalability?  We have too many words, can we limit them?  Example: Is Studying conceptually different from study or studies? may be not (concept called stemming could simply everything to simple concept study) Stemming may not be sufficient then how about clustering web pages into topics i.e., (terms study, science, arts, university, school, college would single concept or a topic may be called as topic education) 45 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Is it sufficient? Can we feel confident about how Web Search Engine works? No, it was just a summary for the day 46 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Guess! what next you would see? ? 47 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Our search engine Yes we are making it 48 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Appendix 49 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Outline What is Research? How to prepare yourself for IR research? 50 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is Research? 51 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What is Research? Research Discover new knowledge  Seek answers to questions Basic research Goal: Expand man’s knowledge (e.g., which genes control social behavior of honey bees? ) Often driven by curiosity (but not always) High impact examples: relativity theory, DNA, …  Applied research Goal: Improve human condition (i.e., improve the world) (e.g., how to cure cancers?) Driven by practical needs High impact examples: computers, transistors, vaccinations, … The boundary is vague; distinction isn’t important 52 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Why Research? Funding Curiosity Utility of  Applications Advancement of  Technology Amount of  knowledge Application Development Applied Research Basic Research 53 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Where’s IR Research? Information Science Funding Quality of  Life Utility of  Applications Advancement of  Technology Amount of  knowledge Computer Science Application Development Applied Research Basic Research 54 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Research Process Identification of the topic (e.g., Web search) Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art) Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data) Test hypothesis (e.g., compare X and Y on the data) Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?) 55 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Typical IR Research Process Look for a high-impact topic (basic or applied) New problem: define/frame the problem  Identify weakness of existing solutions if any Propose new methods  Choose data sets (often a main challenge) Design evaluation measures (can be very difficult) Run many experiments (need to have clear research hypotheses) Analyze results and repeat the steps above if necessary Publish research results 56 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Research Methods Exploratory research: Identify and frame a new problem (e.g., “a survey/outlook of personalized search”) Constructive research: Construct a (new) solution to a problem (e.g., “a new method for expert finding”) Empirical research: evaluate and compare existing solutions  (e.g., “a comparative evaluation of link analysis methods for web search”)  The “E-C-E cycle”: exploratoryconstructiveempiricalexploratory… 57 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Types of Research Questions and Results Exploratory (Framework): What’s out there?  Descriptive (Principles): What does it look like? How does it work? Evaluative (Empirical results): How well does a method solve a problem?  Explanatory (Causes): Why does something happen the way it happens?  Predictive (Models): What would happen if xxx ? 58 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Solid and High Impact Research Solid work:  A clear hypothesis (research question) with conclusive result (either positive or negative) Clearly adds to our knowledge base (what can we learn from this work?) Implications: a solid, focused contribution is often better than a non-conclusive broad exploration High impact = high-importance-of-problem * high-quality-of-solution high impact = open up an important problem high impact = close a problem with the best solution high impact = major milestones in between Implications: question the importance of the problem and don’t just be satisfied with a good solution, make it the best  59 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
How to Prepare Yourself for IR Research? 60 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
What it Takes to do Research? Curiosity: allow you to ask questions Critical thinking: allow you to challenge assumptions Learning: take you to the frontier of knowledge Persistence: so that you don’t give up Respect data and truth: ensure your research is solid Communication: allow you to publish your work … 61 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Learning about IR (1/2) Start with an IR text book (e.g., Manning et al., Grossman & Frieder, a forth-coming book from UMass,…) Then read “Readings in IR” by Karen Sparck Jones,  Peter Willett  And read papers recommended in the following article: http://www.sigir.org/forum/2005D/2005d_sigirforum_moffat.pdf Read other papers published in recent IR/IR-related conferences 62 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Learning about IR (2/2) Getting more focused  Choose your favorite sub-area (e.g., retrieval models) Extend your knowledge about related topics (e.g., machine learning, statistical modeling, optimization) Stay in frontier: Keep monitoring literature in both IR and related areas Broaden your view: Keep an eye on  Industry activities  Read about industry trends Try out novel prototype systems Funding trends Read request for proposals 63 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Critical Thinking Develop a habit of asking questions, especially why questions Always try to make sense of what you have read/heard; don’t let any question pass by Get used to challenging everything Practical advice Question every claim made in a paper or a talk (can you argue the other way?)  Try to write two opposite reviews of a paper (one mainly to argue for accepting the paper and the other for rejecting it) Force yourself to challenge one point in every talk that you attend and raise a question 64 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Respect Data and Truth Be honest with the experiment results  Don’t throw away negative results!  Try to learn from negative results Don’t twist data to fit your hypothesis; instead, let the hypothesis choose data Be objective in data analysis and interpretation; don’t mislead readers  Aim at understanding/explanation instead of just good results Be careful not to over-generalize (for both good and bad results); you may be far from the truth 65 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Communications General communication skills:  Oral and written Formal and informal Talk to people with different level of backgrounds Be clear, concise, accurate, and adaptive (elaborate with examples, summarize by abstraction)  English proficiency Get used to talking to people from different fields 66 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Persistence Work only on topics that you are passionate about Work only on hypotheses that you believe in Don’t draw negative conclusions prematurely and give up easily positive results may be hidden in negative results In many cases, negative results don’t completely reject a hypothesis  Be comfortable with criticisms about your work (learn from negative reviews of a rejected paper) Think of possibilities of repositioning a work 67 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Optimize Your Training Know your strengths and weaknesses strong in math vs. strong in system development creative vs. thorough … Train yourself to fix weaknesses Find strategic partners Position yourself to take advantage of your strengths 68 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk
Thank You Reach me on Twitter: @matifq Email me: maqureshi@iba.edu.pk 69 Reach me on Twitter: @matifq     Email maqureshi@iba.edu.pk

Más contenido relacionado

La actualidad más candente

Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval systemsilambu111
 
Information storage and retrieval
Information storage and  retrievalInformation storage and  retrieval
Information storage and retrievalDr. Utpal Das
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval modelbaradhimarch81
 
Kwic
KwicKwic
KwicPU
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
Informatio retrival evaluation
Informatio retrival evaluationInformatio retrival evaluation
Informatio retrival evaluationNidhirBiswas
 
Introduction of search databases
Introduction of search databases Introduction of search databases
Introduction of search databases Youssef2000
 
Classaurus classification
Classaurus classificationClassaurus classification
Classaurus classificationavid
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challengeGan Keng Hoon
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalssbd6985
 
Current trends in library science research
Current trends in library science researchCurrent trends in library science research
Current trends in library science researchVISHNUMAYA R S
 
Reasons for information repackaging in library
Reasons for information repackaging in libraryReasons for information repackaging in library
Reasons for information repackaging in libraryGabrielkipkoech2015
 
Ontology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical StudyOntology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical StudyDebashisnaskar
 
House keeeping operations .pptx
House keeeping operations .pptxHouse keeeping operations .pptx
House keeeping operations .pptxlisbala
 

La actualidad más candente (20)

Controlled Vocabullary.pptx
Controlled Vocabullary.pptxControlled Vocabullary.pptx
Controlled Vocabullary.pptx
 
Ppt evaluation of information retrieval system
Ppt evaluation of information retrieval systemPpt evaluation of information retrieval system
Ppt evaluation of information retrieval system
 
Information storage and retrieval
Information storage and  retrievalInformation storage and  retrieval
Information storage and retrieval
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Probabilistic retrieval model
Probabilistic retrieval modelProbabilistic retrieval model
Probabilistic retrieval model
 
Kwic
KwicKwic
Kwic
 
Bibliometrics law
Bibliometrics lawBibliometrics law
Bibliometrics law
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Informatio retrival evaluation
Informatio retrival evaluationInformatio retrival evaluation
Informatio retrival evaluation
 
Introduction of search databases
Introduction of search databases Introduction of search databases
Introduction of search databases
 
Precis
PrecisPrecis
Precis
 
Classaurus classification
Classaurus classificationClassaurus classification
Classaurus classification
 
Library Networks
Library NetworksLibrary Networks
Library Networks
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Current trends in library science research
Current trends in library science researchCurrent trends in library science research
Current trends in library science research
 
Reasons for information repackaging in library
Reasons for information repackaging in libraryReasons for information repackaging in library
Reasons for information repackaging in library
 
Ontology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical StudyOntology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical Study
 
POPSI
POPSIPOPSI
POPSI
 
House keeeping operations .pptx
House keeeping operations .pptxHouse keeeping operations .pptx
House keeeping operations .pptx
 

Similar a Information Retrieval

Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migrationpetermurrayrust
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search petermurrayrust
 
Unesco mobileweek 2019_frontier_tech_oer-final
Unesco mobileweek 2019_frontier_tech_oer-finalUnesco mobileweek 2019_frontier_tech_oer-final
Unesco mobileweek 2019_frontier_tech_oer-finalMOVING Project
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysisLuke Czarnecki
 
Open Web Data for Education - Linked Data technologies for connecting open ed...
Open Web Data for Education - Linked Data technologies for connecting open ed...Open Web Data for Education - Linked Data technologies for connecting open ed...
Open Web Data for Education - Linked Data technologies for connecting open ed...Mathieu d'Aquin
 
Big, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near YouBig, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near YouBiplav Srivastava
 
COSC 111 Searching Spring 2011
COSC 111 Searching Spring 2011COSC 111 Searching Spring 2011
COSC 111 Searching Spring 2011Laksamee Putnam
 
The Internet, Part II: The Internet, Higher Education, and the Next 25 Years
The Internet, Part II:  The Internet, Higher Education, and the Next 25 YearsThe Internet, Part II:  The Internet, Higher Education, and the Next 25 Years
The Internet, Part II: The Internet, Higher Education, and the Next 25 YearsKristen T
 
The Generation Game He Forum
The Generation Game He ForumThe Generation Game He Forum
The Generation Game He ForumHAROLDFRICKER
 
Report on hacking crime and workable solution
Report on hacking crime and workable solutionReport on hacking crime and workable solution
Report on hacking crime and workable solutionShohag Prodhan
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingIla Group
 
The End(s) of e-Research
The End(s) of e-ResearchThe End(s) of e-Research
The End(s) of e-ResearchEric Meyer
 
Let the trumpet sound 2007
Let the trumpet sound 2007Let the trumpet sound 2007
Let the trumpet sound 2007Johan Koren
 
From Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
From Hyperlinks to Semantic Web Properties using Open Knowledge ExtractionFrom Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
From Hyperlinks to Semantic Web Properties using Open Knowledge ExtractionSTLab
 
Institutional Repositories (NLA 2011)
Institutional Repositories (NLA 2011)Institutional Repositories (NLA 2011)
Institutional Repositories (NLA 2011)Paul Royster
 
Word Essay Professional Writin. Online assignment writing service.
Word Essay Professional Writin. Online assignment writing service.Word Essay Professional Writin. Online assignment writing service.
Word Essay Professional Writin. Online assignment writing service.Debra Perea
 
Ensuring Continuity of Access To Our Published Heritage
Ensuring Continuity of Access To Our Published HeritageEnsuring Continuity of Access To Our Published Heritage
Ensuring Continuity of Access To Our Published HeritageEDINA, University of Edinburgh
 

Similar a Information Retrieval (20)

Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migration
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
Unesco mobileweek 2019_frontier_tech_oer-final
Unesco mobileweek 2019_frontier_tech_oer-finalUnesco mobileweek 2019_frontier_tech_oer-final
Unesco mobileweek 2019_frontier_tech_oer-final
 
I want to know more about compuerized text analysis
I want to know more about   compuerized text analysisI want to know more about   compuerized text analysis
I want to know more about compuerized text analysis
 
Open Web Data for Education - Linked Data technologies for connecting open ed...
Open Web Data for Education - Linked Data technologies for connecting open ed...Open Web Data for Education - Linked Data technologies for connecting open ed...
Open Web Data for Education - Linked Data technologies for connecting open ed...
 
Big, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near YouBig, Open, Data and Semantics for Real-World Application Near You
Big, Open, Data and Semantics for Real-World Application Near You
 
COSC 111 Searching Spring 2011
COSC 111 Searching Spring 2011COSC 111 Searching Spring 2011
COSC 111 Searching Spring 2011
 
Conference Report Final 11.18
Conference Report Final 11.18Conference Report Final 11.18
Conference Report Final 11.18
 
The Internet, Part II: The Internet, Higher Education, and the Next 25 Years
The Internet, Part II:  The Internet, Higher Education, and the Next 25 YearsThe Internet, Part II:  The Internet, Higher Education, and the Next 25 Years
The Internet, Part II: The Internet, Higher Education, and the Next 25 Years
 
The Generation Game He Forum
The Generation Game He ForumThe Generation Game He Forum
The Generation Game He Forum
 
Report on hacking crime and workable solution
Report on hacking crime and workable solutionReport on hacking crime and workable solution
Report on hacking crime and workable solution
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
The End(s) of e-Research
The End(s) of e-ResearchThe End(s) of e-Research
The End(s) of e-Research
 
Let the trumpet sound 2007
Let the trumpet sound 2007Let the trumpet sound 2007
Let the trumpet sound 2007
 
Connecting Museums with Linked Data
Connecting Museums with Linked DataConnecting Museums with Linked Data
Connecting Museums with Linked Data
 
From Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
From Hyperlinks to Semantic Web Properties using Open Knowledge ExtractionFrom Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
From Hyperlinks to Semantic Web Properties using Open Knowledge Extraction
 
Kick-off meeting Linkflows project
Kick-off meeting Linkflows projectKick-off meeting Linkflows project
Kick-off meeting Linkflows project
 
Institutional Repositories (NLA 2011)
Institutional Repositories (NLA 2011)Institutional Repositories (NLA 2011)
Institutional Repositories (NLA 2011)
 
Word Essay Professional Writin. Online assignment writing service.
Word Essay Professional Writin. Online assignment writing service.Word Essay Professional Writin. Online assignment writing service.
Word Essay Professional Writin. Online assignment writing service.
 
Ensuring Continuity of Access To Our Published Heritage
Ensuring Continuity of Access To Our Published HeritageEnsuring Continuity of Access To Our Published Heritage
Ensuring Continuity of Access To Our Published Heritage
 

Más de Web Science Research Group at Institute of Business Administration, Karachi, Pakistan (8)

ReThinking CS Curriculum for Pakistan
ReThinking CS Curriculum for PakistanReThinking CS Curriculum for Pakistan
ReThinking CS Curriculum for Pakistan
 
Social Media Mining and Analytics
Social Media Mining and AnalyticsSocial Media Mining and Analytics
Social Media Mining and Analytics
 
CSE509 Lecture 6
CSE509 Lecture 6CSE509 Lecture 6
CSE509 Lecture 6
 
CSE509 Lecture 5
CSE509 Lecture 5CSE509 Lecture 5
CSE509 Lecture 5
 
CSE509 Lecture 4
CSE509 Lecture 4CSE509 Lecture 4
CSE509 Lecture 4
 
CSE509 Lecture 3
CSE509 Lecture 3CSE509 Lecture 3
CSE509 Lecture 3
 
CSE509 Lecture 2
CSE509 Lecture 2CSE509 Lecture 2
CSE509 Lecture 2
 
CSE509 Lecture 1
CSE509 Lecture 1CSE509 Lecture 1
CSE509 Lecture 1
 

Último

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Último (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Information Retrieval

  • 1. INFORMATION RETRIEVAL A Look into the Science of Web Search Engines 1 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk Muhammad AtifQureshi
  • 2. Contents Story Mode Learning Learning by Imagination Appendix 2 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 3. Story Mode Learning (Borrowed from Prof. Jimmy Lin, University of Maryland, Scientist in Twitter) 3 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 4. Information Retrieval Systems Information What is “information”? Retrieval What do we mean by “retrieval”? What are different types information needs? Systems How do computer systems fit into the human information seeking process? 4 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 5. What is Information? What do you think? There is no “correct” definition Cookie Monster’s definition: “news or facts about something” Different approaches: Philosophy Psychology Linguistics Electrical engineering Physics Computer science Information science 5 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 6. Dictionary says… Oxford English Dictionary information: informing, telling; thing told, knowledge, items of knowledge, news knowledge: knowing familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known Random House Dictionary information: knowledge communicated or received concerning a particular fact or circumstance; news 6 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 7. Intuitive Notions Information must Be something, although the exact nature (substance, energy, or abstract concept) is not clear; Be “new”: repetition of previously received messages is not informative Be “true”: false or counterfactual information is “mis-information” Be “about” something Robert M. Losee. (1997) A Discipline Independent Definition of Information. Journal of the American Society for Information Science, 48(3), 254-269. 7 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 8. Three Views of Information Information as process Information as communication Information as message transmission and reception 8 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 9. One View Information = characteristics of the output of a process Tells us something about the process and the input Information-generating process do not occur in isolation Input Output Process Input Output Input Output Process1 Process2 Input Output … 9 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 10. Where’s the human? If a tree falls in the forest, and no one is around to hear it, is information transmitted? In the “information as process”: Yes, but that’s not very interesting to us We’re concerned about information for human consumption Transmission of information from one person to another Recording of information Reconstruction of stored information 10 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 11. Another View Information science is characterized by “the deliberate (purposeful) structure of the message by the sender in order to affect the image structure of the recipient” This implies that the sender has knowledge of the recipient's structure Text = “a collection of signs purposefully structured by a sender with the intention of changing image-structure of a recipient” Information = “the structure of any text which is capable of changing the image-structure of a recipient” 11 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 12. Transfer of Information Communication = transmission of information Thoughts Thoughts Telepathy? Words Words Writing Sounds Sounds Speech Encoding Decoding 12 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 13. Information Theory Better called “communication theory” Developed by Claude Shannon in 1940’s Concerned with the transmission of electrical signals over wires How do we send information quickly and reliably? Underlies modern electronic communication: Voice and data traffic… Over copper, fiber optic, wireless, etc. Famous result: Channel Capacity Theorem Formal measure of information in terms of entropy Information = “reduction in surprise” 13 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 14. The Noisy Channel Model Communication = producing the same message at the destination that was sent at the source The message must be encoded for transmission across a medium (called channel) But the channel is noisy and can distort the message Semantics (meaning) is irrelevant channel Receiver message Transmitter noise Source Destination message 14 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 15. A Synthesis Information retrieval as communication over time and space, across a noisy channel Sender Recipient Encoding Decoding Transmitter Receiver channel storage message message indexing/writing retrieval/reading noise Source Destination message message noise 15 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 16. “Retrieval?” “Fetch something” that’s been stored Recover a stored state of knowledge Search through stored messages to find some messages relevant to the task at hand Encoding Decoding storage Sender Recipient message message indexing/writing Retrieval/reading noise 16 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 17. What is IR? Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user Anomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5, 133-143. 17 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 18. Types of Information Needs Retrospective “Searching the past” Different queries posed against a static collection Time invariant Prospective “Searching the future” Static query posed against a dynamic collection Time dependent 18 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 19. Retrospective Searches (I) Ad hoc retrieval: find documents “about this” Known item search Directed exploration Identify positive accomplishments of the Hubble telescope since it was launched in 1991. Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them. Find Jimmy Lin’s homepage. What’s the ISBN number of “Modern Information Retrieval”? Who makes the best chocolates? What video conferencing systems exist for digital reference desk services? 19 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 20. Retrospective Searches (II) Question answering Who discovered Oxygen? When did Hawaii become a state? Where is Ayer’s Rock located? What team won the World Series in 1992? “Factoid” What countries export oil? Name U.S. cities that have a “Shubert” theater. “List” Who is Aaron Copland? What is a quasar? “Definition” 20 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 21. Prospective “Searches” Filtering Make a binary decision about each incoming document Routing Sort incoming documents into different bins? Spam or not spam? Categorize news headlines: World? Nation? Metro? Sports? 21 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 22. What types of information? Text (Documents and portions thereof) XML and structured documents Images Audio (sound effects, songs, etc.) Video Source code Applications/Web services 22 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 23. Content-Based Search This is a relative new concept! What else would you search on? What’s more effective? Why is this hard in many applications? 23 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 24. Interesting Examples Google image search Google video search Query by humming http://images.google.com/ http://video.google.com/ http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming.html 24 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 25. What about databases? What are examples of databases? Banks storing account information Retailers storing inventories Universities storing student grades What exactly is a (relational) database? Think of them as a collection of tables They model some aspect of “the world” 25 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 26. A (Simple) Database Example Student Table Department Table Course Table Enrollment Table 26 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 27. Database Queries What would you want to know from a database? What classes is John Arrow enrolled in? Who has the highest grade in LBSC 690? Who’s in the history department? Of all the non-CLIS students taking LBSC 690 with a last name shorter than six characters and were born on a Monday, who has the longest email address? 27 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 28. Databases vs. IR 28 IR Databases What we’re retrieving Mostly unstructured. Free text with some metadata. Structured data. Clear semantics based on a formal model. Queries we’re posing Vague, imprecise information needs (often expressed in natural language). Formally (mathematically) defined queries. Unambiguous. Results we get Sometimes relevant, often not. Exact. Always correct in a formal sense. Interaction with system Interaction is important. One-shot queries. Other issues Issues downplayed. Concurrency, recovery, atomicity are all critical. Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 29. The Big Picture The four components of the information retrieval environment: User Process System Collection What computer geeks care about! What we care about! 29 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 30. The Information Retrieval Cycle Resource Query Ranked List Documents query reformulation, vocabulary learning, relevance feedback Documents source reselection Source Selection Query Formulation Search Selection Examination Delivery 30 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 31. Supporting the Search Process Source Selection Resource Query Formulation Query Search Ranked List Selection Indexing Documents Index Examination Acquisition Documents Collection Delivery 31 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 32. Simplification? Resource Query Ranked List Documents query reformulation, vocabulary learning, relevance feedback Documents source reselection Source Selection Is this itself a vast simplification? Query Formulation Search Selection Examination Delivery 32 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 33. Tackling the IR Challenge Divide and conquer! Strategy: use encapsulation to limit complexity Approach: Define interfaces (input and output) for each component Define the functions performed by each component Study each component in isolation Repeat the process within components as needed Make sure that this decomposition makes sense Result: a hierarchical decomposition 33 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 34. Where do we make the cut? Study the IR black box in isolation Simple behavior: in goes query, out comes documents Optimize the quality of documents that come out Study everything else around the black box Put the human back in the loop! Search Query Ranked List 34 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 35. The IR Black Box Documents Query Hits 35 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 36. Inside The IR Black Box Documents Query Representation Function Representation Function Query Representation Document Representation Index Comparison Function Hits 36 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 37. The Central Problem in IR Information Seeker Authors Concepts Concepts Query Terms Document Terms Do these represent the same concepts? 37 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 38. Learning by Imagination 38 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 39. Imagine a System We have 1000s of web pages, what make these web pages different? May be different key terms or key words occurring in different web pages (e.g., sports, education, video sharing) 39 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 40. Realize Query Needs What do we expect when query? Query can be single word (no order), collection of words i.e., free sentence (order does not matter) or strict phrase (order matters e.g., "I love Pakistan") How to manage data of web pages Bag of words data structure with/without position of words/terms (simply, posting list of words/terms) What’s the best match? We have many matching results, but what’s the order? 40 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 41. Order of Matching Results How could we rank web pages? Via query content matching score against web pages i.e., content based methods Via importance of web pages i.e., link based methods 41 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 42. What does Content Tell? Content Information: Rare terms give more information than frequent terms as common terms do not differentiate well between the content of documents (Information entropy) So what does common words make? Stop words (extreme case, e.g., it, a, the) or words with lesser importance (e.g., word science inside scientific documents) 42 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 43. Ranking Methods Content based methods: Examples: Tf-idf with cosine similarity, bm25, etc. Link based methods: Examples: PageRank, HITS, etc. 43 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 44. What is More in Ranking? What other measures we can take for ranking better? Combining content based methods with link based methods How about learning to rank by user click through data (apply machine learning) How about learning from social web (apply social science theories) 44 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 45. Lots of Web Pages How about scalability? We have too many words, can we limit them? Example: Is Studying conceptually different from study or studies? may be not (concept called stemming could simply everything to simple concept study) Stemming may not be sufficient then how about clustering web pages into topics i.e., (terms study, science, arts, university, school, college would single concept or a topic may be called as topic education) 45 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 46. Is it sufficient? Can we feel confident about how Web Search Engine works? No, it was just a summary for the day 46 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 47. Guess! what next you would see? ? 47 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 48. Our search engine Yes we are making it 48 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 49. Appendix 49 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 50. Outline What is Research? How to prepare yourself for IR research? 50 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 51. What is Research? 51 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 52. What is Research? Research Discover new knowledge Seek answers to questions Basic research Goal: Expand man’s knowledge (e.g., which genes control social behavior of honey bees? ) Often driven by curiosity (but not always) High impact examples: relativity theory, DNA, … Applied research Goal: Improve human condition (i.e., improve the world) (e.g., how to cure cancers?) Driven by practical needs High impact examples: computers, transistors, vaccinations, … The boundary is vague; distinction isn’t important 52 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 53. Why Research? Funding Curiosity Utility of Applications Advancement of Technology Amount of knowledge Application Development Applied Research Basic Research 53 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 54. Where’s IR Research? Information Science Funding Quality of Life Utility of Applications Advancement of Technology Amount of knowledge Computer Science Application Development Applied Research Basic Research 54 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 55. Research Process Identification of the topic (e.g., Web search) Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art) Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data) Test hypothesis (e.g., compare X and Y on the data) Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?) 55 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 56. Typical IR Research Process Look for a high-impact topic (basic or applied) New problem: define/frame the problem Identify weakness of existing solutions if any Propose new methods Choose data sets (often a main challenge) Design evaluation measures (can be very difficult) Run many experiments (need to have clear research hypotheses) Analyze results and repeat the steps above if necessary Publish research results 56 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 57. Research Methods Exploratory research: Identify and frame a new problem (e.g., “a survey/outlook of personalized search”) Constructive research: Construct a (new) solution to a problem (e.g., “a new method for expert finding”) Empirical research: evaluate and compare existing solutions (e.g., “a comparative evaluation of link analysis methods for web search”) The “E-C-E cycle”: exploratoryconstructiveempiricalexploratory… 57 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 58. Types of Research Questions and Results Exploratory (Framework): What’s out there? Descriptive (Principles): What does it look like? How does it work? Evaluative (Empirical results): How well does a method solve a problem? Explanatory (Causes): Why does something happen the way it happens? Predictive (Models): What would happen if xxx ? 58 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 59. Solid and High Impact Research Solid work: A clear hypothesis (research question) with conclusive result (either positive or negative) Clearly adds to our knowledge base (what can we learn from this work?) Implications: a solid, focused contribution is often better than a non-conclusive broad exploration High impact = high-importance-of-problem * high-quality-of-solution high impact = open up an important problem high impact = close a problem with the best solution high impact = major milestones in between Implications: question the importance of the problem and don’t just be satisfied with a good solution, make it the best 59 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 60. How to Prepare Yourself for IR Research? 60 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 61. What it Takes to do Research? Curiosity: allow you to ask questions Critical thinking: allow you to challenge assumptions Learning: take you to the frontier of knowledge Persistence: so that you don’t give up Respect data and truth: ensure your research is solid Communication: allow you to publish your work … 61 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 62. Learning about IR (1/2) Start with an IR text book (e.g., Manning et al., Grossman & Frieder, a forth-coming book from UMass,…) Then read “Readings in IR” by Karen Sparck Jones, Peter Willett And read papers recommended in the following article: http://www.sigir.org/forum/2005D/2005d_sigirforum_moffat.pdf Read other papers published in recent IR/IR-related conferences 62 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 63. Learning about IR (2/2) Getting more focused Choose your favorite sub-area (e.g., retrieval models) Extend your knowledge about related topics (e.g., machine learning, statistical modeling, optimization) Stay in frontier: Keep monitoring literature in both IR and related areas Broaden your view: Keep an eye on Industry activities Read about industry trends Try out novel prototype systems Funding trends Read request for proposals 63 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 64. Critical Thinking Develop a habit of asking questions, especially why questions Always try to make sense of what you have read/heard; don’t let any question pass by Get used to challenging everything Practical advice Question every claim made in a paper or a talk (can you argue the other way?) Try to write two opposite reviews of a paper (one mainly to argue for accepting the paper and the other for rejecting it) Force yourself to challenge one point in every talk that you attend and raise a question 64 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 65. Respect Data and Truth Be honest with the experiment results Don’t throw away negative results! Try to learn from negative results Don’t twist data to fit your hypothesis; instead, let the hypothesis choose data Be objective in data analysis and interpretation; don’t mislead readers Aim at understanding/explanation instead of just good results Be careful not to over-generalize (for both good and bad results); you may be far from the truth 65 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 66. Communications General communication skills: Oral and written Formal and informal Talk to people with different level of backgrounds Be clear, concise, accurate, and adaptive (elaborate with examples, summarize by abstraction) English proficiency Get used to talking to people from different fields 66 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 67. Persistence Work only on topics that you are passionate about Work only on hypotheses that you believe in Don’t draw negative conclusions prematurely and give up easily positive results may be hidden in negative results In many cases, negative results don’t completely reject a hypothesis Be comfortable with criticisms about your work (learn from negative reviews of a rejected paper) Think of possibilities of repositioning a work 67 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 68. Optimize Your Training Know your strengths and weaknesses strong in math vs. strong in system development creative vs. thorough … Train yourself to fix weaknesses Find strategic partners Position yourself to take advantage of your strengths 68 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk
  • 69. Thank You Reach me on Twitter: @matifq Email me: maqureshi@iba.edu.pk 69 Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk