4. Resources and achievements
Search engines
Databases for property owners in Europe & USA
List of Deputies of State Duma
Man-hours invested in manual search and exploration
Results: 500+ news, 150 articles, 20
interviews and videos, Pekhtin
resigned from Committee of Ethics
5/24/2013 Sergey Chernov, Information Retrieval Basics
5. Outline for today
Sources of Information
Search strategies and tools
Search Cases
Assignments and Q&A Session
5/24/2013 Sergey Chernov, Information Retrieval Basics
6. Outline for today
Sources of Information
Search strategies and tools
Search Cases
Assignments and Q&A Session
5/24/2013 Sergey Chernov, Information Retrieval Basics
7. Information in numbers
Facebook – 900 mln users
Twitter – 500 mln
Flickr – 50 mln
Delicious – 5 mln
Web – 1 trln
5/24/2013 Sergey Chernov, Information Retrieval Basics
8. Information Retrieval
Information Retrieval (IR) is
finding material (usually
documents) of an unstructured
nature (usually text) that
satisfies an information need
from within large collections
(usually stored on computers).
8
9. Information Domains
Desktop
Enterprise Web (Intranet)
Public Web (Internet)
DVD
Disk
FShare
DB
Web
CMS
E-mail
People
Web SitesOnline
Libraries
Online
Shops
Social
Networks
10. Information Retrieval System
Downloads/collects the data
Processes the data and builds Inverted
Index
Evaluates user queries against the index and
computes a list of (ranked) results
Organizes and displays the results to the
user, facilitates navigation through the
result set
Crawler
Indexer
Ranker
Display
11. User Needs
Need [Broder 2002, Rose and Levinson 2004]
Informational – want to learn about something
Navigational – want to go to that page
Transactional – want to do something (web-mediated)
Access a service
Downloads
Shop
Gray areas
Find a good hub
Exploratory search “see what’s there”
Low hemoglobin
United Airlines
Seattle weather
Mars surface images
Canon S410
Car rental Brasil
Sec. 19.4.1
11
12. How far do people look for results?
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
12
13. How to evaluate results? CRAAP
Currency
Relevance
Authority
Accuracy
Purpose
5/24/2013 Sergey Chernov, Information Retrieval Basics
http://www.csuchico.edu/lins/handouts/eval_websites.pdf
How old is the material? Does the age matter?
History – better old info, medicine –fresh stuff.
How well does it fit? Does it answer my question?
Detailed enough?
Who wrote it? Is the author is qualified to write?
What about contact information?
Is it supported by evidence? Refereed? Verifiable?
Unbiased? Clearly written?
What can you infer about authors‘ message? Is it
fact, opinion or propaganda?
California State University, Chico
14. Where to search?
Web
Subject directories
Intranet and Desktop
Digital libraries
Social platforms
Databases and Hidden Web
Business analytics
Wikipedia
Photo stocks
Open datasets and Linked Data
Open Gov Data
5/24/2013 Sergey Chernov, Information Retrieval Basics
27. Outline for today
Sources of Information
Search strategies and tools
Search Cases
Assignments and Q&A Session
5/24/2013 Sergey Chernov, Information Retrieval Basics
28. Search is a journey
Is that all?
http://www.flickr.com/photos/morville
29. Search is a journey
http://www.flickr.com/photos/morville
30. Search is a journey
http://www.flickr.com/photos/morville
31. Search is a journey
http://www.flickr.com/photos/morville
32. Search is a journey
http://www.flickr.com/photos/morville
33. Exploratory search
Lookup
Question answering
Fact retrieval
Known-item search
Navigational search
Lasts for seconds
Exploratory search
InvestigateLearn
Knowledge acquisition
Comprehension
Comparison
Discovery
Serendipity
Incremental search
Driven by uncertainty
Non-linear behavior
Result analysis
Lasts for hours
34. Exploratory behavior
Learn
About the search topic
About the collection
Reformulate query
Broadening
Narrowing
Changing the focus
Socialize
Looking for experts
Collaborative search
37. Web search engine (2)
5/24/2013 Sergey Chernov, Information Retrieval Basics
38. Web search engine (3)
Search for pages that link to a URL – “link:” operator
link: google.com/images
Search for pages that similar to a URL – “related:”
related: nytimes.com
Search for results from specific sites – “site:”
site: strelkainstitute.com
5/24/2013 Sergey Chernov, Information Retrieval Basics
39. Personalized search
5/24/2013 Sergey Chernov, Information Retrieval Basics
Personalization is a modeling of user’s
preferences from previous interactions
Queries, click-through analysis, eye tracking …
Personalized Search usually implemented as:
Re-ranking and filtering of the search results
Personalized query expansion
50. Outline for today
Sources of Information
Search strategies and tools
Search Cases
Assignments and Q&A Session
5/24/2013 Sergey Chernov, Information Retrieval Basics
51. Case 1: finding a research paper
5/24/2013 Sergey Chernov, Information Retrieval Basics
52. Case 2: planning a trip
5/24/2013 Sergey Chernov, Information Retrieval Basics
53. Case 3: looking for an expert
5/24/2013 Sergey Chernov, Information Retrieval Basics
54. Case 4: market analysis
5/24/2013 Sergey Chernov, Information Retrieval Basics
55. Outline for today
Sources of Information
Search strategies and tools
Search Cases
Assignments and Q&A Session
5/24/2013 Sergey Chernov, Information Retrieval Basics
56. Practical assignment
Construct 3 information needs, relevant to your
everyday experience (preparing for an interview,
choosing a learning course, doing a homework, etc.)
Search for the information, using maximum number
of sources and tools
Share your experience
5/24/2013 Sergey Chernov, Information Retrieval Basics
Notas del editor
Here is what a search environment for a company employee looks like
Need this slide in case some people are not familiar with how a IR system works. This is a very simplified standard architecture. In different scenarios some of these components may be absent.Depending on the level of the participants may spend some time explaining how each component works.
Currency: the timeliness of the informationWhen was the information published or posted?Has the information been revised or updated?Is the information current or out-of date for your topic?Are the links functional?Relevance: the importance of the information for your needsDoes the information relate to your topic or answer your question?Who is the intended audience?Is the information at an appropriate level (i.e. not too elementary or advanced for your needs)?Have you looked at a variety of sources before determining this is one you will use?Would you be comfortable using this source for a research paper?Authority: the source of the informationWho is the author/publisher/source/sponsor?Are the author's credentials or organizational affiliations given?What are the author's credentials or organizational affiliations given?What are the author's qualifications to write on the topic?Is there contact information, such as a publisher or e-mail address?Does the URL reveal anything about the author or source? examples: .com (commercial), .edu (educational), .gov (U.S. government), .org (nonprofit organization), or .net (network) Accuracy: the reliability, truthfulness, and correctness of the content, and Where does the information come from?Is the information supported by evidence?Has the information been reviewed or refereed?Can you verify any of the information in another source or from personal knowledge?Does the language or tone seem biased and free of emotion?Are there spelling, grammar, or other typographical errors?Purpose: the reason the information existsWhat is the purpose of the information? to inform? teach? sell? entertain? persuade?Do the authors/sponsors make their intentions or purpose clear?Is the information fact? opinion? propaganda?Does the point of view appear objective and impartial?Are there political, ideological, cultural, religious, institutional, or personal biases?By scoring each category on a scale from 1 to 10 (1 = worst, 10=best possible) you can give each site a grade on a 50 point scale for how high-quality it is!45 - 50 Excellent | 40 - 44 Good | 35 - 39 Average | 30 - 34 Borderline Acceptable | Below 30 - Unacceptable
Subject Directories can help one find more in-depth information on a certain subject, then just a plain search engine.Whether one is looking for articles for medical, academic or just plain curious, one way to find information is by using a basic search engine; however, if one is searching for information on a specific topic and wants to get direct to the point information, one needs to use a subject directory. However, which ones to choose and why can be difficult, so I compiled a list of the most commonly used ones and few hidden gems I found on the internet. Librarians’ Internet Index (LII) – Over 20,000 articles compiled by public librarians with completely reliable sourcesINFOMINE (Infomine.) – over 250,000 articles compiled by academic librarians, all reliable sources. We are talking college level information here. Want an A or a raise, this is a great sight for well researched information and all was written by expertsAbout.com (About.) – With nearly 2 million articles, About.com is one of the leading subject directories. These articles are written by people with experience in the area in which they writeGoogle Directory (Google Directory) – With well over 5 million articles, this is by far the leader in subject directories. This is of course enhanced by the Google search engine, which means more results on the chosen topic of researchYahoo Directory (Yahoo Directory.) – With just over 4 million articles, Yahoo offers up lots of useful information. The only draw back is that this subject directory really works best with popular topics, not vague onesRead more: http://webupon.com/search-engines/top-five-subject-directories-and-how-to-use-them/#ixzz2LHYbMsJ7
The Million Book Project (or the Universal Library), was a book digitization project, led by Carnegie Mellon University School of Computer Science and University Libraries.[1] Working with government and research partners in India (Digital Library of India) and China, the project scanned books in many languages, using OCR to enable full text searching, and providing free-to-read access to the books on the web. As of 2007, they have completed the scanning of 1 million books and have made accessible the entire database from http://www.ulib.org.The Internet Archive is a non-profit digital library with the stated mission of "universal access to all knowledge."[2][3] It offers permanent storage of and free public access to collections of digitized materials, including websites, music, moving images, and nearly three million public-domain books; as of October 2012 it held over 10 petabytes in cultural material.[4]CiteSeer was a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science. It became public in 1998 and had many new features unavailable in academic search engines at that time. The arXiv (pronounced "archive", as if the "X" were the Greek letterChi, χ) is an archive for electronic preprints of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance which can be accessed online. In many fields of mathematics and physics, almost all scientific papers are self-archived on the arXiv. On October 3, 2008, arXiv.org passed the half-million article milestone.[2] The preprint archive turned 20 years old on August 14, 2011.[3] By 2012 the submission rate has grown to more than 7000 per month.[4]
Web 2.0
The Deep Web (also called the Deepnet, the Invisible Web, the Undernet or the hidden Web) is World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines. http://www.makeuseof.com/tag/10-search-engines-explore-deep-invisible-web/ It should not be confused with the dark Internet, the computers that can no longer be reached via Internet, or with the distributed filesharing network Darknet, which could be classified as a smaller part of the Deep Web.Mike Bergman, founder of BrightPlanet and credited with coining the phrase,[1] said that searching on the Internet today can be compared to dragging a net across the surface of the ocean: a great deal may be caught in the net, but there is a wealth of information that is deep and therefore missed.[2] Most of the Web's information is buried far down on dynamically generated sites, and standard search engines do not find it. Traditional search engines cannot "see" or retrieve content in the deep Web—those pages do not exist until they are created dynamically as the result of a specific search. The deep Web is several orders of magnitude larger than the surface Web.[3]Dynamic content: dynamic pages which are returned in response to a submitted query or accessed only through a form, especially if open-domain input elements (such as text fields) are used; such fields are hard to navigate without domain knowledge. Unlinked content: pages which are not linked to by other pages, which may prevent Web crawling programs from accessing the content. This content is referred to as pages without backlinks (or inlinks). Private Web: sites that require registration and login (password-protected resources). Contextual Web: pages with content varying for different access contexts (e.g., ranges of client IP addresses or previous navigation sequence). Limited access content: sites that limit access to their pages in a technical way (e.g., using the Robots Exclusion Standard, CAPTCHAs, or no-cache PragmaHTTP headers which prohibit search engines from browsing them and creating cached copies.[8]) Scripted content: pages that are only accessible through links produced by JavaScript as well as content dynamically downloaded from Web servers via Flash or Ajax solutions. Non-HTML/text content: textual content encoded in multimedia (image or video) files or specific file formats not handled by search engines.
Business analytics (BA) refers to the skills, technologies, applications and practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning.[1] Business analytics focuses on developing new insights and understanding of business performance based on data and statistical methods. In contrast, business intelligence traditionally focuses on using a consistent set of metrics to both measure past performance and guide business planning, which is also based on data and statistical methods.Business analytics makes extensive use of data, statistical and quantitative analysis, explanatory and predictive modeling,[2] and fact-based management to drive decision making. Analytics may be used as input for human decisions or may drive fully automated decisions. Business intelligence is querying, reporting, OLAP, and "alerts.
The Semantic Web is a collaborative movement led by the international standards body, the World Wide Web Consortium (W3C).[1] The standard promotes common data formats on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web dominated by unstructured and semi-structured documents into a "web of data". The Semantic Web stack builds on the W3C's Resource Description Framework (RDF).[2]According to the W3C, "The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries."[2]YAGO2s is a huge semantic knowledge base, derived from WikipediaWordNet and GeoNames. Currently, YAGO2s has knowledge of more than 10 million entities (like persons, organizations, cities, etc.) and contains more than 120 million facts about these entities.
In computing, linked data describes a method of publishing structured data so that it can be interlinked and become more useful. It builds upon standard Web technologies such as HTTP and URIs, but rather than using them to serve web pages for human readers, it extends them to share information in a way that can be read automatically by computers. This enables data from different sources to be connected and queried.[1]Tim Berners-Lee, director of the World Wide Web Consortium, coined the term in a design note discussing issues around the Semantic Web project.[2] However, the idea is very old and is closely related to concepts including database network models, citations between scholarly articles, and controlled headings in library catalogs.[citation needed]Tim Berners-Lee gave a presentation on linked data at the TED 2009 conference.[4] In it, he restated the linked data principles as three "extremely simple" rules:All kinds of conceptual things, they have names now that start with HTTP.I get important information back. I will get back some data in a standard format which is kind of useful data that somebody might like to know about that thing, about that event.I get back that information it's not just got somebody's height and weight and when they were born, it's got relationships. And when it has relationships, whenever it expresses a relationship then the other thing that it's related to is given one of those names that starts with HTTP.FOAF (an acronym of Friend of a friend) is a machine-readableontology describing persons, their activities and their relations to other people and objects. Anyone can use FOAF to describe him or herself. FOAF allows groups of people to describe social networks without the need for a centralised database.FOAF is a descriptive vocabulary expressed using the Resource Description Framework (RDF) and the Web Ontology Language (OWL). Computers may use these FOAF profiles to find, for example, all people living in Europe, or to list all people both you and a friend of yours know.[1][2] This is accomplished by defining relationships between people. Each profile has a unique identifier (such as the person's e-mail addresses, a Jabber ID, or a URI of the homepage or weblog of the person), which is used when defining these relationships.The GeoNames geographical database is available for download free of charge under a creative commons attribution license. It contains over 10 million geographical names and consists of over 8 million unique features whereof 2.8 million populated places and 5.5 million alternate names. All features are categorized into one out of nine feature classes and further subcategorized into one out of 645 feature codes. (more statistics ...). The data is accessible free of charge through a number of webservices and a daily database export. GeoNames is already serving up to over 30 million web service requests per day.
Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. The goals of the open data movement are similar to those of other "Open" movements such as open source, open hardware, open content, and open access. The philosophy behind open data has been long established (for example in the Mertonian tradition of science), but the term "open data" itself is recent, gaining popularity with the rise of the Internet and World Wide Web and, especially, with the launch of open-data government initiatives such as Data.gov.Open data is often focused on non-textual material such as maps, genomes, connectomes, chemical compounds, mathematical and scientific formulae, medical data and practice, bioscience and biodiversity. Problems often arise because these are commercially valuable or can be aggregated into works of value. Access to, or re-use of, the data is controlled by organisations, both public and private. Control may be through access restrictions, licenses, copyright, patents and charges for access or re-use. Advocates of open data argue that these restrictions are against the communal good and that these data should be made available without restriction or fee. In addition, it is important that the data are re-usable without requiring further permission, though the types of re-use (such as the creation of derivative works) may be controlled by license.Data.gov is a U.S. government website launched in late May 2009 by the then Federal Chief Information Officer (CIO) of the United States, VivekKundra.According to its website, "The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government."[1]Open Data Commons is the home of a set of legal tools to help you provide and use Open DataD3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.
Recommender systems or recommendation systems (sometimes replacing "system" with a synonym such as platform or engine) are a subclass of information filtering system that seek to predict the 'rating' or 'preference' that a user would give to an item (such as music, books, or movies) or social element (e.g. people or groups) they had not yet considered, using a model built from the characteristics of an item (content-based approaches) or the user's social environment (collaborative filtering approaches).[1][2]