The document proposes the creation of an Open Web Index (OWI) to address the lack of a comprehensive, public index of web content. It argues that current initiatives like Common Crawl are insufficient as they are not kept fully up-to-date, lack search functionality, and do not address spam removal. The OWI would separate the crawling and indexing of web content from proprietary search services built on top of the index. Building such a major public project requires political and financial support as well as technical expertise. The goal is an independent index that serves as a public library of web content.
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
The Need for and fundamentals of an Open Web Index
1. THE NEED FOR AND FUNDAMENTALS OF
AN OPEN WEB INDEX
Prof. Dr. Dirk Lewandowski
Hamburg University of Applied Sciences, Hamburg, Germany
dirk.lewandowski@haw-hamburg.de
First International Symposium on Open Search Technology
Garching, 23 October, 2019
2. Proposal for an Open Web Index (OWI)
Prof. Dr. Dirk Lewandowski
ABOUT ME
• Professor of Information Research and
Information Retrieval at Hamburg
University of Applied Sciences
• Author of 100+ scholarly articles on
search engines
• German-language book “Suchmaschinen
verstehen” (Springer, 2nd edition, 2018)
• Editor, Aslib Journal of Information
Management (Emerald Publishing)
• Served as expert for the High Court of
Justice (UK) and Deutscher Bundestag
(German parliament)
1
https://searchstudies.org/dirk
5. Proposal for an Open Web Index (OWI)
Prof. Dr. Dirk Lewandowski
PROBLEM STATEMENT
• As there is no central directory of the Web, private search engine companies
have built large indexes of its contents
• Companies operating Web-scale indexes do not allow sufficient access to
their data to other parties interested
• The difficulties in building a Web index lie in technical issues, operating costs,
Web size, and freshness
• Due to these difficulties, there is no Web index built by a European company
(or other entity)
4
6. Proposal for an Open Web Index (OWI)
Prof. Dr. Dirk Lewandowski
IDEA
5
VISION
To build a public library of the Web
TECHNICAL IDEA
Separate the index from the services that are built on the index
PUBLIC VS. PRIVATE
While the index should be public, the services can be proprietary
Separate the index from the services that are built on the index
TECHNICAL IDEA
Separate the index from the services that are built on the index
TECHNICAL IDEA
Separate the index from the services that are built on the index
PUBLIC VS. PRIVATE
While the index should be public, the services can be proprietary
TECHNICAL IDEA
Separate the index from the services that are built on the index
7. Proposal for an Open Web Index (OWI)
Prof. Dr. Dirk Lewandowski
STRUCTURE
6
OWI
Crawler
OWI
Basic Indexer
OWI
Advanced Indexer
OWI
Web Index
OWI
Usage Data Index
Service 1 Service 2 Service 3
User User User
OWI Interface / API
User User UserUser User UserUser User User User
Service 4
8. Proposal for an Open Web Index (OWI)
Prof. Dr. Dirk Lewandowski
POSSIBLE APPLICATIONS
N.B.: This list of ideas is far from being complete and only serves illustrative purposes.
7
SEARCH
SCIENCE / RESEARCH
• Web Search
• Vertical Search, e.g.,video or
scholarly content
• Trend analysis, e.g., political trends
• Language use on the Web
• Research evaluation, e.g., Altmetrics
DATA ANALYSIS
• Data aggregation, e.g., company or person dossiers
• Opinion mining (“Who says what about whom?”)
• Market researc
SCIENCE / RESEARCH
• Web Search
• Vertical Search, e.g.,video or
scholarly content
• Trend analysis, e.g., political trends
• Language use on the Web
• Research evaluation, e.g., Altmetrics
DATA ANALYSIS
• Data aggregation, e.g., company or person
• Opinion mining (“Who says what about who
• Market researc
DATA ANALYSIS
• Data aggregation, e.g., company or
person dossiers
• Opinion mining (“Who says what
about whom?”)
• Market research
ARTIFICAL INTELLIGENCE
OWI could build the foundation for
large-scale AI applications, e.g.,
• Machine translation
• Question answering
DATA ANALYSIS
• Data aggregation, e.g., company or
person dossiers
• Opinion mining (“Who says what
about whom?”)
• Market research
COMBINING OWI DATA WITH PROPRIETARY DATA
• Company profiles + OWI data = enriched company dossiers
• Product data + OWI data = enriched product descriptions
• Geospatial data + OWI data = enriched map applicatio
DATA ANALYSIS
• Data aggregation, e.g., company or
person dossiers
• Opinion mining (“Who says what
about whom?”)
• Market research
COMBINING OWI DATA WITH PROPRIETARY DATA
• Company profiles + OWI data = enriched company dossiers
• Product data + OWI data = enriched product descriptions
• Geospatial data + OWI data = enriched map applications
10. Proposal for an Open Web Index (OWI)
Prof. Dr. Dirk Lewandowski
WHAT SIZE SHOULD A WEB INDEX HAVE?
• 1.71 billion websites
• How many pages/URLs
does this mean?
à There is no such thing as
a complete index.
à However, without
representing a major part
of the Web, an index is
more or less useless.
9
11. Proposal for an Open Web Index (OWI)
Prof. Dr. Dirk Lewandowski
WHY ARE INITIATIVES LIKE COMMON CRAWL NOT
ENOUGH?
They are not comprehensive
- CommonCrawl: 2.6 billion pages (not websites!)
They are static
- Crawling once a month is very different from keeping an index current at any time
They do not provide search functionality
- No (basic) indexing as needed to build applications on top of the index
- No SPAM control as needed to build applications
- No human raters to control for the quality of the index
à The use of initiatives like Common Crawl is more or less restricted to analysing Web
content. Due to the sampling procedure applied, it may not even be too useful for that.
10
12. Proposal for an Open Web Index (OWI)
Prof. Dr. Dirk Lewandowski
CRAWLING IS NOT THE PROBLEM, ANYWAY
Crawling is just the beginning of a long process. Indexing is required for making the index
searchable.
The real problems are
1) Avoiding SPAM (= excluding it from the index) – SPAM makes up A LOT of the Web’s
content
2) Keeping the index fresh
3) Providing indexing (basic and advanced)
4) Making the index searchable
11
13. Proposal for an Open Web Index (OWI)
Prof. Dr. Dirk Lewandowski
BIAS ON THE WEB
12Baeza-Yates, R. (2018). Bias on the web. Communications of the ACM, 61(6), 54–61. https://doi.org/10.1145/3209581
14. Proposal for an Open Web Index (OWI)
Prof. Dr. Dirk Lewandowski
WHO CONTROLS THE RESULT RANKINGS?
13
Search Engine
Providers
Search Engine
Result Page
Content
ProvidersUsers
Search Engine
Optimizers
16. Proposal for an Open Web Index (OWI)
Prof. Dr. Dirk Lewandowski
HOW TO PROCEED
- A comprehensible and fresh Web index is a societal/political project, not a mere
technical problem.
- Therefore, we need to approach politics. They should decide for building the index (and
financing it)
- To make the index independent from governments, a European foundation should be
built to govern it.
- The technical implementation of the Index should lie in the hands of those
(companies/institution) best capable of building it.
15
17. THANK YOU
Dirk Lewandowski
Hamburg University of Applied Sciences, Hamburg, Germany
dirk.lewandowski@haw-hamburg.de
www.searchstudies.org/dirk