Powerful Google developer tools for immediate impact! (2023-24 C)
WebART in 10 minutes
1. Web Archive Retrieval Tools
Paul Doorenbosch Jaap Kamps Richard Rogers Arjen de Vries René Voorburg
CATCH Meeting HiTime e-History, November 1, 2011
2. Information
Access
Paul Doorenbosch
Arjen de Vries René Voorburg
Web
Archive
Jaap Kamps
New
Media
Richard Rogers
16. Supporting Complex Search Tasks
Nick Belkin Charlie Clarke Ning Gao Jaap Kamps Jussi Karlgren
Thanks!
SIGIR 2011 Workshop, July 28, 2011
Notas del editor
Good afternoon. My name is Jaap Kamps and it is my pleasure to introduce the WebART (Web Archive Retrieval Tools) project.\n
\n
The project is a collaboration of three groups of researchers: \n1. Specialists working on Information Access (Computer Science, Arjen de Vries);\n2. New media scholars working on the Web and the Web Archive (Humanities, Richard Rogers); and\n3. Web Archivists from the Dutch Web Archive (Heritage Sector, Rene Voorburg en Paul Doorenbosch).\nWhat is special is that all three groups are actively building technical tools -- the Koninklijke Bibliotheek does large scale crawling; the new media scholar build dedicated crawlers/screen-scrapers and analysis tools; and the computer scientists think they know the next generation of search tools.\n
The Web is a unique object with an unprecedented size and growth curve, and with distance the largest source of information on -- basically -- everything. The Web has had a revolutionary impact on how we publish, access, and share information. \n
In fact, it has a fundamental impact on our daily lives that increasingly take place “on the Web.”\n
However, this increasing dependence on the Web comes at a price: the ease of publishing on the Web also results in the easy loss of information—Web content tends to be ephemeral. This project addresses the problem of our future cultural heritage. Globally this has been addressed head on by the Internet Archive, now supplemented by many national initiatives.\n
\n
We don’t want to focus on preservation, but on its use. That is, we critically assess the value of Web archives for realistic research scenarios, and develop information access tools and methods that maximize the archive’s utility for research. Web research tends to require complex selections and manipulations of the data.\n
Search technology has advanced at an insane rate over the last decade. Who is still old enough to remember the early days of the Web, when people spent considerable parts of their time to collect and organize bookmarks.\n
Despite the progress, complex tasks are still poorly supported by a modern search engines! The best strategy is to slice-and-dice the complex information need into many small sub-requests, and combine all the information post-hoc and outside the search engine into a coherent answer.\n
Some systems allow for more complex interaction -- for example systems catering for exploratory or faceted search.\n
Such systems are creating complex search query in the back end -- and on restricted domains much of the complexity could be hidden from the searcher.\n
What if we have a way to open up this box? -- and allow searchers to manipulate complex requests or search strategies directly by combining several building blocks in unconstrained ways. Modern structured DB/IR technology allows for powerful, declarative queries or search strategies turning a collection of Web pages into a high dimensional data space.\n
Each building block corresponds to a particular data source or manipulation of the data. The search strategy builds effectively a dedicated search engine “on the fly”.\n
What will happen if we put these tools in the hands of the Web researchers? We will develop the appropriate building blocks and incrementally let them construct complex search strategies. Effectively, this means they can on the fly do their research, rather than have a turn around time of weeks or months in developing the right kind of crawler, the right kind of analysis tool, and then executing it. Moreover, researchers can store the search strategies, reuse and refine them, and share them with colleagues. In essence, the research methods will evolve in parallel with the search strategies, at a much faster pace and scale than ever before...\n
\n
However, the chosen selection and archiving strategies of Web material will have a crucial impact on their future value as cultural heritage. What choices are made or enforced upon us? What is the missing Web? The broken Web? The banned Web? We will critically evaluate the state of the Web Archive the resulting recommendations may prevent the loss of digital heritage.\n
Progress is particularly thorny since we combine radically different research paradigms -- the truth finding paradigm of the exact sciences and the interpretative paradigm of the humanities -- we are in a unique situation of three disciplines (Computer Science, Media Studies, Heritage Field) looking at the same object of study, although seeing it also in different ways.\n