Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Crawling the Web
for Structured Documents
Julián Urbano, Juan Lloréns, Yorgos Andreadakis and Mónica Marrero
University Ca...
<CrawlerSettings>
<CrawlerId>
<Threads>
<Thread>
<Priority>
<UriType>
<Target>
<Uri>
...
<Avoid>
<Uri>
...
<TryAnyUriTypeO...
Próxima SlideShare
Cargando en…5
×

Crawling the Web for Structured Documents

342 visualizaciones

Publicado el

Structured Information Retrieval is gaining a lot of interest in recent years, as this kind of information is becoming an invaluable asset for professional communities such as Software Engineering. Most of the research has focused on XML documents, with initiatives like INEX to bring together and evaluate new techniques focused on structured information. Despite the use of XML documents is the immediate choice, the Web is filled with several other types of structured information, which account for millions of other documents. These documents may be collected directly using standard Web search engines like Google and Yahoo, or following specific search patterns in online repositories like Sourceforge. This demo describes a distributed and focused web crawler for any kind of structured documents, and we show with it how to exploit general-purpose resources to gather large amounts of real-world structured documents off the Web. This kind of tool could help building large test collections of other types of documents, such as Java source code for software oriented search engines or RDF for semantic searching.

Publicado en: Ciencias
  • Sé el primero en comentar

Crawling the Web for Structured Documents

  1. 1. Crawling the Web for Structured Documents Julián Urbano, Juan Lloréns, Yorgos Andreadakis and Mónica Marrero University Carlos III of Madrid · Department of Computer Science Motivation Structured Information Retrieval is gaining a lot of interest recently Almost all research is focused just on XML documents, with initiatives like INEX But what about other types of document like SQL, DTD, Java source code, RDF, UML? How can we easily gather real-world structured documents off the Web? And can we use them to develop collections and search engines for specific structured information? Motivation Structured Information Retrieval is gaining a lot of interest recently Almost all research is focused just on XML documents, with initiatives like INEX But what about other types of document like SQL, DTD, Java source code, RDF, UML? How can we easily gather real-world structured documents off the Web? And can we use them to develop collections and search engines for specific structured information? Ask General Purpose Web Search Engines Follow Link Patterns in Web Repositories Type Google (P@20) Yahoo (P@20) XML 25M (0.85) 238K (0.8) DTD 48K (0.95) 48K (1) XSD 134K (1) 181K (1) SQL 104K (1) 152K (0.95) JAVA 3M (1) 1.6M (1) Are there really that many documents? Not everything is relevant (not SQL) Se we have to develop filters, because: • Query terms not relevant (comments) • Many problems with MIME types • Hierarchical file types (XSD is also XML) Returns only about 1000 results… account bank deposit URLs + additional info Files Files. . . Crawler + cfg. Scheduler HTML Processor + cfg. SQL Processor + cfg. Java Processor + cfg. Crawler + cfg. Scheduler HTML Processor + cfg. XML Processor + cfg. . . . How It Works Built for Microsoft .net framework and the free SQL Server Express Collaborative, multi-computer, multi-threaded with hot plug-in Core detached from the GUI, can be used programmatically New file types and meta-data can be added on-the-fly with no effort What do Processors do? One processor per file type: 1.What additional info we want for these files (e.g. number of FK definitions, DBMS) 2.Filter files (e.g. SQL script without table definitions) 3.Process files (e.g. parse the SQL script and index the table names, fields and relationships) Intelligent HTML Processor Configured per domain: • Discover URLs, to collect in the DB and download • Follow URLs, just to navigate through (no need to download everything) URL patters defined in terms of: • The actual links in webpages • The HTML structure of webpages Highly customizable (see back page)
  2. 2. <CrawlerSettings> <CrawlerId> <Threads> <Thread> <Priority> <UriType> <Target> <Uri> ... <Avoid> <Uri> ... <TryAnyUriTypeOnEmpty> <TryAnyUriOnEmpty> ... <DatabaseHost> <DatabaseName> <BatchSize> <WaitTimeForUris> <DownloadDirectory> <DownloadDirectoryDepth> <DownloadDirectoryWidth> <DownloadDirectoryPerUriType> <DownloadDirectoryFullPath> <UriTypes> <Type> <Name> <CanBeProcessed> <ProcessorAssembly> <ProcessorFullname> <ProcessorConfig> ... <Keywords> <Uri> ... <Notification> <Server> <From> <To> <Address> ... <HTMLSettings> <SpamWords> <Word> ... <SpamUris> <Uri> ... <UserAgents> <String> ... <MaxInMemoryFileSize> <DownloadBufferLength> <NormalizeUris> <UnescapeUris> <RemoveAnchors> <Domains> <Domain> <Uri> <MaxLevels> <CheckNoscript> <MaxQueueSize> <MaxTimeoutWait> <MaxDownloadAttempts> <MinTimeBetweenRequests> <MaxTimeBetweenRequests> <MaxRedirections> <UseSessions> <KeepAlive> <IgnoreCertificate> <AllowDeflate> <AllowGZIP> <InLinkFollow> <Uri> ... <InPageFollow> <Uri> ... <InLinkDiscover> <Uri> ... <InPageDiscover> <Uri> ... ... <FileTypes> <Type> <UriTypeName> <MinLength> <MaxLength> <Extensions> <Extension> ... <MIMETypes> <Type> ... ... Create target URLs with patterns and keywords <Keywords> <Uri><![CDATA[http://www.google.com/search?q=(?<key>+)(?<key>+)(?<key>+)(?<key>+)+%2B"create+table"+filetype:sql&filter=0]]></Uri> <Uri><![CDATA[http://sourceforge.net/search/?type_of_search=soft&words=(?<key>+)(?<key>+)(?<key>+)(?<key>+)]]></Uri> </Keywords> Get results from Google Search <InPageFollow> <Uri><![CDATA[<a href="(?<(?<(?<(?<uriuriuriuri>[^"]+)>[^"]+)>[^"]+)>[^"]+)"[^>]+id=pnnext]]></Uri> </InPageFollow> <InPageDiscover> <Uri><![CDATA[<h3.+?<a href="(?<(?<(?<(?<uriuriuriuri>[^"]+)>[^"]+)>[^"]+)>[^"]+)".+?</a></h3>]]></Uri> </InPageDiscover> Navigate through Sourceforge’s projects and get project files <InLinkFollow> <Uri><![CDATA[(?<(?<(?<(?<uriuriuriuri>http://sourceforge.net/projects>http://sourceforge.net/projects>http://sourceforge.net/projects>http://sourceforge.net/projects/[^"]+//[^"]+//[^"]+//[^"]+/downloaddownloaddownloaddownload))))]]></Uri> </InLinkFollow> <InPageFollow> <Uri><![CDATA[<a href="(?<(?<(?<(?<uriuriuriuri>[^"]+)>[^"]+)>[^"]+)>[^"]+)">Next &rarr;</a>]]></Uri></InPageFollow> <InLinkDiscover> <Uri><![CDATA[(?<(?<(?<(?<uriuriuriuri>http://sourceforge.net/projects/[^/]+/)>http://sourceforge.net/projects/[^/]+/)>http://sourceforge.net/projects/[^/]+/)>http://sourceforge.net/projects/[^/]+/)]]></Uri> </InLinkDiscover> <InPageDiscover> <Uri><![CDATA[Please use this <a href="(?<(?<(?<(?<uriuriuriuri>[^"]+)>[^"]+)>[^"]+)>[^"]+)"]]></Uri> </InPageDiscover>

×