Information Retrieval and Extraction

 Information is being generated at a faster
rate than ever before
 The speed at which information can be
generated is continually increasing
 Continuous improvements in computers,
storage, and networking make much of this
information readily available to indviduals

 Most search engines use a keyword based
approach
 If a document contains all of the keywords
specified it is returned as a match
 Ranking algorithms (e.g. PageRank) are used
to put the most relevant results at the top of
the list and the least relevant at the bottom

 Not everything can be easily expressed as a
keyword
 Suppose you want to search for unknown
phone numbers? How can you do this with
keywords?
 How do we recognize a phone number when
we see one?

 We recognize a phone number by recognizing
the pattern of digits
◦ (XXX) XXX-XXXX
 While it is hard to express such a pattern in
the form of a keyword, it is really easy to
express it in the form of a regular expression
 (s?(?d{3})?[-s.]?d{3}[-.]d{4})

#!usr/bin/perl
use strict;
use warnings;
(my $string=<<'LIST');
John (555) 555-5555 fits pattern
Bob 234 567-8901
Mary 734-234-9873
Tom 999 999-9999
Harry 111 111 1111 does not fit pattern
LIST
while($string=~/(s?(?d{3})?[-s.]?d{3}[-.]d{4})/g){
print "$1n";
}

 Conduct a broad key word search using an
existing search engine
 Use your custom coded application to take
the returned search results and perform
regular expression based pattern matching
 The results that match your regular
expression are your refined search results

General Search APIs Specialized Search APIs
 Bing
 Yahoo BOSS
 Blekko
 Yandex
 Twitter
 Medicine – Pubmed
 Physics –Arxiv
 Government –
GovTrack
 Finance – Yahoo
Finance
 etc

Seeking to Extract: DFN [A-Z]d+
Script described in:
http://www.biomedcentral.com/1472-6947/7/32

 #!usr/bin/perl
 use LWP;
 use strict;
 use warnings;
 #sets query and congress session
 my $query='fracking';
 my $congress=112;
 my $ua = LWP::UserAgent->new;
 my $url="http://www.govtrack.us/api/v2/bill?q=$query&congress=$congress";
 my $response=$ua->get($url);
 my $result=$response->content;
 print $result;
Returns JSON formatted
output

 #!usr/bin/perl

 use LWP;
 use XML::LibXML;
 use strict;
 use warnings;

 my $ua=LWP::UserAgent->new();
 my $query='perl programming';
 my $url="http://blekko.com/ws/?q=$query+/rss";
 my $response=$ua->get($url);
 my $results=$response->content; die unless $response->is_success;

 my $parser=XML::LibXML->new;
 my $domtree=$parser->parse_string($results);
 my @Records=$domtree->getElementsByTagName("item");
 my $i=0;
 foreach(@Records){
 my $link=$Records[$i]->getChildrenByTagName("link");
 print "$i $linkn";
 my $description=$Records[$i]->getChildrenByTagName("description");
 print "$descriptionnn";
 $i++;
 }

 Allows programmers to extract code samples
pertaining to a set of keywords
 Recognizes the patterns associated with
CC++ functions and CC++ Control
structures
int myfunc ( ){
//code here
}
while ( ) {
//code here
}

 use Text::Balanced qw(extract_codeblock);

 #delimiter used to distinguish code blocks for use with Text::Balanced
 $delim='{}';

 #regex used to match keywords/patterns that precede code blocks
 my $regex='(((int|long|double|float|void)s*?w{1,25})|if|while|for)';

 foreach $link(@links){
 $response=$request->get("$link"); # gets Web page
 $results=$response->content;
 while($results=~s/<script.*?>.*?</script>//gsi){}; # filters out Javascript
 pos($results)=0;
 while($results=~s/.*?($regexs*?(.*?)s*?){/{/s){
 $code=$1 . extract_codeblock($results,$delim);
 print OFile "<h3><a href="$link">$link</a></h3> n";
 print OFile "$code" . "n" . "n";
 }
 }

 A common challenge to performing
information extraction and text mining on
many Web pages or parts of Web pages is
that the content is served up by JavaScript
 This can be dealt with by putting the
JavaScript that serves up the content through
a JavaScript Engine like V8

 <title>Contact XYZ inc</title>
<H1>Contact XYZ inc</H1><br>
<p>For more information about XYZ inc, please contact us at the following Email address</p>
<script type="text/javascript" language="javascript">

</script><noscript>Sorry, you need Javascript on to email me.</noscript>

 #!usr/bin/perl
use JavaScript::V8;
use LWP;
use Text::Balanced qw(extract_codeblock);
use strict;
use warnings;
#delimiter used to distinguish code blocks for use with Text::Balanced
my $delim='{}';
#downloads Web page
my $ua=LWP::UserAgent->new;
my $response=$ua->get('http://localhost/email.html');
my $result=$response->content;
#print "$resultnn";
#extracts JavaScript
my $js;
if($result=~s/.*?http://www.jottings.com/obfuscator/s*{/{/s){
$js=extract_codeblock($result,$delim);
}
#modified JS to make it processable by V8 module
$js=~s/document.write/write/;
$js=~s/'/'/g;
#print "$jsnn";
#processes JS
my $context = JavaScript::V8::Context->new();
$context->bind_function(write => sub { print @_ });
my $mail=$context->eval("$js");
print "$mailnn";

 cfrenz@gmail.com
 http://www.linkedin.com/in/christopherfrenz/

Information Retrieval and Extraction

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (11)

Destacado

Destacado (20)

Similar a Information Retrieval and Extraction

Similar a Information Retrieval and Extraction (20)

Último

Último (20)

Information Retrieval and Extraction