2. Information is being generated at a faster
rate than ever before
The speed at which information can be
generated is continually increasing
Continuous improvements in computers,
storage, and networking make much of this
information readily available to indviduals
5. Most search engines use a keyword based
approach
If a document contains all of the keywords
specified it is returned as a match
Ranking algorithms (e.g. PageRank) are used
to put the most relevant results at the top of
the list and the least relevant at the bottom
6. Not everything can be easily expressed as a
keyword
Suppose you want to search for unknown
phone numbers? How can you do this with
keywords?
How do we recognize a phone number when
we see one?
7.
8. We recognize a phone number by recognizing
the pattern of digits
◦ (XXX) XXX-XXXX
While it is hard to express such a pattern in
the form of a keyword, it is really easy to
express it in the form of a regular expression
(s?(?d{3})?[-s.]?d{3}[-.]d{4})
9. #!usr/bin/perl
use strict;
use warnings;
(my $string=<<'LIST');
John (555) 555-5555 fits pattern
Bob 234 567-8901
Mary 734-234-9873
Tom 999 999-9999
Harry 111 111 1111 does not fit pattern
LIST
while($string=~/(s?(?d{3})?[-s.]?d{3}[-.]d{4})/g){
print "$1n";
}
10. Conduct a broad key word search using an
existing search engine
Use your custom coded application to take
the returned search results and perform
regular expression based pattern matching
The results that match your regular
expression are your refined search results
11. General Search APIs Specialized Search APIs
Bing
Yahoo BOSS
Blekko
Yandex
Twitter
Medicine – Pubmed
Physics –Arxiv
Government –
GovTrack
Finance – Yahoo
Finance
etc
12. Seeking to Extract: DFN [A-Z]d+
Script described in:
http://www.biomedcentral.com/1472-6947/7/32
13. #!usr/bin/perl
use LWP;
use strict;
use warnings;
#sets query and congress session
my $query='fracking';
my $congress=112;
my $ua = LWP::UserAgent->new;
my $url="http://www.govtrack.us/api/v2/bill?q=$query&congress=$congress";
my $response=$ua->get($url);
my $result=$response->content;
print $result;
Returns JSON formatted
output
14. #!usr/bin/perl
use LWP;
use XML::LibXML;
use strict;
use warnings;
my $ua=LWP::UserAgent->new();
my $query='perl programming';
my $url="http://blekko.com/ws/?q=$query+/rss";
my $response=$ua->get($url);
my $results=$response->content; die unless $response->is_success;
my $parser=XML::LibXML->new;
my $domtree=$parser->parse_string($results);
my @Records=$domtree->getElementsByTagName("item");
my $i=0;
foreach(@Records){
my $link=$Records[$i]->getChildrenByTagName("link");
print "$i $linkn";
my $description=$Records[$i]->getChildrenByTagName("description");
print "$descriptionnn";
$i++;
}
15.
16. Allows programmers to extract code samples
pertaining to a set of keywords
Recognizes the patterns associated with
CC++ functions and CC++ Control
structures
int myfunc ( ){
//code here
}
while ( ) {
//code here
}
17. use Text::Balanced qw(extract_codeblock);
#delimiter used to distinguish code blocks for use with Text::Balanced
$delim='{}';
#regex used to match keywords/patterns that precede code blocks
my $regex='(((int|long|double|float|void)s*?w{1,25})|if|while|for)';
foreach $link(@links){
$response=$request->get("$link"); # gets Web page
$results=$response->content;
while($results=~s/<script.*?>.*?</script>//gsi){}; # filters out Javascript
pos($results)=0;
while($results=~s/.*?($regexs*?(.*?)s*?){/{/s){
$code=$1 . extract_codeblock($results,$delim);
print OFile "<h3><a href="$link">$link</a></h3> n";
print OFile "$code" . "n" . "n";
}
}
18.
19. A common challenge to performing
information extraction and text mining on
many Web pages or parts of Web pages is
that the content is served up by JavaScript
This can be dealt with by putting the
JavaScript that serves up the content through
a JavaScript Engine like V8
20. <title>Contact XYZ inc</title>
<H1>Contact XYZ inc</H1><br>
<p>For more information about XYZ inc, please contact us at the following Email address</p>
<script type="text/javascript" language="javascript">
<!--
// Email obfuscator script 2.1 by Tim Williams, University of Arizona
// Random encryption key feature by Andrew Moulden, Site Engineering Ltd
// This code is freeware provided these four comment lines remain intact
// A wizard to generate this code is at http://www.jottings.com/obfuscator/
{ coded = "OKUxkq@KwtoO2K.0ko"
key = "l7rE9B41VmIKiFwOLq2uUGYCQaWoMfzNASycJj3Ds8dtRkPv6XTHg0beh5xpZn"
shift=coded.length
link=""
for (i=0; i<coded.length; i++) {
if (key.indexOf(coded.charAt(i))==-1) {
ltr = coded.charAt(i)
link += (ltr)
}
else {
ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.length
link += (key.charAt(ltr))
}
}
document.write("<a href='mailto:"+link+"'>"+link+"</a>")
}
//-->
</script><noscript>Sorry, you need Javascript on to email me.</noscript>
21.
22. #!usr/bin/perl
use JavaScript::V8;
use LWP;
use Text::Balanced qw(extract_codeblock);
use strict;
use warnings;
#delimiter used to distinguish code blocks for use with Text::Balanced
my $delim='{}';
#downloads Web page
my $ua=LWP::UserAgent->new;
my $response=$ua->get('http://localhost/email.html');
my $result=$response->content;
#print "$resultnn";
#extracts JavaScript
my $js;
if($result=~s/.*?http://www.jottings.com/obfuscator/s*{/{/s){
$js=extract_codeblock($result,$delim);
}
#modified JS to make it processable by V8 module
$js=~s/document.write/write/;
$js=~s/'/'/g;
#print "$jsnn";
#processes JS
my $context = JavaScript::V8::Context->new();
$context->bind_function(write => sub { print @_ });
my $mail=$context->eval("$js");
print "$mailnn";