SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
Christopher M. Frenz
 Information is being generated at a faster
rate than ever before
 The speed at which information can be
generated is continually increasing
 Continuous improvements in computers,
storage, and networking make much of this
information readily available to indviduals
54,000 hits
 Most search engines use a keyword based
approach
 If a document contains all of the keywords
specified it is returned as a match
 Ranking algorithms (e.g. PageRank) are used
to put the most relevant results at the top of
the list and the least relevant at the bottom
 Not everything can be easily expressed as a
keyword
 Suppose you want to search for unknown
phone numbers? How can you do this with
keywords?
 How do we recognize a phone number when
we see one?
 We recognize a phone number by recognizing
the pattern of digits
◦ (XXX) XXX-XXXX
 While it is hard to express such a pattern in
the form of a keyword, it is really easy to
express it in the form of a regular expression
 (s?(?d{3})?[-s.]?d{3}[-.]d{4})
#!usr/bin/perl
use strict;
use warnings;
(my $string=<<'LIST');
John (555) 555-5555 fits pattern
Bob 234 567-8901
Mary 734-234-9873
Tom 999 999-9999
Harry 111 111 1111 does not fit pattern
LIST
while($string=~/(s?(?d{3})?[-s.]?d{3}[-.]d{4})/g){
print "$1n";
}
 Conduct a broad key word search using an
existing search engine
 Use your custom coded application to take
the returned search results and perform
regular expression based pattern matching
 The results that match your regular
expression are your refined search results
General Search APIs Specialized Search APIs
 Bing
 Yahoo BOSS
 Blekko
 Yandex
 Twitter
 Medicine – Pubmed
 Physics –Arxiv
 Government –
GovTrack
 Finance – Yahoo
Finance
 etc
Seeking to Extract: DFN [A-Z]d+
Script described in:
http://www.biomedcentral.com/1472-6947/7/32
 #!usr/bin/perl
 use LWP;
 use strict;
 use warnings;
 #sets query and congress session
 my $query='fracking';
 my $congress=112;
 my $ua = LWP::UserAgent->new;
 my $url="http://www.govtrack.us/api/v2/bill?q=$query&congress=$congress";
 my $response=$ua->get($url);
 my $result=$response->content;
 print $result;
Returns JSON formatted
output
 #!usr/bin/perl

 use LWP;
 use XML::LibXML;
 use strict;
 use warnings;

 my $ua=LWP::UserAgent->new();
 my $query='perl programming';
 my $url="http://blekko.com/ws/?q=$query+/rss";
 my $response=$ua->get($url);
 my $results=$response->content; die unless $response->is_success;

 my $parser=XML::LibXML->new;
 my $domtree=$parser->parse_string($results);
 my @Records=$domtree->getElementsByTagName("item");
 my $i=0;
 foreach(@Records){
 my $link=$Records[$i]->getChildrenByTagName("link");
 print "$i $linkn";
 my $description=$Records[$i]->getChildrenByTagName("description");
 print "$descriptionnn";
 $i++;
 }
 Allows programmers to extract code samples
pertaining to a set of keywords
 Recognizes the patterns associated with
CC++ functions and CC++ Control
structures
int myfunc ( ){
//code here
}
while ( ) {
//code here
}
 use Text::Balanced qw(extract_codeblock);

 #delimiter used to distinguish code blocks for use with Text::Balanced
 $delim='{}';

 #regex used to match keywords/patterns that precede code blocks
 my $regex='(((int|long|double|float|void)s*?w{1,25})|if|while|for)';

 foreach $link(@links){
 $response=$request->get("$link"); # gets Web page
 $results=$response->content;
 while($results=~s/<script.*?>.*?</script>//gsi){}; # filters out Javascript
 pos($results)=0;
 while($results=~s/.*?($regexs*?(.*?)s*?){/{/s){
 $code=$1 . extract_codeblock($results,$delim);
 print OFile "<h3><a href="$link">$link</a></h3> n";
 print OFile "$code" . "n" . "n";
 }
 }
 A common challenge to performing
information extraction and text mining on
many Web pages or parts of Web pages is
that the content is served up by JavaScript
 This can be dealt with by putting the
JavaScript that serves up the content through
a JavaScript Engine like V8
 <title>Contact XYZ inc</title>
<H1>Contact XYZ inc</H1><br>
<p>For more information about XYZ inc, please contact us at the following Email address</p>
<script type="text/javascript" language="javascript">
<!--
// Email obfuscator script 2.1 by Tim Williams, University of Arizona
// Random encryption key feature by Andrew Moulden, Site Engineering Ltd
// This code is freeware provided these four comment lines remain intact
// A wizard to generate this code is at http://www.jottings.com/obfuscator/
{ coded = "OKUxkq@KwtoO2K.0ko"
key = "l7rE9B41VmIKiFwOLq2uUGYCQaWoMfzNASycJj3Ds8dtRkPv6XTHg0beh5xpZn"
shift=coded.length
link=""
for (i=0; i<coded.length; i++) {
if (key.indexOf(coded.charAt(i))==-1) {
ltr = coded.charAt(i)
link += (ltr)
}
else {
ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.length
link += (key.charAt(ltr))
}
}
document.write("<a href='mailto:"+link+"'>"+link+"</a>")
}
//-->
</script><noscript>Sorry, you need Javascript on to email me.</noscript>
 #!usr/bin/perl
use JavaScript::V8;
use LWP;
use Text::Balanced qw(extract_codeblock);
use strict;
use warnings;
#delimiter used to distinguish code blocks for use with Text::Balanced
my $delim='{}';
#downloads Web page
my $ua=LWP::UserAgent->new;
my $response=$ua->get('http://localhost/email.html');
my $result=$response->content;
#print "$resultnn";
#extracts JavaScript
my $js;
if($result=~s/.*?http://www.jottings.com/obfuscator/s*{/{/s){
$js=extract_codeblock($result,$delim);
}
#modified JS to make it processable by V8 module
$js=~s/document.write/write/;
$js=~s/'/'/g;
#print "$jsnn";
#processes JS
my $context = JavaScript::V8::Context->new();
$context->bind_function(write => sub { print @_ });
my $mail=$context->eval("$js");
print "$mailnn";
 cfrenz@gmail.com
 http://www.linkedin.com/in/christopherfrenz/

Más contenido relacionado

La actualidad más candente

Intro to Php Security
Intro to Php SecurityIntro to Php Security
Intro to Php SecurityDave Ross
 
How to optimise TTFB - BrightonSEO 2020
How to optimise TTFB - BrightonSEO 2020How to optimise TTFB - BrightonSEO 2020
How to optimise TTFB - BrightonSEO 2020Roxana Stingu
 
Brian hogg word camp preparing a plugin for translation
Brian hogg   word camp preparing a plugin for translationBrian hogg   word camp preparing a plugin for translation
Brian hogg word camp preparing a plugin for translationwcto2017
 
DEF CON 27 - BEN SADEGHIPOUR - owning the clout through ssrf and pdf generators
DEF CON 27 - BEN SADEGHIPOUR  - owning the clout through ssrf and pdf generatorsDEF CON 27 - BEN SADEGHIPOUR  - owning the clout through ssrf and pdf generators
DEF CON 27 - BEN SADEGHIPOUR - owning the clout through ssrf and pdf generatorsFelipe Prado
 
Website Security
Website SecurityWebsite Security
Website SecurityMODxpo
 
47300 php-web-backdoor-decode
47300 php-web-backdoor-decode47300 php-web-backdoor-decode
47300 php-web-backdoor-decodeAttaporn Ninsuwan
 

La actualidad más candente (11)

Google Dorks
Google DorksGoogle Dorks
Google Dorks
 
Intro to Php Security
Intro to Php SecurityIntro to Php Security
Intro to Php Security
 
All About HTML Tags
All About HTML TagsAll About HTML Tags
All About HTML Tags
 
How to optimise TTFB - BrightonSEO 2020
How to optimise TTFB - BrightonSEO 2020How to optimise TTFB - BrightonSEO 2020
How to optimise TTFB - BrightonSEO 2020
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Brian hogg word camp preparing a plugin for translation
Brian hogg   word camp preparing a plugin for translationBrian hogg   word camp preparing a plugin for translation
Brian hogg word camp preparing a plugin for translation
 
DEF CON 27 - BEN SADEGHIPOUR - owning the clout through ssrf and pdf generators
DEF CON 27 - BEN SADEGHIPOUR  - owning the clout through ssrf and pdf generatorsDEF CON 27 - BEN SADEGHIPOUR  - owning the clout through ssrf and pdf generators
DEF CON 27 - BEN SADEGHIPOUR - owning the clout through ssrf and pdf generators
 
Website Security
Website SecurityWebsite Security
Website Security
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Documento
DocumentoDocumento
Documento
 
47300 php-web-backdoor-decode
47300 php-web-backdoor-decode47300 php-web-backdoor-decode
47300 php-web-backdoor-decode
 

Destacado

XSSmon: A Perl Based IDS for the Detection of Potential XSS Attacks
XSSmon: A Perl Based IDS for the Detection of Potential XSS AttacksXSSmon: A Perl Based IDS for the Detection of Potential XSS Attacks
XSSmon: A Perl Based IDS for the Detection of Potential XSS AttacksChristopher Frenz
 
Searchable Encryption Systems
Searchable Encryption SystemsSearchable Encryption Systems
Searchable Encryption SystemsChristopher Frenz
 
Hot fuzz - textual analysis
Hot fuzz - textual analysis Hot fuzz - textual analysis
Hot fuzz - textual analysis Kamila Glomska
 
Mining Product Synonyms - Slides
Mining Product Synonyms - SlidesMining Product Synonyms - Slides
Mining Product Synonyms - SlidesAnkush Jain
 
Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalSvitlana volkova
 
Web Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsWeb Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsGUANBO
 
Group-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaGroup-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaAhmedali Durga
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersSriTeja Allaparthi
 
System for-health-diagnosis
System for-health-diagnosisSystem for-health-diagnosis
System for-health-diagnosisask2372
 
Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITAnkit Sharma
 
A survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalA survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalChen Xi
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
Open Information Extraction 2nd
Open Information Extraction 2ndOpen Information Extraction 2nd
Open Information Extraction 2ndhit_alex
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionDeeksha thakur
 
ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...Jim Jenkins
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesTommaso Teofili
 

Destacado (20)

What the fuzz
What the fuzzWhat the fuzz
What the fuzz
 
XSSmon: A Perl Based IDS for the Detection of Potential XSS Attacks
XSSmon: A Perl Based IDS for the Detection of Potential XSS AttacksXSSmon: A Perl Based IDS for the Detection of Potential XSS Attacks
XSSmon: A Perl Based IDS for the Detection of Potential XSS Attacks
 
Searchable Encryption Systems
Searchable Encryption SystemsSearchable Encryption Systems
Searchable Encryption Systems
 
Hot fuzz - textual analysis
Hot fuzz - textual analysis Hot fuzz - textual analysis
Hot fuzz - textual analysis
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
 
Mining Product Synonyms - Slides
Mining Product Synonyms - SlidesMining Product Synonyms - Slides
Mining Product Synonyms - Slides
 
Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location Retrieval
 
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
 
Web Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsWeb Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical Models
 
Group-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaGroup-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social media
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
 
System for-health-diagnosis
System for-health-diagnosisSystem for-health-diagnosis
System for-health-diagnosis
 
Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIIT
 
A survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalA survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrieval
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
Open Information Extraction 2nd
Open Information Extraction 2ndOpen Information Extraction 2nd
Open Information Extraction 2nd
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...
 
2 13
2 132 13
2 13
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 

Similar a Information Retrieval and Extraction

My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
Web application security
Web application securityWeb application security
Web application securityRavi Raj
 
Website Security
Website SecurityWebsite Security
Website SecurityCarlos Z
 
Ch1(introduction to php)
Ch1(introduction to php)Ch1(introduction to php)
Ch1(introduction to php)Chhom Karath
 
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)James Titcumb
 
User authentication module using php
User authentication module using phpUser authentication module using php
User authentication module using phpRishabh Srivastava
 
Hacking Client Side Insecurities
Hacking Client Side InsecuritiesHacking Client Side Insecurities
Hacking Client Side Insecuritiesamiable_indian
 
Salzburg WebDev Meetup PHP Symfony
Salzburg WebDev Meetup PHP SymfonySalzburg WebDev Meetup PHP Symfony
Salzburg WebDev Meetup PHP SymfonyGeorg Sorst
 
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPPHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPiMasters
 
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)James Titcumb
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthdaveconnors
 
PHP Unit-1 Introduction to PHP
PHP Unit-1 Introduction to PHPPHP Unit-1 Introduction to PHP
PHP Unit-1 Introduction to PHPLariya Minhaz
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the codeWim Godden
 
Reconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awarenessReconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awarenessLeon Teale
 
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20Dennis de Greef
 
20160211 OWASP Charlotte RASP
20160211 OWASP Charlotte RASP20160211 OWASP Charlotte RASP
20160211 OWASP Charlotte RASPchadtindel
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Lucidworks
 
Getting More Traffic From Search Advanced Seo For Developers Presentation
Getting More Traffic From Search  Advanced Seo For Developers PresentationGetting More Traffic From Search  Advanced Seo For Developers Presentation
Getting More Traffic From Search Advanced Seo For Developers PresentationSeo Indonesia
 

Similar a Information Retrieval and Extraction (20)

My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Web application security
Web application securityWeb application security
Web application security
 
Website Security
Website SecurityWebsite Security
Website Security
 
Ch1(introduction to php)
Ch1(introduction to php)Ch1(introduction to php)
Ch1(introduction to php)
 
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)
Kicking off with Zend Expressive and Doctrine ORM (PHPNW2016)
 
User authentication module using php
User authentication module using phpUser authentication module using php
User authentication module using php
 
Hacking Client Side Insecurities
Hacking Client Side InsecuritiesHacking Client Side Insecurities
Hacking Client Side Insecurities
 
2018 03 20_biological_databases_part3
2018 03 20_biological_databases_part32018 03 20_biological_databases_part3
2018 03 20_biological_databases_part3
 
Salzburg WebDev Meetup PHP Symfony
Salzburg WebDev Meetup PHP SymfonySalzburg WebDev Meetup PHP Symfony
Salzburg WebDev Meetup PHP Symfony
 
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHPPHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
PHP Experience 2016 - [Workshop] Elastic Search: Turbinando sua aplicação PHP
 
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)
Kicking off with Zend Expressive and Doctrine ORM (ZendCon 2016)
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
 
PHP Unit-1 Introduction to PHP
PHP Unit-1 Introduction to PHPPHP Unit-1 Introduction to PHP
PHP Unit-1 Introduction to PHP
 
Beyond php it's not (just) about the code
Beyond php   it's not (just) about the codeBeyond php   it's not (just) about the code
Beyond php it's not (just) about the code
 
Reconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awarenessReconnaissance - For pentesting and user awareness
Reconnaissance - For pentesting and user awareness
 
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
Profiling PHP - AmsterdamPHP Meetup - 2014-11-20
 
20160211 OWASP Charlotte RASP
20160211 OWASP Charlotte RASP20160211 OWASP Charlotte RASP
20160211 OWASP Charlotte RASP
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
 
Getting More Traffic From Search Advanced Seo For Developers Presentation
Getting More Traffic From Search  Advanced Seo For Developers PresentationGetting More Traffic From Search  Advanced Seo For Developers Presentation
Getting More Traffic From Search Advanced Seo For Developers Presentation
 

Último

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Último (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Information Retrieval and Extraction

  • 2.  Information is being generated at a faster rate than ever before  The speed at which information can be generated is continually increasing  Continuous improvements in computers, storage, and networking make much of this information readily available to indviduals
  • 4.
  • 5.  Most search engines use a keyword based approach  If a document contains all of the keywords specified it is returned as a match  Ranking algorithms (e.g. PageRank) are used to put the most relevant results at the top of the list and the least relevant at the bottom
  • 6.  Not everything can be easily expressed as a keyword  Suppose you want to search for unknown phone numbers? How can you do this with keywords?  How do we recognize a phone number when we see one?
  • 7.
  • 8.  We recognize a phone number by recognizing the pattern of digits ◦ (XXX) XXX-XXXX  While it is hard to express such a pattern in the form of a keyword, it is really easy to express it in the form of a regular expression  (s?(?d{3})?[-s.]?d{3}[-.]d{4})
  • 9. #!usr/bin/perl use strict; use warnings; (my $string=<<'LIST'); John (555) 555-5555 fits pattern Bob 234 567-8901 Mary 734-234-9873 Tom 999 999-9999 Harry 111 111 1111 does not fit pattern LIST while($string=~/(s?(?d{3})?[-s.]?d{3}[-.]d{4})/g){ print "$1n"; }
  • 10.  Conduct a broad key word search using an existing search engine  Use your custom coded application to take the returned search results and perform regular expression based pattern matching  The results that match your regular expression are your refined search results
  • 11. General Search APIs Specialized Search APIs  Bing  Yahoo BOSS  Blekko  Yandex  Twitter  Medicine – Pubmed  Physics –Arxiv  Government – GovTrack  Finance – Yahoo Finance  etc
  • 12. Seeking to Extract: DFN [A-Z]d+ Script described in: http://www.biomedcentral.com/1472-6947/7/32
  • 13.  #!usr/bin/perl  use LWP;  use strict;  use warnings;  #sets query and congress session  my $query='fracking';  my $congress=112;  my $ua = LWP::UserAgent->new;  my $url="http://www.govtrack.us/api/v2/bill?q=$query&congress=$congress";  my $response=$ua->get($url);  my $result=$response->content;  print $result; Returns JSON formatted output
  • 14.  #!usr/bin/perl   use LWP;  use XML::LibXML;  use strict;  use warnings;   my $ua=LWP::UserAgent->new();  my $query='perl programming';  my $url="http://blekko.com/ws/?q=$query+/rss";  my $response=$ua->get($url);  my $results=$response->content; die unless $response->is_success;   my $parser=XML::LibXML->new;  my $domtree=$parser->parse_string($results);  my @Records=$domtree->getElementsByTagName("item");  my $i=0;  foreach(@Records){  my $link=$Records[$i]->getChildrenByTagName("link");  print "$i $linkn";  my $description=$Records[$i]->getChildrenByTagName("description");  print "$descriptionnn";  $i++;  }
  • 15.
  • 16.  Allows programmers to extract code samples pertaining to a set of keywords  Recognizes the patterns associated with CC++ functions and CC++ Control structures int myfunc ( ){ //code here } while ( ) { //code here }
  • 17.  use Text::Balanced qw(extract_codeblock);   #delimiter used to distinguish code blocks for use with Text::Balanced  $delim='{}';   #regex used to match keywords/patterns that precede code blocks  my $regex='(((int|long|double|float|void)s*?w{1,25})|if|while|for)';   foreach $link(@links){  $response=$request->get("$link"); # gets Web page  $results=$response->content;  while($results=~s/<script.*?>.*?</script>//gsi){}; # filters out Javascript  pos($results)=0;  while($results=~s/.*?($regexs*?(.*?)s*?){/{/s){  $code=$1 . extract_codeblock($results,$delim);  print OFile "<h3><a href="$link">$link</a></h3> n";  print OFile "$code" . "n" . "n";  }  }
  • 18.
  • 19.  A common challenge to performing information extraction and text mining on many Web pages or parts of Web pages is that the content is served up by JavaScript  This can be dealt with by putting the JavaScript that serves up the content through a JavaScript Engine like V8
  • 20.  <title>Contact XYZ inc</title> <H1>Contact XYZ inc</H1><br> <p>For more information about XYZ inc, please contact us at the following Email address</p> <script type="text/javascript" language="javascript"> <!-- // Email obfuscator script 2.1 by Tim Williams, University of Arizona // Random encryption key feature by Andrew Moulden, Site Engineering Ltd // This code is freeware provided these four comment lines remain intact // A wizard to generate this code is at http://www.jottings.com/obfuscator/ { coded = "OKUxkq@KwtoO2K.0ko" key = "l7rE9B41VmIKiFwOLq2uUGYCQaWoMfzNASycJj3Ds8dtRkPv6XTHg0beh5xpZn" shift=coded.length link="" for (i=0; i<coded.length; i++) { if (key.indexOf(coded.charAt(i))==-1) { ltr = coded.charAt(i) link += (ltr) } else { ltr = (key.indexOf(coded.charAt(i))-shift+key.length) % key.length link += (key.charAt(ltr)) } } document.write("<a href='mailto:"+link+"'>"+link+"</a>") } //--> </script><noscript>Sorry, you need Javascript on to email me.</noscript>
  • 21.
  • 22.  #!usr/bin/perl use JavaScript::V8; use LWP; use Text::Balanced qw(extract_codeblock); use strict; use warnings; #delimiter used to distinguish code blocks for use with Text::Balanced my $delim='{}'; #downloads Web page my $ua=LWP::UserAgent->new; my $response=$ua->get('http://localhost/email.html'); my $result=$response->content; #print "$resultnn"; #extracts JavaScript my $js; if($result=~s/.*?http://www.jottings.com/obfuscator/s*{/{/s){ $js=extract_codeblock($result,$delim); } #modified JS to make it processable by V8 module $js=~s/document.write/write/; $js=~s/'/'/g; #print "$jsnn"; #processes JS my $context = JavaScript::V8::Context->new(); $context->bind_function(write => sub { print @_ }); my $mail=$context->eval("$js"); print "$mailnn";
  • 23.