SlideShare una empresa de Scribd logo
1 de 22
Searching: Needles and Haystacks
 Why it's important
 How it's done
 Technical difficulties
 Prospects, what's happening
 Social and ethical difficulties
Search and Hugh
- Involved in search here and there since about
1982, mainly EEC projects especially Celex
- Recent work for Vienna U on a multimedia
database for stored manuscripts etc.
- Professionally in computing since about 1974.
Actually small FORTRAN program to calculate
π in about 1966!
- Slides available at:
http://www.slideshare.net/hughbar/
hugh.barnard@gmail.com
As a Human Activity
 Looking for keys
 Remembering names and birthdays
 Looking up in a book
 And [the subject of this] making tools for the
intertubes
 Getting a clue, from the above...
Why do/understand this at all?
 Since Google, Bing, Yahoo already did it, for
you:
 Lots of interesting technical pieces
 Self education
 Fun and profit, do it 'better'
 Internal search engines, intranet search
engines
 Domain specific engines
 School or research projects
How it's Done: 1
 Health warning: this explanation is simplified!
 Let's take Google/Bing for example
 How does it find one zillion documents/images
with 'lolcat' in them, within a few seconds?
How it's Done:2
 It did it already
 A key concept: Indexing
Another key concept: Inverted index [see
wikipedia]:
https://en.wikipedia.org/wiki/Inverted_index
- lolcat in document x at position y
- highlighted cat in document x at position y
(we'll come back to this, for rank)
Some Indexing Problems
- Bandwidth and data transfer, 4.73 billion 'visible
pages' as of November
- Coverage, darkweb, dynamic creation of pages,
javascript etc. etc.
- Speed of refresh, 'freshness' of indexed data
- Storage sizes
These don't apply or 'less' to domain specific
projects
Parts of a Search Engine
- 1: Spidering, harvesting and directory
processing (we'll come back to the differences)
- 2: Parser/Indexer
- 3: Index/data Storage
- 4; Retrieval [the bit of Google/Bing that we
see!]
At any stage 'algorithms', for spam detection, for
ranking etc. etc. the secret sauce of search
I'm going to go through these in order...
1: Spidering/Harvesting/Darkweb
- Spidering: start with a seed, follow links
- Directory based: index a load of things in a given
directory
- Harvesting and metasearch: academic harvester
interfaces, domain specific, I do this at present:
https://en.wikipedia.org/wiki/Bioinformatic_Harvester
- Robots.txt: courtesy, the interactive web and the
darkweb
- Resource: https://commoncrawl.org/what-we-do/
Can people think about problems with any of these?
2: Parser/Indexer
 Now the fun begins!
 Parsing: breaking down the stuff you get into tokens
 <b>lolcat</b> can be indexed as 'lolcat', for example
 Some tagging could be preserved as meta-
information, <h1>Cat</h1> and <p>Cat</p> (see
later)
 Parsing document types text, html, pdf etc
 Swish-e: http://www.swish-e.org/ has a small scale
harvester/parser, for example
 Some problems/opportunities
3: Indexing: A 'plain vanilla' inverted index
3:Indexing: 'chocolate' indexing
- Semantic aware/context storage:
For example: cat and <b>cat</b> and CAT! And
<h1>cat</h1> and cat, cat, cat may have different
'values'
Pagerank is the most [in]famous:
https://en.wikipedia.org/wiki/PageRank for estimating
the 'value' of a resource
- Spam (cat, cat, cat) detection/SEO gaming detection
(germ warfare)
- In general: 'algorithms', these are the bits that will
probably give competitive advantage...
Storage
 This used to be 'easy', now lots of options
 Sparse data, some entries 'lolcats' have lots of
entries, pyx [wait for it!] won't have many
 It's a 'lot' of data, google came about by
misspelling googolplex:
 Relationals are rather unsuitable, in general
 Sparse, roll-your-own:
https://en.wikipedia.org/wiki/Bigtable
 Nosql and ready-mades: http://solr-vs-
elasticsearch.com/ for example
This is a (pretty nice) Pyx!
4: Retrieval
 Here you get results of all this work
 Simple, one field, one button = Google
 Booleans and implied booleans [lots of works
anded together]
 Relevant results, this is the main thing and
links back to the storage and parsing
 Cookies/user awareness
 Let's look at a few problems with retrieval
Problems with Retrieval
- General relevancy, the 'Paris Hilton' problem
- Contextual relevancy, a zookeeper will want
pythons, a programmer Python (cookies etc.)
- Purpose, buying, researching, looking
(remember you are the product!)
- Diacritics, non-Roman (problem for both
indexing and retrieval, actually)
- Non-textual, images, chemical structures
- Relevant synonyms
Etc.
Technical Difficulties: Examples
 Looking for André
 Looking for 中国 with UK keyboard [for
example: zhong1 + guo2]
 Looking for cat [furry] and cat [computer
command]
 Speed of index refresh [days, usually]
 Storage and computation, everything is 'big'
 Semantic search vs. search for 'words'
Social Difficulties
 Right to be forgotten, search tracking
 Security services and data mining (queries,
history, for example)
 Privacy and doxing, see visual tagging too
 Linking 'unlinked' data, informal 'joins',
generally
 Automatic visual tagging [facebook, ugh!]
 Automatic geolocation [most smartphones]
 Any more?
Opportunities
This is speculation, don't take too seriously:
- expansion of domain specific: https://www.shodan.io/
- above example was for IoT, clearly there'll be more
- 'honest' engines, non-profit etc. but how to finance?
- expanded and specialised metasearch
- improved semantics and synonyms
- better 'understanding' (Siri,see: http://sirius.clarity-
lab.org/ ) , translation, thesauri (where I came in, in
fact)
Rounding Up
 It's a central human activity
 It's a vital activity for the [intra|inter]tubes (web,
IoT, internal applications)
 Very simple central idea(s), but lots of
evolution possible
 There's a huge societal debate to go with the
technical evolution
 Question and (possibly) some answers
Thanks!
Thanks for listening! As a reward, here is a nice picture of a
bathtime duck:
Questions
Also, I'm happy to give another talk this term:
- Perl?
- Raspberry Pi?
- Threats and opportunities in AI?
- Other talks that I am almost certainly
unqualified to give...

Más contenido relacionado

La actualidad más candente

Dean da costa 1 pager - sourcing methodology
Dean da costa 1 pager - sourcing methodologyDean da costa 1 pager - sourcing methodology
Dean da costa 1 pager - sourcing methodologyTalent42
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web searchVictor de Boer
 
FSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
FSU SLIS InfoSvcs Wk 3 - Web Search & EvaluationFSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
FSU SLIS InfoSvcs Wk 3 - Web Search & EvaluationLorri Mon
 
The Semantic Web: 2010 Update
The Semantic Web: 2010 Update The Semantic Web: 2010 Update
The Semantic Web: 2010 Update James Hendler
 
Data Management 101 (2015)
Data Management 101 (2015)Data Management 101 (2015)
Data Management 101 (2015)Kristin Briney
 
2018 GIS in Development: Semantic Web
2018 GIS in Development: Semantic Web2018 GIS in Development: Semantic Web
2018 GIS in Development: Semantic WebGIS in the Rockies
 
Tools for Open Source Intelligence (OSINT)
Tools for Open Source Intelligence (OSINT)Tools for Open Source Intelligence (OSINT)
Tools for Open Source Intelligence (OSINT)Sudhanshu Chauhan
 
Brief Introduction to Linked Data
Brief Introduction to Linked DataBrief Introduction to Linked Data
Brief Introduction to Linked DataRobert Sanderson
 
OSINT Black Magic: Listen who whispers your name in the dark!!!
OSINT Black Magic: Listen who whispers your name in the dark!!!OSINT Black Magic: Listen who whispers your name in the dark!!!
OSINT Black Magic: Listen who whispers your name in the dark!!!Nutan Kumar Panda
 
NCURA Webinar on Open Data
NCURA Webinar on Open DataNCURA Webinar on Open Data
NCURA Webinar on Open DataKristin Briney
 
Preservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesPreservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesDorothea Salo
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonKrishna Sankar
 
Risk management and auditing
Risk management and auditingRisk management and auditing
Risk management and auditingDorothea Salo
 
Internet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and MoreInternet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and Moreeclark131
 
OSINT mindset to protect your organization - Null monthly meet version
OSINT mindset to protect your organization - Null monthly meet versionOSINT mindset to protect your organization - Null monthly meet version
OSINT mindset to protect your organization - Null monthly meet versionChandrapal Badshah
 
Github for Research Science
Github for Research ScienceGithub for Research Science
Github for Research ScienceTony Fast
 
Dr. You or, How I Learned to Stop Worry and Love the PhD
Dr. You or, How I Learned to Stop Worry and Love the PhDDr. You or, How I Learned to Stop Worry and Love the PhD
Dr. You or, How I Learned to Stop Worry and Love the PhDOlga Botvinnik
 
DataSploit - BlackHat Asia 2017
DataSploit - BlackHat Asia 2017 DataSploit - BlackHat Asia 2017
DataSploit - BlackHat Asia 2017 Shubham Mittal
 

La actualidad más candente (20)

Dean da costa 1 pager - sourcing methodology
Dean da costa 1 pager - sourcing methodologyDean da costa 1 pager - sourcing methodology
Dean da costa 1 pager - sourcing methodology
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
FSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
FSU SLIS InfoSvcs Wk 3 - Web Search & EvaluationFSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
FSU SLIS InfoSvcs Wk 3 - Web Search & Evaluation
 
The Semantic Web: 2010 Update
The Semantic Web: 2010 Update The Semantic Web: 2010 Update
The Semantic Web: 2010 Update
 
Ted Talk
Ted TalkTed Talk
Ted Talk
 
Data Management 101 (2015)
Data Management 101 (2015)Data Management 101 (2015)
Data Management 101 (2015)
 
2018 GIS in Development: Semantic Web
2018 GIS in Development: Semantic Web2018 GIS in Development: Semantic Web
2018 GIS in Development: Semantic Web
 
Tools for Open Source Intelligence (OSINT)
Tools for Open Source Intelligence (OSINT)Tools for Open Source Intelligence (OSINT)
Tools for Open Source Intelligence (OSINT)
 
Brief Introduction to Linked Data
Brief Introduction to Linked DataBrief Introduction to Linked Data
Brief Introduction to Linked Data
 
OSINT Black Magic: Listen who whispers your name in the dark!!!
OSINT Black Magic: Listen who whispers your name in the dark!!!OSINT Black Magic: Listen who whispers your name in the dark!!!
OSINT Black Magic: Listen who whispers your name in the dark!!!
 
NCURA Webinar on Open Data
NCURA Webinar on Open DataNCURA Webinar on Open Data
NCURA Webinar on Open Data
 
Internet tips lewis 2013
Internet tips lewis 2013Internet tips lewis 2013
Internet tips lewis 2013
 
Preservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesPreservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanities
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
 
Risk management and auditing
Risk management and auditingRisk management and auditing
Risk management and auditing
 
Internet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and MoreInternet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and More
 
OSINT mindset to protect your organization - Null monthly meet version
OSINT mindset to protect your organization - Null monthly meet versionOSINT mindset to protect your organization - Null monthly meet version
OSINT mindset to protect your organization - Null monthly meet version
 
Github for Research Science
Github for Research ScienceGithub for Research Science
Github for Research Science
 
Dr. You or, How I Learned to Stop Worry and Love the PhD
Dr. You or, How I Learned to Stop Worry and Love the PhDDr. You or, How I Learned to Stop Worry and Love the PhD
Dr. You or, How I Learned to Stop Worry and Love the PhD
 
DataSploit - BlackHat Asia 2017
DataSploit - BlackHat Asia 2017 DataSploit - BlackHat Asia 2017
DataSploit - BlackHat Asia 2017
 

Destacado

A Preliminary Trawl through Rawls
A Preliminary Trawl through RawlsA Preliminary Trawl through Rawls
A Preliminary Trawl through RawlsHugh Barnard
 
Pakistani Hajis Statistics
Pakistani Hajis StatisticsPakistani Hajis Statistics
Pakistani Hajis StatisticsDaily 10 Minutes
 
Av.Sima BAKTAŞ
Av.Sima BAKTAŞ Av.Sima BAKTAŞ
Av.Sima BAKTAŞ Sima Baktas
 
Capital budgeting untuk lembaga nonprofit dan sektor publik bagian 3
Capital budgeting untuk lembaga nonprofit dan sektor publik bagian 3Capital budgeting untuk lembaga nonprofit dan sektor publik bagian 3
Capital budgeting untuk lembaga nonprofit dan sektor publik bagian 3Futurum2
 
5.4 cladistic.doc
5.4 cladistic.doc5.4 cladistic.doc
5.4 cladistic.docBob Smullen
 
5.3 classification and biodiversity.doc
5.3 classification and biodiversity.doc5.3 classification and biodiversity.doc
5.3 classification and biodiversity.docBob Smullen
 
Transpiration experiment
Transpiration experimentTranspiration experiment
Transpiration experimentMrsKRamos
 
9.1 transport in the xylem of plants worksheet
9.1 transport in the xylem of plants worksheet9.1 transport in the xylem of plants worksheet
9.1 transport in the xylem of plants worksheetBob Smullen
 

Destacado (13)

A Preliminary Trawl through Rawls
A Preliminary Trawl through RawlsA Preliminary Trawl through Rawls
A Preliminary Trawl through Rawls
 
Rising Pakistan Economy
Rising Pakistan EconomyRising Pakistan Economy
Rising Pakistan Economy
 
Opera
OperaOpera
Opera
 
Pakistani Hajis Statistics
Pakistani Hajis StatisticsPakistani Hajis Statistics
Pakistani Hajis Statistics
 
Av.Sima BAKTAŞ
Av.Sima BAKTAŞ Av.Sima BAKTAŞ
Av.Sima BAKTAŞ
 
Сім'я: правильний початок (С.Грунтковський)
Сім'я: правильний початок (С.Грунтковський) Сім'я: правильний початок (С.Грунтковський)
Сім'я: правильний початок (С.Грунтковський)
 
Capital budgeting untuk lembaga nonprofit dan sektor publik bagian 3
Capital budgeting untuk lembaga nonprofit dan sektor publik bagian 3Capital budgeting untuk lembaga nonprofit dan sektor publik bagian 3
Capital budgeting untuk lembaga nonprofit dan sektor publik bagian 3
 
5.4 cladistic.doc
5.4 cladistic.doc5.4 cladistic.doc
5.4 cladistic.doc
 
Let me love you
Let me love youLet me love you
Let me love you
 
5.3 classification and biodiversity.doc
5.3 classification and biodiversity.doc5.3 classification and biodiversity.doc
5.3 classification and biodiversity.doc
 
Transpiration experiment
Transpiration experimentTranspiration experiment
Transpiration experiment
 
Justice
JusticeJustice
Justice
 
9.1 transport in the xylem of plants worksheet
9.1 transport in the xylem of plants worksheet9.1 transport in the xylem of plants worksheet
9.1 transport in the xylem of plants worksheet
 

Similar a Searching tech2

II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...Dr. Haxel Consult
 
Roelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slidesRoelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slidesLeon Kuunders
 
Statistics vs machine learning
Statistics vs machine learningStatistics vs machine learning
Statistics vs machine learningTom Dierickx
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
A tale of two proxies
A tale of two proxiesA tale of two proxies
A tale of two proxiesSensePost
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchainjasonhaddix
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Peter Mika
 
Dealing with web scale data
Dealing with web scale dataDealing with web scale data
Dealing with web scale dataJnaapti
 
From bit-streams-to-life-streams-ajai-narendran-srishti-bangalore-stff-2011
From bit-streams-to-life-streams-ajai-narendran-srishti-bangalore-stff-2011From bit-streams-to-life-streams-ajai-narendran-srishti-bangalore-stff-2011
From bit-streams-to-life-streams-ajai-narendran-srishti-bangalore-stff-2011ajai
 
Making things findable
Making things findableMaking things findable
Making things findablePeter Mika
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteDeep Kayal
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
BigData Search Simplified with ElasticSearch
BigData Search Simplified with ElasticSearchBigData Search Simplified with ElasticSearch
BigData Search Simplified with ElasticSearchTO THE NEW | Technology
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011sssw2011
 
Arron daniels 1 pager researching the tech talent market
Arron daniels 1 pager   researching the tech talent marketArron daniels 1 pager   researching the tech talent market
Arron daniels 1 pager researching the tech talent marketTalent42
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic SearchRoi Blanco
 
Scraping talk public
Scraping talk publicScraping talk public
Scraping talk publicNesta
 

Similar a Searching tech2 (20)

Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
II-SDV 2017: Custom Open Source Search Engine with Drupal 8 and Solr at Frenc...
 
Roelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slidesRoelof Temmingh FIRST07 slides
Roelof Temmingh FIRST07 slides
 
Statistics vs machine learning
Statistics vs machine learningStatistics vs machine learning
Statistics vs machine learning
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
A tale of two proxies
A tale of two proxiesA tale of two proxies
A tale of two proxies
 
The Web Application Hackers Toolchain
The Web Application Hackers ToolchainThe Web Application Hackers Toolchain
The Web Application Hackers Toolchain
 
Inside Searching
Inside SearchingInside Searching
Inside Searching
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012
 
Dealing with web scale data
Dealing with web scale dataDealing with web scale data
Dealing with web scale data
 
From bit-streams-to-life-streams-ajai-narendran-srishti-bangalore-stff-2011
From bit-streams-to-life-streams-ajai-narendran-srishti-bangalore-stff-2011From bit-streams-to-life-streams-ajai-narendran-srishti-bangalore-stff-2011
From bit-streams-to-life-streams-ajai-narendran-srishti-bangalore-stff-2011
 
Making things findable
Making things findableMaking things findable
Making things findable
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
BigData Search Simplified with ElasticSearch
BigData Search Simplified with ElasticSearchBigData Search Simplified with ElasticSearch
BigData Search Simplified with ElasticSearch
 
3 google hacking
3 google hacking3 google hacking
3 google hacking
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011
 
Arron daniels 1 pager researching the tech talent market
Arron daniels 1 pager   researching the tech talent marketArron daniels 1 pager   researching the tech talent market
Arron daniels 1 pager researching the tech talent market
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
Scraping talk public
Scraping talk publicScraping talk public
Scraping talk public
 

Último

Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...meghakumariji156
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoilmeghakumariji156
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...kajalverma014
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasDigicorns Technologies
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiMonica Sydney
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...kumargunjan9515
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtrahman018755
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样ayvbos
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrHenryBriggs2
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdfMatthew Sinclair
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制pxcywzqs
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查ydyuyu
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdfMatthew Sinclair
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdfMatthew Sinclair
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsPriya Reddy
 

Último (20)

Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
Tadepalligudem Escorts Service Girl ^ 9332606886, WhatsApp Anytime Tadepallig...
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu DhabiAbu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
Abu Dhabi Escorts Service 0508644382 Escorts in Abu Dhabi
 
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrStory Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
Story Board.pptxrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
一比一原版(Offer)康考迪亚大学毕业证学位证靠谱定制
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
 

Searching tech2

  • 1. Searching: Needles and Haystacks  Why it's important  How it's done  Technical difficulties  Prospects, what's happening  Social and ethical difficulties
  • 2. Search and Hugh - Involved in search here and there since about 1982, mainly EEC projects especially Celex - Recent work for Vienna U on a multimedia database for stored manuscripts etc. - Professionally in computing since about 1974. Actually small FORTRAN program to calculate π in about 1966! - Slides available at: http://www.slideshare.net/hughbar/ hugh.barnard@gmail.com
  • 3. As a Human Activity  Looking for keys  Remembering names and birthdays  Looking up in a book  And [the subject of this] making tools for the intertubes  Getting a clue, from the above...
  • 4. Why do/understand this at all?  Since Google, Bing, Yahoo already did it, for you:  Lots of interesting technical pieces  Self education  Fun and profit, do it 'better'  Internal search engines, intranet search engines  Domain specific engines  School or research projects
  • 5. How it's Done: 1  Health warning: this explanation is simplified!  Let's take Google/Bing for example  How does it find one zillion documents/images with 'lolcat' in them, within a few seconds?
  • 6. How it's Done:2  It did it already  A key concept: Indexing Another key concept: Inverted index [see wikipedia]: https://en.wikipedia.org/wiki/Inverted_index - lolcat in document x at position y - highlighted cat in document x at position y (we'll come back to this, for rank)
  • 7. Some Indexing Problems - Bandwidth and data transfer, 4.73 billion 'visible pages' as of November - Coverage, darkweb, dynamic creation of pages, javascript etc. etc. - Speed of refresh, 'freshness' of indexed data - Storage sizes These don't apply or 'less' to domain specific projects
  • 8. Parts of a Search Engine - 1: Spidering, harvesting and directory processing (we'll come back to the differences) - 2: Parser/Indexer - 3: Index/data Storage - 4; Retrieval [the bit of Google/Bing that we see!] At any stage 'algorithms', for spam detection, for ranking etc. etc. the secret sauce of search I'm going to go through these in order...
  • 9. 1: Spidering/Harvesting/Darkweb - Spidering: start with a seed, follow links - Directory based: index a load of things in a given directory - Harvesting and metasearch: academic harvester interfaces, domain specific, I do this at present: https://en.wikipedia.org/wiki/Bioinformatic_Harvester - Robots.txt: courtesy, the interactive web and the darkweb - Resource: https://commoncrawl.org/what-we-do/ Can people think about problems with any of these?
  • 10. 2: Parser/Indexer  Now the fun begins!  Parsing: breaking down the stuff you get into tokens  <b>lolcat</b> can be indexed as 'lolcat', for example  Some tagging could be preserved as meta- information, <h1>Cat</h1> and <p>Cat</p> (see later)  Parsing document types text, html, pdf etc  Swish-e: http://www.swish-e.org/ has a small scale harvester/parser, for example  Some problems/opportunities
  • 11. 3: Indexing: A 'plain vanilla' inverted index
  • 12. 3:Indexing: 'chocolate' indexing - Semantic aware/context storage: For example: cat and <b>cat</b> and CAT! And <h1>cat</h1> and cat, cat, cat may have different 'values' Pagerank is the most [in]famous: https://en.wikipedia.org/wiki/PageRank for estimating the 'value' of a resource - Spam (cat, cat, cat) detection/SEO gaming detection (germ warfare) - In general: 'algorithms', these are the bits that will probably give competitive advantage...
  • 13. Storage  This used to be 'easy', now lots of options  Sparse data, some entries 'lolcats' have lots of entries, pyx [wait for it!] won't have many  It's a 'lot' of data, google came about by misspelling googolplex:  Relationals are rather unsuitable, in general  Sparse, roll-your-own: https://en.wikipedia.org/wiki/Bigtable  Nosql and ready-mades: http://solr-vs- elasticsearch.com/ for example
  • 14. This is a (pretty nice) Pyx!
  • 15. 4: Retrieval  Here you get results of all this work  Simple, one field, one button = Google  Booleans and implied booleans [lots of works anded together]  Relevant results, this is the main thing and links back to the storage and parsing  Cookies/user awareness  Let's look at a few problems with retrieval
  • 16. Problems with Retrieval - General relevancy, the 'Paris Hilton' problem - Contextual relevancy, a zookeeper will want pythons, a programmer Python (cookies etc.) - Purpose, buying, researching, looking (remember you are the product!) - Diacritics, non-Roman (problem for both indexing and retrieval, actually) - Non-textual, images, chemical structures - Relevant synonyms Etc.
  • 17. Technical Difficulties: Examples  Looking for André  Looking for 中国 with UK keyboard [for example: zhong1 + guo2]  Looking for cat [furry] and cat [computer command]  Speed of index refresh [days, usually]  Storage and computation, everything is 'big'  Semantic search vs. search for 'words'
  • 18. Social Difficulties  Right to be forgotten, search tracking  Security services and data mining (queries, history, for example)  Privacy and doxing, see visual tagging too  Linking 'unlinked' data, informal 'joins', generally  Automatic visual tagging [facebook, ugh!]  Automatic geolocation [most smartphones]  Any more?
  • 19. Opportunities This is speculation, don't take too seriously: - expansion of domain specific: https://www.shodan.io/ - above example was for IoT, clearly there'll be more - 'honest' engines, non-profit etc. but how to finance? - expanded and specialised metasearch - improved semantics and synonyms - better 'understanding' (Siri,see: http://sirius.clarity- lab.org/ ) , translation, thesauri (where I came in, in fact)
  • 20. Rounding Up  It's a central human activity  It's a vital activity for the [intra|inter]tubes (web, IoT, internal applications)  Very simple central idea(s), but lots of evolution possible  There's a huge societal debate to go with the technical evolution  Question and (possibly) some answers
  • 21. Thanks! Thanks for listening! As a reward, here is a nice picture of a bathtime duck:
  • 22. Questions Also, I'm happy to give another talk this term: - Perl? - Raspberry Pi? - Threats and opportunities in AI? - Other talks that I am almost certainly unqualified to give...