SlideShare una empresa de Scribd logo
1 de 30
Ordering the chaos: 
creating websites using 
imperfect data 
Andrew Stretton 
Oxford University Web SIG November 2014
Who am I, what is ChemBio Hub? 
• Andrew Stretton – Data Architect and Developer 
github.com/strets123 
@strets123 
linkedin (google me) 
• Chembio Hub 
http://chembiohub.ox.ac.uk (feel free to link to us!) 
@oxchembiohub 
github.com/thesgc
Chembio Hub exists to 
support research at the 
interface of chemistry and 
biology 
by enabling sharing of reagents, expertise 
and data across 20+ departments
Who are we trying to connect and how? 
User 1: 
Scientist at Oxford 
User 2: 
Potential collaborator 
Could be in industry 
or anywhere in academia 
Stored and curated by ChemBio Hub 
Unpublished 
results 
Negative Data 
Methods 
Equipment 
Reagents 
? Not sure yet 
Areas of 
expertise 
Questions 
and answers 
Contacts 
Publications 
Held on other sites or social networks 
Organised/linked to by ChemBio Hub
All of these parts require tagging 
entities in text, how can we do it 
Who are we trying to connect and how? 
cheaply and sustainably? 
User 1: 
Scientist at Oxford 
User 2: 
Potential collaborator 
Could be in industry 
or anywhere in academia 
Stored and curated by ChemBio Hub 
Unpublished 
results 
Negative Data 
Methods 
Equipment 
Reagents 
? Not sure yet 
Areas of 
expertise 
Questions 
and answers 
Contacts 
Publications 
Held on other sites or social networks 
Organised/linked to by ChemBio Hub
What sorts of messy data are we working with? 
• Full text from procedures, biographies, web sites 
• Raw CSV/ Excel formats from multiple machines 
or departmental processes 
• “Standard” XML and JSON formats from various 
sources that do not map perfectly to our 
application 
• Multiple external databases to submit data to
How do most of our users like their web-based tools? 
Simple Search 
Flexible data 
management 
Comprehensive, 
overlapping tagging 
Clear progress, seamless experience
What do we sometimes give them? 
• Incomplete or many-to-one tagging 
• Hyperlinks instead of the right information 
from the other site 
• Dumb search 
• Inflexible schemas 
• Lack of linking between datasets
What strategies do we have to deal with messy data? 
Create more helpful data management apps 
Fill in gaps in tagging by using search engines 
Consider creating databases of flat files 
Create map reduce / 
Database views 
for schema 
Normalisation and 
data analysis 
Web crawling - not as 
hard or messy as it 
used to be
Let’s look at this one first, happy 
to discuss other areas later… 
What strategies do we have to deal with messy data? 
Create more helpful data management apps 
Fill in gaps in tagging by using search engines 
Consider creating databases of flat files 
Create map reduce / 
Database views 
for schema 
Normalisation and 
data analysis 
Web crawling - not as 
hard or messy as it 
used to be
How do we fill in gaps on un-tagged 
data? 
Let’s do an experiment… 
github.com/strets123/web-sig-2014/
Elasicsearch - information extraction on-the-fly 
• Take a dataset of 18801 companies 
~ 50% tagged 
> 80% have some 
text data 
0% 50% 100% 
Tags 
Description 
Overview 
Overview or 
description 
Source data : http://jsonstudio.com/resources/ github.com/strets123/web-sig-2014/
Use the “significant terms” feature… 
• What description/overview words most strongly 
linked to each tag? 
travel education music realestate 
Search 
engine 
optimization 
jobs onlinemarketing projectmanagement 
travel students music estate seo job marketing project 
travelers teachers artists real optimization jobs seo projects 
trip learning musicians agents engine employers agency task 
trips education songs property ppc career optimization collaboration 
hotels student labels listings marketing teams 
flights educational playlists search management 
traveler bands click 
travellers song pay 
airline artist 
hotel fans
Now let’s test these queries 
• Which companies have no tag but are most 
likely to need tagging with “music”… 
uPlaya 
Description uPlaya provides independent or unsigned musicians with immediate 
feedback on their music…. 
Category games_video 
Tags - 
Webceleb 
Description Webceleb is music marketplace and community where musicians 
and fans engage and profit from discovering, purchasing and 
downloading the latest independent music.…. 
Category games_video 
Tags -
But what if we have 
NO TAGS?
A process to extract tags from text… 
Index Data 
Assign resources (e.g. 
Amazon spot instance 
for large dataset) 
List word counts with 
the least frequent 
first 
Exclude lowest counts 
Aggregate the 
significant terms for 
each word 
Filter words that have 
a lot of high scoring 
significant terms
What does this give us? 
athletes: [athletes, coaches, athlete, coach, sports, fans] 
avatars: [avatars, avatar, multiplayer, virtual, casual, 3d, games, chat, create, game] 
clouds: [clouds, cloud, hybrid, computing, private, deploy, public, infrastructure] 
dashboards: [dashboards, bi, reports, analytics, reporting, self, analysis, intelligence, features] 
dial: [dial, calling, calls, voip, number, call, voice, phone] 
exercise: [exercise, sleep, nutrition, fitness, weight, healthy, health] 
indie: [indie, labels, artists, music] 
logos: [logos, branding, flash, design] 
pci: [pci, dss, hipaa, compliance, sensitive, compliant] 
portland: [portland, oregon, inc, founded] 
ringtones: [ringtones, ringtone, personalization, games] 
traders: [traders, forex, trader, trading, quotes, stock, trade] 
yellow: [yellow, pages, directory, local] 
abc: [abc, cnn, nbc, television] 
argentina: [argentina, buenos, aires, chile, uruguay, colombia, brazil, mexico, latin] 
aviation: [aviation, aircraft, aerospace, defense, transportation] 
airline: [airline, fares, airlines, flights, flight, travel, tickets, hotel, air]
What else can we do with this? 
Filter words that have 
a lot of high scoring 
significant terms 
De duplicate where 
large overlaps exist 
Assign levels of tags 
in order of frequency 
Use to categorise 
new data on the fly 
using percolate 
Curate manually 
Generate a sidebar 
menu 
github.com/strets123/web-sig-2014/ 
Use elasticsearch 
phrase suggester to 
create phrase tags
Advantages over direct curation / supervised learning: 
• Simplicity and pragmatism 
• Applicable to novel domains 
– e.g. Chemical Biology 
• Auto generated tags choose more appropriate 
word combinations than manual curators 
• No need for complex data formats like rdf 
• Data from many sources can be mixed 
– e.g. categories from other university’s sites…
Where might this technology lead? 
• How about a tag-based file system? 
• How about an implicit social network? 
• Elasticsearch is really easy to scale… 
• Which websites, filesystems and datasets do 
you need to categorise? 
– Do you really need RDF ontologies, curators etc. or 
can you just do something simple?
Summary 
• We now have many options to categorise and 
tidy up messy data 
• Managing variations on schemas takes a lot of 
resources – leave it to the data owners if you 
can! 
• When it comes to tagging… 
– Perfection is in the eye of the beholder 
– Sustainability is really important
Thanks 
• Thanks to the Research 
informatics team at the NDM 
Structural Genomics 
Consortium 
– Paul Barrett 
– Karen Porter 
– Michael O’Hagan 
– Brian Marsden 
– David Damerell 
– Sefa Garsot 
– Anthony Bradley 
• Thanks to the InfoDev team 
at IT services for answering 
my endless questions about 
webauth 
• Funders: 
– John Fell Fund 
– NDM Strategic 
– Welcome Trust 
– Higher Education Funding 
Council 
• To everyone here for listening
Any Questions? 
• Andrew Stretton 
github.com/strets123 
@strets123 
linkedin (google me) 
• Chembio Hub 
http://chembiohub.ox.ac.uk 
@oxchembiohub 
github.com/thesgc 
Simple example categorisation 
code available here in python 
github.com/strets123/web-sig-2014/
Appendix of other messy 
data techniques
How do we make it easy to 
add spreadsheet data to a 
system?
Working with flat files 
• Sometimes a flat file is the right schema for a 
dataset 
– User defined formats 
– Different types of research 
– Only some of the fields are relevant when 
comparing experiments 
– Data is not in memory unless needed 
• Pandas and HDF allows SQL-like queries on flat 
files
Helpful data management 
• Data Wrangler 
– https://player.vimeo.com/video/19185801 
• Raw 
– http://raw.densitydesign.org 
• Take these as inspiration for our tool for re-shaping 
biochemistry data
Simplifying web crawling 
• Modern web crawling patterns use class 
selectors instead of xPath 
– Less likelihood of change 
• Content can be crawled using a backend web 
browser 
– Dynamic javascript elements are included 
• Using a website’s data for classification is 
more acceptable than wholesale reproduction
Managing multiple JSON schemas with views 
PostgreSQL – also supported by Rails/Activerecord 
Couchbase
Why views over JSON can be useful 
• Expose only required fields from e.g. RDF 
• Input format may change but we don’t want 
crawler to break 
• Required fields may change 
• Versions are easy to support if format 
normalisation is in the database layer 
• Storage is cheap 
• View code is executed only once

Más contenido relacionado

La actualidad más candente

Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowRichard Wallis
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentationurvics
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices Richard Wallis
 
Three Linked Data choices for Libraries
Three Linked Data choices for LibrariesThree Linked Data choices for Libraries
Three Linked Data choices for LibrariesRichard Wallis
 
Schema.org - Extending Benefits
Schema.org - Extending BenefitsSchema.org - Extending Benefits
Schema.org - Extending BenefitsRichard Wallis
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebSimon Price
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.orgrvguha
 
Schema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your LibrarySchema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your LibraryRichard Wallis
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebStefan Dietze
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic  Web and Linked DataAn introduction to Semantic  Web and Linked Data
An introduction to Semantic Web and Linked DataGabriela Agustini
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in PracticePeter Mika
 
Wimmics Overview 2021
Wimmics Overview 2021Wimmics Overview 2021
Wimmics Overview 2021Fabien Gandon
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the RisePeter Mika
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Peter Mika
 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.orgJoshua Shinavier
 
Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!Richard Wallis
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic webStanley Wang
 

La actualidad más candente (20)

Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & How
 
NetIKX Semantic Search Presentation
NetIKX Semantic Search PresentationNetIKX Semantic Search Presentation
NetIKX Semantic Search Presentation
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices
 
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
Oct 15 NISO Webinar: 21st Century Resource Sharing: Which Inter-Library Loan ...
 
Three Linked Data choices for Libraries
Three Linked Data choices for LibrariesThree Linked Data choices for Libraries
Three Linked Data choices for Libraries
 
Schema.org - Extending Benefits
Schema.org - Extending BenefitsSchema.org - Extending Benefits
Schema.org - Extending Benefits
 
Extending Schema.org
Extending Schema.orgExtending Schema.org
Extending Schema.org
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
 
Schema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your LibrarySchema.org: What It Means For You and Your Library
Schema.org: What It Means For You and Your Library
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the Web
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic  Web and Linked DataAn introduction to Semantic  Web and Linked Data
An introduction to Semantic Web and Linked Data
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
Wimmics Overview 2021
Wimmics Overview 2021Wimmics Overview 2021
Wimmics Overview 2021
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the Rise
 
Danbri Drupalcon Export
Danbri Drupalcon ExportDanbri Drupalcon Export
Danbri Drupalcon Export
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.org
 
Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!Structured Data: It's All About the Graph!
Structured Data: It's All About the Graph!
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 

Destacado

Destacado (7)

Seven Axiom
Seven AxiomSeven Axiom
Seven Axiom
 
Get Your Ducks Nccet Webinar
Get Your Ducks   Nccet WebinarGet Your Ducks   Nccet Webinar
Get Your Ducks Nccet Webinar
 
E11 Physics Evaluation Sheet
E11 Physics Evaluation SheetE11 Physics Evaluation Sheet
E11 Physics Evaluation Sheet
 
Chembio Crunch Intro
Chembio Crunch IntroChembio Crunch Intro
Chembio Crunch Intro
 
Moodle
MoodleMoodle
Moodle
 
California Corporate College Presentation at NCCET 100910
California Corporate College Presentation at NCCET 100910California Corporate College Presentation at NCCET 100910
California Corporate College Presentation at NCCET 100910
 
California Corporate College Cccaoe Fall 2009
California Corporate College Cccaoe Fall 2009California Corporate College Cccaoe Fall 2009
California Corporate College Cccaoe Fall 2009
 

Similar a Ordering the chaos: Creating websites with imperfect data

Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas WorkshopNiall Beard
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13DataDryad
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...Sylvain Zimmer
 
A Brief (and Practical) Introduction to Information Architecture
A Brief (and Practical) Introduction to Information ArchitectureA Brief (and Practical) Introduction to Information Architecture
A Brief (and Practical) Introduction to Information ArchitectureLouis Rosenfeld
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral
 
Graphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Graphs for Recommendation Engines: Looking beyond Social, Retail, and MediaGraphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Graphs for Recommendation Engines: Looking beyond Social, Retail, and MediaNeo4j
 
Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...Enterprise Ireland
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media
 
Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningValeria de Paiva
 

Similar a Ordering the chaos: Creating websites with imperfect data (20)

Bioschemas Workshop
Bioschemas WorkshopBioschemas Workshop
Bioschemas Workshop
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13Fox-Keynote-Now and Now of Data Publishing-nfdp13
Fox-Keynote-Now and Now of Data Publishing-nfdp13
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
 
A Brief (and Practical) Introduction to Information Architecture
A Brief (and Practical) Introduction to Information ArchitectureA Brief (and Practical) Introduction to Information Architecture
A Brief (and Practical) Introduction to Information Architecture
 
Alamw15 VIVO
Alamw15 VIVOAlamw15 VIVO
Alamw15 VIVO
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Graphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Graphs for Recommendation Engines: Looking beyond Social, Retail, and MediaGraphs for Recommendation Engines: Looking beyond Social, Retail, and Media
Graphs for Recommendation Engines: Looking beyond Social, Retail, and Media
 
Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
FAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech ProposalsFAIRDOM data management support for ERACoBioTech Proposals
FAIRDOM data management support for ERACoBioTech Proposals
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data Mining
 

Último

➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...amitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 

Último (20)

➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 

Ordering the chaos: Creating websites with imperfect data

  • 1. Ordering the chaos: creating websites using imperfect data Andrew Stretton Oxford University Web SIG November 2014
  • 2. Who am I, what is ChemBio Hub? • Andrew Stretton – Data Architect and Developer github.com/strets123 @strets123 linkedin (google me) • Chembio Hub http://chembiohub.ox.ac.uk (feel free to link to us!) @oxchembiohub github.com/thesgc
  • 3. Chembio Hub exists to support research at the interface of chemistry and biology by enabling sharing of reagents, expertise and data across 20+ departments
  • 4. Who are we trying to connect and how? User 1: Scientist at Oxford User 2: Potential collaborator Could be in industry or anywhere in academia Stored and curated by ChemBio Hub Unpublished results Negative Data Methods Equipment Reagents ? Not sure yet Areas of expertise Questions and answers Contacts Publications Held on other sites or social networks Organised/linked to by ChemBio Hub
  • 5. All of these parts require tagging entities in text, how can we do it Who are we trying to connect and how? cheaply and sustainably? User 1: Scientist at Oxford User 2: Potential collaborator Could be in industry or anywhere in academia Stored and curated by ChemBio Hub Unpublished results Negative Data Methods Equipment Reagents ? Not sure yet Areas of expertise Questions and answers Contacts Publications Held on other sites or social networks Organised/linked to by ChemBio Hub
  • 6. What sorts of messy data are we working with? • Full text from procedures, biographies, web sites • Raw CSV/ Excel formats from multiple machines or departmental processes • “Standard” XML and JSON formats from various sources that do not map perfectly to our application • Multiple external databases to submit data to
  • 7. How do most of our users like their web-based tools? Simple Search Flexible data management Comprehensive, overlapping tagging Clear progress, seamless experience
  • 8. What do we sometimes give them? • Incomplete or many-to-one tagging • Hyperlinks instead of the right information from the other site • Dumb search • Inflexible schemas • Lack of linking between datasets
  • 9. What strategies do we have to deal with messy data? Create more helpful data management apps Fill in gaps in tagging by using search engines Consider creating databases of flat files Create map reduce / Database views for schema Normalisation and data analysis Web crawling - not as hard or messy as it used to be
  • 10. Let’s look at this one first, happy to discuss other areas later… What strategies do we have to deal with messy data? Create more helpful data management apps Fill in gaps in tagging by using search engines Consider creating databases of flat files Create map reduce / Database views for schema Normalisation and data analysis Web crawling - not as hard or messy as it used to be
  • 11. How do we fill in gaps on un-tagged data? Let’s do an experiment… github.com/strets123/web-sig-2014/
  • 12. Elasicsearch - information extraction on-the-fly • Take a dataset of 18801 companies ~ 50% tagged > 80% have some text data 0% 50% 100% Tags Description Overview Overview or description Source data : http://jsonstudio.com/resources/ github.com/strets123/web-sig-2014/
  • 13. Use the “significant terms” feature… • What description/overview words most strongly linked to each tag? travel education music realestate Search engine optimization jobs onlinemarketing projectmanagement travel students music estate seo job marketing project travelers teachers artists real optimization jobs seo projects trip learning musicians agents engine employers agency task trips education songs property ppc career optimization collaboration hotels student labels listings marketing teams flights educational playlists search management traveler bands click travellers song pay airline artist hotel fans
  • 14. Now let’s test these queries • Which companies have no tag but are most likely to need tagging with “music”… uPlaya Description uPlaya provides independent or unsigned musicians with immediate feedback on their music…. Category games_video Tags - Webceleb Description Webceleb is music marketplace and community where musicians and fans engage and profit from discovering, purchasing and downloading the latest independent music.…. Category games_video Tags -
  • 15. But what if we have NO TAGS?
  • 16. A process to extract tags from text… Index Data Assign resources (e.g. Amazon spot instance for large dataset) List word counts with the least frequent first Exclude lowest counts Aggregate the significant terms for each word Filter words that have a lot of high scoring significant terms
  • 17. What does this give us? athletes: [athletes, coaches, athlete, coach, sports, fans] avatars: [avatars, avatar, multiplayer, virtual, casual, 3d, games, chat, create, game] clouds: [clouds, cloud, hybrid, computing, private, deploy, public, infrastructure] dashboards: [dashboards, bi, reports, analytics, reporting, self, analysis, intelligence, features] dial: [dial, calling, calls, voip, number, call, voice, phone] exercise: [exercise, sleep, nutrition, fitness, weight, healthy, health] indie: [indie, labels, artists, music] logos: [logos, branding, flash, design] pci: [pci, dss, hipaa, compliance, sensitive, compliant] portland: [portland, oregon, inc, founded] ringtones: [ringtones, ringtone, personalization, games] traders: [traders, forex, trader, trading, quotes, stock, trade] yellow: [yellow, pages, directory, local] abc: [abc, cnn, nbc, television] argentina: [argentina, buenos, aires, chile, uruguay, colombia, brazil, mexico, latin] aviation: [aviation, aircraft, aerospace, defense, transportation] airline: [airline, fares, airlines, flights, flight, travel, tickets, hotel, air]
  • 18. What else can we do with this? Filter words that have a lot of high scoring significant terms De duplicate where large overlaps exist Assign levels of tags in order of frequency Use to categorise new data on the fly using percolate Curate manually Generate a sidebar menu github.com/strets123/web-sig-2014/ Use elasticsearch phrase suggester to create phrase tags
  • 19. Advantages over direct curation / supervised learning: • Simplicity and pragmatism • Applicable to novel domains – e.g. Chemical Biology • Auto generated tags choose more appropriate word combinations than manual curators • No need for complex data formats like rdf • Data from many sources can be mixed – e.g. categories from other university’s sites…
  • 20. Where might this technology lead? • How about a tag-based file system? • How about an implicit social network? • Elasticsearch is really easy to scale… • Which websites, filesystems and datasets do you need to categorise? – Do you really need RDF ontologies, curators etc. or can you just do something simple?
  • 21. Summary • We now have many options to categorise and tidy up messy data • Managing variations on schemas takes a lot of resources – leave it to the data owners if you can! • When it comes to tagging… – Perfection is in the eye of the beholder – Sustainability is really important
  • 22. Thanks • Thanks to the Research informatics team at the NDM Structural Genomics Consortium – Paul Barrett – Karen Porter – Michael O’Hagan – Brian Marsden – David Damerell – Sefa Garsot – Anthony Bradley • Thanks to the InfoDev team at IT services for answering my endless questions about webauth • Funders: – John Fell Fund – NDM Strategic – Welcome Trust – Higher Education Funding Council • To everyone here for listening
  • 23. Any Questions? • Andrew Stretton github.com/strets123 @strets123 linkedin (google me) • Chembio Hub http://chembiohub.ox.ac.uk @oxchembiohub github.com/thesgc Simple example categorisation code available here in python github.com/strets123/web-sig-2014/
  • 24. Appendix of other messy data techniques
  • 25. How do we make it easy to add spreadsheet data to a system?
  • 26. Working with flat files • Sometimes a flat file is the right schema for a dataset – User defined formats – Different types of research – Only some of the fields are relevant when comparing experiments – Data is not in memory unless needed • Pandas and HDF allows SQL-like queries on flat files
  • 27. Helpful data management • Data Wrangler – https://player.vimeo.com/video/19185801 • Raw – http://raw.densitydesign.org • Take these as inspiration for our tool for re-shaping biochemistry data
  • 28. Simplifying web crawling • Modern web crawling patterns use class selectors instead of xPath – Less likelihood of change • Content can be crawled using a backend web browser – Dynamic javascript elements are included • Using a website’s data for classification is more acceptable than wholesale reproduction
  • 29. Managing multiple JSON schemas with views PostgreSQL – also supported by Rails/Activerecord Couchbase
  • 30. Why views over JSON can be useful • Expose only required fields from e.g. RDF • Input format may change but we don’t want crawler to break • Required fields may change • Versions are easy to support if format normalisation is in the database layer • Storage is cheap • View code is executed only once

Notas del editor

  1. Real word data is not: Perfectly tagged In one place In one format In one technology stack Spreadsheet processes don’t just disappear when you build a tool