SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
How we built the largest
                          open database of
                        companies in the world




Thursday, 7 June 2012
A simple (huge) goal: an entry (and URI) for
       every corporate legal entity in the world
                                            URI is based on the company register
                                              ID, meaning it’s open and IP-free




        Also i
    trade       mpor
          marks       ting p
   officia         , gove ublic data
          l regis        rnme
                 ters &       nt spe –
                         gazet         nding
                               te not         ,
                                      ices..
                                             .




Thursday, 7 June 2012
All Op
                                 enly L
                        free re         icens
                                use, e        ed, al
                                       ven c         lowin
                                             omm           g
                                                   ercial
                                                          ly

Thursday, 7 June 2012
5 core uses




Thursday, 7 June 2012
1. An open identifying system
               URIs can be used as common identifiers among a
               variety of organisations
               Can be used without reference to OpenCorporates
               Because they map to the id issued by the company
               register the corresponding entry in the registry (and
               associated info) can be found, and vice versa
               Fits the new EU Business Vocabulary
               Can even by used for companies in jurisdiction we
               haven’t yet imported

Thursday, 7 June 2012
2. The simple search

               Not to be underestimated
               Massively reduces friction
               (how long will it take you
               to find and search
               multiple jurisdictions)
               Allows what if questions
               Potentially generates
               stories in its own right

Thursday, 7 June 2012
3. Source for additional info
               Addresses, filings,
               status, websites...
               Intl trademarks, UK
               govt spending, official
               notices, health & safety
               violations...
               Other IDs: SEC, CAGE,
               etc – allows reverse
               mapping queries, e.g.
               show me legal entitity
               mapped to a CIK code
Thursday, 7 June 2012
4. Reconciliation
         (matching names to legal entities)

         Clean up messy
         company names
         (& prev names)
         to legal entity,
         and from there
         to other data
         Google Refine
         reconciliation
         service (specific
         to jurisdiction)

Thursday, 7 June 2012
5. The platform

               API: allows all
               information to be
               retrieved as data,
               even searches
               Users can now
               add data too
               Coming soon: the
               option to match
               data to
               companies
Thursday, 7 June 2012
New feature: directors/officers

         We’ve just
         started
         importing &
         indexing
         company
         directors &
         officers,
         allowing search
         by name, &
                                other resources
         finding links
         between them
         and other         similarly named
         companies

Thursday, 7 June 2012
How have we done it?
         1. Started small,
         with just three
         countries and
         3 million
         companies
         2. Increasingly
         using official
         sources, where
         this is possible (i.e.
         the company
         registers are open
         and make data
         available)

Thursday, 7 June 2012
How have we done it?
          3. Leveraged the
          open data
          community and
          ScraperWiki to
          scrape company
          registers around
          the world
          4. Worked with
          governments to
          help understand
          the problems – EU,
          World Bank, G20
          Financial Stability
          Board, etc

Thursday, 7 June 2012
The technology
         Vanilla, commodity open-source software, hosted on our
         own UK-based servers

         Database                        MySQL
                              (but considering PostgreSQL)
         Search                           Solr
                             (but considering ElasticSearch)
         Code                              Ruby
                          (RubyOnRails main app, Sinatra API,
                         vanilla Ruby for various internal libraries)
         Webserver          Nginx (webserver) + Memcached
                         (caching) + Redis (queue + persistence)

Thursday, 7 June 2012
How do we pay for all this?


               Unlike many open data projects, we’re a for-profit
               company – the open data movement needs successful
               companies if it’s going to have a diverse ecosystem
               But we’re a company whose business model is
               dependent on making more data open, and an
               advisory board to make sure we do the right thing
               Not yet looking for customers, but...


Thursday, 7 June 2012
How do we pay for all this?
         Two projected sources of income

               Services model, especially around cleansing data/
               reconciliation. Of course, you can use our API,
               reconciliation service without asking us, but it may be
               cheaper to pay us to do it. Ditto custom extracts, and
               verticals
               Dual-licence model – contribute back to the community
               either with data, or financial support, e.g. if you have a
               proprietary database you may not want to be bound by
               the share-alike attribution restrictions
               And we already have some (small) customers

Thursday, 7 June 2012
The problems




         Getting the data Company registers have forgotten their
         main role is as public record, and actively work to prohibit
         free and open access to the data
Thursday, 7 June 2012
The problems




         Understanding the data Language, legal and cultural
         issues, not to mention the complexity of the subject
Thursday, 7 June 2012
The problems




         Normalising the data How do we abstract company
         types, status, industry codes, addresses, etc
Thursday, 7 June 2012
W3C Business Vocabulary

               What are
               we doing?
               Why are we
               doing it?
               What does
               it mean?
               Where is it
               going?

Thursday, 7 June 2012
The problems




         Handling the data Over 150 million rows in some tables
         (slow schema changes), heavy reading and writing,
         evolving understanding of the problems and solutions
Thursday, 7 June 2012
tions
                                                          isdic tes
                                                     0 jur
                                              nies in 5 23 US sta
                                        compa     clud ing
                               3million         In
     wo                 v er 4
  No




Thursday, 7 June 2012

Más contenido relacionado

Similar a EDF2012 Chris Taggart - How the biggest Open Database of Companies was built

Open Data 4 Startups
Open Data 4 StartupsOpen Data 4 Startups
Open Data 4 StartupsCSI Piemonte
 
Open data 4 Startups @ Digital Festival Torino
Open data 4 Startups @ Digital Festival TorinoOpen data 4 Startups @ Digital Festival Torino
Open data 4 Startups @ Digital Festival Torinomzaglio
 
Lessons from Launching NPR StateImpact and The Texas Tribune
Lessons from Launching NPR StateImpact and The Texas TribuneLessons from Launching NPR StateImpact and The Texas Tribune
Lessons from Launching NPR StateImpact and The Texas TribuneElise Hu-Stiles
 
You rang, M’LOD? Google Refine in the world of LOD
You rang, M’LOD? Google Refine in the world of LODYou rang, M’LOD? Google Refine in the world of LOD
You rang, M’LOD? Google Refine in the world of LODMateja Verlic
 
M12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part TwoM12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part TwoMER Conference
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project GuidanceVarad Meru
 
Code sharing at MediaEval
Code sharing at MediaEvalCode sharing at MediaEval
Code sharing at MediaEvalAdam Rae
 
CII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud ComputingCII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud ComputingAnand Deshpande
 
Looking at INSPIRE from an Open Source obsessed SME
Looking at INSPIRE from an Open Source obsessed SMELooking at INSPIRE from an Open Source obsessed SME
Looking at INSPIRE from an Open Source obsessed SMEsmespire
 
Learn from the Experts: The Do's and Don'ts of Data Collection
Learn from the Experts: The Do's and Don'ts of Data CollectionLearn from the Experts: The Do's and Don'ts of Data Collection
Learn from the Experts: The Do's and Don'ts of Data CollectionIQPC Exchange
 
Automated indexing - Hyland Onbase
Automated indexing - Hyland OnbaseAutomated indexing - Hyland Onbase
Automated indexing - Hyland OnbaseAMS Imaging
 
Productivity Future Vision
Productivity Future VisionProductivity Future Vision
Productivity Future VisionMicro Focus SRL
 
Open Data for Transportation Agencies
Open Data for Transportation AgenciesOpen Data for Transportation Agencies
Open Data for Transportation AgenciesNovavia Solutions
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And FootballAmanda Gray
 
Open data: what's in it for business?
Open data: what's in it for business?Open data: what's in it for business?
Open data: what's in it for business?Chris Taggart
 
Website Usability | Class 1
Website Usability | Class 1Website Usability | Class 1
Website Usability | Class 1studiokandm
 
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.Jari Koister
 
Annual centre for competition policy conference - access to data, and more 20...
Annual centre for competition policy conference - access to data, and more 20...Annual centre for competition policy conference - access to data, and more 20...
Annual centre for competition policy conference - access to data, and more 20...Peter Wells
 
ORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple RepositoriesORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple RepositoriesEDINA, University of Edinburgh
 

Similar a EDF2012 Chris Taggart - How the biggest Open Database of Companies was built (20)

Open Data 4 Startups
Open Data 4 StartupsOpen Data 4 Startups
Open Data 4 Startups
 
Open data 4 Startups @ Digital Festival Torino
Open data 4 Startups @ Digital Festival TorinoOpen data 4 Startups @ Digital Festival Torino
Open data 4 Startups @ Digital Festival Torino
 
Lessons from Launching NPR StateImpact and The Texas Tribune
Lessons from Launching NPR StateImpact and The Texas TribuneLessons from Launching NPR StateImpact and The Texas Tribune
Lessons from Launching NPR StateImpact and The Texas Tribune
 
You rang, M’LOD? Google Refine in the world of LOD
You rang, M’LOD? Google Refine in the world of LODYou rang, M’LOD? Google Refine in the world of LOD
You rang, M’LOD? Google Refine in the world of LOD
 
Story spaces pitch
Story spaces pitchStory spaces pitch
Story spaces pitch
 
M12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part TwoM12S07 - Retention & ESI - Paths to Success - Part Two
M12S07 - Retention & ESI - Paths to Success - Part Two
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
Code sharing at MediaEval
Code sharing at MediaEvalCode sharing at MediaEval
Code sharing at MediaEval
 
CII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud ComputingCII Panel Discussion on Cloud Computing
CII Panel Discussion on Cloud Computing
 
Looking at INSPIRE from an Open Source obsessed SME
Looking at INSPIRE from an Open Source obsessed SMELooking at INSPIRE from an Open Source obsessed SME
Looking at INSPIRE from an Open Source obsessed SME
 
Learn from the Experts: The Do's and Don'ts of Data Collection
Learn from the Experts: The Do's and Don'ts of Data CollectionLearn from the Experts: The Do's and Don'ts of Data Collection
Learn from the Experts: The Do's and Don'ts of Data Collection
 
Automated indexing - Hyland Onbase
Automated indexing - Hyland OnbaseAutomated indexing - Hyland Onbase
Automated indexing - Hyland Onbase
 
Productivity Future Vision
Productivity Future VisionProductivity Future Vision
Productivity Future Vision
 
Open Data for Transportation Agencies
Open Data for Transportation AgenciesOpen Data for Transportation Agencies
Open Data for Transportation Agencies
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
 
Open data: what's in it for business?
Open data: what's in it for business?Open data: what's in it for business?
Open data: what's in it for business?
 
Website Usability | Class 1
Website Usability | Class 1Website Usability | Class 1
Website Usability | Class 1
 
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
 
Annual centre for competition policy conference - access to data, and more 20...
Annual centre for competition policy conference - access to data, and more 20...Annual centre for competition policy conference - access to data, and more 20...
Annual centre for competition policy conference - access to data, and more 20...
 
ORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple RepositoriesORI & RJ Broker: Automating Deposition to Multiple Repositories
ORI & RJ Broker: Automating Deposition to Multiple Repositories
 

Más de European Data Forum

EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...European Data Forum
 
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...European Data Forum
 
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...European Data Forum
 
EDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro PresentationEDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro PresentationEuropean Data Forum
 
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...European Data Forum
 
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...European Data Forum
 
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...European Data Forum
 
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...European Data Forum
 
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...European Data Forum
 
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...European Data Forum
 
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...European Data Forum
 
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...European Data Forum
 
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...European Data Forum
 
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...European Data Forum
 
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...European Data Forum
 
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...European Data Forum
 
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...European Data Forum
 
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...European Data Forum
 
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...European Data Forum
 

Más de European Data Forum (20)

EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
EDF2014: Ralf-Peter Schaefer, Head of Traffic Product Unit, TomTom, Germany: ...
 
Barbato leit ict 15-16-17
Barbato leit ict 15-16-17Barbato leit ict 15-16-17
Barbato leit ict 15-16-17
 
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
EDF2014: BIG - NESSI Networking Session: Edward Curry, National University of...
 
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
EDF2014: BIG - NESSI Networking Session: Nuria de Lama, Representative to the...
 
EDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro PresentationEDF2014: BIG - NESSI Networking Session: Intro Presentation
EDF2014: BIG - NESSI Networking Session: Intro Presentation
 
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
EDF2014: Kush Wadhwa, Senior Partner, Trilateral Research & Consulting: Addre...
 
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
EDF2014: Adrian Cristal, Barcelona Supercomputing Center, RETHINK big Project...
 
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
EDF2014: Dimitris Vassiliadis, Head of Unit, EXUS Innovation Attractor: From ...
 
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
EDF2014: Rüdiger Eichin, Research Manager at SAP AG, Germany: Deriving Value ...
 
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
EDF2014: Paul Groth, Department of Computer Science & The Network Institute, ...
 
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
EDF2014: Christian Lindemann, Wolters Kluwer Germany & Christian Dirschl, Wol...
 
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
EDF2014: Marta Nagy-Rothengass, Head of Unit Data Value Chain, Directorate Ge...
 
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
EDF2014: Stefan Wrobel, Institute Director, Fraunhofer IAIS / Member of the b...
 
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
EDF2014: Michele Vescovi, Researcher, Semantic & Knowledge Innovation Lab, It...
 
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
EDF2014: Allan Hanbury, Senior Researcher, Vienna University of Technology, A...
 
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
EDF2014: Nikolaos Loutas, Manager at PwC Belgium, Business Models for Linked ...
 
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
EDF2014: Vedran Sabol, Head of the Knowledge Visualisation Area, Know-Center,...
 
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
EDF2014: Daniel Vila-Suero, Researcher, Ontology Engineering Group, Universid...
 
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
EDF2014: Piek Vossen, Professor Computational Lexicology, VU University Amste...
 
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
EDF2014: Taru Rastas, Senior Advisor, Ministry of Communications of Finland: ...
 

Último

UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 

Último (20)

UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 

EDF2012 Chris Taggart - How the biggest Open Database of Companies was built

  • 1. How we built the largest open database of companies in the world Thursday, 7 June 2012
  • 2. A simple (huge) goal: an entry (and URI) for every corporate legal entity in the world URI is based on the company register ID, meaning it’s open and IP-free Also i trade mpor marks ting p officia , gove ublic data l regis rnme ters & nt spe – gazet nding te not , ices.. . Thursday, 7 June 2012
  • 3. All Op enly L free re icens use, e ed, al ven c lowin omm g ercial ly Thursday, 7 June 2012
  • 4. 5 core uses Thursday, 7 June 2012
  • 5. 1. An open identifying system URIs can be used as common identifiers among a variety of organisations Can be used without reference to OpenCorporates Because they map to the id issued by the company register the corresponding entry in the registry (and associated info) can be found, and vice versa Fits the new EU Business Vocabulary Can even by used for companies in jurisdiction we haven’t yet imported Thursday, 7 June 2012
  • 6. 2. The simple search Not to be underestimated Massively reduces friction (how long will it take you to find and search multiple jurisdictions) Allows what if questions Potentially generates stories in its own right Thursday, 7 June 2012
  • 7. 3. Source for additional info Addresses, filings, status, websites... Intl trademarks, UK govt spending, official notices, health & safety violations... Other IDs: SEC, CAGE, etc – allows reverse mapping queries, e.g. show me legal entitity mapped to a CIK code Thursday, 7 June 2012
  • 8. 4. Reconciliation (matching names to legal entities) Clean up messy company names (& prev names) to legal entity, and from there to other data Google Refine reconciliation service (specific to jurisdiction) Thursday, 7 June 2012
  • 9. 5. The platform API: allows all information to be retrieved as data, even searches Users can now add data too Coming soon: the option to match data to companies Thursday, 7 June 2012
  • 10. New feature: directors/officers We’ve just started importing & indexing company directors & officers, allowing search by name, & other resources finding links between them and other similarly named companies Thursday, 7 June 2012
  • 11. How have we done it? 1. Started small, with just three countries and 3 million companies 2. Increasingly using official sources, where this is possible (i.e. the company registers are open and make data available) Thursday, 7 June 2012
  • 12. How have we done it? 3. Leveraged the open data community and ScraperWiki to scrape company registers around the world 4. Worked with governments to help understand the problems – EU, World Bank, G20 Financial Stability Board, etc Thursday, 7 June 2012
  • 13. The technology Vanilla, commodity open-source software, hosted on our own UK-based servers Database MySQL (but considering PostgreSQL) Search Solr (but considering ElasticSearch) Code Ruby (RubyOnRails main app, Sinatra API, vanilla Ruby for various internal libraries) Webserver Nginx (webserver) + Memcached (caching) + Redis (queue + persistence) Thursday, 7 June 2012
  • 14. How do we pay for all this? Unlike many open data projects, we’re a for-profit company – the open data movement needs successful companies if it’s going to have a diverse ecosystem But we’re a company whose business model is dependent on making more data open, and an advisory board to make sure we do the right thing Not yet looking for customers, but... Thursday, 7 June 2012
  • 15. How do we pay for all this? Two projected sources of income Services model, especially around cleansing data/ reconciliation. Of course, you can use our API, reconciliation service without asking us, but it may be cheaper to pay us to do it. Ditto custom extracts, and verticals Dual-licence model – contribute back to the community either with data, or financial support, e.g. if you have a proprietary database you may not want to be bound by the share-alike attribution restrictions And we already have some (small) customers Thursday, 7 June 2012
  • 16. The problems Getting the data Company registers have forgotten their main role is as public record, and actively work to prohibit free and open access to the data Thursday, 7 June 2012
  • 17. The problems Understanding the data Language, legal and cultural issues, not to mention the complexity of the subject Thursday, 7 June 2012
  • 18. The problems Normalising the data How do we abstract company types, status, industry codes, addresses, etc Thursday, 7 June 2012
  • 19. W3C Business Vocabulary What are we doing? Why are we doing it? What does it mean? Where is it going? Thursday, 7 June 2012
  • 20. The problems Handling the data Over 150 million rows in some tables (slow schema changes), heavy reading and writing, evolving understanding of the problems and solutions Thursday, 7 June 2012
  • 21. tions isdic tes 0 jur nies in 5 23 US sta compa clud ing 3million In wo v er 4 No Thursday, 7 June 2012