Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 43 Anuncio

Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes

Descargar para leer sin conexión

This presentation will provide a brief introduction to logical reasoning and overview of the most popular semantic schema and ontology languages: RDFS and the profiles of OWL 2.

While automatic reasoning has always inspired the imagination, numerous projects have failed to deliver to the promises. The typical pitfalls related to ontologies and symbolic reasoning fall into two categories:

- Over-engineered ontologies. The selected ontology language and modeling patterns can be too expressive. This can make the results of inference hard to understand and verify, which in its turn makes KG hard to evolve and maintain. It can also impose performance penalties far greater than the benefits.

- Inappropriate reasoning support. There are many inference algorithms and implementation approaches, which work well with taxonomies and conceptual models of few thousands of concepts, but cannot cope with KG of millions of entities.

- Inappropriate data layer architecture. One such example is reasoning with virtual KG, which is often infeasible.

This presentation will provide a brief introduction to logical reasoning and overview of the most popular semantic schema and ontology languages: RDFS and the profiles of OWL 2.

While automatic reasoning has always inspired the imagination, numerous projects have failed to deliver to the promises. The typical pitfalls related to ontologies and symbolic reasoning fall into two categories:

- Over-engineered ontologies. The selected ontology language and modeling patterns can be too expressive. This can make the results of inference hard to understand and verify, which in its turn makes KG hard to evolve and maintain. It can also impose performance penalties far greater than the benefits.

- Inappropriate reasoning support. There are many inference algorithms and implementation approaches, which work well with taxonomies and conceptual models of few thousands of concepts, but cannot cope with KG of millions of entities.

- Inappropriate data layer architecture. One such example is reasoning with virtual KG, which is often infeasible.

Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes (20)

Anuncio

Más de Ontotext (19)

Más reciente (20)

Anuncio

Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes

  1. 1. making sense of text and data Atanas Kiryakov Webinar, July 2020 Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
  2. 2. Who are we? o Leader ü Semantic technology vendor established year 2000 ü Part of Sirma Group: 400 persons, listed at Sofia Stock Exchange o Profitable and growing ü Global: 80% of revenue from London and New York ü Clients: S&P, BBC, FT, Top-5 US Bank, UK Parliament, Fujitsu, … ü Verticals: Financial services, Health care and Life sciences, Publishing, Manufacturing o Innovator ü Attracted over $15M in innovation funding ü Member of W3C, EDMC, ODI, STI and LDBC, developing next gen. standards
  3. 3. …, the market leaders in this space continue to be Neo4J and Ontotext (GraphDB), which are graph and RDF database providers respectively. These are the longest established vendors in this space (both founded in 2000) so they have a longevity and experience that other suppliers cannot yet match. Bloor Research Graph Database Market Update 2020 Ontotext GraphDB™ - the Flagship Product
  4. 4. Ontotext Portfolio
  5. 5. Presentation Outline o Reasoning Introduction: Benefits and Pitfalls o Reasoning Use Cases and Demos o RDFS and OWL 2 Profiles o Reasoning Implementation Choices o Reasoning With GraphDB Presentation Outline
  6. 6. Knowledge Graphs = Rich Data in Context KGs put data in context via linking and semantic metadata We help enterprises get profound insights via interlinking, analyzing and exploring: o diverse databases o text documents and other content o proprietary & global data
  7. 7. What is a Knowledge Graph? o The KG represents a collection of interlinked descriptions of concepts and entities ü Concepts describe each other ü Connections provide context ü Context helps comprehension! o A KG can be used as: ü Database: can be queried ü Graph: can be analyzed as network ü Knowledge base: new facts can be inferred Read more: https://www.ontotext.com/knowledgehub/fundamentals/what-is-a-knowledge-graph/
  8. 8. What is Semantics? o Formal semantics allows new valid facts to be inferred ü Both data and schema can be interpreted ü Semantic schema = ontology ü Languages: RDF Schema (RDFS), OWL o Only the relevant semantics is formalized in the schema ü The meaning of relativeOf is not fully described by defining it as owl:SymmetricProperty ü The best model is the simplest one that can do the work. But not simpler! myData: Maria ptop:Agent ptop:Person ptop:Woman ptop:childOf ptop:parentOf rdfs:range owl:inverseOf inferred myData:Ivan owl:relativeOf owl:inverseOfowl:SymmetricProperty rdfs:subPropertyOf owl:inverseOf owl:inverseOf rdf:type rdf:type rdf:type
  9. 9. Reasoning Benefits o Schema alignment and easy querying in diverse datasets ü Across sources similar relationships can be modeled in a different way - one can use parentOf, another childOf and a third one just the more general relativeOf ü The database will return Ivan as a result of the query (Maria relativeOf ?x) when the fact derived from the source and asserted is (Ivan childOf Maria) o Getting deeper and more complete results ü Finding patterns and inferring new relationships ü Instant discovery of hidden relationships scattered across multiple sources o Consistency checking and quality validation ü RDF Shapes ensure graph consistency and quality
  10. 10. The Pitfalls of Reasoning o Over-engineered ontologies ü Too expressive ontology language ü Results of inference hard to understand and verify ü Performance penalties far greater than the benefits o Inappropriate reasoning support ü Inference implementations that work well with taxonomies and conceptual models of few thousands of concepts, but cannot cope with KG of millions of entities o Inappropriate data layer architecture ü One such example is reasoning with virtual KG, which is often infeasible
  11. 11. Presentation Outline o Reasoning Introduction: Benefits and Pitfalls o Reasoning Use Cases and Demos o RDFS and OWL 2 Profiles o Reasoning Implementation Choices o Reasoning With GraphDB Presentation Outline
  12. 12. Search in British Museum’s Collection o Artefacts are described via the granular ontology CIDOC CRM o Searching in such collection requires Fundamental Relations ü Aggregation of large number of paths through CRM data into a smaller number of searchable relations o E.g.: FR "Thing from Place"
  13. 13. British Museum’s Collection: Volumetrics o Museum objects: 2,051,797 ü Thesaurus entries: 415,509 o Explicit statements: 195,208,156 o Total statements: 916,735,486 ü Expansion ratio is 4.7x, i.e., for each statement, 3.7 more are inferred ü Nodes (unique URLs and literals): 53,803,189 o Loading time (including materialization): ü 22.2h on RAM drive ü 32.9h on non-SSD hard drives
  14. 14. GraphDB Benchmarking o LDBC: TPC-like benchmarks for graph databases o Members include: Ontotext, OpenLink, neo4j, CWI, UPM, ORACLE, IBM, *Sparsity o LDBC Semantic Publishing Benchmark ü Based on BBC’s Dynamic Semantic Publishing editorial workflow ü Updates, adding new content metadata or updating the reference knowledge (e.g., new people) ü Aggregation queries retrieve content according to various criteria (e.g., to generate a topic web page) ü The only benchmark that involves reasoning and updates
  15. 15. LDBC SPB Results of GraphDB Clients reading / writing Reads/s Writes/s 0 / 1 0.0000 11.4067 0 / 2 0.0000 14.3033 0 / 4 0.0000 14.6700 0 / 8 0.0000 15.1067 1 / 0 17.8258 0.0000 4 / 0 43.0833 0.0000 8 / 0 70.3767 0.0000 16 / 0 83.2633 0.0000 8 / 2 52.5667 9.2867 8 / 4 54.0233 9.6167 8 / 8 54.9067 9.5733 10 / 2 59.9467 8.5333 10 / 4 62.2867 8.4767 10 / 8 61.7167 8.6067 16 / 2 68.8100 5.0600 16 / 4 70.3900 5.1067 16 / 8 70.2300 4.9967 16 / 16 70.9467 5.0567 o CPU: 1 x E5-1650 o RAM: 20G heap o Dataset: LDBC SPB 256 o DB: GraphDB SE 8.0, RDF Statements: 254,948,985 (explicit), 480,405,141 (total) OWL-Horst-optimized rule set o Creative works: 8,821,535
  16. 16. FactForge: Data Integration o DBpedia (the English version) 496M o GeoNames (all geographic features on Earth) 150M o owl:sameAs links between DBpedia and Geonames 471K o GLEI (global company register data) 3M o Panama Papers DB (#LinkedLeaks) 20M o Other datasets and ontologies: WordNet, WorldFacts, FIBO o News metadata (2000 articles/day enriched by NOW) 1 023M o Total size (2.2B explicit + 328M inferred statements) 2 522М
  17. 17. FIBO: Financial Industry Business Ontology o Developed by EDMC, https://spec.edmcouncil.org/fibo/ o We loaded FIBO Foundations and BE ü About 35 RDF files all together (old version) o Reasoning profile: OWL 2 RL o Loading takes 2-3 sec. o Number of explicit statements: 5 696 o Number of total statements, including inferred: 15 713 ü About 10k statements materialized
  18. 18. FIBO-PROTON Mapping o PROTON is an upper-level ontology ü 500 classes, 200 properties; developed by Ontotext since 2004 ü used semantic annotation and LOD integration services, e.g, FactForge ü mapped to DBPedia, Freebase, GeoNames o A very basic mapping for public companies and few related properties was loaded in 4 hours in FactForge: fb:business.issuer rdfs:subClassOf pext:PublicCompany. pext:PublicCompany rdfs:subClassOf fibo-be-corp-corp:PubliclyHeldCompany. ptop:Organization rdfs:subClassOf fibo-fnd-org-fm:FormalOrganization. dbp-prop:industry rdfs:subPropertyOf pext:industryOf. pext:industryOf rdfs:subPropertyOf fibo-fnd-rel-rel:isClassifiedBy. dbp-ont:subsidiary rdfs:subPropertyOf ptop:controls. ptop:controls rdfs:subPropertyOf fibo-fnd-rel-rel:controls.
  19. 19. Presentation Outline o Reasoning Introduction: Benefits and Pitfalls o Reasoning Use Cases and Demos o RDFS and OWL 2 Profiles o Reasoning Implementation Choices o Reasoning With GraphDB Presentation Outline
  20. 20. Rule-Based Reasoning o Description Logic (DL) doesn’t scale ü Satisfiability checking is not tractable ü Complexity grows exponentially with size o Rule-based inference engine ü R-Entailment rules, PROLOG-style, as defined in [1] o Sound and complete in PSPACE ü Under some constraints: do not introduce blank nodes, bound size of the rule bodies, ground RDF graph, [1] [1] Combining RDF and Part of OWL with Rules: Semantics, Decidability, Complexity Herman J. ter Horst ,Published in International Semantic Web Conference 2005 More at: http://graphdb.ontotext.com/documentation/standard/reasoning.html Complexity* DLRules, LP OWL Full OWL DL OWL Lite RDFS SWRL Datalog OWL 2 QL Expressivity supported by GraphDB OWL 2 RL OWL Horst
  21. 21. Presentation Outline o Reasoning Introduction: Benefits and Pitfalls o Reasoning Use Cases and Demos o RDFS and OWL 2 Profiles o Reasoning Implementation Choices o Reasoning With GraphDB Presentation Outline
  22. 22. Forward-Chaining and Materialization o All possible inferences are made upon update and are stored ü The inferred statements are stored and indexed along the explicit ones ü Interferences that are no longer supported upon delete are retracted o Forward-chaining works, subject to conscious modeling ü The overheads of the materialization approach are bearable ü Say, 2x index size and 2x slower loading and updates ü Marginal (if any) slowdown of queries
  23. 23. Query-time Reasoning and Backward-Chaining o Perform reasoning query-time ü No overhead upon data loading and updates ü Two basic approaches: Backward-chaining and Query rewriting o Backward-chaining slows down query evaluation dramatically ü Alike PROLOG unification, the engine “dives” recursively, in order to exhaust all alternative ways to find bindings for each separate triple pattern in the query ü There is no way to guess before the actual evaluation the cardinality of the results for each triple pattern ü This makes query plan optimization impossible and ruins query performance
  24. 24. Query Rewriting o Each pattern in the query is rewritten as disjunction of several alternatives, based on reasoning on the schema/ontology/TBox <?a rdf:type ptop:Person> query pattern will be expanded to something like <?a rdf:type ptop:Person> OR (<?p rdfs:range ptop:Person> AND <?b ?p ?a>) OR (<?a rdf:type ?c> AND <?c rdfs:subClassOf ptop:Person >) … o Execution of 10s combinations of variants is slow ü Imagine a query with two patterns: the first one expands into 5 variants and the second into 6 variants. The engine will have to evaluate 30 alternative combinations ü Think of implementing the semantics of owl:sameAs via query rewriting o Query rewriting also delivers incomplete results ü Recursion is not possible with SPARQL query rewriting
  25. 25. Presentation Outline o Reasoning Introduction: Benefits and Pitfalls o Reasoning Use Cases and Demos o RDFS and OWL 2 Profiles o Reasoning Implementation Choices o GraphDB o Reasoning with GraphDB o Reasoning Optimizations in GraphDB Presentation Outline
  26. 26. GraphDB Essentials o Scalable RDF / SPARQL engine ü W3C standards support ü NEW: RDF* support, property annotations o Platform independent (100% Java) o Open source API ü Main contributor to the RDF4J project o Reasoning and consistency checking ü UNIQUE! Efficient reasoning support for big data sets across the full lifecycle of the data: load, query, updates
  27. 27. Architecture GraphDB Workbench User friendly interface for database administration GraphDB Engine REST API for database access Plugin / Connectors
  28. 28. GraphDB Workbench o SPARQL editor & autocomplete o Schema visualization o Graph exploration o Database monitoring and administration 9/10/20
  29. 29. Visual Graph #29
  30. 30. Features Free Standard Enterprise RDF 1.1 support SPARQL 1.1 support RDFS, OWL2 RL and QL reasoning Efficient query execution Workbench interface Community support Unlimited number of CPU cores Commercial support Connectors for Elasticsearch & SOLR High-availability cluster Managed service GraphDB Enterprise: Resilience & Availability
  31. 31. Presentation Outline o Reasoning Introduction: Benefits and Pitfalls o Reasoning Use Cases and Demos o RDFS and OWL 2 Profiles o Reasoning Implementation Choices o GraphDB o Reasoning with GraphDB o Reasoning Optimizations in GraphDB Presentation Outline
  32. 32. Reasoning in GraphDB o Fast forward-chaining materialization ü Allows for efficient query evaluation on big datasets o Incremental for both inserts and deletes ü Inferred closure is updated transparently upon commit of transaction o Sample rules: ENTAILMENT CONSISITENCY p <rdf:type> <owl:FunctionalProperty> x owl:sameAs y x p y x owl:differentFrom y x p z ------------------------ ------------------------------- y <owl:sameAs> z
  33. 33. OWL 2 Reasoning o Built-in rule-sets for: RDFS, OWL-Horst, OWL2-RL, OWL2-QL o Custom rule-sets easily defined ü Ruleset optimizer/profiler o Configurations with multiple rule-sets ü E.g. one with consistency checking to be used for internal data and another one with „open-world“ semantics for LOD and other external datasets o NEW: Proof plug-in provides inference explanation
  34. 34. Predefined Rule-Sets Ruleset Description Empty No reasoning rdfs Standard RDFS: subClassOf, subPropertyOf, domain and range of properties rdfs-plus RDFS plus symmetric, transitive and inverse properties owl-horst (pD*) sameAs, equivalentClass, equivalentProperty, SymmetricProperty, TransitiveProperty, inverseOf, FunctionalProperty, InverseFunctionalProperty. Partial support for: intersectionOf, someValuesFrom, hasValue, allValuesFrom owl-max See the spec http://graphdb.ontotext.com/documentation/standard/reasoning.html owl-rl (DL-LiteR) AsymmetricProperty, IrreflexiveProperty, propertyChainAxiom, AllDisjointProperties, hasKey, unionOf, complementOf, oneOf, differentFrom, AllDisjointClasses and all the property cardinality primitives. Adds more complete support for intersectionOf, someValuesFrom, hasValue, allValuesFrom owl-ql Partial compliance. See the spec https://www.w3.org/TR/owl2-profiles
  35. 35. Optimized Rule-Sets o These versions exclude some RDFS reasoning rules, which are not useful for most of the applications, but add substantial reasoning overheads o “Optimized” ruleset versions suppress this rule Id: rdf1_rdfs4a_4b x a y ------------------------------- x <rdf:type> <rdfs:Resource> a <rdf:type> <rdfs:Resource> y <rdf:type> <rdfs:Resource>
  36. 36. Presentation Outline o Reasoning Introduction: Benefits and Pitfalls o Reasoning Use Cases and Demos o RDFS and OWL 2 Profiles o Reasoning Implementation Choices o GraphDB o Reasoning with GraphDB o Reasoning Optimizations in GraphDB Presentation Outline
  37. 37. Efficient Retraction of Inferred Facts o Materialization causes troubles upon delete ü It is not trivial to figure out which inferred statements are no longer supported o Deletion without recomputing the inference closure is needed ü Without it forward-chaining is not feasible for dynamic environments o GraphDB retracts statements via a unique algorithm ü Forward-chaining to find potentially affected inferences ü Backward-chaining to test which inferences are still supported ü No truth maintenance information overheads ü Fast – the same order of magnitude as materialization upon insert
  38. 38. The Honey of owl:sameAs Equivalence o owl:sameAs links the datasets in the Linked Open Data cloud o owl:sameAs declares that two different URIs denote one and the same object ü Aligns different identifiers of the same real-world entity used in different data sources o For example, let’s say that we have three different URIs for Bulgaria and two for Sofia (its capital) dbpedia:Sofia owl:sameAs geonames:727011 geonames:727011 geo-ont:parentFeature geonames:732800 dbpedia:Bulgaria owl:sameAs geonames:732800 dbpedia:Bulgaria owl:sameAs opencyc-en:Bulgaria
  39. 39. The Sting of owl:sameAs Equivalence o According to the standard semantics of owl:sameAs ü It is a transitive and symmetric relationship ü Statements, asserted using one of the equivalent URIs, should be inferred to appear with all equivalent URIs placed in the same position ü Thus the 4 statements in the example lead to 10 inferred statements : geonames:727011 owl:sameAs dbpedia:Sofia geonames:732800 owl:sameAs dbpedia:Bulgaria geonames:732800 owl:sameAs opencyc-en:Bulgaria opencyc-en:Bulgaria owl:sameAs dbpedia:Bulgaria opencyc-en:Bulgaria owl:sameAs geonames:732800 dbpedia:Sofia geo-ont:parentFeature geonames:732800 dbpedia:Sofia geo-ont:parentFeature opencyc-en:Bulgaria dbpedia:Sofia geo-ont:parentFeature dbpedia:Bulgaria geonames:727011 geo-ont:parentFeature opencyc-en:Bulgaria geonames:727011 geo-ont:parentFeature dbpedia:Bulgaria
  40. 40. The Honey and the Sting of owl:sameAs E11 E22 E12 E21 E23 geonames:727011 dbpedia:Sofia geonames:732800 dbpedia:Bulgaria opencyc-en:Bulgaria geo-ont:parentFeature
  41. 41. The Honey and the Sting of owl:sameAs E11 E22 E12 E21 E23 geonames:727011 dbpedia:Sofia geonames:732800 dbpedia:Bulgaria opencyc-en:Bulgaria geo-ont:parentFeature
  42. 42. owl:sameAs Optimization o GraphDB features an optimization of owl:sameAs ü It can use a single master-node in its indices to represent a class of sameAs-equivalent URIs o Avoids inflating the indices with multiple equivalent statements ü Imagine a statement that has 5 sameAs-equivalents of its subject, 2 of its predicate and 3 of its object. Such statement would have 30 replicas in the indices after forward-chaining if such an optimization is not used o Helps presenting compact query results ü The owl:sameAs equivalence can result in multiplication of the bindings of the variables in the process of query evaluation with both forward- and backward-chaining. This leads to expansion of the result- set with rows that differ only by referring to different URIs, which are sameAs-equivalent ü Optionally, query results can be expanded, as if there is no optimization
  43. 43. Questions? Experience the technology with our demonstrators FactForge: Knowledge graph of linked open data and news about People and Organizations http://factforge.net RANK: News popularity ranking for companies http://rank.ontotext.com NOW: Semantic News Portal http://now.ontotext.com #43

×