SlideShare una empresa de Scribd logo
1 de 41
A HYBRID FRAMEWORK FOR QUERYING
     LINKED DATA DYNAMICALLY

          JÜRGEN UMBRICH

               PhD Viva
           November 26th, 2012
Classical Query Approach
                       MATERIALISED STORE
                              centralised
                           data warehousing




                                              fast query
                                                times




 26/11/2012     PhD Viva, Jürgen Umbrich             Slide 1 of 39
                                                                1
MOTIVATING EXAMPLE

GIVE ME THE CURRENT TEMPERATURE OF THE EUROPEAN CAPITALS.




 26/11/2012            PhD Viva, Jürgen Umbrich     Slide 2 of 39
                                                               2
Research Questions

How dynamic is linked data and what is the impact for store based
query processing?


Can the performance of live querying be improved by applying
lightweight reasoning?

 How effective are hash-based data summaries for source
 selection in live query processing?

How can live and store based processing be combined to obtain a
trade-off between fast and fresh results?

 26/11/2012               PhD Viva, Jürgen Umbrich             Slide 3 of 39
                                                                          3
How dynamic is linked data and what is the impact
       for store based query processing?




                                                [LDOW 2010]
                                              [DESWEB 2010]
                                                 [COLD 2011]
26/11/2012         PhD Viva, Jürgen Umbrich          Slide 4 of 39
                                                                4
DYNAMIC LINKED DATA OBSERVATORY

 Allows to study and assess the dynamics of Linked Data


               DataHub and BTC
               95K static URIs
               95K dynamic (2 hops)



               once a week
               started in March 2012

                                                                        [http://world.yale.edu]

                                                        weekly dumps are freely available at
                                                                 http://swse.deri.org/dyldo

 26/11/2012                      PhD Viva, Jürgen Umbrich                                Slide 5 of 39
                                                                                                    5
DYNAMICS OF LINKED DATA

How fast does a source change?                                       15 weeks
                      rest
                      17%



          only once
             8%

                                                        no changes
                                                           58%
        every week
           17%



 26/11/2012                  PhD Viva, Jürgen Umbrich                   Slide 6 of 39
                                                                                   6
DYNAMICS OF LINKED DATA

Can we observe different types of changes?                     15 weeks
                    others
                                                 only value
                     14%
                                                  updates
                                                    24%


        adds/dels
          19%


                                                     value
                                                   updates &
                only adds                          adds/dels
                  20%                                 23%
 26/11/2012                  PhD Viva, Jürgen Umbrich             Slide 7 of 39
                                                                             7
IMPLICATIONS FOR CENTRALISED QUERYING




                                            How coherent are
                                            the results?




 26/11/2012      PhD Viva, Jürgen Umbrich               Slide 8 of 39
                                                                   8
COHERENCE OF QUERIES


     LOD cache                                       SPARQL endpoints
                        complete coherent
              1%                                                 15%
                                                    35%
                   43% complete incoherent



 56%
                         partially coherent
                                                                   50%



 26/11/2012              PhD Viva, Jürgen Umbrich                Slide 9 of 39
                                                                            9
PROBLEM WITH CLASSICAL QUERY APPROACH
                       MATERIALISED STORE
                              centralised
                           data warehousing

                                              outdated
                                               results


                                               limited
                                              coverage


                                              fast query
                                                times




 26/11/2012     PhD Viva, Jürgen Umbrich            Slide 10 of10
                                                                39
LTBQE: LINK TRAVERSAL BASED QUERY EXECUTION
                                                      ohDoc:
                                                                        Exploiting Linked Data principles:
   oh:olaf           foaf:name          Olaf Hartig                          dereferencing URIs
                      owl:sameAs                                             following links
  foaf:img
                foaf:knows
                 foaf:knows           dblpA:Olaf_Hartig
  http://...             cb:chris                                        SELECT ?f ?img
       rdfs:seeAlso                                                      WHERE {
                                                                           oh:olaf foaf:knows ?f .
  cbDoc:                                                                   ?f foaf:depiction ?img .
                                                                         }

 cbDoc:

                         cb:chris
       foaf:depiction
                                                                                 ?f              ?img
                                 owl:sameAs                                   cb:chris      http://..
   http://...
                foaf:name           dblpA:Christian
                                        _Bizer
           Chris Bizer

 26/11/2012                                  PhD Viva, Jürgen Umbrich                                   Slide 11 of11
                                                                                                                    39
PERFORMANCE FACTORS OF LTBQE

  query time is influenced by
    source selection
    number of sequential lookups


  result recall is influenced by
    dereferenceability
    execution order
    connectivity


 26/11/2012            PhD Viva, Jürgen Umbrich   Slide 12 of12
                                                              39
Can the performance of live querying be improved
       by applying lightweight reasoning?




                                                    [RR 2012]
                                             [SWJ submission]
26/11/2012        PhD Viva, Jürgen Umbrich           Slide 13 of13
                                                                 39
OUR CONTRIBUTION TO LTBQE

    Improved recall with reasoning extensions
     to make more raw data available
      subset of RDFS
      explicit owl:sameAs




 26/11/2012            PhD Viva, Jürgen Umbrich   Slide 14 of14
                                                              39
HOW REASONING CAN HELP LTBQE
                                                      ohDoc:
                                                                        SELECT ?label WHERE {
   oh:olaf          foaf:name           Olaf Hartig                       oh:olaf foaf:knows ?f .
                      owl:sameAs                                          ?f rdfs:label ?label .
  foaf:img                                                              }
                foaf:knows            dblpA:Olaf_Hartig
  http://...             cb:chris                                                    ?label
       rdfs:seeAlso                                                              Christian Bizer
                                                                                   Chris bizer
  cbDoc:

                                             foaf:name          rdfs:subPropertyOf            rdfs:label
 cbDoc:
                                                                                 rdfs:label
                         cb:chris                                                                Christian Bizer
       foaf:depiction                                                   dblpA:Christian
                                owl:sameAs                                  _Bizer
   http://...
                foaf:name           dblpA:Christian                              foaf:maker
                                        _Bizer                                                 dblpP:Hartig09
           Chris Bizer                                                               dblpADoc:Christian_Bizer
 26/11/2012                                  PhD Viva, Jürgen Umbrich                                      Slide 15 of15
                                                                                                                       39
LTBQE ANALYSIS


 Investigate how practical LTBQE is and how much more raw
 data and results can be make available with our extensions?



   How many URIs can be dereferenced?
   How much additional data with our extensions?
   How do our extensions perform in practice?




 26/11/2012               PhD Viva, Jürgen Umbrich             Slide 16 of16
                                                                           39
LTBQE ANALYSIS: EXPERIMENTS

How many URIs can be dereferenced?


                  position                          %URIs   available data
                  <URI> ?p ?o .                     85%          95%
   BTC 2011
                  ?s ?p <URI> .                     46%          44%
   25.4m URIs     ?s <URI> ?o .                     1%         0.00…%
                  ?s rdf:type <URI> .               10%         0.2%
                  <URI>                             44%          51%
    Schema data
                  Improved query time by around 50%
                  by reducing number of lookup
 26/11/2012              PhD Viva, Jürgen Umbrich                    Slide 17 of17
                                                                                 39
LTBQE ANALYSIS: EXPERIMENTS

How much additional data with our extensions?



                        position                                %URIs     available data
   BTC 2011
                        <URI> rdfs:seeAlso ?o .                 2%          1.006x
   18.65m URIs          <URI> owl:sameAs ?o .                   16%         2.5x
                        RDFS reasoning*                         81%         1.78 x


          *rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain, rdfs:range
              authoritativeTbox[Bonatti] extracted from BTC 2011

 26/11/2012                          PhD Viva, Jürgen Umbrich                     Slide 18 of18
                                                                                              39
QUERY GENERATION

How do our extensions perform in practice?

Existing benchmarks target either a single domain or provide
only a few queries.


  BTC 2011               1100 queries
                          100 each for
QWalk:
Random walk based
                           11 “typical”
query generation.             shapes




 26/11/2012                PhD Viva, Jürgen Umbrich            Slide 19 of19
                                                                           39
THROUGHPUT: AVERAGE RESULT/TIME RATIO
                            worst                                   best

                     LTBQE      Core-          seeAlso            sameAs   RDFS    Comb
      entity-s       1          1.68           1.67               2.15     1.29    1.53
      entity-o       3.97       6.48           6.16               5.7      5.37    4.33
      entity-so      2.02       2.82           2.66               3.71     3.73    4.8
      star-3-0       0.11       0.16           0.15               0.15     0.24    0.2
      star-2-1       0.58       1.12           1                  1.04     2.14    1.75
      star-1-2       0.17       1.6            1.35               1.6      70.97   58.85
      star-0-3       0.18       0.35           0.33               0.94     0.24    0.68
      s-path-2       0.44       0.72           0.68               0.7      0.83    0.78
      s-path-3       1.76       2.45           2.56               2.46     2.43    2.1
      o-path-2       1.38       8.39           7.76               10.55    6.36    6.89
      o-path-3       0.95       5.7            5.84               6.08     5.04    4.68

                  Overall average query time of ~12 seconds.
 26/11/2012                            PhD Viva, Jürgen Umbrich                            Slide 20 of20
                                                                                                       39
LIMITATION OF LTBQE: JOIN OVER LITERALS
                                                   ohDoc:                              dblpADoc:Olaf_Hartig


                   foaf:name                                           Olaf Hartig          dblpP:Hartig09
  oh:olaf                            Olaf Hartig
                      owl:sameAs                                     foaf:name       foaf:maker
  foaf:img
               foaf:knows           dblpA:Olaf_Hartig                   dblpA:Olaf_Hartig
  http://...             cb:chris
       rdfs:seeAlso

 cbDoc:
                                                                          join over Literal
                       materialised                                   SELECT ?p2
  LTBQE                   store                                       WHERE {
                                                                        oh:olaf foaf:name ?name .
                               ?        outdated
                                         results                        ?p2 foaf:name ?name .
                                                                      }


 26/11/2012                               PhD Viva, Jürgen Umbrich                                   Slide 21 of21
                                                                                                                 39
ALTERNATIVE: SOURCE SELECTION
       ohDoc:   dblpADoc:Olaf_Hartig




                  SOURCE INDEX                           QUERY
                                                         ENGINE




 26/11/2012                   PhD Viva, Jürgen Umbrich        Slide 22 of22
                                                                          39
How effective are hash-based data summaries for
    source selection in live query processing?




                                               [WWW 2010]
                                              [WWWJ 2011]

26/11/2012         PhD Viva, Jürgen Umbrich       Slide 23 of23
                                                              39
APPROXIMATE DATA SUMMARIES
   Combined description of
      schema and
      instance data

   Use approximation to reduce index size
    (incurs false positives)

   Hash-based approach
      Space complexity: O(buckets * #sources)

   QTree: Combination of histograms and R-tree inheriting the
       benefit of both data structures
         optimal for sparse data


 26/11/2012                     PhD Viva, Jürgen Umbrich         Slide 24 of24
                                                                             39
HASH-BASED DATA SUMMARIES
                                       ohDoc:                                ohDoc:

      oh:olaf    foaf:name        Olaf Hartig
                                                                         o
 Input:         triple + source
                                                                    p
 Hash: triple
 Insert: 3D point and save
          source information
     30                                         Data
                                                  oh:olaf foaf:name “Olaf Hartig” . ohDoc:
     20
                                                Hash:
o                                                 [ 24 , 5 , 2 ] , ohDoc:
     10
                                                Insert:
      1                                           ([ 24 , 5 , 2 ] , ohDoc: )
          1      10          20         30
                      s
    26/11/2012                           PhD Viva, Jürgen Umbrich                     Slide 25 of25
                                                                                                  39
EFFICIENT SOURCE SELECTION
 Summarise data with buckets and store cardinality and source
  information
 Query: Lookup
          { oh:olaf ?p ?o }                 hash          ( 24 , ? , ? )

                  equi-width histogram                              QTree
         30


         20
    o
         10

          1
              1         10       20         30
                                                  ohDoc:
                             s
 26/11/2012                           PhD Viva, Jürgen Umbrich              Slide 26 of26
                                                                                        39
EVALUATION
 Number of estimated sources as the crucial performance factor
                                                          other approaches
                                                          Qtree
     Number of sources (log)



                                                          actually relevant




 26/11/2012                    PhD Viva, Jürgen Umbrich                  Slide 27 of27
                                                                                     39
TRADE-OFF: FRESH OR FAST
    ACCESSING DATA
                                                          MATERIALISED
     AT RUNTIME
                                                             STORE




                      fresh                        fast              outdated
                     results                      query               results
                                                  times
      slow
     query                                                            limited
     times                                                           coverage




 26/11/2012                PhD Viva, Jürgen Umbrich                       Slide 28 of28
                                                                                      39
How can live and store query processing be
  combined to obtain a trade-off between fast and
                  fresh results?



                                               [DESWEB 2012]
                                                 [EKAW 2012]
                                                  [ISWC 2012]
26/11/2012          PhD Viva, Jürgen Umbrich          Slide 29 of29
                                                                  39
HYBRID SPARQL EXECUTION IDEA
GIVE ME THE CURRENT TEMPERATURE OF THE EUROPEAN CAPITALS.




                    fresh                          fast query
                   results                           times



                  dynamic                             static




 26/11/2012             PhD Viva, Jürgen Umbrich                Slide 30 of30
                                                                            39
HYBRID SPARQL: ARCHITECTURE

                                     coherence                 update
                           update




                                                                        Index query
              Live query              monitor




                                                                          interface
               interface

                                        query
                                       planner




 26/11/2012                         PhD Viva, Jürgen Umbrich                          Slide 31 of31
                                                                                                  39
COHERENCE MONITOR

                                     coherence                 update
                           update




                                                                        Index query
              Live query              monitor




                                                                          interface
               interface

                                        query
                                       planner


 computes and stores statistics about the freshness and coverage
 of cache for individual query patterns

  store independent: can be applied to any store; no indication
  of specific coverage or update rates
  store specific: more sensitive to the update patterns and
  coverage of the store
 26/11/2012                         PhD Viva, Jürgen Umbrich                          Slide 32 of32
                                                                                                  39
COHERENCE OF PREDICATES

     LOD cache                                             SPARQL endpoints
                              complete coherent
                 10%
                                                          30%
                       23%
                             complete incoherent                                  46%



  67%
                               partially coherent               24%

 sioc:account_of             swivt:creationDate                  foaf:knows
 26/11/2012                    PhD Viva, Jürgen Umbrich                  Slide 33 of33
                                                                                     39
COHERENCE ESTIMATES




 26/11/2012     PhD Viva, Jürgen Umbrich   Slide 34 of34
                                                       39
QUERY PLANNER

                                     coherence                 update
                           update




                                                                        Index query
              Live query              monitor




                                                                          interface
               interface

                                        query
                                       planner




   finding best query plan
   identifying dynamic/static patterns
   delegation and merging



 26/11/2012                         PhD Viva, Jürgen Umbrich                          Slide 35 of35
                                                                                                  39
QUERY PLANNING
              selectivity-based                                   coherence-based



                              tp4                                                tp3

                      tp3                                                  tp2

    tp1         tp2                                 tp4            tp1

                      Pattern        Selectivity               Coherence
                        tp1                  0.98                 0.86
                        tp2                  0.43                 0.32
                        tp3                  0.21                 0.00
                        tp4                  0.15                 0.91
 26/11/2012                         PhD Viva, Jürgen Umbrich                           Slide 36 of36
                                                                                                   39
REAL WORLD EXPERIMENTS
  Evaluation of different hybrid query plan strategies

  Methodology
   QWalk: Various types of SPARQL SELECT queries
     star-shaped, path-shaped, mixed
     different numbers of patterns
     at least one static and dynamic pattern
   Variable counting ordering
   Single split with threshold (e.g. 0.5)
   Static part is executed first
   Linked traversal based query execution

 26/11/2012                 PhD Viva, Jürgen Umbrich     Slide 37 of37
                                                                     39
REAL WORLD EXPERIMENTS
                                 Avg. of 43 queries
                          live                                          ordering
                     1                                                     coh
                                                                           sel

                    0.8
      live recall




                                                                        split
                                                                                rnd.
                                                                                thres.
                    0.4                                                         fixed
                                                                                opt
                                                                store
                    0.3




                          1 2                  6               12
                                       speedup
 26/11/2012                         PhD Viva, Jürgen Umbrich               Slide 38 of38
                                                                                       39
CONCLUSION
 How dynamic is Linked Data and what is the impact for store based query
  processing?
   We verified that Linked Data is dynamic and that it impacts the result
     freshness and completeness of cache based query engines.
 Can the performance of live querying be improved by applying lightweight
  reasoning?
   our source selection and reasoning optimisation improve query time and
     result recall compared to the state of the art.
 How effective are hash-based data summaries for source selection in live
  query processing?
   The QTree loosen the query restrictions of pure live querying and
     outperforms similar source selection approaches.
 How can live and cache query processing be combined to obtain a trade-off
  between fast and fresh results?
     Hybrid query execution with the knowledge of data dynamics for fast and
      fresh results.
 26/11/2012                    PhD Viva, Jürgen Umbrich                 Slide 39 of39
                                                                                    39
FUTURE WORK
 Dynamic Linked Data Observatory
    Extended experiments
    Data mining to discover dynamic relations

 Hybrid Query Execution
    Develop a cost model which combines selectivity and
     coherence
    Automatically find best plan and split
    Combination of different query approaches

 SPARQL as the query language for the Web
    Navigational features




 26/11/2012              PhD Viva, Jürgen Umbrich          Slide 40 of40
                                                                       39

Más contenido relacionado

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Destacado

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Destacado (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

A HYBRID FRAMEWORK FOR QUERYING

  • 1. A HYBRID FRAMEWORK FOR QUERYING LINKED DATA DYNAMICALLY JÜRGEN UMBRICH PhD Viva November 26th, 2012
  • 2. Classical Query Approach MATERIALISED STORE centralised data warehousing fast query times 26/11/2012 PhD Viva, Jürgen Umbrich Slide 1 of 39 1
  • 3. MOTIVATING EXAMPLE GIVE ME THE CURRENT TEMPERATURE OF THE EUROPEAN CAPITALS. 26/11/2012 PhD Viva, Jürgen Umbrich Slide 2 of 39 2
  • 4. Research Questions How dynamic is linked data and what is the impact for store based query processing? Can the performance of live querying be improved by applying lightweight reasoning? How effective are hash-based data summaries for source selection in live query processing? How can live and store based processing be combined to obtain a trade-off between fast and fresh results? 26/11/2012 PhD Viva, Jürgen Umbrich Slide 3 of 39 3
  • 5. How dynamic is linked data and what is the impact for store based query processing? [LDOW 2010] [DESWEB 2010] [COLD 2011] 26/11/2012 PhD Viva, Jürgen Umbrich Slide 4 of 39 4
  • 6. DYNAMIC LINKED DATA OBSERVATORY Allows to study and assess the dynamics of Linked Data  DataHub and BTC  95K static URIs  95K dynamic (2 hops)  once a week  started in March 2012 [http://world.yale.edu] weekly dumps are freely available at http://swse.deri.org/dyldo 26/11/2012 PhD Viva, Jürgen Umbrich Slide 5 of 39 5
  • 7. DYNAMICS OF LINKED DATA How fast does a source change? 15 weeks rest 17% only once 8% no changes 58% every week 17% 26/11/2012 PhD Viva, Jürgen Umbrich Slide 6 of 39 6
  • 8. DYNAMICS OF LINKED DATA Can we observe different types of changes? 15 weeks others only value 14% updates 24% adds/dels 19% value updates & only adds adds/dels 20% 23% 26/11/2012 PhD Viva, Jürgen Umbrich Slide 7 of 39 7
  • 9. IMPLICATIONS FOR CENTRALISED QUERYING How coherent are the results? 26/11/2012 PhD Viva, Jürgen Umbrich Slide 8 of 39 8
  • 10. COHERENCE OF QUERIES LOD cache SPARQL endpoints complete coherent 1% 15% 35% 43% complete incoherent 56% partially coherent 50% 26/11/2012 PhD Viva, Jürgen Umbrich Slide 9 of 39 9
  • 11. PROBLEM WITH CLASSICAL QUERY APPROACH MATERIALISED STORE centralised data warehousing outdated results limited coverage fast query times 26/11/2012 PhD Viva, Jürgen Umbrich Slide 10 of10 39
  • 12. LTBQE: LINK TRAVERSAL BASED QUERY EXECUTION ohDoc: Exploiting Linked Data principles: oh:olaf foaf:name Olaf Hartig  dereferencing URIs owl:sameAs  following links foaf:img foaf:knows foaf:knows dblpA:Olaf_Hartig http://... cb:chris SELECT ?f ?img rdfs:seeAlso WHERE { oh:olaf foaf:knows ?f . cbDoc: ?f foaf:depiction ?img . } cbDoc: cb:chris foaf:depiction ?f ?img owl:sameAs cb:chris http://.. http://... foaf:name dblpA:Christian _Bizer Chris Bizer 26/11/2012 PhD Viva, Jürgen Umbrich Slide 11 of11 39
  • 13. PERFORMANCE FACTORS OF LTBQE  query time is influenced by  source selection  number of sequential lookups  result recall is influenced by  dereferenceability  execution order  connectivity 26/11/2012 PhD Viva, Jürgen Umbrich Slide 12 of12 39
  • 14. Can the performance of live querying be improved by applying lightweight reasoning? [RR 2012] [SWJ submission] 26/11/2012 PhD Viva, Jürgen Umbrich Slide 13 of13 39
  • 15. OUR CONTRIBUTION TO LTBQE  Improved recall with reasoning extensions to make more raw data available  subset of RDFS  explicit owl:sameAs 26/11/2012 PhD Viva, Jürgen Umbrich Slide 14 of14 39
  • 16. HOW REASONING CAN HELP LTBQE ohDoc: SELECT ?label WHERE { oh:olaf foaf:name Olaf Hartig oh:olaf foaf:knows ?f . owl:sameAs ?f rdfs:label ?label . foaf:img } foaf:knows dblpA:Olaf_Hartig http://... cb:chris ?label rdfs:seeAlso Christian Bizer Chris bizer cbDoc: foaf:name rdfs:subPropertyOf rdfs:label cbDoc: rdfs:label cb:chris Christian Bizer foaf:depiction dblpA:Christian owl:sameAs _Bizer http://... foaf:name dblpA:Christian foaf:maker _Bizer dblpP:Hartig09 Chris Bizer dblpADoc:Christian_Bizer 26/11/2012 PhD Viva, Jürgen Umbrich Slide 15 of15 39
  • 17. LTBQE ANALYSIS Investigate how practical LTBQE is and how much more raw data and results can be make available with our extensions? How many URIs can be dereferenced? How much additional data with our extensions? How do our extensions perform in practice? 26/11/2012 PhD Viva, Jürgen Umbrich Slide 16 of16 39
  • 18. LTBQE ANALYSIS: EXPERIMENTS How many URIs can be dereferenced? position %URIs available data <URI> ?p ?o . 85% 95% BTC 2011 ?s ?p <URI> . 46% 44% 25.4m URIs ?s <URI> ?o . 1% 0.00…% ?s rdf:type <URI> . 10% 0.2% <URI> 44% 51% Schema data Improved query time by around 50% by reducing number of lookup 26/11/2012 PhD Viva, Jürgen Umbrich Slide 17 of17 39
  • 19. LTBQE ANALYSIS: EXPERIMENTS How much additional data with our extensions? position %URIs available data BTC 2011 <URI> rdfs:seeAlso ?o . 2% 1.006x 18.65m URIs <URI> owl:sameAs ?o . 16% 2.5x RDFS reasoning* 81% 1.78 x *rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain, rdfs:range authoritativeTbox[Bonatti] extracted from BTC 2011 26/11/2012 PhD Viva, Jürgen Umbrich Slide 18 of18 39
  • 20. QUERY GENERATION How do our extensions perform in practice? Existing benchmarks target either a single domain or provide only a few queries. BTC 2011 1100 queries 100 each for QWalk: Random walk based 11 “typical” query generation. shapes 26/11/2012 PhD Viva, Jürgen Umbrich Slide 19 of19 39
  • 21. THROUGHPUT: AVERAGE RESULT/TIME RATIO worst best LTBQE Core- seeAlso sameAs RDFS Comb entity-s 1 1.68 1.67 2.15 1.29 1.53 entity-o 3.97 6.48 6.16 5.7 5.37 4.33 entity-so 2.02 2.82 2.66 3.71 3.73 4.8 star-3-0 0.11 0.16 0.15 0.15 0.24 0.2 star-2-1 0.58 1.12 1 1.04 2.14 1.75 star-1-2 0.17 1.6 1.35 1.6 70.97 58.85 star-0-3 0.18 0.35 0.33 0.94 0.24 0.68 s-path-2 0.44 0.72 0.68 0.7 0.83 0.78 s-path-3 1.76 2.45 2.56 2.46 2.43 2.1 o-path-2 1.38 8.39 7.76 10.55 6.36 6.89 o-path-3 0.95 5.7 5.84 6.08 5.04 4.68 Overall average query time of ~12 seconds. 26/11/2012 PhD Viva, Jürgen Umbrich Slide 20 of20 39
  • 22. LIMITATION OF LTBQE: JOIN OVER LITERALS ohDoc: dblpADoc:Olaf_Hartig foaf:name Olaf Hartig dblpP:Hartig09 oh:olaf Olaf Hartig owl:sameAs foaf:name foaf:maker foaf:img foaf:knows dblpA:Olaf_Hartig dblpA:Olaf_Hartig http://... cb:chris rdfs:seeAlso cbDoc: join over Literal materialised SELECT ?p2 LTBQE store WHERE { oh:olaf foaf:name ?name . ? outdated results ?p2 foaf:name ?name . } 26/11/2012 PhD Viva, Jürgen Umbrich Slide 21 of21 39
  • 23. ALTERNATIVE: SOURCE SELECTION ohDoc: dblpADoc:Olaf_Hartig SOURCE INDEX QUERY ENGINE 26/11/2012 PhD Viva, Jürgen Umbrich Slide 22 of22 39
  • 24. How effective are hash-based data summaries for source selection in live query processing? [WWW 2010] [WWWJ 2011] 26/11/2012 PhD Viva, Jürgen Umbrich Slide 23 of23 39
  • 25. APPROXIMATE DATA SUMMARIES  Combined description of  schema and  instance data  Use approximation to reduce index size (incurs false positives)  Hash-based approach  Space complexity: O(buckets * #sources)  QTree: Combination of histograms and R-tree inheriting the benefit of both data structures  optimal for sparse data 26/11/2012 PhD Viva, Jürgen Umbrich Slide 24 of24 39
  • 26. HASH-BASED DATA SUMMARIES ohDoc: ohDoc: oh:olaf foaf:name Olaf Hartig o  Input: triple + source p  Hash: triple  Insert: 3D point and save source information 30 Data oh:olaf foaf:name “Olaf Hartig” . ohDoc: 20 Hash: o [ 24 , 5 , 2 ] , ohDoc: 10 Insert: 1 ([ 24 , 5 , 2 ] , ohDoc: ) 1 10 20 30 s 26/11/2012 PhD Viva, Jürgen Umbrich Slide 25 of25 39
  • 27. EFFICIENT SOURCE SELECTION  Summarise data with buckets and store cardinality and source information  Query: Lookup { oh:olaf ?p ?o } hash ( 24 , ? , ? ) equi-width histogram QTree 30 20 o 10 1 1 10 20 30 ohDoc: s 26/11/2012 PhD Viva, Jürgen Umbrich Slide 26 of26 39
  • 28. EVALUATION Number of estimated sources as the crucial performance factor other approaches Qtree Number of sources (log) actually relevant 26/11/2012 PhD Viva, Jürgen Umbrich Slide 27 of27 39
  • 29. TRADE-OFF: FRESH OR FAST ACCESSING DATA MATERIALISED AT RUNTIME STORE fresh fast outdated results query results times slow query limited times coverage 26/11/2012 PhD Viva, Jürgen Umbrich Slide 28 of28 39
  • 30. How can live and store query processing be combined to obtain a trade-off between fast and fresh results? [DESWEB 2012] [EKAW 2012] [ISWC 2012] 26/11/2012 PhD Viva, Jürgen Umbrich Slide 29 of29 39
  • 31. HYBRID SPARQL EXECUTION IDEA GIVE ME THE CURRENT TEMPERATURE OF THE EUROPEAN CAPITALS. fresh fast query results times dynamic static 26/11/2012 PhD Viva, Jürgen Umbrich Slide 30 of30 39
  • 32. HYBRID SPARQL: ARCHITECTURE coherence update update Index query Live query monitor interface interface query planner 26/11/2012 PhD Viva, Jürgen Umbrich Slide 31 of31 39
  • 33. COHERENCE MONITOR coherence update update Index query Live query monitor interface interface query planner computes and stores statistics about the freshness and coverage of cache for individual query patterns  store independent: can be applied to any store; no indication of specific coverage or update rates  store specific: more sensitive to the update patterns and coverage of the store 26/11/2012 PhD Viva, Jürgen Umbrich Slide 32 of32 39
  • 34. COHERENCE OF PREDICATES LOD cache SPARQL endpoints complete coherent 10% 30% 23% complete incoherent 46% 67% partially coherent 24% sioc:account_of swivt:creationDate foaf:knows 26/11/2012 PhD Viva, Jürgen Umbrich Slide 33 of33 39
  • 35. COHERENCE ESTIMATES 26/11/2012 PhD Viva, Jürgen Umbrich Slide 34 of34 39
  • 36. QUERY PLANNER coherence update update Index query Live query monitor interface interface query planner  finding best query plan  identifying dynamic/static patterns  delegation and merging 26/11/2012 PhD Viva, Jürgen Umbrich Slide 35 of35 39
  • 37. QUERY PLANNING selectivity-based coherence-based tp4 tp3 tp3 tp2 tp1 tp2 tp4 tp1 Pattern Selectivity Coherence tp1 0.98 0.86 tp2 0.43 0.32 tp3 0.21 0.00 tp4 0.15 0.91 26/11/2012 PhD Viva, Jürgen Umbrich Slide 36 of36 39
  • 38. REAL WORLD EXPERIMENTS Evaluation of different hybrid query plan strategies Methodology  QWalk: Various types of SPARQL SELECT queries  star-shaped, path-shaped, mixed  different numbers of patterns  at least one static and dynamic pattern  Variable counting ordering  Single split with threshold (e.g. 0.5)  Static part is executed first  Linked traversal based query execution 26/11/2012 PhD Viva, Jürgen Umbrich Slide 37 of37 39
  • 39. REAL WORLD EXPERIMENTS Avg. of 43 queries live ordering 1 coh sel 0.8 live recall split rnd. thres. 0.4 fixed opt store 0.3 1 2 6 12 speedup 26/11/2012 PhD Viva, Jürgen Umbrich Slide 38 of38 39
  • 40. CONCLUSION  How dynamic is Linked Data and what is the impact for store based query processing?  We verified that Linked Data is dynamic and that it impacts the result freshness and completeness of cache based query engines.  Can the performance of live querying be improved by applying lightweight reasoning?  our source selection and reasoning optimisation improve query time and result recall compared to the state of the art.  How effective are hash-based data summaries for source selection in live query processing?  The QTree loosen the query restrictions of pure live querying and outperforms similar source selection approaches.  How can live and cache query processing be combined to obtain a trade-off between fast and fresh results?  Hybrid query execution with the knowledge of data dynamics for fast and fresh results. 26/11/2012 PhD Viva, Jürgen Umbrich Slide 39 of39 39
  • 41. FUTURE WORK  Dynamic Linked Data Observatory  Extended experiments  Data mining to discover dynamic relations  Hybrid Query Execution  Develop a cost model which combines selectivity and coherence  Automatically find best plan and split  Combination of different query approaches  SPARQL as the query language for the Web  Navigational features 26/11/2012 PhD Viva, Jürgen Umbrich Slide 40 of40 39

Notas del editor

  1. e.g. sindice, watson,swse, virtuoso
  2. No stream processing mentioning No infrastructure needed – not asking for eventsAd-hocHow to do query processing
  3. This setup allows for study-ing (i) dynamics within the datasets (ii) dynamics between datasets (esp. links) (iii)and the growth of Linked Data and the arrival of new sources (although to a lesserextent).
  4. This setup allows for study-ing (i) dynamics within the datasets (ii) dynamics between datasets (esp. links) (iii)and the growth of Linked Data and the arrival of new sources (although to a lesserextent).
  5. Make it clearerand the growth of Linked Data and the arrival of new sources (although to a lesserextent).
  6. Don’t mention two stores
  7. We proofed for two prominent stores that that problem exists
  8. e.g. sindice, watson,swse, virtuoso
  9. denote
  10. More links and connect more parts of the graph
  11. Snapshot live
  12. Overlay, dereferencing schema knowledge
  13. Reasonable increase Most inferences look reasonable
  14. We run it liveQuery generation to the slide
  15. Add here some query timesIf you would assume linear query times Use a table with ratios
  16. Materialsied store, outdated results, we need to check them again But that means we do not use the data, only the source information
  17. Shrink the source index , compared to materialsed index
  18. Introduce example query to show that LTQBE is limited and we can fix it by doing source selection We do not need a full materialsed index, since we retrieve the source and compute the query over itIf we do live lookup.
  19. investigate several lightweight source selection approaches to further im-prove the query times, increase the result recall and loosen the query typerestriction of pure link traversal based query approaches
  20. Could combine with previousAttachsourceto pointIf we wouldstore for each point the source information we would end up with full index with dic.so we split the numerical data space into buckets
  21. Qtree optimal for sparse dataSame number of buckets , but more fine grained source selection
  22. ShowexperUse the diagram again iments in a different way
  23. Introduce bit by bitInterfacesCoehereQuery planner
  24. involve monitoring a large range of Linked Data sources to build a comprehensive, global picture of the dynamicity of the Web of Data. Previous empirical studies [17,15] have shown varying levels of dynamicity across Linked Data sources; furthermore, we speculate that dynamicity varies by the schema of data [17]. In term of benefits, cache- independent estimates can be applied generically to any store (and indeed to other use-cases) [6]; however, they give no indication as to the specific coverage or update rates, etc., of the cache engine at hand.
  25. Materialised storesLODcacheSindice SPARQLUse store icons
  26. Triple pattern estimatesCentered predicatesQuery Sampling URIsDistinct predicates for chaces
  27. More details
  28. Filter out queries which produced empty results (offline sources)
  29. We verifiied that Linked Data is dynamic which has an impact on results of mat enginesLTBQE approaches offer fresh results but works only for deref URIs and we can improve the recall through reasoing extensionsA compact data summary such as the Qtree pose no query restrictions and can find more sources that can answer the query than ltbqeMat cahces and lTBQE can be combined in a hybird execution framework to deliver fresh and fast results by integrating the knowledge about data dynamics.
  30. Bildnicht optimal
  31. This setup allows for study-ing (i) dynamics within the datasets (ii) dynamics between datasets (esp. links) (iii)and the growth of Linked Data and the arrival of new sources (although to a lesserextent).
  32. This setup allows for study-ing (i) dynamics within the datasets (ii) dynamics between datasets (esp. links) (iii)Make it clearerand the growth of Linked Data and the arrival of new sources (although to a lesserextent).
  33. Fuege label ein und loeschezweiquellen
  34. Explain to claudio – maybe remove it
  35. Triple pattern estimatesCentered predicatesQuery Sampling URIsDistinct predicates for chaces