Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

12 Knowledge Mining

840 visualizaciones

Publicado el

Lecture 12 - Information Service Engineering, FIZ Karlsruhe, Leibniz Institute for Information Infrastructure, KIT Karlsruhe

Publicado en: Educación
  • How Bookies CHEAT and How to beat them?■■■ http://scamcb.com/zcodesys/pdf
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

12 Knowledge Mining

  1. 1. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Lecture 12: Knowledge Mining Prof. Dr. Harald Sack FIZ Karlsruhe - Leibniz Institute for Information Infrastructure AIFB - Karlsruhe Institute of Technology Summer Semester 2017 This file is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0)
  2. 2. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Last lecture: Information Retrieval - 2 4.1 A Brief History of Libraries and IR 4.2 Fundamental Concepts of IR 4.3 Information Retrieval Models 4.4 Retrieval Evaluation 4.5 Web Information Retrieval 4.6 Indexing 4.7 Query Processing and Ranking 2 ● Information Retrieval Models ● Average Precision @ rank ● Mean Average Precision (MAP) ● Web Crawler ● Inverted Index ● TF-IDF
  3. 3. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology ● Term Frequency - Inverse Document Frequency is tf multiplied by idf TF-IDF Term frequency tf ● +1 to avoid negative results, ● log to lower impact of frequent words Normalization usually via cosine similarity, But there are more weighting variants... 4. Information Retrieval / 4.7 Query Processing and Ranking 3 Invers document frequency idf
  4. 4. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology PageRank ● For Web Search besides the relevance of a search term for a document also the “popularity” of a document can be taken into account ● Google’s PageRank is a graph algorithm that determines the “importance” of a web page via its link graph ● PageRank basic assumptions: ○ Importance of a page is dependent on the number of incoming links, i.e. other web pages referring to the web page ○ If a web page links to another web page, the importance of a link is determined by the importance of the originating web page ○ The more outgoing links, the less important is a single outgoing link 4. Information Retrieval / 4.7 Query Processing and Ranking 4
  5. 5. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology PageRank 4. Information Retrieval / 4.7 Query Processing and Ranking 5 damping factor for all incoming links importance of incoming link j PRi PageRank of page i PRj PageRank of page j that is linking to page i cj page j is linking to cj different pages n number of incoming links to page i N number of all pages d damping factor d∈[0,1] (Brin, Page, 1998)
  6. 6. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology PageRank 4. Information Retrieval / 4.7 Query Processing and Ranking 6 1.0 1.0 1.0 1.0 DC Iteration r(A) r(B) r(C) r(D) 1 1,0 1,0 1,0 1,0 2 1,0 0,575 2,275 0,15 3 2,083 0,575 1,1912 0,15 … … … … … n 1,49 0,7833 1,577 0,15 BA Initial state 1.49 0.78 1.57 0.15 DC BA Final stateIteration of PageRank computation
  7. 7. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Last Lecture: Information Retrieval - 2 4.1 A Brief History of Libraries and IR 4.2 Fundamental Concepts of IR 4.3 Information Retrieval Models 4.4 Retrieval Evaluation 4.5 Web Information Retrieval 4.6 Indexing 4.7 Query Processing and Ranking 7
  8. 8. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering Lecture Overview 1. Information, Natural Language and the Web 2. Natural Language Processing 3. Linked Data Engineering 4. Information Retrieval 5. Knowledge Mining 6. Exploratory Search and Recommender Systems 8
  9. 9. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering 5. Kowledge Mining 9 5.1 From Data to Knowledge 5.2 Linked Data based Information Visualization 5.3 Linked Data based Knowledge Mining 5.4 Linked Data Analytics
  10. 10. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology collective application of knowledge in context understanding principles (why?, what is best?, doing things right) experience, context, value applied to a message understanding patterns (principles: how to?) a message meant to change the receivers perception understanding relations (description: what?) discrete objective facts about event DIKW Pyramid, Ackoff 1989 [1] Knowledge Mining evaluated understanding information enriched with semantics in usable form raw characters and symbols future novelty past experience 5. IKnowledge Engineering/ 5.1 From Data to Knowledge
  11. 11. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Data ● Data is raw. ● It simply exists and has no significance beyond its existence (in and of itself). ● It can exist in any form, usable or not. 5. Knowledge Engineering/ 5.1 From Data to Knowledge
  12. 12. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information ● Information is data that has been given meaning by way of relational connection. ● This "meaning" can be useful, but does not have to be. ● Information is contained in descriptions, answers to questions that begin with such words as who, what, when, where, and how many 5. Knowledge Engineering/ 5.1 From Data to Knowledge
  13. 13. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Knowledge ● Knowledge is the appropriate collection of information, such that it's intent is to be useful. ● Understanding is a continuum that leads from data, through information and knowledge, and ultimately to wisdom ● Data transforms to information by convention, information to knowledge by cognition, and knowledge to wisdom by contemplation 5. Knowledge Engineering/ 5.1 From Data to Knowledge
  14. 14. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Knowledge 5. Knowledge Engineering/ 5.1 From Data to Knowledge ● How do we transform data into knowledge? 1. Collect Data 2. Organize Data 3. Analyze Data
  15. 15. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering 5. Kowledge Mining 15 5.1 From Data to Knowledge 5.2 Linked Data based Information Visualization 5.3 Linked Data based Knowledge Mining 5.4 Linked Data Analytics
  16. 16. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 16 Charles Joseph Minard (1781-1870) - Napoleon’s Russian Campaign http://patrimoine.enpc.fr/document/ENPC01_Fol_10975?image=54#bibnum
  17. 17. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 17 https://en.wikipedia.org/wiki/Charles_Joseph_Minard#/media/File:Minard_map_of_napoleon.png
  18. 18. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 18 A Picture is worth a 1000 words... ● Pictures have been used to convey information long before the development of writing ● A single picture can be processed (“understood”) much faster than a (linear) text page ● Human perception is processing in parallel, text analysis is limited by the sequential process of reading https://commons.wikimedia.org/wiki/File%3AA_picture_is_worth_a_thousand_words.jpg 5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
  19. 19. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 19 Information Visualization ● Information Visualization is the study of (interactive) visual representations of abstract data to reinforce human cognition ● Information graphics or infographics are graphic visual representations of information, data or knowledge intended to present information quickly and clearly ○ a static form of information visualization ○ aims to emphasize specific findings gained from the visualized data ○ Mandatory precondition: Data Analysis 5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
  20. 20. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology SPARQL e.g. via GoogleDoc Hands On Data Visualization Example ● Dataset: DBpedia ● Task: Draw a map chart which indicates the number of soccer players per country 5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
  21. 21. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Task: Draw a Map Chart... ● Look for a representative (prominent) example 5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
  22. 22. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Task: Draw a Map Chart... ● Dataset: DBpedia ○ Examine a representative example from DBpedia, e.g. http://dbpedia.org/page/Lionel_Messi 5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
  23. 23. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Task: Draw a Map Chart... ● SPARQL Query: ○ Country and Number of soccerplayers per country ■ ?s rdf:type dbo:SoccerPlayer . ■ ?s dbo:birthPlace ?birthplace . ■ ?birthplace dbo:country ?country . ■ ?country rdfs:label ?country_name ■ FILTER (lang(?countryLabel)=”en”) ■ GROUP BY ?countryLabel ■ COUNT(DISTINCT ?s) 5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
  24. 24. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Task: Draw a Map Chart... ● SPARQL Query: DBpedia SPARQL query 5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
  25. 25. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Task: Draw a Map Chart... ● SPARQL Result: ○ Save as CSV 5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
  26. 26. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Task: Draw a Map Chart... ● Import the csv data into a spreadsheet, as e.g. in GoogleDocs 5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
  27. 27. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Task: Draw a Map Chart... ● Display spreadsheet data in some diagram 5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
  28. 28. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Task: Draw a Map Chart... ● Important step: Data Cleansing ○ Is the displayed data correct? ○ Outlier detection ○ In our example: are all displayed countries really existing countries? ○ Potential Solutions: ■ Adapt your original query (new SPARQL query) ■ Remove evidently wrong data ● manually or procedural 5. Knowledge Engineering/ 5.2 Data Mining
  29. 29. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Task: Draw a Map Chart... ● Redraw after Data Cleansing 5. Knowledge Engineering/ 5.2 Linked Data based Information Visualization
  30. 30. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering 5. Kowledge Mining 30 5.1 From Data to Knowledge 5.2 Linked Data based Information Visualization 5.3 Linked Data based Knowledge Mining 5.4 Linked Data Analytics
  31. 31. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Knowledge Mining and Knowledge Discovery 5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining Knowledge Discovery [in Databases] (KDD) is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in (massive) data sources. (Fayyad et al, 1996 [2]) ● valid: to a certain degree the discovered patterns should also hold for new, previously unseen problem instances. ● novel: at least to the system and preferable to the user. ● potentially useful: they should lead to some benefit to the user or task. ● ultimately understandable: the end user should be able to interpret the patterns either immediately or after some postprocessing.
  32. 32. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Knowledge Mining and Knowledge Discovery 5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining Knowledge Discovery [in Databases] (KDD) is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in (massive) data sources. (Fayyad et al, 1996 [2]) ● Goals: ○ Descriptive Modelling: explains the characteristic and the behaviour of the observed data ○ Predictive Modelling: predicts the behaviour of new data based on some model ● Important: ○ The extracted model/pattern does not have to apply in 100% of the cases
  33. 33. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology The Knowledge Discovery Process 5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining Data Selection: Select a relevant dataset or focus on a subset of a dataset Target Data Preprocessing/ Cleaning: Data integration from different sources, Data Cleaning Preprocessed Data Transformation: Select useful features, feature transformation, dimensionality reduction Transformed Data Data Mining: Search for patterns of interest Patterns Evaluation: Evaluate patterns based on interestingness measures, model validation Knowledge
  34. 34. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Data Cleaning 5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining ● “Dirty” Data: ○ Dummy values, absence of data, contradicting data, etc. ● Steps in Data Cleaning ○ Parsing: locates and identifies individual data elements in raw data ○ Correcting: corrects parsed individual data components using sophisticated data algorithms ○ Normalization: applies conversion routines to transform data into standard formats ○ Matching: searching and matching records within and across data based on predefined rules ○ Consolidating: merges data into one representation
  35. 35. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Knowledge Mining Functionality 5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining ● Characterization: summarizing general features of objects in a target class (concept description) ● Discrimination: comparing general features of objects between a target class and a contrasting class (concept comparison) ● Association: studying the frequency of items occurring together ● Prediction: predicting some unknown or missing attribute values ● Classification: organizing data in given classes based on attribute values (supervised) ● Clustering: organizing data in classes based on attribute values (unsupervised) ● Outlier analysis: identifying and explaining exceptions (surprises) ● Time-series analysis: analyzing trends and deviations
  36. 36. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 36 Data Analysis in the Knowledge Mining Process ● Data Analysis is a fundamental iterative process: 1. Formulation and execution of a query 2. Analysis of the results 3. Formulation of a consecutive query based on the achieved results ● Goals of Data Analysis: ○ maximize understanding of analyzed data ○ uncover hidden structures/patterns ○ extraction of important variables ○ detection of anomalies and outliers ○ testing of hypotheses ○ development of a simple model 5. Knowledge Engineering/ 5.3 Linked Data Based Knowledge Mining
  37. 37. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering 5. Kowledge Mining 37 5.1 From Data to Knowledge 5.2 Linked Data based Information Visualization 5.3 Linked Data based Knowledge Mining 5.4 Linked Data Analytics
  38. 38. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 38 HandsOn: Linked Data Analytics ● Available Toolset: ○ Data 1: DBpedia SPARQL endpoint ○ Data 2: Wikidata SPARQL endpoint ● Query language: ○ SPARQL ● simple statistics and visualization: ○ R https://www.r-project.org/ 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  39. 39. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 39 HandsOn: Linked Data Analytics 5. Knowledge Engineering/ 5.4 Linked Data Analytics https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg
  40. 40. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 40 Linked Data Analytics ● First example: Analyse the number of achieved goals of soccer players with DBpedia ● Solution 1. Create appropriate SPARQL query 2. Save result as csv file 3. Analyze data with R 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  41. 41. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 41 Linked Data Analytics 1. Create appropriate SPARQL query SPARQL Query SELECT ?goals WHERE { ?s rdf:type dbo:SoccerPlayer ; dbp:totalgoals ?goals FILTER (DATATYPE(?goals)=xsd:integer) . } 2. Save the result as csv file 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  42. 42. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 42 Linked Data Analytics ● First example: Analyse the number of achieved goals of soccer players with DBpedia 3. Analyse data with R ○ read data: goals <- read.csv("dbpedia-goals", header=TRUE) ○ summarize data: summary(goals) 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  43. 43. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 43 Linked Data Analytics First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R ○ analyze data via boxplot: boxplot(goals) 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  44. 44. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 44 Linked Data Analytics First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R ○ analyze data via boxplot: boxplot(goals) ○ adapt ranges for visibility: boxplot(goals,ylim=c(-90,200)) ○ 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  45. 45. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 45 Linked Data Analytics First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R ○ analyze data via boxplot: boxplot(goals) ○ adapt ranges for visibility: boxplot(goals,ylim=c(-90,200)) ○ 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  46. 46. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 46 Linked Data Analytics First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R ○ analyze data via boxplot: boxplot(goals) ○ adapt ranges for visibility: boxplot(goals,ylim=c(-90,200)) ○ Inter Quartile Range (IQR) 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  47. 47. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 47 Linked Data Analytics First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R ○ analyze data via boxplot: boxplot(goals) ○ adapt ranges for visibility: boxplot(goals,ylim=c(-90,200)) ○ IQR Whiskers: IQR x 1.5 = (Q3 -Q1 ) x 1.5 Q3 + 1.5 IQR Q1 - 1.5 IQR Outliers 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  48. 48. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 48 Linked Data Analytics First example: Analyse the number of achieved goals of soccer players 3. Analyse data with R ○ Data Cleaning ■ remove negative values ■ examine outliers (SPARQL query) ○ New data analysis 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  49. 49. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 49 Linked Data Analytics Second example: ● Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia) https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  50. 50. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 50 Linked Data Analytics 1. Create appropriate SPARQL query SPARQL Query SELECT SAMPLE(?goals) as ?goals SAMPLE(xsd:integer(?height)) as ?height SAMPLE(xsd:date(?bday)) as ?bday WHERE { ?s rdf:type dbo:SoccerPlayer ; <http://dbpedia.org/ontology/Person/height> ?height ; dbo:birthDate ?bday ; dbp:totalgoals ?goals FILTER (DATATYPE(?goals)=xsd:integer). } GROUP BY ?s 2. Save the result as csv file 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  51. 51. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 51 Linked Data Analytics Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia) 3. Analyse data with R 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  52. 52. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 52 Linked Data Analytics Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia) 3. Analyse data with R > boxplot(goals2$height) 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  53. 53. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 53 Linked Data Analytics Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia) 3. Analyse data with R > boxplot(goals2$bday) 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  54. 54. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 54 Linked Data Analytics Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia) 3. Analyse data with R ○ How are the values distributed? ○ > hist(goals2$goals) 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  55. 55. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 55 Linked Data Analytics Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia) 3. Analyse data with R ○ How are the values distributed? ○ > hist(goals2$goals) ○ > hist(goals2$height) 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  56. 56. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 56 Linked Data Analytics Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia) 3. Analyse data with R ○ How are the values distributed? ○ > hist(goals2$goals) ○ > hist(goals2$height) ○ > hist(goals2$bday) 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  57. 57. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 57 Linked Data Analytics Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia) 3. Analyse data with R ○ Data Cleaning and 2D plot 1. height vs. goals > plot(goals2$height,goals2$goals) 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  58. 58. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 58 Linked Data Analytics Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia) 3. Analyse data with R ○ Data Cleaning and 2D plot 2. birthday vs. goals > plot(goals2$bday,goals2$goals) 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  59. 59. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 59 Linked Data Analytics Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in DBpedia) 3. Analyse data with R ○ Data Cleaning and 2D plot 3. birthday vs. height > plot(goals2$bday,goals2$height) 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  60. 60. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 60 Linked Data Analytics ● But, is there really a relationship between the number of achieved goals, the birthdate, or the height of a soccer player? ● Statistics to the Rescue: ○ Correlation Coefficient determines correlation and dependence of two or more variables: ○ Applied to out problem in R: ■ cor(goals2$height,goals2$goals)= -0.04350283 ■ cor(goals2$bday,goals2$goals)= -0.1098645 ■ cor(goals2$bday,goals2$height)= 0.2881508 Covariance of X and Y Standard deviations of X and Y 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  61. 61. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 61 Linked Data Analytics ● Let’s have a closer look at the distribution of heights looks like a Normal distribution might probably be an anomaly 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  62. 62. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 62 Linked Data Analytics ● Let’s have a closer look at the distribution of heights ○ Use SPARQL for further analysis SPARQL query might probably be an anomaly select ?s ?height WHERE { ?s rdf:type dbo:SoccerPlayer ; dbo:height ?height . } ORDER BY ?height 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  63. 63. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 63 Linked Data Analytics ● Let’s have a closer look at the distribution of heights ○ Use SPARQL for further analysis ○ For DBpedia, you always have to consider the extraction process from the original Wikipedia infoboxes... 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  64. 64. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 64 Linked Data Analytics Third example: ● Use another data source: ● Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in Wikidata) https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  65. 65. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 65 ● WIKIDATA follows a different data schema qualifiers https://www.wikidata.org/wiki/Q39444 statement
  66. 66. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 66 Linked Data Analytics ● WIKIDATA uses a different data schema https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg https://www.wikidata.org/wiki/Q39444 Access via different namespaces for properties: ● wdt: connects an item to a value wd:Q39444 wdt:P54 ?team . 5. Knowledge Engineering/ 5.4 Linked Data Analytics Object = subject/context for statement
  67. 67. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 67 Linked Data Analytics ● WIKIDATA uses a different data schema https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg https://www.wikidata.org/wiki/Q39444 statement Access via different namespaces for properties: ● wdt: connects an item to a value wd:Q39444 wdt:P54 ?team . ● p: connects a subject to a statement wd:Q39444 p:P54 ?team_statement . 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  68. 68. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 68 Linked Data Analytics ● WIKIDATA uses a different data schema https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg https://www.wikidata.org/wiki/Q39444 property and object/value of statement Access via different namespaces for properties: ● wdt: connects an item to a value wd:Q39444 wdt:P54 ?team . ● p: connects a subject to a statement wd:Q39444 p:P54 ?team_statement . ● pq: connects statement to qualifier value ?team_statement pq:1351 ?statement_value 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  69. 69. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 69 Linked Data Analytics https://commons.wikimedia.org/wiki/File:Fu%C3%9Fballgeschichte_(1892).jpg https://www.wikidata.org/wiki/Q39444 pq:P1351 ?goals Access via different namespaces for properties: ● pq: connects statement to qualifier value wd:Q39444 p:P54 ?team_statement . ?team_statement pq:P1351 ?goals . 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  70. 70. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 70 Linked Data Analytics 1. Create appropriate SPARQL query SPARQL Query SELECT (SUM(?goals) as ?total_goals) (SAMPLE(?height) as ?height) (SAMPLE(xsd:date(?birthdate)) as ?bday) WHERE { ?s wdt:P106 wd:Q937857 ; p:P54 ?team_statement ; wdt:P2048 ?height ; wdt:P569 ?birthdate . ?team_statement pq:P1351 ?goals . } GROUP BY ?s 2. Save the result as csv file 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  71. 71. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 71 Linked Data Analytics Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in WIKIDATA) 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  72. 72. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 72 Linked Data Analytics Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in WIKIDATA) ● height vs. goals > plot(goals3$height,goals3$goals) > cor(goals3$height,goals3$goals) [1] -0.046644459 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  73. 73. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 73 Linked Data Analytics Is there a relationship between the number of achieved goals, the birthdate, or the height of a soccer player (in WIKIDATA) ● Distribution of birthdays > hist(goals3$bday) 5. Knowledge Engineering/ 5.4 Linked Data Analytics
  74. 74. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 74 Linked Data Analytics Are there any relationships between the number of goals and other properties? 5. Knowledge Engineering/ 5.4 Linked Data Analytics SPARQL query
  75. 75. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology Information Service Engineering 5. Kowledge Mining 75 5.1 From Data to Knowledge 5.2 Linked Data based Information Visualization 5.3 Linked Data based Knowledge Mining 5.4 Linked Data Analytics
  76. 76. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 5. Knowledge Mining Bibliography [1] Ackoff, R. L. (1989). From data to wisdom. Journal of Applied Systems Analysis 15: 3-9 [2] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. (1996). From data mining to knowledge discovery: an overview. In Advances in knowledge discovery and data mining, American Association for Artificial Intelligence, Menlo Park, CA, USA 1-34. [3] The R Project for Statistical Computing, https://www.r-project.org/ 76
  77. 77. Information Service Engineering , Prof. Dr. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure & AIFB - Karlsruhe Institute of Technology 5. Knowledge Mining Syllabus Questions ● Explain the concept of PageRank. What are its basic assumptions? ● What’s the difference: Data, Information, Knowledge, and Wisdom? ● What is Knowledge Discovery? ● What are the goals of Knowledge Discovery? ● Explain the process of Knowledge Discovery ● Why do we need a “Data Cleaning” step in Knowledge Mining? 77

×