Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Próxima SlideShare
Cargando en…5
×

# Matching Conceptual Models Using Multivariate Analysis

179 visualizaciones

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Sé el primero en comentar

• Sé el primero en recomendar esto

### Matching Conceptual Models Using Multivariate Analysis

1. 1. 1 MATCHING CONCEPTUAL MODELS (PART OF THE ‘IBIOSEARCH’ PROJECT) JUNE 9 2008 Quantitative Methods Ritu Khare
2. 2. Order of the Presentation 2     Problem and Background Research Questions Initial Dataset Overall Methodology Representation of Dataset A  Criteria to compare two entities  Generation of dataset B  Multivariate Analysis of dataset B   Results Case I  Case II  Case III  Case IV     Inferences Future Work References
3. 3. 1. Problem and Background 3  Search Interface is represented as a Conceptual Model C A Search X A: B: Search Y C:    X B The aim is to combine all search interfaces i.e. to combine several conceptual models. Hence, matching of models is required. In this project, focus is on matching of entities. Y
4. 4. 2. Research Questions 4    Find an Entity Matching Technique(s) to match entities of two models. Does this technique (or combination of techniques ) provide a good way to compare two entities? What other basis of comparison can be used?
5. 5. 3. Initial Dataset A 5   20 Conceptual Models Expect Example 1: Matrix Domain DB  Example 2: BLASTP Alignments Accession No. Gene ID Title Sequence Gene Patent Patent Sequence Number Gene Name
6. 6. 4. Overall Methodology 6 Conceptual Models Representation of Dataset A into structured tables Criteria to compare entities from different models (Entity Name, Attribute set, Relationship Set) Generation of Dataset B Multivariate Analysis of Dataset B Analysis Results
7. 7. 4.1 Representation of dataset A 7  Every model is represented as   List of entities Every Entity in a model is represented as Entity Name  List of attributes  List of relationships   Dataset A has the following columns: (Model_ID, Entity_name, Attribute_set, Relationship_set)
8. 8. 4.2 Criteria to compare two entities 8   All entities from two different models are compared. Criteria to compare two entities Entity Name Similarity Exact String Matching, Substring Matching Output: Boolean Variable (True, False)  Attribute Set Similarity Jaccard Coefficient Output: Decimal Number (between 0 and 1)  Relationship Set Similarity Jaccard Coefficient Output: Decimal Number (between 0 and 1) 
9. 9. 4.3 Generation of Dataset B 9   Input: 20 Conceptual Models Algorithm:    Stem Entity Names and Attribute Names (Porter Stemmer) Compare each pair of Entities from different models based on the three criteria (Slide 7) Output: Table (598 records) Pair# Name Similarity Attribute Similarity Relationship Similarity XYZ Yes 0.657 0.004
10. 10. 4.4 Multivariate Analysis of dataset B 10   Manually annotate if a pair represents similar entities or not. (“Match” column) 60 matches and 538 mismatches were found. Pair# Name Sim. Attribute Sim. Relationsh ip Sim. XYZ  Match Yes Yes 0.657 0.004 Is this a good Classification Model?    Can it correctly identify matching and non-matching pair? Which technique is suitable to answer these questions? Binary Logistic Regression  Predictive variables are a combination of continuous and categorical variables.  Name_Sim (Categorical), Attr_Sim (Continuous), Rel_Sim (Continuous)
11. 11. 5. Results 11  Binary Logistic Regression IV: Name_Sim, Attr_Sim, Rel_Sim  DV: Match      Case I: IV = Name_Sim Case 2: IV = Name_Sim, Attr_Sim Case 3: IV = Name_Sim, Rel_Sim Case 4: IV = Name_Sim, Attr_Sim, Rel_Sim
12. 12. 5.1 Results: Case 1and Case 2 12 DV=Match, IV=Name_Sim DV= Match, IV = Name_Sim, Attr_Sim + Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7% + Variables in the equation for constant and Sim_name are both significant. + Nagelkerke R square = .469 - Specificity decreased from 100 to 98.24%, FP increased improved from 0 to 1.75% - -2 Log Likelihood very high = 309.673 - Cox and Snell R squares = .263 + Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7% + Variables in the equation for constant and Sim_name are both significant. + Nagelkerke R square = .470 - Specificity decreased from 100 to 98.24%, FP rate increased from 0 to 1.75% - -2 Log Likelihood very high = 309.622 - Cox and Snell R squares = .264 - Variables in the equation for Sim_Attr is not significant.
13. 13. 5.2 Results: Case 3 and 4 13 DV= Match, IV=Name_Sim, Rel_Sim DV: Match, IV: Name_Sim, Attr_Sim + Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7% + Variables in the equation for constant and Sim_name are both significant. + Nagelkerke R square = .470 - Specificity decreased from 100 to 98.24%, FP rate increased from 0 to 1.75% - -2 Log Likelihood very high = 309.622 - Cox and Snell R squares = .264 - Variables in the equation for Sim_Rel is not significant. + Accuracy increased from 85.6% to 92.6%, Sensibility increased from 0 to 59.3%, FN rate dropped from 100 to 40.7% + Variables in the equation for constant and Sim_name are both significant. + Nagelkerke R square = .471 - Specificity decreased from 100 to 98.24%, FP rate increased from 0 to 1.75% - -2 Log Likelihood very high = 308.818 - Cox and Snell R squares = .265 - Variables in the equation for Sim_Attr, and Sim_rel are not significant.
14. 14. 6. Inferences 14   Out of the three predictive variables (Name_Sim, Rel_Sim, and Attr_Sim), only Name_Sim is a good predictor of actual classes of observations. The misclassified cases mainly represent those observations which require some domain knowledge e.g. BLASTP is same as Protein Sequence; and TBLASTX is same as Nucleotide Sequence.
15. 15. 7. Future Work 15      Improve Similarity Function Use of domain dictionaries Include more number of models Generate a new classification function Clustering entities that are found similar
16. 16. References 16     NAR Journal dataset Porter’s Stemming Algorithm: http://tartarus.org/~martin/PorterStemmer/ Sharma, S. (1995), Applied Multivariate Techniques, John Wiley & Sons, Inc. New York, NY, USA. INFO 692 Lecture Handouts
17. 17. 17 Thank You Questions, Comments, Ideas…?