SlideShare una empresa de Scribd logo
1 de 67
Descargar para leer sin conexión
Development of health
            measurement scales


If you cannot express in numbers something that you are describing,
            you probably have little knowledge about it.
Reliability and  validity
RELIABILITY AND VALIDITY
15/03/2013




   Dr. Priyamadhaba Behera
    Junior Resident, AIIMS
                             3
Why you need to worry about reliability and Validity ?
What happens with low reliability and validity ?


What is the relationship between reliability and validity ?
Do you need validity always ? Or reliability always ? Or both ?

What is the minimum reliability that is needed for a scale ?
Reliability and  validity
No matter how well the
objectives are written,
or how clever the items,
the quality and usefulness
of an examination is
predicated on Validity
and Reliability


Validity & Reliability
Validity                          Reliability
We don’t say “an exam is valid and reliable”

We do say “the exam score is reliable and
Valid for a specified purpose”


                         KEY ELEMENT!




Validity & Reliability
Reliability and  validity
Reliability Vs. Validity
Validity
• Two steps to determine usefulness of a scale
   – Reliability – necessary but not sufficient
   – Validity – next step
• Validity – is the test measuring what it is meant to measure?
• Two important issues
   – The nature of the what is being measured
   – Relationship of that variable to its purported cause
      • Sr. creatinine is a measure of kidney func. because we know it is regulated
         by the kidneys
      • But whether students who do volunteer work will become better doctors?
• Since our understanding of human behaviour is far from
  perfect, such predictions have to validated against actual
  performance
Types of validity
• Three Cs (conventionally)
  – Content
                            All types of validity are addressing the same
  – Criterion             issue of the degree of confidence we can place
                           in the inferences we can draw from the scales
     • Concurrent
     • Predictive
• Construct –
  – Convergent, discriminant, trait etc.,

•Others (face validity)
Differing perspectives
• Previously validity was seen as demonstrating the properties
  of the scale
• Current thinking - what inferences can be made about the
  people that have given rise to the scores on these scales?
   – Thus validation is a process of hypothesis testing (someone who scores
     on test A, will do worse in test B, and will differ from people who do
     better in test C and D)
   – Researchers are only limited by their imagination to devise
     experiments to test such hypothesis
Validity & Reliability
• Face validity
   – On the face of it the tool appears to be measuring what it is
     supposed to measure
   – Subjective judgment by one/more experts, rarely by any
     empirical means

• Content validity
   – Measures whether the tool includes all relevant domains or
     not
   – Closely related to face validity
   – aka. ‘validity by assumption’ because an expert says so

• Certain situations where these may not be desired
Content validity
• Example – cardiology exam;
  – Assume it contains all aspects of the circulatory
    system       (physiology,      anatomy,      pathology,
    pharmacology etc., etc.,)
  – If a person scores high on this test, we can say ‘infer’
    that he knows much about the subject (i.e., our
    inferences about the person will right across various
    situations)
  – In contrast, if the exam did not contain anything about
    circulation, the inferences we make about a high scorer
    may be wrong most of the time and vice versa
• Generally, a measure that includes a more representative
  sample of the target behaviour will have more content validity
  and hence lead to more accurate inferences

• Reliability places an upper limit on validity (the maximum
  validity is the square root of reliability coeff.) the higher the
  reliability the higher the maximum possible validity
   – One exception is that between internal consistency and
      validity (better to sacrifice IC to content validity)
   – The ultimate aim of scale is inferential which depends
      more on content validity than internal consistency
Criterion validity
• Correlation of a scale to an accepted ‘gold standard’
• Two types
   – Concurrent (both the new scale and standard scale are given at the
      same time)
   – Predictive – the Gold Standard results will be available some time in
      the future (eg. Entrance test for college admission to assess if a person
      will graduate or not)
• Why develop a new scale when we already have a criterion scale?
   – Diagnostic utility/substitutability(expensive, invasive, dangerous, time-
      consuming)
   – Predictive utility (no decision can be made on the basis of new scale)
• Criterion contamination
   – If the result of the GS is in part determined in some way by the results
      of the new test, it may lead to an artificially high correlation
Construct validity
• Height, weight – readily observable
• Psychological - anxiety, pain, intelligence are abstract
  variables and can’t be directly observed
• For eg. Anxiety – we say that a person has anxiety if he has
  sweaty palms, tachycardia, pacing back and forth, difficulty in
  concentrating etc., (i.e., we have a hypothesize that these
  symptoms are the result of anxiety)
• Such proposed underlying factors are called hypothetical
  constructs/ constructs (eg. Anxiety, illness behaviour)
• Such constructs arise from larger theories/ clinical
  observations
• Most psychological instruments tap some aspect of construct
Establishing construct validity
• IBS is a construct rather than a disease – it is a
  diagnosis of exclusion
• A large vocabulary, wide knowledge and
  problem solving skills – what is the underlying
  construct?
• Many clinical syndromes are constructs rather
  than actual entities (schizophrenia, SLE)
• Initial scales for IBS – ruling out other organic
  diseases and some physical signs and symptoms
  – These scales were inadequate because they lead to
    many missed and wrong diagnoses
  – New scales developed incorporating demographical
    features and personality features
• Now how to assess the validity of this new scale
  – Based on theory high scorers on this scale should
    have
     • Symptoms which will not clear with conventional therapy
     • Lower prevalence of organic bowel disease on autopsy
Differences form other types
1. Content and criterion can be established in one or two
studies, but there is no single experiment that can prove a
construct
•Construct validation is an ongoing process, learning more
about the construct, making new predictions and then testing
them
•Each supportive study strengthens the construct but one
well designed negative study can question the entire
construct
2. We are assessing the theory as well as the measure at the
same time
IBS example
• We had predicted that IBS patients will not respond to
  conventional therapy
• Assume that we gave the test to a sample of patients
  with GI symptoms and treated them with conventional
  therapy
• If high scoring patients responded in the same
  proportion as low scorers then there are 3 possibilities
   – Our scale is good but theory wrong
   – Our theory is good but scale bad
   – Both scale and theory are bad
• We can identify the reason only from further studies
• If an experimental design is used to test the
  construct, then in addition to the above
  possibilities our experiment may be flawed
• Ultimately, construct validity doesn’t differ
  conceptually from other types of validity
  – All validity is at its base some form of construct
    validity… it is the basic meaning of validity –
    (Guion)
Establishing construct validity
• Extreme groups
• Convergent and discriminant validity
• Multitrait-multimethod matrix
Extreme groups
• Two groups – as decided by clinicians
   – One IBS and the other some other GI disease
   – Equivocal diagnosis eliminated
• Two problems
   – That we are able to separate two extreme groups implies
     that we already have a tool which meets our needs
     (however we can do bootstrapping)
   – This is not sufficient, the real use of a scale is making
     much finer discriminations. But such studies can be a first
     step, if the scale fails this it will be probably useless in
     practical situations
Multitrait-multimethod matrix
•   Two unrelated traits/constructs each measured by two different methods
•   Eg. Two traits – anxiety, intelligence; two methods – a rater, exam

                                       Anxiety          Intelligence
                                    Rater    Exam     Rater     Exam
                            Rater   0.53
               Anxiety
                            Exam    0.42     0.79
                            Rater   0.18     0.17     0.58
             Intelligence
                            Exam    0.15     0.23     0.49       0.88

     –   Purple – reliabilities of the four instruments (sh be highest)
     –   Blue – homotrait heteromethod corr. (convergent validity)
     –   Yellow – heterotrait homomethod corr. (divergent validity)
     –   Red – heterotrait heteromethod corr. (sh be lowest)

•   Very powerful method but very difficult to get such a combination
• Convergent validity - If there are two measures for
  the same construct, then they should correlate with
  each other but should not correlate too much.
  E.g. Index of anxiety and ANS awareness index

• Divergent validity – the measure should not correlate
  with a measure of a different construct, eg. Anxiety
  index and intelligence index
Biases in validity assessment
• Restriction in range
     • May be in new scale (MAO level)
     • May be in criterion (depression score)
     • A third variable correlated to both (severity)
• Eg. A high correlation was found between
  MAO levels and depression score in
  community based study, but on replicating the
  study in hospital the correlation was low
Reliability and  validity
The information we seek and our
   best hope for obtaining it.

              Content/
               Action      +     Error
            Our human frailty and inability to
                write effective questions.


 Validity & Reliability
Reliability and  validity
Maximum validity of a test is the square root of reliability coefficient. Reliability places
 an upper limit on validity so that higher the reliability, higher the maximum possible
                                         validity
Variance = sum of (individual value – mean value) 2
           ----------------------------------------------------------------------------------

                                      no. of values
Reliability
• Whether our tool is measuring the attribute in a
  reproducible fashion or not
• A way to show the amount of error (random and
  systematic) in any measurement
• Sources of error – observers, instruments, instability
  of the attribute
• Day to day encounters
   – Weighing machine, watch, thermometer
Assessing Reliability
• Internal Consistency
   – The average correlation among all the items in the tool
      • Item-total correlation
      • Split half reliability
      • Kuder-Richardson 20 & Cronbach’s alpha
      • Multifactor inventories
• Stability
   – Reproducibility of a measure on different occasions
      • Inter-Observer reliability
      • Test-Retest reliability (Intra-Observer reliability)
Internal consistency
• All items in a scale tap different aspects of the same
  attribute and not different traits
• Items should be moderately corr. with each other and
  each item with the total
• Two schools of thought
   – If the aim is to describe a trait/behaviour/disorder
   – If the aim is to discriminate people with the trait from those
     without
• The trend is towards scales that are more internally
  consistent
• IC doesn’t apply to multidimensional scales
Item-total correlation
• Oldest, still used
• Correlation of each item with the total score w/o that
  item
• For k number of items, we have to calculate k number
  of correlations, labourious
• Item should be discarded if r < 0.20(kline 1986)
• Best is Pearson’s R, in case of dichotomous items -
  point-biserial correlation
Split half reliability
• Divide the items into two halves and calculate corr.
  between them
• Underestimates the true reliability because we are
  reducing the length of scale to half (r is directly related
  to the no. of items)
   – Corrected by Spearman-Brown formula
• Should not be used in
   – Chained items
   Difficulties-ways to divide a test
                -doesn't point which item is contributing to
    poor reliability
KR 20/Cronbach’s alfa
• KR-20 for dichotomous responses
• Cronbach’s alfa for more than two responses
• They give the average of all possible split half reliabilities of a
  scale
• If removing an item increases the coeff. it should be discarded
• Problems
   – Depends on the no. of items
   – A scale with two different sub-scales will prob. yield high alfa
   – Very high alfa denotes redundancy (asking the same question in
     slightly different ways)
   – Thus alfa should be more than 0.70 but not more than 0.90
• Cronbach’s basic equation for alpha
                n       ΣVi 
           α=       1 −     
              n − 1  Vtest 


  – n = number of questions
  – Vi = variance of scores on each question
  – Vtest = total variance of overall scores on the
    entire test
Multifactor inventories
• More sophisticated techniques
• Item-total procedure – each item should correlate
  with the total of its scale and the total of all the scales
• Factor analysis
   – Determining the underlying factors
   – For eg., if there are five tests
      • Vocabulary, fluency, phonetics, reasoning and
        arithmetic
      • We can theorize that the first three would be correlated
        under a factor called ‘verbal factor’ and the last two
        under ‘logic factor’
Stability/ Measuring error
• A weighing machine shows weight in the range of
  say 40-80 kg and thus an error of ±1kg is
  meaningful

   Reality we calculate the ratio
   variability between subjects / total variability
   (Total variability includes subjects and measurement error)
• So that a ratio of
   –1 indicates no measurement error/perfect reliability
   –0 indicates otherwise
• Reliability =
  subj. variability / (subj. variability + measurement error)

• Statistically ‘variance’ is the measure of variability so,
• Reliability =
  SD2 of subjects / (SD2 of subjects + SD2 of error)

• Thus reliability is the proportion of the total variance that
  is due to the ‘true’ differences between the subjects

• Reliability has meaning only when applied to specific
  populations
Calculation of reliability
• The statistical technique used is ANOVA and
  since we have repeated measurements in
  reliability, the method is
  – repeated measures ANOVA
Example
Reliability and  validity
• Classical definition of reliability
• Interpretation is that 88% of the variance is
  due to the true variance among patients (aka
  Intraclass Correlation coefficient)
Fixed/random factor
• What happened to the variance due to observers?
• Are these the same observers going to be used or they
  are a random sample?




• Other situations where observations may be treated as
  fixed is subjects answering ‘same items on a scale’
Other types of reliability
• We have only examined the effect of different
  observers on the same behaviour
• But there can be error due to ‘day to day’ differences,
  if we measure the same behaviour a week or two
  apart we can calculate ‘intra-observer reliability
  coefficient’
• If there are no observers (self-rated tests) we can still
  calculate ‘test-retest reliability’
• Usually high inter-observer is sufficient, but if it is
  low then we may have to calculate intra-observer
  reliability to determine the source of unreliability

• Mostly measures of internal consistency are reported
  as ‘reliability’, because there are easily computed in a
  single sitting.
   – Hence caution is required as they may not measure
     variability due to day to day differences
Diff. forms of reliability coefficient
• So far we have seen forms of ICC
• Others
  – Pearson product-moment correlation
  – Cohen’s kappa
  – Bland – altman analysis
Pearson’s correlation
• Based on regression – the extent to which the relation
  between two variables can be described by straight
  line
Limitations of Pearson’s R
• A perfect fit of 1.0 may be obtained even if the intercept
  is non-zero and the slope is not equal to one unlike with
  ICC
• So, Pearson’s R will be higher than truth, but in practice it
  is usually equal to ICC as the predominant source of error
  is random variation
• If there are multiple observations then multiple pairwise
  Rs are required, unlike the single ICC

• For eg. with 10 observers there will be 45 Pearson’s Rs
  whereas only one ICC
Kappa coeff.

• Used when responses are dichotomous/categorical

• When the frequency of positive results is very low or high,
  kappa will be very high

• Weighted kappa focuses on disagreement, cells are weighted
  according to the distance from the diagonal of agreement

• Weighting can be arbitrary or using quadratic weights (based
  on square of the amount of discrepancy)

• Quadratic scheme of weighted kappa is equivalent to ICC
Reliability and  validity
Reliability and  validity
Kappa coeff.
Bland and Altman method
• A plot of difference between two observations
  against the mean of the two observations
• Agreement is expressed as the ‘limits of agreement’. The
  presentation of the 95% limits of agreement is for visual
  judgement of how well two methods of measurement agree.
  The smaller the range between these two limits the better the
  agreement is.

• The question of how small is small depends on the clinical
  context: would a difference between measurement methods as
  extreme as that described by the 95% limits of agreement
  meaningfully affect the interpretation of the results

• Limitation - the onus is placed on the reader to juxtapose the
  calculated error against some implicit notion of true variability
Standards for magnitude of reliability coeff.
•How much reliability is good?
 Kelly (0.94) Stewart (0.85)
•A test for individual judgment should be higher
than that for research in groups
•Research purposes –
  – Mean score and the sample size will reduce the error
  – Conclusions are usually made after a series of studies
  – Acceptable reliability is dependent on the sample size
    in research(in sample of 1000 reliablity may low
    compared to sample size of 10)
Reliability and probability of misclassification
•Depends on the property of the instrument and the
decision of cut point
•Relation between reliability and likelihood of
misclassification
  – Eg. A sample of 100, one person ranked 25th and another
    50th
  – If the R is 0, 50% chance that the two will reverse order on
    retesting
  – If R is 0.5, 37% chance, with R=0.8, 2.2% chance
•Hence R of 0.75 is minimum requirement for a useful
instrument
Improving reliability
• Increase the subject variance relative to the error
  variance (by legitimate means and otherwise)
• Reducing error variance
   – Observer/rater training
   – Removing consistently extreme observers
   – Designing better scales
• Increasing true variance
   – In case of ‘floor’ or ‘ceiling’ effect, introduce items that
     will bring the performance to the middle of the scale (thus
     increasing true variance)
      • Eg. Fair-good-very good-excellent
• Ways that are not legitimate
  – Test the scale in a heterogeneous population
    (normal and bedridden arthritics)
  – A scale developed in homogeneous population will
    have a larger reliability when used in a
    heterogeneous population
     • correct for attenuation
• Simplest way to increase R is to increase the no. of
  items(statistical theory)
• True variance increases as the square of items
  whereas error variance increases only as the no. of
  items
• If the length of the test is triples
   – Then Rspearman brown = 3R/ 1 + 2R
• In reality the equation overestimates the new
  reliability
• We can also use this equation to determine the
  length of a test for achieving a pre-decided
  reliability
• To improve test-retest reliability – shorten the
  interval between the tests
• An ideal approach is the examine all the sources
  of variation and try to reduce the larger ones
  (generalizability theory)
Summary for Reliability
• Pearson R is theoretically incorrect but in
  practice fairly close
• Bland and Altman method is analogous to
  error variance of ICC but doesn’t relate this to
  the range of observations
• kappa and ICC are identical and most
  appropriate
THANK YOU



            71
            71

Más contenido relacionado

La actualidad más candente

RELIABILITY AND VALIDITY
RELIABILITY AND VALIDITYRELIABILITY AND VALIDITY
RELIABILITY AND VALIDITYJoydeep Singh
 
Reliability and validity ppt
Reliability and validity pptReliability and validity ppt
Reliability and validity pptsurendra poudel
 
Validity and Reliability - Research Mangement
Validity and Reliability - Research MangementValidity and Reliability - Research Mangement
Validity and Reliability - Research MangementVinu Arpitha
 
Test Reliability and Validity
Test Reliability and ValidityTest Reliability and Validity
Test Reliability and ValidityBrian Ebie
 
Reliability & validity
Reliability & validityReliability & validity
Reliability & validityalameenpa
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validityKaimrc_Rss_Jd
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good testcyrilcoscos
 
Reliability for testing and assessment
Reliability for testing and assessmentReliability for testing and assessment
Reliability for testing and assessmentErlwinmer Mangmang
 
Presentation validity
Presentation validityPresentation validity
Presentation validityAshMusavi
 
validity its types and importance
validity its types and importancevalidity its types and importance
validity its types and importanceIerine Joy Caserial
 
Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Maheen Iftikhar
 

La actualidad más candente (20)

RELIABILITY AND VALIDITY
RELIABILITY AND VALIDITYRELIABILITY AND VALIDITY
RELIABILITY AND VALIDITY
 
Validity in Assessment
Validity in AssessmentValidity in Assessment
Validity in Assessment
 
Reliability
ReliabilityReliability
Reliability
 
Validity &amp; reliability
Validity &amp; reliabilityValidity &amp; reliability
Validity &amp; reliability
 
Reliability and validity ppt
Reliability and validity pptReliability and validity ppt
Reliability and validity ppt
 
Validity and Reliability - Research Mangement
Validity and Reliability - Research MangementValidity and Reliability - Research Mangement
Validity and Reliability - Research Mangement
 
Reliability types
Reliability typesReliability types
Reliability types
 
Test Reliability and Validity
Test Reliability and ValidityTest Reliability and Validity
Test Reliability and Validity
 
Reliability & validity
Reliability & validityReliability & validity
Reliability & validity
 
Validity
ValidityValidity
Validity
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
Characteristics of a good test
Characteristics of a good testCharacteristics of a good test
Characteristics of a good test
 
Reliability for testing and assessment
Reliability for testing and assessmentReliability for testing and assessment
Reliability for testing and assessment
 
Presentation validity
Presentation validityPresentation validity
Presentation validity
 
validity its types and importance
validity its types and importancevalidity its types and importance
validity its types and importance
 
Reliability
ReliabilityReliability
Reliability
 
01 validity and its type
01 validity and its type01 validity and its type
01 validity and its type
 
Validity, its types, measurement & factors.
Validity, its types, measurement & factors.Validity, its types, measurement & factors.
Validity, its types, measurement & factors.
 
Validity and reliability
Validity and reliabilityValidity and reliability
Validity and reliability
 
Reliability
Reliability Reliability
Reliability
 

Destacado

Jane
JaneJane
Janegyzel
 
Type 1 and type 2 errors
Type 1 and type 2 errorsType 1 and type 2 errors
Type 1 and type 2 errorssmulford
 
Error And Power
Error And PowerError And Power
Error And Powershoffma5
 
Type i and type ii errors
Type i and type ii errorsType i and type ii errors
Type i and type ii errorsp24ssp
 
Types of errors
Types of errorsTypes of errors
Types of errorsRima fathi
 
A" Research Methods Reliability and validity
A" Research Methods Reliability and validityA" Research Methods Reliability and validity
A" Research Methods Reliability and validityJill Jan
 
Internal and external validity factors
Internal and external validity factorsInternal and external validity factors
Internal and external validity factorsAmir Mahmoud
 
Interview method in research
Interview method in researchInterview method in research
Interview method in researchVinay Kumar
 
Threats to internal and external validity
Threats to internal and external validityThreats to internal and external validity
Threats to internal and external validityrodsazon
 
Test of hypothesis
Test of hypothesisTest of hypothesis
Test of hypothesisvikramlawand
 
Educational research
Educational researchEducational research
Educational researchmeenuch
 
Regression analysis
Regression analysisRegression analysis
Regression analysisRavi shankar
 
Measurement and scaling techniques
Measurement  and  scaling  techniquesMeasurement  and  scaling  techniques
Measurement and scaling techniquesUjjwal 'Shanu'
 

Destacado (15)

Jane
JaneJane
Jane
 
Type 1 and type 2 errors
Type 1 and type 2 errorsType 1 and type 2 errors
Type 1 and type 2 errors
 
Error And Power
Error And PowerError And Power
Error And Power
 
Type i and type ii errors
Type i and type ii errorsType i and type ii errors
Type i and type ii errors
 
Types of errors
Types of errorsTypes of errors
Types of errors
 
A" Research Methods Reliability and validity
A" Research Methods Reliability and validityA" Research Methods Reliability and validity
A" Research Methods Reliability and validity
 
Internal and external validity factors
Internal and external validity factorsInternal and external validity factors
Internal and external validity factors
 
Interview method in research
Interview method in researchInterview method in research
Interview method in research
 
Experimental method of Research
Experimental method of ResearchExperimental method of Research
Experimental method of Research
 
Threats to internal and external validity
Threats to internal and external validityThreats to internal and external validity
Threats to internal and external validity
 
Test of hypothesis
Test of hypothesisTest of hypothesis
Test of hypothesis
 
Descriptive research
Descriptive researchDescriptive research
Descriptive research
 
Educational research
Educational researchEducational research
Educational research
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Measurement and scaling techniques
Measurement  and  scaling  techniquesMeasurement  and  scaling  techniques
Measurement and scaling techniques
 

Similar a Reliability and validity

Validity.pptx
Validity.pptxValidity.pptx
Validity.pptxrupasi13
 
Evaluation of Measurement Instruments.ppt
Evaluation of Measurement Instruments.pptEvaluation of Measurement Instruments.ppt
Evaluation of Measurement Instruments.pptCityComputers3
 
Business Research Methods Unit III
Business Research Methods Unit IIIBusiness Research Methods Unit III
Business Research Methods Unit IIIKartikeya Singh
 
Reliability & Validity
Reliability & ValidityReliability & Validity
Reliability & ValidityIkbal Ahmed
 
unit 9 measurements presentation- short.ppt
unit 9 measurements presentation- short.pptunit 9 measurements presentation- short.ppt
unit 9 measurements presentation- short.pptMitikuTeka1
 
Development of health measurement scales - part 1
Development of health measurement scales - part 1Development of health measurement scales - part 1
Development of health measurement scales - part 1Rizwan S A
 
Topic validity
Topic validityTopic validity
Topic validitymikki khan
 
Instrument development and psychometric validation 030222
Instrument development and psychometric validation 030222Instrument development and psychometric validation 030222
Instrument development and psychometric validation 030222Roger Watson
 
Business Research Method - Unit III, AKTU, Lucknow Syllabus
Business Research Method - Unit III, AKTU, Lucknow SyllabusBusiness Research Method - Unit III, AKTU, Lucknow Syllabus
Business Research Method - Unit III, AKTU, Lucknow SyllabusKartikeya Singh
 
Reliability & validity
Reliability & validityReliability & validity
Reliability & validityshefali84
 
reliablity and validity in social sciences research
reliablity and validity  in social sciences researchreliablity and validity  in social sciences research
reliablity and validity in social sciences researchSourabh Sharma
 
Questionnaire and Instrument validity
Questionnaire and Instrument validityQuestionnaire and Instrument validity
Questionnaire and Instrument validitymdanaee
 

Similar a Reliability and validity (20)

Reliability and validity
Reliability and  validityReliability and  validity
Reliability and validity
 
Validity.pptx
Validity.pptxValidity.pptx
Validity.pptx
 
Evaluation of Measurement Instruments.ppt
Evaluation of Measurement Instruments.pptEvaluation of Measurement Instruments.ppt
Evaluation of Measurement Instruments.ppt
 
Validity
ValidityValidity
Validity
 
Validity
ValidityValidity
Validity
 
Business Research Methods Unit III
Business Research Methods Unit IIIBusiness Research Methods Unit III
Business Research Methods Unit III
 
Reliability & Validity
Reliability & ValidityReliability & Validity
Reliability & Validity
 
unit 9 measurements presentation- short.ppt
unit 9 measurements presentation- short.pptunit 9 measurements presentation- short.ppt
unit 9 measurements presentation- short.ppt
 
Development of health measurement scales - part 1
Development of health measurement scales - part 1Development of health measurement scales - part 1
Development of health measurement scales - part 1
 
Topic validity
Topic validityTopic validity
Topic validity
 
Monika seminar
Monika seminarMonika seminar
Monika seminar
 
Monika seminar
Monika seminarMonika seminar
Monika seminar
 
Instrument development and psychometric validation 030222
Instrument development and psychometric validation 030222Instrument development and psychometric validation 030222
Instrument development and psychometric validation 030222
 
unit 2.6.pptx
unit 2.6.pptxunit 2.6.pptx
unit 2.6.pptx
 
Business Research Method - Unit III, AKTU, Lucknow Syllabus
Business Research Method - Unit III, AKTU, Lucknow SyllabusBusiness Research Method - Unit III, AKTU, Lucknow Syllabus
Business Research Method - Unit III, AKTU, Lucknow Syllabus
 
Data collection reliability
Data collection reliabilityData collection reliability
Data collection reliability
 
Reliability & validity
Reliability & validityReliability & validity
Reliability & validity
 
reliablity and validity in social sciences research
reliablity and validity  in social sciences researchreliablity and validity  in social sciences research
reliablity and validity in social sciences research
 
Questionnaire and Instrument validity
Questionnaire and Instrument validityQuestionnaire and Instrument validity
Questionnaire and Instrument validity
 
Systemic review PPT - Duke EMB conference
Systemic review PPT - Duke EMB conferenceSystemic review PPT - Duke EMB conference
Systemic review PPT - Duke EMB conference
 

Más de Priyamadhaba Behera

Más de Priyamadhaba Behera (10)

Cancer registration and challenges in india
Cancer registration and challenges in indiaCancer registration and challenges in india
Cancer registration and challenges in india
 
DISTRICT HEALTH ACTION PLAN
DISTRICT HEALTH ACTION PLANDISTRICT HEALTH ACTION PLAN
DISTRICT HEALTH ACTION PLAN
 
Enteric fever
Enteric feverEnteric fever
Enteric fever
 
Hepatitis A
Hepatitis A Hepatitis A
Hepatitis A
 
Cholera
CholeraCholera
Cholera
 
Malaria control strategies in india
Malaria control strategies in indiaMalaria control strategies in india
Malaria control strategies in india
 
Mortality and mobidity indicators
Mortality and mobidity indicatorsMortality and mobidity indicators
Mortality and mobidity indicators
 
Article presentation on clinical trial
Article presentation on clinical trialArticle presentation on clinical trial
Article presentation on clinical trial
 
Clinical epidemiology
Clinical epidemiologyClinical epidemiology
Clinical epidemiology
 
Tb control in india
Tb control in indiaTb control in india
Tb control in india
 

Último

MedMatch: Your Health, Our Mission. Pitch deck.
MedMatch: Your Health, Our Mission. Pitch deck.MedMatch: Your Health, Our Mission. Pitch deck.
MedMatch: Your Health, Our Mission. Pitch deck.whalesdesign
 
power point presentation of Clinical evaluation of strabismus
power point presentation of Clinical evaluation  of strabismuspower point presentation of Clinical evaluation  of strabismus
power point presentation of Clinical evaluation of strabismusChandrasekar Reddy
 
Different drug regularity bodies in different countries.
Different drug regularity bodies in different countries.Different drug regularity bodies in different countries.
Different drug regularity bodies in different countries.kishan singh tomar
 
Adenomyosis or Fibroid- making right diagnosis
Adenomyosis or Fibroid- making right diagnosisAdenomyosis or Fibroid- making right diagnosis
Adenomyosis or Fibroid- making right diagnosisSujoy Dasgupta
 
"Radical excision of DIE in subferile women with deep infiltrating endometrio...
"Radical excision of DIE in subferile women with deep infiltrating endometrio..."Radical excision of DIE in subferile women with deep infiltrating endometrio...
"Radical excision of DIE in subferile women with deep infiltrating endometrio...Sujoy Dasgupta
 
Pharmacokinetic Models by Dr. Ram D. Bawankar.ppt
Pharmacokinetic Models by Dr. Ram D.  Bawankar.pptPharmacokinetic Models by Dr. Ram D.  Bawankar.ppt
Pharmacokinetic Models by Dr. Ram D. Bawankar.pptRamDBawankar1
 
Bulimia nervosa ( Eating Disorders) Mental Health Nursing.
Bulimia nervosa ( Eating Disorders) Mental Health Nursing.Bulimia nervosa ( Eating Disorders) Mental Health Nursing.
Bulimia nervosa ( Eating Disorders) Mental Health Nursing.aarjukhadka22
 
FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...
FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...
FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...Shubhanshu Gaurav
 
Female Reproductive Physiology Before Pregnancy
Female Reproductive Physiology Before PregnancyFemale Reproductive Physiology Before Pregnancy
Female Reproductive Physiology Before PregnancyMedicoseAcademics
 
How to cure cirrhosis and chronic hepatitis naturally
How to cure cirrhosis and chronic hepatitis naturallyHow to cure cirrhosis and chronic hepatitis naturally
How to cure cirrhosis and chronic hepatitis naturallyZurück zum Ursprung
 
Trustworthiness of AI based predictions Aachen 2024
Trustworthiness of AI based predictions Aachen 2024Trustworthiness of AI based predictions Aachen 2024
Trustworthiness of AI based predictions Aachen 2024EwoutSteyerberg1
 
SGK RỐI LOẠN TOAN KIỀM ĐHYHN RẤT HAY VÀ ĐẶC SẮC.pdf
SGK RỐI LOẠN TOAN KIỀM ĐHYHN RẤT HAY VÀ ĐẶC SẮC.pdfSGK RỐI LOẠN TOAN KIỀM ĐHYHN RẤT HAY VÀ ĐẶC SẮC.pdf
SGK RỐI LOẠN TOAN KIỀM ĐHYHN RẤT HAY VÀ ĐẶC SẮC.pdfHongBiThi1
 
EXERCISE PERFORMANCE.pptx, Lung function
EXERCISE PERFORMANCE.pptx, Lung functionEXERCISE PERFORMANCE.pptx, Lung function
EXERCISE PERFORMANCE.pptx, Lung functionkrishnareddy157915
 
historyofpsychiatryinindia. Senthil Thirusangu
historyofpsychiatryinindia. Senthil Thirusanguhistoryofpsychiatryinindia. Senthil Thirusangu
historyofpsychiatryinindia. Senthil Thirusangu Medical University
 
Basic structure of hair and hair growth cycle.pptx
Basic structure of hair and hair growth cycle.pptxBasic structure of hair and hair growth cycle.pptx
Basic structure of hair and hair growth cycle.pptxkomalt2001
 
Mental health Team. Dr Senthil Thirusangu
Mental health Team. Dr Senthil ThirusanguMental health Team. Dr Senthil Thirusangu
Mental health Team. Dr Senthil Thirusangu Medical University
 
Clinical Research Informatics Year-in-Review 2024
Clinical Research Informatics Year-in-Review 2024Clinical Research Informatics Year-in-Review 2024
Clinical Research Informatics Year-in-Review 2024Peter Embi
 

Último (20)

MedMatch: Your Health, Our Mission. Pitch deck.
MedMatch: Your Health, Our Mission. Pitch deck.MedMatch: Your Health, Our Mission. Pitch deck.
MedMatch: Your Health, Our Mission. Pitch deck.
 
Biologic therapy ice breaking in rheumatology, Case based approach with appli...
Biologic therapy ice breaking in rheumatology, Case based approach with appli...Biologic therapy ice breaking in rheumatology, Case based approach with appli...
Biologic therapy ice breaking in rheumatology, Case based approach with appli...
 
power point presentation of Clinical evaluation of strabismus
power point presentation of Clinical evaluation  of strabismuspower point presentation of Clinical evaluation  of strabismus
power point presentation of Clinical evaluation of strabismus
 
Different drug regularity bodies in different countries.
Different drug regularity bodies in different countries.Different drug regularity bodies in different countries.
Different drug regularity bodies in different countries.
 
Adenomyosis or Fibroid- making right diagnosis
Adenomyosis or Fibroid- making right diagnosisAdenomyosis or Fibroid- making right diagnosis
Adenomyosis or Fibroid- making right diagnosis
 
"Radical excision of DIE in subferile women with deep infiltrating endometrio...
"Radical excision of DIE in subferile women with deep infiltrating endometrio..."Radical excision of DIE in subferile women with deep infiltrating endometrio...
"Radical excision of DIE in subferile women with deep infiltrating endometrio...
 
Pharmacokinetic Models by Dr. Ram D. Bawankar.ppt
Pharmacokinetic Models by Dr. Ram D.  Bawankar.pptPharmacokinetic Models by Dr. Ram D.  Bawankar.ppt
Pharmacokinetic Models by Dr. Ram D. Bawankar.ppt
 
Bulimia nervosa ( Eating Disorders) Mental Health Nursing.
Bulimia nervosa ( Eating Disorders) Mental Health Nursing.Bulimia nervosa ( Eating Disorders) Mental Health Nursing.
Bulimia nervosa ( Eating Disorders) Mental Health Nursing.
 
FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...
FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...
FDMA FLAP - The first dorsal metacarpal artery (FDMA) flap is used mainly for...
 
Female Reproductive Physiology Before Pregnancy
Female Reproductive Physiology Before PregnancyFemale Reproductive Physiology Before Pregnancy
Female Reproductive Physiology Before Pregnancy
 
How to cure cirrhosis and chronic hepatitis naturally
How to cure cirrhosis and chronic hepatitis naturallyHow to cure cirrhosis and chronic hepatitis naturally
How to cure cirrhosis and chronic hepatitis naturally
 
Trustworthiness of AI based predictions Aachen 2024
Trustworthiness of AI based predictions Aachen 2024Trustworthiness of AI based predictions Aachen 2024
Trustworthiness of AI based predictions Aachen 2024
 
SGK RỐI LOẠN TOAN KIỀM ĐHYHN RẤT HAY VÀ ĐẶC SẮC.pdf
SGK RỐI LOẠN TOAN KIỀM ĐHYHN RẤT HAY VÀ ĐẶC SẮC.pdfSGK RỐI LOẠN TOAN KIỀM ĐHYHN RẤT HAY VÀ ĐẶC SẮC.pdf
SGK RỐI LOẠN TOAN KIỀM ĐHYHN RẤT HAY VÀ ĐẶC SẮC.pdf
 
Immune labs basics part 1 acute phase reactants ESR, CRP Ahmed Yehia Ismaeel,...
Immune labs basics part 1 acute phase reactants ESR, CRP Ahmed Yehia Ismaeel,...Immune labs basics part 1 acute phase reactants ESR, CRP Ahmed Yehia Ismaeel,...
Immune labs basics part 1 acute phase reactants ESR, CRP Ahmed Yehia Ismaeel,...
 
How to master Steroid (glucocorticoids) prescription, different scenarios, ca...
How to master Steroid (glucocorticoids) prescription, different scenarios, ca...How to master Steroid (glucocorticoids) prescription, different scenarios, ca...
How to master Steroid (glucocorticoids) prescription, different scenarios, ca...
 
EXERCISE PERFORMANCE.pptx, Lung function
EXERCISE PERFORMANCE.pptx, Lung functionEXERCISE PERFORMANCE.pptx, Lung function
EXERCISE PERFORMANCE.pptx, Lung function
 
historyofpsychiatryinindia. Senthil Thirusangu
historyofpsychiatryinindia. Senthil Thirusanguhistoryofpsychiatryinindia. Senthil Thirusangu
historyofpsychiatryinindia. Senthil Thirusangu
 
Basic structure of hair and hair growth cycle.pptx
Basic structure of hair and hair growth cycle.pptxBasic structure of hair and hair growth cycle.pptx
Basic structure of hair and hair growth cycle.pptx
 
Mental health Team. Dr Senthil Thirusangu
Mental health Team. Dr Senthil ThirusanguMental health Team. Dr Senthil Thirusangu
Mental health Team. Dr Senthil Thirusangu
 
Clinical Research Informatics Year-in-Review 2024
Clinical Research Informatics Year-in-Review 2024Clinical Research Informatics Year-in-Review 2024
Clinical Research Informatics Year-in-Review 2024
 

Reliability and validity

  • 1. Development of health measurement scales If you cannot express in numbers something that you are describing, you probably have little knowledge about it.
  • 3. RELIABILITY AND VALIDITY 15/03/2013 Dr. Priyamadhaba Behera Junior Resident, AIIMS 3
  • 4. Why you need to worry about reliability and Validity ? What happens with low reliability and validity ? What is the relationship between reliability and validity ? Do you need validity always ? Or reliability always ? Or both ? What is the minimum reliability that is needed for a scale ?
  • 6. No matter how well the objectives are written, or how clever the items, the quality and usefulness of an examination is predicated on Validity and Reliability Validity & Reliability
  • 7. Validity Reliability We don’t say “an exam is valid and reliable” We do say “the exam score is reliable and Valid for a specified purpose” KEY ELEMENT! Validity & Reliability
  • 10. Validity • Two steps to determine usefulness of a scale – Reliability – necessary but not sufficient – Validity – next step • Validity – is the test measuring what it is meant to measure? • Two important issues – The nature of the what is being measured – Relationship of that variable to its purported cause • Sr. creatinine is a measure of kidney func. because we know it is regulated by the kidneys • But whether students who do volunteer work will become better doctors? • Since our understanding of human behaviour is far from perfect, such predictions have to validated against actual performance
  • 11. Types of validity • Three Cs (conventionally) – Content All types of validity are addressing the same – Criterion issue of the degree of confidence we can place in the inferences we can draw from the scales • Concurrent • Predictive • Construct – – Convergent, discriminant, trait etc., •Others (face validity)
  • 12. Differing perspectives • Previously validity was seen as demonstrating the properties of the scale • Current thinking - what inferences can be made about the people that have given rise to the scores on these scales? – Thus validation is a process of hypothesis testing (someone who scores on test A, will do worse in test B, and will differ from people who do better in test C and D) – Researchers are only limited by their imagination to devise experiments to test such hypothesis
  • 14. • Face validity – On the face of it the tool appears to be measuring what it is supposed to measure – Subjective judgment by one/more experts, rarely by any empirical means • Content validity – Measures whether the tool includes all relevant domains or not – Closely related to face validity – aka. ‘validity by assumption’ because an expert says so • Certain situations where these may not be desired
  • 15. Content validity • Example – cardiology exam; – Assume it contains all aspects of the circulatory system (physiology, anatomy, pathology, pharmacology etc., etc.,) – If a person scores high on this test, we can say ‘infer’ that he knows much about the subject (i.e., our inferences about the person will right across various situations) – In contrast, if the exam did not contain anything about circulation, the inferences we make about a high scorer may be wrong most of the time and vice versa
  • 16. • Generally, a measure that includes a more representative sample of the target behaviour will have more content validity and hence lead to more accurate inferences • Reliability places an upper limit on validity (the maximum validity is the square root of reliability coeff.) the higher the reliability the higher the maximum possible validity – One exception is that between internal consistency and validity (better to sacrifice IC to content validity) – The ultimate aim of scale is inferential which depends more on content validity than internal consistency
  • 17. Criterion validity • Correlation of a scale to an accepted ‘gold standard’ • Two types – Concurrent (both the new scale and standard scale are given at the same time) – Predictive – the Gold Standard results will be available some time in the future (eg. Entrance test for college admission to assess if a person will graduate or not) • Why develop a new scale when we already have a criterion scale? – Diagnostic utility/substitutability(expensive, invasive, dangerous, time- consuming) – Predictive utility (no decision can be made on the basis of new scale) • Criterion contamination – If the result of the GS is in part determined in some way by the results of the new test, it may lead to an artificially high correlation
  • 18. Construct validity • Height, weight – readily observable • Psychological - anxiety, pain, intelligence are abstract variables and can’t be directly observed • For eg. Anxiety – we say that a person has anxiety if he has sweaty palms, tachycardia, pacing back and forth, difficulty in concentrating etc., (i.e., we have a hypothesize that these symptoms are the result of anxiety) • Such proposed underlying factors are called hypothetical constructs/ constructs (eg. Anxiety, illness behaviour) • Such constructs arise from larger theories/ clinical observations • Most psychological instruments tap some aspect of construct
  • 19. Establishing construct validity • IBS is a construct rather than a disease – it is a diagnosis of exclusion • A large vocabulary, wide knowledge and problem solving skills – what is the underlying construct? • Many clinical syndromes are constructs rather than actual entities (schizophrenia, SLE)
  • 20. • Initial scales for IBS – ruling out other organic diseases and some physical signs and symptoms – These scales were inadequate because they lead to many missed and wrong diagnoses – New scales developed incorporating demographical features and personality features • Now how to assess the validity of this new scale – Based on theory high scorers on this scale should have • Symptoms which will not clear with conventional therapy • Lower prevalence of organic bowel disease on autopsy
  • 21. Differences form other types 1. Content and criterion can be established in one or two studies, but there is no single experiment that can prove a construct •Construct validation is an ongoing process, learning more about the construct, making new predictions and then testing them •Each supportive study strengthens the construct but one well designed negative study can question the entire construct 2. We are assessing the theory as well as the measure at the same time
  • 22. IBS example • We had predicted that IBS patients will not respond to conventional therapy • Assume that we gave the test to a sample of patients with GI symptoms and treated them with conventional therapy • If high scoring patients responded in the same proportion as low scorers then there are 3 possibilities – Our scale is good but theory wrong – Our theory is good but scale bad – Both scale and theory are bad • We can identify the reason only from further studies
  • 23. • If an experimental design is used to test the construct, then in addition to the above possibilities our experiment may be flawed • Ultimately, construct validity doesn’t differ conceptually from other types of validity – All validity is at its base some form of construct validity… it is the basic meaning of validity – (Guion)
  • 24. Establishing construct validity • Extreme groups • Convergent and discriminant validity • Multitrait-multimethod matrix
  • 25. Extreme groups • Two groups – as decided by clinicians – One IBS and the other some other GI disease – Equivocal diagnosis eliminated • Two problems – That we are able to separate two extreme groups implies that we already have a tool which meets our needs (however we can do bootstrapping) – This is not sufficient, the real use of a scale is making much finer discriminations. But such studies can be a first step, if the scale fails this it will be probably useless in practical situations
  • 26. Multitrait-multimethod matrix • Two unrelated traits/constructs each measured by two different methods • Eg. Two traits – anxiety, intelligence; two methods – a rater, exam Anxiety Intelligence Rater Exam Rater Exam Rater 0.53 Anxiety Exam 0.42 0.79 Rater 0.18 0.17 0.58 Intelligence Exam 0.15 0.23 0.49 0.88 – Purple – reliabilities of the four instruments (sh be highest) – Blue – homotrait heteromethod corr. (convergent validity) – Yellow – heterotrait homomethod corr. (divergent validity) – Red – heterotrait heteromethod corr. (sh be lowest) • Very powerful method but very difficult to get such a combination
  • 27. • Convergent validity - If there are two measures for the same construct, then they should correlate with each other but should not correlate too much. E.g. Index of anxiety and ANS awareness index • Divergent validity – the measure should not correlate with a measure of a different construct, eg. Anxiety index and intelligence index
  • 28. Biases in validity assessment • Restriction in range • May be in new scale (MAO level) • May be in criterion (depression score) • A third variable correlated to both (severity) • Eg. A high correlation was found between MAO levels and depression score in community based study, but on replicating the study in hospital the correlation was low
  • 30. The information we seek and our best hope for obtaining it. Content/ Action + Error Our human frailty and inability to write effective questions. Validity & Reliability
  • 32. Maximum validity of a test is the square root of reliability coefficient. Reliability places an upper limit on validity so that higher the reliability, higher the maximum possible validity
  • 33. Variance = sum of (individual value – mean value) 2 ---------------------------------------------------------------------------------- no. of values
  • 34. Reliability • Whether our tool is measuring the attribute in a reproducible fashion or not • A way to show the amount of error (random and systematic) in any measurement • Sources of error – observers, instruments, instability of the attribute • Day to day encounters – Weighing machine, watch, thermometer
  • 35. Assessing Reliability • Internal Consistency – The average correlation among all the items in the tool • Item-total correlation • Split half reliability • Kuder-Richardson 20 & Cronbach’s alpha • Multifactor inventories • Stability – Reproducibility of a measure on different occasions • Inter-Observer reliability • Test-Retest reliability (Intra-Observer reliability)
  • 36. Internal consistency • All items in a scale tap different aspects of the same attribute and not different traits • Items should be moderately corr. with each other and each item with the total • Two schools of thought – If the aim is to describe a trait/behaviour/disorder – If the aim is to discriminate people with the trait from those without • The trend is towards scales that are more internally consistent • IC doesn’t apply to multidimensional scales
  • 37. Item-total correlation • Oldest, still used • Correlation of each item with the total score w/o that item • For k number of items, we have to calculate k number of correlations, labourious • Item should be discarded if r < 0.20(kline 1986) • Best is Pearson’s R, in case of dichotomous items - point-biserial correlation
  • 38. Split half reliability • Divide the items into two halves and calculate corr. between them • Underestimates the true reliability because we are reducing the length of scale to half (r is directly related to the no. of items) – Corrected by Spearman-Brown formula • Should not be used in – Chained items Difficulties-ways to divide a test -doesn't point which item is contributing to poor reliability
  • 39. KR 20/Cronbach’s alfa • KR-20 for dichotomous responses • Cronbach’s alfa for more than two responses • They give the average of all possible split half reliabilities of a scale • If removing an item increases the coeff. it should be discarded • Problems – Depends on the no. of items – A scale with two different sub-scales will prob. yield high alfa – Very high alfa denotes redundancy (asking the same question in slightly different ways) – Thus alfa should be more than 0.70 but not more than 0.90
  • 40. • Cronbach’s basic equation for alpha n  ΣVi  α= 1 −  n − 1  Vtest  – n = number of questions – Vi = variance of scores on each question – Vtest = total variance of overall scores on the entire test
  • 41. Multifactor inventories • More sophisticated techniques • Item-total procedure – each item should correlate with the total of its scale and the total of all the scales • Factor analysis – Determining the underlying factors – For eg., if there are five tests • Vocabulary, fluency, phonetics, reasoning and arithmetic • We can theorize that the first three would be correlated under a factor called ‘verbal factor’ and the last two under ‘logic factor’
  • 42. Stability/ Measuring error • A weighing machine shows weight in the range of say 40-80 kg and thus an error of ±1kg is meaningful Reality we calculate the ratio variability between subjects / total variability (Total variability includes subjects and measurement error) • So that a ratio of –1 indicates no measurement error/perfect reliability –0 indicates otherwise
  • 43. • Reliability = subj. variability / (subj. variability + measurement error) • Statistically ‘variance’ is the measure of variability so, • Reliability = SD2 of subjects / (SD2 of subjects + SD2 of error) • Thus reliability is the proportion of the total variance that is due to the ‘true’ differences between the subjects • Reliability has meaning only when applied to specific populations
  • 44. Calculation of reliability • The statistical technique used is ANOVA and since we have repeated measurements in reliability, the method is – repeated measures ANOVA
  • 47. • Classical definition of reliability • Interpretation is that 88% of the variance is due to the true variance among patients (aka Intraclass Correlation coefficient)
  • 48. Fixed/random factor • What happened to the variance due to observers? • Are these the same observers going to be used or they are a random sample? • Other situations where observations may be treated as fixed is subjects answering ‘same items on a scale’
  • 49. Other types of reliability • We have only examined the effect of different observers on the same behaviour • But there can be error due to ‘day to day’ differences, if we measure the same behaviour a week or two apart we can calculate ‘intra-observer reliability coefficient’ • If there are no observers (self-rated tests) we can still calculate ‘test-retest reliability’
  • 50. • Usually high inter-observer is sufficient, but if it is low then we may have to calculate intra-observer reliability to determine the source of unreliability • Mostly measures of internal consistency are reported as ‘reliability’, because there are easily computed in a single sitting. – Hence caution is required as they may not measure variability due to day to day differences
  • 51. Diff. forms of reliability coefficient • So far we have seen forms of ICC • Others – Pearson product-moment correlation – Cohen’s kappa – Bland – altman analysis
  • 52. Pearson’s correlation • Based on regression – the extent to which the relation between two variables can be described by straight line
  • 53. Limitations of Pearson’s R • A perfect fit of 1.0 may be obtained even if the intercept is non-zero and the slope is not equal to one unlike with ICC • So, Pearson’s R will be higher than truth, but in practice it is usually equal to ICC as the predominant source of error is random variation • If there are multiple observations then multiple pairwise Rs are required, unlike the single ICC • For eg. with 10 observers there will be 45 Pearson’s Rs whereas only one ICC
  • 54. Kappa coeff. • Used when responses are dichotomous/categorical • When the frequency of positive results is very low or high, kappa will be very high • Weighted kappa focuses on disagreement, cells are weighted according to the distance from the diagonal of agreement • Weighting can be arbitrary or using quadratic weights (based on square of the amount of discrepancy) • Quadratic scheme of weighted kappa is equivalent to ICC
  • 58. Bland and Altman method • A plot of difference between two observations against the mean of the two observations
  • 59. • Agreement is expressed as the ‘limits of agreement’. The presentation of the 95% limits of agreement is for visual judgement of how well two methods of measurement agree. The smaller the range between these two limits the better the agreement is. • The question of how small is small depends on the clinical context: would a difference between measurement methods as extreme as that described by the 95% limits of agreement meaningfully affect the interpretation of the results • Limitation - the onus is placed on the reader to juxtapose the calculated error against some implicit notion of true variability
  • 60. Standards for magnitude of reliability coeff. •How much reliability is good? Kelly (0.94) Stewart (0.85) •A test for individual judgment should be higher than that for research in groups •Research purposes – – Mean score and the sample size will reduce the error – Conclusions are usually made after a series of studies – Acceptable reliability is dependent on the sample size in research(in sample of 1000 reliablity may low compared to sample size of 10)
  • 61. Reliability and probability of misclassification •Depends on the property of the instrument and the decision of cut point •Relation between reliability and likelihood of misclassification – Eg. A sample of 100, one person ranked 25th and another 50th – If the R is 0, 50% chance that the two will reverse order on retesting – If R is 0.5, 37% chance, with R=0.8, 2.2% chance •Hence R of 0.75 is minimum requirement for a useful instrument
  • 62. Improving reliability • Increase the subject variance relative to the error variance (by legitimate means and otherwise) • Reducing error variance – Observer/rater training – Removing consistently extreme observers – Designing better scales • Increasing true variance – In case of ‘floor’ or ‘ceiling’ effect, introduce items that will bring the performance to the middle of the scale (thus increasing true variance) • Eg. Fair-good-very good-excellent
  • 63. • Ways that are not legitimate – Test the scale in a heterogeneous population (normal and bedridden arthritics) – A scale developed in homogeneous population will have a larger reliability when used in a heterogeneous population • correct for attenuation
  • 64. • Simplest way to increase R is to increase the no. of items(statistical theory) • True variance increases as the square of items whereas error variance increases only as the no. of items • If the length of the test is triples – Then Rspearman brown = 3R/ 1 + 2R
  • 65. • In reality the equation overestimates the new reliability • We can also use this equation to determine the length of a test for achieving a pre-decided reliability • To improve test-retest reliability – shorten the interval between the tests • An ideal approach is the examine all the sources of variation and try to reduce the larger ones (generalizability theory)
  • 66. Summary for Reliability • Pearson R is theoretically incorrect but in practice fairly close • Bland and Altman method is analogous to error variance of ICC but doesn’t relate this to the range of observations • kappa and ICC are identical and most appropriate
  • 67. THANK YOU 71 71

Notas del editor

  1. Even when a test is constructed on the basis of a specific criterion, it may ultimately be judged to have greater construct validity than the criterion .We start with a vague concept which we associate with certain observations. We then discover empirically that these observations co vary with some other observation which possesses greater reliability or is more intimately correlated with relevant experimental changes than is the original measure.