1. Tests, evaluation and teacher dismissal
John Cronin, Ph.D.
Director
The Kingsbury Center @ NWEA
2. Tests, evaluation and teacher dismissal
Presenter - John Cronin, Ph.D.
Contacting us:
Rebecca Moore: 503-548-5129
E-mail: rebecca.moore@nwea.org
This presentation can be viewed at:
http://www.slideshare.net/JFCronin/ed-reform-lecture-university-of-arkansas
3. If one objective of evaluation reform was
to make it easier to dismiss ineffective
teachers, in most states the reforms are
likely to make dismissal more difficult.
4. Problems
• If tests are the controlling evidence in a dismissal, expect
expensive battles of experts.
• Title VII claims are likely if evaluation systems have disparate
impact. Especially likely in states using less robust models like
the Colorado Growth Model.
• Many states implementing evaluation reform have enacted
stricter procedural requirements, particularly around
classroom observation.
• Rating systems can be manipulated, in favor of and against
educators.
• The threats of cheating and gaming are underestimated, and
risks are greater as we move to growth measurement.
5. How tests are used to evaluate teachers and
principals
6. Measurement Issues
Measuring a teacher’s
contribution to learning is
inexact.
10. Reliability of teacher value-added
estimates
Teachers with growth scores in lowest and
highest quintile over two years using NWEA’s
Measures of Academic Progress
Bottom Top quintile
quintile Y1&Y2
Y1&Y2
Number 59/493 63/493
Percent 12% 13%
r .64 r2 .41
Typical r values for measures of teaching effectiveness range
between .30 and .60 (Brown Center on Education Policy, 2010)
12. Issues in the use of growth and value-
added measures
“Among those who ranked in the top
category on the TAKS reading test, more
than 17% ranked among the lowest two
categories on the Stanford. Similarly
more than 15% of the lowest value-added
teachers on the TAKS were in the highest
two categories on the Stanford.”
Corcoran, S., Jennings, J., & Beveridge, A., Teacher Effectiveness on High and Low Stakes
Tests, Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI
(2010).
13. Measurement Issues
It’s about the measurement…
AND conditions...
AND the model.
14. Los Angeles Unified
• Teachers can easily rate in multiple categories
• The choice of model can have a large impact
• Models effect English more than Math
• Teachers do better in some subjects than others
• More complex models don't necessarily favor the t
15. Possible racial bias in models
“Significant evidence of bias plagued the value-added model
estimated for the Los Angeles Times in 2010, including significant
patterns of racial disparities in teacher ratings both by the race of
the student served and by the race of the teachers (see Green,
Baker and Oluwole, 2012). These model biases raise the possibility
that Title VII disparate impact claims might also be filed by teachers
dismissed on the basis of their value-added estimates.
Additional analyses of the data, including richer models using
additional variables mitigated substantial portions of the bias in the
LA Times models (Briggs & Domingue, 2010).”
Baker, B. (2012, April 28).
If it’s not valid, reliability doesn’t matter so much! More on VAM-ing
16. Instability at the tails of the
distribution
“The findings indicate that these modeling
choices can significantly influence outcomes
for individual teachers, particularly those in
the tails of the performance distribution who
are most likely to be targeted by high-stakes
policies.”
Ballou, D., Mokher, C. and Cavalluzzo, L. (2012)
Using Value-Added Assessment for Personnel Decisions: How Omitted Variables and Model Specif
LA Times Teacher #1
LA Times Teacher #2
17. New York City
• Margins of error can be very large
• Increasing n doesn't always decrease the
margin of error
• The margin of error in math is typically less
than reading
18. The problem with spring-spring testing
Teacher 1 Summer Teacher 2
3/11 4/11 5/11 6/11 7/11 8/11 9/11 10/11 11/11 12/11 1/12 2/12 3/12
19. The problem with spring-spring testing
Teacher 1 Summer Teacher 2
3/11 4/11 5/11 6/11 7/11 8/11 9/11 10/11 11/11 12/11 1/12 2/12 3/12
20. The problem with spring-spring testing
Teacher 1 Summer Teacher 2
3/11 4/11 5/11 6/11 7/11 8/11 9/11 10/11 11/11 12/11 1/12 2/12 3/12
21. Characteristics of value-added metrics
• Value-added metrics always produce winners and
losers.
• Value-added metrics can’t measure progress of the
larger group.
• Extreme performance is more likely to have alternate
explanations.
23. Translating ranked data to ratings -
principles
• There is no “science” per se around translating a
ranking to a rating. If you call a bottom 40% teacher
ineffective that is a judgment.
• The rating process can be politicized.
• The process is easy to over-engineer.
24. New York Rating System
• 60 points assigned from classroom observation
• 20 points assigned from state assessment
• 20 points assigned from local assessment
• A score of 64 or less is rated ineffective.
25.
26. Connecticut requirements
• Criteria for student growth indicator
– Fair to students
• The indicator of academic growth and development is used in such a way as to provide
students an opportunity to show that they have met or are making progress in meeting the
learning objective. The use of the indicator of academic growth and development is as free as
possible from bias and stereotype.
– Fair to teachers
• The use of an indicator of academic growth and development is fair when a teacher has the
professional resources and opportunity to show that his/her students have made growth and
when the indicator is appropriate to the teacher’s content, assignment and class composition.
– Reliable
– Valid
– Useful
• The indicator may be used to provide the teacher with meaningful feedback about student
knowledge, skills, perspective and classroom experience that may be used to enhance student
learning and provide opportunities for teacher professional growth and development.
27. Connecticut requirements
• Components of the evaluation
– Student growth (45%) - including the state test, one non-standardized
indicator, and (optional) one other standardized indicator.
• Requires a beginning of the year, mid-year, and end-of year conference
– Teacher practice and performance (40%) –
• First and second year teachers – 3 in-class observations
• Developing or below standard – 3 in-class observations
• Proficient or exemplary – 3 observations of practice, one in-class
– Whole-school learning indicator or student feedback (5%)
– Parent or peer feedback (10%)
28. Connecticut requirements
Requirements for observations
1. Facilitate and encourage effective means for multiple in-class visits necessary
for gathering evidence of the quality of teacher practice;
2. Provide constructive oral and written feedback of observations in a timely and
useful manner;
3. Provide on-going calibration of evaluators in the district;
4. Use a combination of formal, informal, announced, and unannounced
observation;
5. Consider differentiating the number of observations related to experience,
prior ratings, needs and goals.
6. Include pre- and post-conferences that include deep professional
conversations that allow evaluators and teachers to set goals, allow
administrators to gain insight into the teacher’s progress in addressing issues
and working toward their goals, and share evidence each has gathered during
the year
29. Cheating
Atlanta Public Schools
Crescendo Charter Schools
Philadelphia Public Schools
Washington DC Public Schools
Houston Independent School
District
Michigan Public Schools
30. Unintended Consequences?
• Principals and teachers may game the system,
inadvertently or intentionally.
• Many principals and teachers (including good ones)
will seek schools or teaching assignments that they
think will improve their results.
• Many teachers will seek opportunities to avoid
grades with standardized tests.
• Ranking metrics can discourage cooperation among
principals and teachers – finding ways to reward
teamwork and cooperation are important.
31. Case Study #1 - Mean value-added performance in mathematics by
school – fall to spring
32. Case Study #1 - Mean spring and fall test duration in minutes by
school
33. Case Study #1 - Mean value-added growth by school and test
duration
34. Case Study # 2
Differences in fall-spring test durations Differences in growth index score
based on fall-spring test durations
35. Case Study # 2
How much of summer loss is really summer loss?
Differences in spring -fall test durations Differences in raw growth based by
spring-fall test duration
36. Case Study # 2
Differences in fall-spring test duration (yellow-black) and
Differences in growth index scores (green) by school
37. Negotiated goals – Student Learning
Objectives
• Negotiated goals are not likely to be
challenging
• Negotiated goals leave a potential for
discrimination charges if teachers at a grade
level have different improvement
expectations.
38. An alternate approach
• Give primacy to evaluator observation for judging teachers.
• Focus mandatory observations on low performers.
• Use assessments and value-added measurement to validate
observations.
• Require reassessment when observations and assessment
data are in significant misalignment.