Observational Studies in a Learning Health System

OBSERVATIONAL STUDIES IN A LEARNING HEALTH SYSTEM
An Institute of Medicine Workshop sponsored by the Patient-Centered Outcomes Research Institute
April 25-26, 2013
The National Academies
2101 Constitution Avenue, NW
Washington, DC 20418


Table of Contents
SECTION 1: WORKSHOP FRAMING MATERIALS
• Agenda
• Planning Committee Roster
• Participant List
SECTION 2: ENGAGING THE ISSUE OF BIAS
• Hernan, Miguel A. and Hernandez-Diaz, Sonia. Beyond the intention-to-treat in comparative
effectiveness research. Clinical Trials. 9:48–55. 2012.
• Prasad, Vinay and Jena, Anupam. Prespecified falsification end points: Can they validate true
observational associations? JAMA. 309(3). 2013.
• Walker, Alexander A. Orthogonal predictions: Follow-up questions for suggestive data.
Pharmacoepidemiology and Drug Safety. 19: 529–532. 2010.
• Ryan, PB, et al. Empirical assessment of methods for risk identification in healthcare data:
results from the experiments of the Observational Medical Outcomes Partnership. Statistics
in Medicine. 2012.
• Lorch, SA, et al. The differential impact of delivery hospital on the outcomes of premature
infants. Pediatrics. 130(2). 2012.
• Small, Dylan S. and Rosenbaum, Paul R. War and wages: The strength of instrumental
variables and their sensitivity to unobserved biases. Journal of the American Statistical Association.
103(483). 2008.
• Brookhart, MA, et al. Comparative Mortality Risk of Anemia Management Practices in
Incident Hemodialysis Patients.JAMA. 303(9). 2010.
• Cornfield, J. Principles of Research. Statistics in Medicine. 31:2760-2768. 2012.
SECTION 3: GENERALIZING RCT RESULTS TO BROADER POPULATIONS
• Kaizar, Eloise E. Estimating treatment effect via simple cross design synthesis. Statistics in
Medicine. 30:2986–3009. 2011.
• Go, AS, et al. Anticoagulation therapy for stroke prevention in atrial fibrillation: How well
do randomized trials translate into clinical practice? JAMA. 290(20). 2003.
• Hernan, MA, et al. Observational studies analyzed like randomized experiments: An
application to postmenopausal hormone therapy and Coronary Heart Disease. Epidemiology.
19(6). 2008.
• Weintraub, WS, et al. Comparative effectiveness of revascularization strategies. The New
England Journal of Medicine. 366(16). 2012.

SECTION 4: DETECTING TREATMENT EFFECT HETEROGENEITY
• Hlatky, MA, et al. Coronary artery bypass surgery compared with percutaneous coronary
interventions for multivessel disease: A collaborative analysis of individual patient data from
ten randomised trials. The Lancet. 373: 1190–97. 2009.
• Kent, DM, et al. Assessing and reporting heterogeneity in treatment effects in clinical trials:
A proposal. Trials. 11:85. 2010.
• Kent, David M. and Hayward, Rodney A. Limitations of applying summary results of clinical
trials to individual patients: The need for risk stratification. JAMA. 298(10):1209-1212. 2007.
• Basu, A, et al. Heterogeneity in Action: The Role of Passive Personalization in Comparative
Effectiveness Research. 2012
• Basu, Anirban. Estimating Person-centered Treatment (PeT) Effects using Instrumental
Variables: An application to evaluating prostate cancer treatments. 2013.
SECTION 5: PREDICTING INDIVIDUAL RESPONSES
• Byar, David P. Why Databases should not replace Randomized Clinical Trials. Biometrics.
36(2): 337-342. 1980.
• Lee, KL, et al. Clinical judgment and statistics. Lessons from a simulated randomized trial in
coronary artery disease. Circulation. 61:508-515. 1980.
• Pencina, Michael J. and D’Agostino, Ralph B. Thoroughly modern risk prediction? Science
Translational Medicine. 4(131). 2012.
• Tatonetti, NP, et al. Data-driven prediction of drug effects and interactions. Science
Translational Medicine. 4(125). 2012.
SECTION 6: ORGANIZATIONAL BACKGROUND
• IOM Roundtable on Value & Science-Driven Health Care (VSRT)
1. VSRT Background Information and Roster
2. VSRT Charter and Vision
3. Clinical Effectiveness Research Innovation Collaborative Background Information
• Patient-Centered Outcomes Research Institute (PCORI)
1. National Priorities for Research and Research Agenda
SECTION 7: BIOGRAPHIES AND MEETING LOGISTICS
• Planning Committee Biographies
• Speaker Biographies
• Location, Hotel, and Travel

----------------------------------
Workshop Framing Materials
WorkshopFraming
Materials


An Institute of Medicine Workshop
Sponsored by the Patient-Centered Outcomes Research Institute

A LEARNING HEALTH SYSTEM ACTIVITY
IOM ROUNDTABLE ON VALUE & SCIENCE-DRIVEN HEALTH CARE
APRIL 25-26, 2013
THE NATIONAL ACADEMY OF SCIENCES
2101 CONSTITUTION AVENUE, NW
WASHINGTON, DC
Day 1: Thursday, April 25th
8:00 am Coffee and light breakfast available
8:30 am Welcome, introductions and overview
Welcome, framing of the meeting and agenda overview
Welcome from the IOM
Michael McGinnis, Institute of Medicine
Opening remarks and meeting overview
Joe Selby, Patient-Centered Outcomes Research Institute
Ralph Horwitz, GlaxoSmithKline
Meeting objectives
1. Explore the role of observational studies (OS) in the generation of evidence to guide clinical and
health policy decisions, with a focus on individual patient care, in a learning health system;
2. Consider concepts of OS design and analysis, emerging statistical methods, use of OS’s to
supplement evidence from experimental methods, identifying treatment heterogeneity, and
providing effectiveness estimates tailored for individual patients;
3. Engage colleagues from disciplines typically underrepresented in clinical evidence discussions;
4. Identify strategies for accelerating progress in the appropriate use of OS for evidence generation.

2
9:00 am Workshop stage-setting
 Session format
o Workshop overview and stage-setting
Steve Goodman, Stanford University
Q&A and open discussion
 Session questions:
o How do OS contribute to building valid evidence to support effective
decision making by patients and clinicians? When are their findings
useful, when are they not?
o What are the major challenges (study design, methodological, data
collection/management/analysis, cultural etc.) facing the field in the use
of OS data for decision making? Please include consideration of the
following issues: bias, methodological standards, publishing requirements.
o What can workshop participants expect from the following sessions?
9:45 am Engaging the issue of bias
Moderator: Michael Lauer, National Heart Lung and Blood Institute
 Session format
o Introduction to issue
Sebastian Schneeweiss, Harvard University
o Presentations:
 Instrumental variables and their sensitivity to unobserved biases
Dylan Small, University of Pennsylvania
 An empirical approach to measuring and calibrating for error in
observational analyses
Patrick Ryan, Johnson & Johnson
o Respondents and panel discussion:
 John Wong, Tufts University
 Joel Greenhouse, Carnegie Mellon University
o What are the major bias-related concerns with the use of observational
study methods? What are the sources of bias?
o How many of these concerns relate to methods and how much to the
quality and availability of suitable data? What barriers have these concerns
created for the use of the results of observational studies to drive
decision-making?

3
o What are the most promising approaches to reduction of bias through
the use of statistical methods? Through study design (e.g. Dealing with
issues of multiplicity)?
o What are the circumstances under which administrative (claims) data can
be used to assess treatment benefits? What data are needed from EHRs
to strengthen the value of administrative data?
o What methods are best to adjust for the changes in treatment and clinical
conditions among patients followed longitudinally?
o What are the implications of these promising approaches for the use of
observational study methods moving forward?
11:30 pm Lunch
Participants will be asked to identify among their lunch table, what they think the
most critical questions are for PCOR in the topics covered by the workshop. These
topics will them be circulated to the moderators of the proceeding sessions.
12:30 pm Generalizing RCT results to broader populations
Moderator: Harold Sox, Dartmouth University
 Session format
Robert Califf, Duke
o Presentations:
 Generalizing the right question
Miguel Hernan, Harvard University
 Using observational studies to determine RCT generalizability
Eloise Kaizar, Ohio State
 William Weintraub, Christiana Medical Center
 Constantine Frangakis, Johns Hopkins University
o What are the most cogent methodological and clinical considerations in
using observational study methods to test the external validity of findings
from RCTs?
o How do data collection, management, and analysis approaches impact
generalizability?
o What are the generalizability questions of greatest interest? Or, where
does the greatest doubt arise? (Age, concomitant illness, concomitant
treatment) What examples represent well established differences?
o What statistical methods are needed to generalize RCT results?

4
o Are the standards for causal inference from OS different when prior
RCTs have been performed? How does statistical methodology vary in
this case?
o What are the implications when treatment results for patients
not included in the RCT differ from the overall results reported in the
original RCT?
o What makes an observed difference in outcome credible? Finding the
RCT-shown effect on the narrower population? Replication in >1
environment? Confidence interval of the result? Size of the effect in the
RCT?
o Can subset analyses in the RCT, even if underpowered, be used to
support or rebut the OS finding?
2:15 pm Break
2:30 pm Detecting treatment-effect heterogeneity
Moderator: Richard Platt, Harvard Pilgrim Health Care Institute
 Session format
David Kent, Tufts University
o Presentations:
 Comparative effectiveness of coronary artery bypass grafting and
percutaneous coronary intervention
Mark Hlatky, Stanford University
 Identification of effect heterogeneity using instrumental variables
Anirban Basu, University of Washington
 Mary Charlson, Cornell University
 Mark Cullen, Stanford University
o What is the potential for OS in assessing treatment response
heterogeneity and individual patient decision-making?
o What clinical and other data can be collected routinely to affect this
potential?
o How can longitudinal information on change in treatment categories and
clinical condition be used to assess variation in treatment response and
individual patient decision-making?
 What are the statistical methods for time varying changes in
treatment (including co-therapies) and clinical condition

5
o What are the best methods to form distinctive patient subgroups in
which to examine for heterogeneity of treatment response?
 What data elements are necessary to define these distinctive
patient subgroups?
o What are the best methods to assess heterogeneity in multi-dimensional
outcomes?
o How could further implementation of best practices in data collection,
management, and analysis impact treatment response heterogeneity?
o What is needed in order for information about treatment response
heterogeneity to be validated and used in practice?
4:15 pm Summary and preview of next day
4:45 pm Reception
5:45 pm Adjourn
*********************************************
Day 2: Friday, April 26th
8:00 am Coffee and light breakfast available
8:30 am Welcome, brief agenda overview, summary of previous day
Welcome, framing of the meeting and agenda overview
9:00 am Predicting individual responses
Moderator: Ralph Horwitz, GSK
 Session format
Burton Singer, University of Florida
o Presentations:
 Data-driven prediction models
Nicholas Tatonetti, Columbia University
 Individual prediction
Michael Kattan, Cleveland Clinic
 Peter Bach, Sloan Kettering
 Mitchell Gail, National Cancer Institute

6
o How can patient-level observational data be used to create predictive
models of treatment response in individual patients? What statistical
methodologies are needed?
o How can predictive analytic methods be used to study the interactions of
treatment with multiple patient characteristics?
o How should the clinical history (longitudinal information) for a given
patient be utilized in the creation of prediction rules for responses of that
patient to one or more candidate treatment regimens?
o What are effective methodologies for producing prediction rules to guide
the management of an individual patient based on their comparability to
results of RCTs, OS, and archived patient records?
o How can we blend predictive models, which can predict impact of
treatment choices, and causal modeling, that compare predictions under
different treatments?
11:00 am Conclusions and strategies going forward
Panel members will be charged with highlighting very specific next steps laid out in
the course of workshop presentations and discussions and/or suggesting some of
their own.
 Panel:
o Rob Califf , Duke University
o Cynthia Mulrow, University of Texas
o Jean Slutsky, Agency for Healthcare Quality and Research
o Steve Goodman, Stanford University
o What are the major themes and conclusions from the workshop’s
presentations and discussions?
o How can these themes be translated into actionable strategies with
designated stakeholders?
o What are the critical next steps in terms of advancing analytic methods?
o What are the critical next steps in developing data bases that will generate
evidence to guide clinical decision making.
o What are critical next steps in disseminating information on new methods
to increase their appropriate use?
10:45 am Break

7
12:15 pm Summary and next steps
Comments from the Chairs
Comments and thanks from the IOM
Michael McGinnis, Institute of Medicine
12:45 pm Adjourn
*******************************************
Planning Committee
Co–Chairs
Members
Anirban Basu, University of Washington
Troy Brennan, CVS/Caremark
Louis Jacques, Centers for Medicare & Medicaid Services
Steve Goodman, Stanford University
Jerry Kassirer, Tuft University
Michael Lauer, National Heart Lung and Blood Institute
David Madigan, Columbia University
Sharon-Lise Normand, Harvard University
Richard Platt, Harvard Pilgrim Health Care Institute
Robert Temple, Food and Drug Administration
Burton Singer, University of Florida
Jean Slutsky, Agency for Healthcare Research and Quality
Staff officer:
Claudia Grossmann
cgrossmann@nas.edu
202.334.3867

Workshop Planning Committee
Co–Chairs
Ralph I. Horwitz, MD
Senior Vice President, Clinical Sciences Evaluation
GlaxoSmithKline
Joe V. Selby, MD, MPH
Executive Director
PCORI
Members
Anirban Basu, MS, PhD
Associate Professor and Director
Health Economics and Outcomes Methodology
University of Washington
Troyen A. Brennan, MD, JD, MPH
Executive Vice President and Chief Medical Officer
CVS Caremark
Steven N. Goodman, MD, PhD
Associate Dean for Clinical & Translational Research
Stanford University School of Medicine
Louis B. Jacques, MD
Director
Coverage and Analysis Group
Centers for Medicare & Medicaid Services
Jerome P. Kassirer, MD
Distinguished Professor
Tufts University School of Medicine
Michael S. Lauer, MD, FACC, FAHA
Director
Division of Prevention and Population Sciences
National Heart, Lung, and Blood Institute
David Madigan, PhD
Chair of Statistics
Columbia University
Sharon-Lise T. Normand, PhD, MSc
Professor
Department of Biostatistics and Health Care Policy
Harvard Medical School
Richard Platt, MD, MS
Chair, Ambulatory Care and Prevention
Chair, Population Medicine
Harvard University
Burton H. Singer, PhD, MS
Professor
Emerging Pathogens Institute
University of Florida
Jean Slutsky, PA, MS
Director
Center for Outcomes and Evidence
Agency for Healthcare Research and Quality
Robert Temple, MD
Deputy Director for Clinical Science
Centers for Drug Evaluation and Research
Food and Drug Administration

Current as of 12pm, April 24
OBSERVATIONAL STUDIES IN A LEARNING HEALTH SYSYTEM
April 25-26, 2013
Workshop Participants
Jill Abell, PhD, MPH
Senior Director, Clinical Effectiveness
and Safety
GlaxoSmithKline
Joseph Alper
Writer and Technology Analyst
LSN Consulting
Naomi Aronson
Executive Director
Blue Cross, Blue Shield
Peter Bach, MD, MAPP
Attending Physician
Department of Epidemiology & Biostatistics
Memorial Sloan-Kettering Cancer Center
Anirban Basu, MS, PhD
Associate Professor and Director
Program in Health Economics
and Outcomes Methodology
University of Washington
Lawrence Becker
Director, Benefits
Xerox Corporation
Marc L. Berger, MD
Vice President, Real World Data and Analytics
Pfizer Inc.
Robert M. Califf, MD
Vice Chancellor for Clinical Research
Duke University Medical Center
Mary E. Charlson, MD
Chief, Clinical Epidemiology and
Evaluative Sciences Research
Weill Cornell Medical College
Jennifer B. Christian, PharmD, MPH, PhD
Senior Director, Clinical Effectiveness
and Safety
GlaxoSmithKline
Michael L. Cohen, PhD
Senior Program Officer
Committee on National Statistics
Mark R. Cullen, MD
Professor of Medicine
Stanford School of Medicine
Steven R. Cummings, MD, FACP
Professor Emeritus, Department of Medicine
University of California, San Francisco
Robert W. Dubois, MD, PhD
Chief Science Officer
National Pharmaceutical Council
Rachael L. Fleurence, PhD
Acting Director, Accelerating PCOR
Methods Program
PCORI
Dean Follmann, PhD
Branch Chief-Associate Director for Biostatistics
National Institutes of Health
Constantine Frangakis, PhD
Professor, Department of Biostatistics
Johns Hopkins Bloomberg School
of Public Health
Mitchell H. Gail, MD, PhD
Senior Investigator
National Cancer Institute
Kathleen R. Gans-Brangs, PhD
Senior Director, Medical Affairs
AstraZeneca

Steven N. Goodman, MD, PhD
Associate Dean for Clinical and
Translational Research
Stanford University School of Medicine
Sheldon Greenfield, MD
Executive Co-Director, Health Policy
Research Institute
University of California, Irvine
Joel B. Greenhouse, PhD
Professor of Statistics
Carnegie Mellon University
Sean Hennessy, PharmD, PhD
Associate Professor of Epidemiology
University of Pennsylvania
Miguel Hernan, MD, DrPH, ScM, MPH
Professor of Epidemiology
Harvard University
Mark A. Hlatky, MD
Professor of Health Research & Policy,
Stanford University
Ralph I. Horwitz, M.D.
Senior Vice President, Clinical
Science Evaluation
GlaxoSmithKline
Gail Hunt
President and CEO
National Alliance for Caregiving
Robert Jesse, MD, PhD
Principal Deputy Under Secretary for Health
Department of Veterans Affairs
Eloise E. Kaizar, PhD
Associate Professor
Department of Statistics
The Ohio State University
Jerome P. Kassirer, MD
Distinguished Professor
Tufts University School of Medicine
Michael Kattan, PhD
Quantitative Health Sciences Department Chair
Cleveland Clinic
David M. Kent, MD, MSc
Director, Clinical and Translational
Science Program
Tufts University Sackler School
of Graduate Biomedical Sciences
Michael S. Lauer, MD, FACC, FAHA
Director, Division of Prevention
and Population Sciences
National Heart, Lung, and Blood Institute
J. Michael McGinnis, MD, MPP, MA
Senior Scholar
Institute of Medicine
David O. Meltzer, PhD
Associate Professor
University of Chicago
Nancy E. Miller, PhD
Senior Science Policy Analyst
Office of Science Policy
National Institutes of Health
Sally Morton, PhD
Professor and Chair, Department of Biostatistics
Graduate School of Public Health
University of Pittsburgh
Cynthia D. Mulrow, MD, MSc
Senior Deputy Editor
Annals of Internal Medicine
Robin Newhouse
Chair and Professor
University of Maryland School of Nursing
Perry D. Nisen, MD, PhD
SVP, Science and Innovation
GlaxoSmithKline
Michael Pencina, PhD
Associate Professor
Boston University

Richard Platt, MD, MS
Chair, Ambulatory Care and Prevention
Chair, Population Medicine
Harvard University
James Robins, MD
Mitchell L. and Robin LaFoley Dong
Professor of Epidemiology
Harvard University
Patrick Ryan, PhD
Head of Epidemiology Analytics
Janssen Research and Development
Nancy Santanello, MD, MS
Vice President, Epidemiology
Merck
Richard L. Schilsky, MD, FASCO
Chief Medical Officer
American Society of Clinical Oncology
Sebastian Schneeweiss, MD
Associate Professor, Epidemiology
Division of Pharmacoepidemiology
and Pharmacoeconomics
Brigham and Women's Hospital
Michelle K. Schwalbe, PhD
Program Officer
Board on Mathematical Sciences
and Their Applications
National Research Council
Jodi Segal, MD, MPH
Director, Pharmacoepidemiology Program
The John Hopkins Medical Institutions
Joe V. Selby, MD, MPH
Executive Director
PCORI
Burton H. Singer, PhD, MS
Professor, Emerging Pathogens Institute
University of Florida
Jean Slutsky, PA, MS
Director, Center for Outcomes and Evidence
Agency for Healthcare Research and Quality
Dylan Small, PhD
Associate Professor of Statistics
University of Pennsylvania
Harold C. Sox, MD
Professor of Medicine (emeritus, active)
The Dartmouth Institute for Health Policy
and Clinical Practice
Dartmouth Geisel School of Medicine
Elizabeth A. Stuart
Associate Professor, Department of Biostatistics
Johns Hopkins Bloomberg School
of Public Health
Nicholas Tatonetti, PhD
Assistant Professor of Biomedical Informatics
Columbia University
Robert Temple, MD
Deputy Center Director for Clinical Science
Food and Drug Administration
Scott T. Weidman, PhD
Director, Board on Mathematical Sciences
and their Applications
National Research Council
William S. Weintraub, MD, FACC
John H. Ammon Chair of Cardiology
Christiana Care Health Services
Harlan Weisman
Managing Director
And-One Consulting, LLC
Ashley E. Wivel, MD, MSc
Senior Director in Clinical Effectiveness
and Safety
GlaxoSmithKline
John B. Wong, MD
Tufts University Sackler School
of Graduate Biomedical Sciences

IOM Staff
Claudia Grossmann, PhD
Diedtra Henderson
Program Officer
Elizabeth Johnston
Program Assistant
Valerie Rohrbach
Senior Program Assistant
Julia Sanders
Senior Program Assistant
Robert Saunders, PhD
Barret Zimmermann
Program Assistant

----------------------------------
Engaging the Issue of Bias
EngagingtheIssueofBias

CLINICAL
TRIALS
Clinical Trials 2012; 9: 48–55ARTICLE
Beyond the intention-to-treat in comparative
effectiveness research
Miguel A Hernańa,b
and Sonia Hernańdez-Dıáza
Background The intention-to-treat comparison is the primary, if not the only,
analytic approach of many randomized clinical trials.
Purpose To review the shortcomings of intention-to-treat analyses, and of ‘as
treated’ and ‘per protocol’ analyses as commonly implemented, with an emphasis
on problems that are especially relevant for comparative effectiveness research.
Methods and Results In placebo-controlled randomized clinical trials, intention-
to-treat analyses underestimate the treatment effect and are therefore nonconser-
vative for both safety trials and noninferiority trials. In randomized clinical trials with
an active comparator, intention-to-treat estimates can overestimate a treatment’s
effect in the presence of differential adherence. In either case, there is no guarantee
that an intention-to-treat analysis estimates the clinical effectiveness of treatment.
Inverse probability weighting, g-estimation, and instrumental variable estimation
can reduce the bias introduced by nonadherence and loss to follow-up in ‘as treated’
and ‘per protocol’ analyses.
Limitations These analyse require untestable assumptions, a dose-response model,
and time-varying data on confounders and adherence.
Conclusions We recommend that all randomized clinical trials with substantial lack
of adherence or loss to follow-up are analyzed using different methods. These
include an intention-to-treat analysis to estimate the effect of assigned treatment
and ‘as treated’ and ‘per protocol’ analyses to estimate the effect of treatment after
appropriate adjustment via inverse probability weighting or g-estimation. Clinical
Trials 2012; 9: 48–55. http://ctj.sagepub.com
Introduction
Randomized clinical trials (RCTs) are widely viewed
as a key tool for comparative effectiveness research
[1], and the intention-to-treat (ITT) comparison has
long been regarded as the preferred analytic
approach for many RCTs [2].
Indeed, the ITT, or ‘as randomized,’ analysis has
two crucial advantages over other common alterna-
tives – for example, an ‘as treated’ analysis. First, in
double-blind RCTs, an ITT comparison provides a
valid statistical test of the hypothesis of null effect of
treatment [3,4]. Second, in placebo-controlled trials,
an ITT comparison is regarded as conservative
because it underestimates the treatment effect
when participants do not fully adhere to their
assigned treatment.
Yet excessive reliance on the ITT approach is
problematic, as has been argued by others before us
[5]. In this paper, we review the problems of ITT
comparisons with an emphasis on those that are
especially relevant for comparative effectiveness
research. We also review the shortcomings of ‘as
treated’ and ‘per protocol’ analyses as commonly
implemented in RCTs and recommend the routine
use of analytic approaches that address some of
a
Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA, b
Harvard-MIT Division of Health
Sciences and Technology, Boston, MA, USA
Author for correspondence: Miguel Hernań, Department of Epidemiology, Harvard School of Public Health, Boston,
MA 02115, USA
E-mail: miguel_hernan@post.harvard.edu
Ó The Author(s), 2011
Reprints and permissions: http://www.sagepub.co.uk/journalsPermissions.nav 10.1177/1740774511420743

those shortcomings. Let us start by defining two
types of causal effects that can be estimated in RCTs.
The effect of assigned treatment
versus the effect of treatment
Consider a double-blind clinical trial in which par-
ticipants are randomly assigned to either active
treatment (Z ¼ 1) or placebo (Z ¼ 0) and are then
followed for 5 years or until they die (Y ¼ 1 if they die
within 5 years, Y ¼ 0 otherwise). An ITT analysis
would compare the 5-year risk of death in those
assigned to treatment with the 5-year risk of death in
those assigned to placebo. An ITT comparison
unbiasedly estimates the average causal effect of
treatment assignment Z on the outcome Y. For
brevity, we will refer to this as the effect of assigned
treatment.
Trial participants may not adhere to, or comply
with, the assigned treatment Z. Some of those
assigned to placebo may decide to take treatment,
and some of those assigned to active treatment may
decide not to take it. We use A to refer to the
treatment actually received. Thus, regardless of their
assigned treatment Z, some subjects will take treat-
ment (A ¼ 1) and others will not take it (A ¼ 0). The
use of ITT comparisons is sometimes criticized when
not all trial participants adhere to their assigned
treatment Z, that is, when Z is not equal to A for
every trial participant. For example, consider two
RCTs: in the first trial, half of the participants in
the Z ¼ 1 group decide not to take treatment; in the
second trial, all participants assigned to Z ¼ 1 decide
to take the treatment. An ITT comparison will
correctly estimate the effect of assigned treatment
Z in both trials, but the effects will be different even
if the two trials are otherwise identical. The direction
and magnitude of the effect of assigned treatment
depends on the adherence pattern.
Now suppose that, in each of the two trials with
different adherence, we could estimate the effect
that would have been observed if all participants had
fully adhered to the value of treatment A (1 or 0)
originally assigned to them. We will refer to such
effect as the average causal effect of treatment A on
the outcome Y or, for brevity, the effect of treatment.
The effect of treatment A appears to be an attractive
choice to summarize the findings of RCTs with
substantial nonadherence because it will be the
same in two trials that differ only in their adherence
pattern.
However, estimating the magnitude of the effect
of treatment A without bias requires assumptions
grounded on expert knowledge (see below). No
matter how sophisticated the statistical analysis,
the estimate of the effect of A will be biased if one
makes incorrect assumptions.
may be misleading
An ITT comparison is simple and therefore very
attractive [4]. It bypasses the need for assumptions
regarding adherence and dose-response by focusing
on estimating the effect of assigned treatment Z
rather than the effect of treatment A. However, there
is a price to pay for this simplicity, as reviewed in this
section.
We start by considering placebo-controlled
double-blind RCTs. It is well known that if treatment
A has a null effect on the outcome, then both the
effect of assigned treatment Z and the effect of treat-
ment A will be null. This is a key advantage of the ITT
analysis: it correctly estimates the effect of treatment
A under the null, regardless of the adherence pattern.
It is also well known that if treatment A has a non-
null effect (that is, either increases or decreases the
risk of the outcome) and some participants do not
adhere to their assigned treatment, then the effect of
assigned treatment Z will be closer to the null that
the actual effect of treatment A [3]. This bias toward
the null is due to contamination of the treatment
groups: some subjects assigned to treatment (Z ¼ 1)
may not take it (A ¼ 0) whereas some subjects
assigned to placebo (Z ¼ 0) may find a way to take
treatment (A ¼ 1). As long as the proportion of
patients who end up taking treatment (A ¼ 1) is
greater in the group assigned to treatment (Z ¼ 1)
than in the group assigned to placebo (Z ¼ 0), the
effect of assigned treatment Z will be in between the
effect of treatment A and the null value.
The practical effect of this bias varies depending
on the goal of the trial. Some placebo-controlled
RCTs are designed to quantify a treatment’s benefi-
cial effects – for example, a trial to determine
whether sildenafil reduces the risk of erectile dys-
function. An ITT analysis of these trials is said to be
‘conservative’ because the effect of assigned treat-
ment Z is biased toward the null. That is, if an ITT
analysis finds a beneficial effect for treatment assign-
ment Z, then the true beneficial effect of treatment A
must be even greater. The makers of treatment A
have a great incentive to design a high-quality study
with high levels of adherence. Otherwise, a small
beneficial effect of treatment might be missed by the
ITT analysis.
Other trials are designed to quantify a treatment’s
harmful effects – for example, a trial to determine
whether sildenafil increases the risk of cardiovascular
disease. An ITT analysis of these trials is antic-
onservative precisely because the effect of assigned
Beyond the intention-to-treat 49
http://ctj.sagepub.com Clinical Trials 2012; 9: 48–55

treatment Z is biased toward the null. That is, if an
ITT analysis fails to find a toxic effect, there is no
guarantee that treatment A is safe. A trial designed to
quantify harm and whose protocol foresees only an
ITT analysis could be referred to as a ‘randomized
cynical trial.’
Now let us consider double-blind RCTs that com-
pare two active treatments. These trials are often
designed to show that a new treatment (A ¼ 1) is not
inferior to a reference treatment (A ¼ 0) in terms of
either benefits or harms. An example of a noninfer-
iority trial would be one that compares the reduction
in blood glucose between a new inhaled insulin and
regular injectable insulin. The protocol of the trial
would specify a noninferiority margin, that is, the
maximum average difference in blood glucose that is
considered equivalent (e.g., 10 mg/dL). Using an ITT
comparison, the new insulin (A ¼ 1) will be declared
not inferior to classical insulin (A ¼ 0) if the average
reduction in blood glucose in the group assigned to
the new treatment (Z ¼ 1) is within 10 mg/dL of the
average reduction in blood glucose in the group
assigned to the reference treatment (Z ¼ 0) plus/
minus random variability. Such ITT analysis may be
misleading in the presence of imperfect adherence.
To see this, consider the following scenario.
Scenario 1
The new treatment A ¼ 1 is actually inferior to the
reference treatment A ¼ 0, for example, the average
reduction in blood glucose is 10 mg/dL under treat-
ment A ¼ 1 and 22 mg/dL under treatment A ¼ 0. The
type and magnitude of adherence is equal in the two
groups, for example 30% of subjects in each group
decided not to take insulin. As a result, the average
reduction is, say, 7 mg/dL in the group assigned to
the new treatment (Z ¼ 1) and 15 mg/dL in the group
assigned to the reference treatment (Z ¼ 0). An ITT
analysis, which is biased toward the null in this
scenario, may incorrectly suggest that the new
treatment A ¼ 1 is not inferior to the reference
treatment A ¼ 0.
Other double-blind RCTs with an active compar-
ator are designed to show that a new treatment
(A ¼ 1) is superior to the reference treatment (A ¼ 0)
in terms of either benefits or harms. An example of a
superiority trial would be one that compares the risk
of heart disease between two antiretroviral regimes.
Using an ITT comparison, the new regimen (A ¼ 1)
will be declared superior to the reference regime
(A ¼ 0) if the heart disease risk is lower in the group
assigned to the new regime (Z ¼ 1) than in the group
assigned to the reference regime (Z ¼ 0) plus/minus
random variability. Again, such ITT analysis may be
misleading in the presence of imperfect adherence.
Consider the following scenario.
Scenario 2
The new treatment A ¼ 1 is actually equivalent to the
reference treatment A ¼ 0, for example, the 5-year
risk of heart disease is 3% under either treatment
A ¼ 1 or treatment A ¼ 0, and the risk in the absence
of either treatment is 1%. The type or magnitude of
adherence differs between the two groups, for exam-
ple, 50% of subjects assigned to the new regime and
10% of those assigned to the reference regime
decided not to take their treatment because of
minor side effects. As a result, the risk is, say, 2% in
the group assigned to the new regime (Z ¼ 1) and
2.8% in the group assigned to the reference regime
(Z ¼ 0). An ITT analysis, which is biased away from
the null in this scenario, may incorrectly suggest that
treatment A ¼ 1 is superior to treatment A ¼ 0.
An ITT analysis of RCTs with an active comparator
may result in effect estimates that are biased toward
(Scenario 1) or away from (Scenario 2) the null. In
other words, the magnitude of the effect of assigned
treatment Z may be greater than or less than the
effect of treatment A. The direction of the bias
depends on the proportion of subjects that do not
adhere to treatment in each group, and on the
reasons for nonadherence.
Yet, a common justification for ITTcomparisons is
the following: Adherence is not perfect in clinical
practice. Therefore, clinicians may be more inter-
ested in consistently estimating the effect of
assigned treatment Z, which already incorporates
the impact of nonadherence, than the effect of
treatment A in the absence of nonadherence. That
is, the effect of assigned treatment Z reflects a
treatment’s clinical effectiveness and therefore
should be privileged over the effect of treatment A.
In the next section, we summarize the reasons why
this is not necessarily true.
is not the same as the effectiveness
of treatment
Effectiveness is usually defined as ‘how well a treat-
ment works in everyday practice,’ and efficacy as
‘how well a treatment works under perfect adherence
and highly controlled conditions.’ Thus, the effect of
assigned treatment Z in postapproval settings is
often equated with effectiveness, whereas the effect
of treatment Z in preapproval settings (which is close
to the effect of A when adherence is high) is often
50 MA Hernań and S Hernańdez-Dıáz
Clinical Trials 2012; 9: 48–55 http://ctj.sagepub.com

equated with efficacy. There is, however, no guaran-
tee that the effect of assigned treatment Z matches
the treatment’s effectiveness in routine medical
practice. A discrepancy may arise for multiple rea-
sons, including differences in patient characteristics,
monitoring, or blinding, as we now briefly review.
The eligibility criteria for participants in RCTs are
shaped by methodologic and ethical considerations.
To maximize adherence to the protocol, many RCTs
exclude individuals with severe disease, comorbid-
ities, or polypharmacy. To minimize risks to vulner-
able populations, many RCTs exclude pregnant
women, children, or institutionalized populations.
As a consequence, the characteristics of participants
in an RCT may be, on average, different from those
of the individuals who will receive the treatment in
clinical practice. If the effect of the treatment under
study varies by those characteristics (e.g., treatment
is more effective for those using certain concomitant
treatments) then the effect of assigned treatment Z
in the trial will differ from the treatment’s effective-
ness in clinical practice.
Patients in RCTs are often more intensely moni-
tored than patients in clinical practice. This greater
intensity of monitoring may lead to earlier detection
of problems (i.e., toxicity, inadequate dosing) in
RCTs compared with clinical practice. Thus, a treat-
ment’s effectiveness may be greater in RCTs because
the earlier detection of problems results in more
timely therapeutic modifications, including modifi-
cations in treatment dosing, switching to less toxic
treatments, or addition of concomitant treatments.
Blinding is a useful approach to prevent bias from
differential ascertainment of the outcome [6]. There
is, however, an inherent contradiction in conduct-
ing a double-blind study while arguing that the goal
of the study is estimating the effectiveness in routine
medical practice. In real life, both patients and
doctors are aware of the assigned treatment. A true
effectiveness measure should incorporate the effects
of assignment awareness (e.g., behavioral changes)
that are eliminated in ITT comparisons of double-
blind RCTs.
Some RCTs, commonly referred to as pragmatic
trials [7–9], are specifically designed to guide decisions
in clinical practice. Compared with highly controlled
trials, pragmatic trials include less selected partici-
pants and are conducted under more realistic condi-
tions, which may result in lower adherence to the
assigned treatment. It is often argued that an ITT
analysis of pragmatic trials is particularly appropriate
to measure the treatment’s effectiveness, and
thus that pragmatic trials are the best design for
comparative effectiveness research. However, this
argument raises at least two concerns.
First, the effect of assigned treatment Z is influ-
enced by the adherence patterns observed in the
trial, regardless of whether the trial is a pragmatic
one. Compared with clinical practice, trial partici-
pants may have a greater adherence because they are
closely monitored (see above), or simply because
they are the selected group who received informed
consent and accepted to participate. Patients outside
the trial may have a greater adherence after they
learn, perhaps based on the trial’s findings, that
treatment is beneficial. Therefore, the effect of
assigned treatment estimated by an ITT analysis
may under- or overestimate the effectiveness of the
treatment.
Second, the effect of assigned treatment Z is
inadequate for patients who are interested in initi-
ating and fully adhering to a treatment A that has
been shown to be efficacious in previous RCTs. In
order to make the best informed decision, these
patients would like to know the effect of treatment A
rather than an effect of assigned treatment Z, which
is contaminated by other patients’ nonadherence
[5]. For example, to decide whether to use certain
contraception method, a couple may want to know
the failure rate if they use the method as indicated,
rather than the failure rate in a population that
included a substantial proportion of nonadherers.
Therefore, the effect of assigned treatment Z may be
an insufficient summary measure of the trial data,
even if it actually measures the treatment’s
effectiveness.
In summary, the effect of assigned treatment Z –
estimated via an ITT comparison – may not be a valid
measure of the effectiveness of treatment A in
clinical practice. And even if it were, effectiveness
is not always the most interesting effect measure.
These considerations, together with the inappropri-
ateness of ITT comparisons for safety and noninfer-
iority trials, make it necessary to expand the
reporting of results from RCTs beyond ITT analyses.
The next section reviews other analytic approaches
for data from RCTs.
Conventional ‘as treated’ and ‘per
protocol’ analyses
Two common attempts to estimate the effect of
treatment A are ‘as treated’ and ‘per protocol’ com-
parisons. Neither is generally valid.
An ‘as treated’ analysis classifies RCT participants
according to the treatment that they took (either
A ¼ 1 or A ¼ 0) rather than according to the treat-
ment that they were assigned to (either Z ¼ 1 or
Z ¼ 0). Then an ‘as treated’ analysis compares the risk
(or the mean) of the outcome Y among those who
took treatment (A ¼ 1) with that among those who
did not take treatment (A ¼ 0), regardless of their
treatment assignment Z. That is, an ‘as treated’
comparison ignores that the data come from an

RCT and rather treats them as coming from an
observational study. As a result, an ‘as treated’
comparison will be confounded if the reasons that
moved participants to take treatment were associ-
ated with prognostic factors. The causal diagram in
Figure 1 represents the confounding as a noncausal
association between A and Y when there exist
prognostic factors L that also affect the decision to
take treatment A (U is an unmeasured common cause
of L and Y). Confounding arises in an ‘as treated’
analysis when not all prognostic factors L are appro-
priately measured and adjusted for.
A ‘per protocol’ analysis – also referred to as an ‘on
treatment’ analysis – only includes individuals who
adhered to the clinical trial instructions as specified
in the study protocol. The subset of trial participants
included in a ‘per protocol’ analysis, referred to as
the per protocol population, includes only partici-
pants with A equal to Z: those who were assigned to
treatment (Z ¼ 1) and took it (A ¼ 1), and those who
were not assigned to treatment (Z ¼ 0) and did not
take it (A ¼ 0). A ‘per protocol’ analysis compares the
risk (or the mean) of the outcome Y among those
who were assigned to treatment (Z ¼ 1) with that
among those who were not assigned to treatment
(Z ¼ 0) in the per protocol population. That is, a ‘per
protocol’ analysis is an ITT analysis in the per
protocol population. This contrast will be affected
by selection bias [10] if the reasons that moved
participants to adhere to their assigned treatment
were associated with prognostic factors L. The causal
diagram in Figure 2 includes S as an indicator of
selection into the ‘per protocol’ population. The
selection indicator S is fully determined by the values
of Z and A, that is, S ¼ 1 when A ¼ Z, and S ¼ 0
otherwise. The selection bias is a noncausal associ-
ation between Z and Y that arises when the analysis
is restricted to the ‘per protocol’ population (S ¼ 1)
and not all prognostic factors L are appropriately
measured and adjusted for.
As an example of biased ‘as treated’ and ‘per
protocol’ estimates of the effect of treatment A,
consider the following scenario.
Scenario 3
An RCT assigns men to either colonoscopy (Z ¼ 1) or
no colonoscopy (Z ¼ 0). Suppose that undergoing a
colonoscopy (A ¼ 1) does not affect the 10-year risk
of death from colon cancer (Y) compared with not
undergoing a colonoscopy (A ¼ 0), that is, the effect
of treatment A is null. Further suppose that, among
men assigned to Z ¼ 1, those with family history of
colon cancer (L ¼ 1) are more likely to adhere to their
assigned treatment and undergo the colonoscopy
(A ¼ 1).
Even though A has a null effect, an ‘as treated’
analysis will find that men undergoing colonoscopy
(A ¼ 1) are more likely to die from colon cancer
because they include a greater proportion of men
with a predisposition to colon cancer than the others
(A ¼ 0). This is the situation depicted in Figure 1.
Similarly, a ‘per protocol’ analysis will find a greater
risk of death from colon cancer in the group Z ¼ 1
than in the group Z ¼ 0 because the per protocol
restriction A ¼ Z overloads the group assigned to
colonoscopy with men with a family history of colon
cancer. This is the situation depicted in Figure 2.
The confounding bias in the ‘as treated’ analysis
and the selection bias in the ‘per protocol’ analysis
can go in either direction – for example, suppose
that L represents healthy diet rather than family
history of colon cancer. In general, the direction of
the bias is hard to predict because it is possible that
the proportions of people with a family history,
healthy diet, and any other prognostic factor will
vary between the groups A ¼ 1 and A ¼ 0 condi-
tional on Z.
In summary, ‘as treated’ and ‘per protocol’ anal-
yses transform RCTs into observational studies for all
practical purposes. The estimates from these analyses
Z L
U
A Y
Figure 1. Simplified causal diagram for a randomized clinical
trial with assigned treatment Z, received treatment A, and
outcome Y. U represents the unmeasured common causes of A
and Y. An ‘as treated’ analysis of the A-Y association will be
confounded unless all prognostic factors L are adjusted for.
Z L
U
S
A Y
Figure 2. Simplified causal diagram for a randomized clinical
trial with assigned treatment Z, received treatment A, and
outcome Y. U represents the unmeasured common causes of A
and Y, and S an indicator for selection into the ‘per protocol’
population. The Z-Y association in the ‘per protocol’ population
(a restriction represented by the box around S) will be affected
by selection bias unless all prognostic factors L are adjusted for.

can only be interpreted as the effect of treatment A if
the analysis is appropriately adjusted for the con-
founders L. If the intended analysis of the RCT is ‘as
treated’ or ‘per protocol,’ then the protocol of the
trial should describe the potential confounders and
how they will be measured, just like the protocol of
an observational study would do.
More general ‘as treated’ and ‘per
protocol’ analyses to estimate the
effect of treatment
So far we have made the simplifying assumption that
adherence is all or nothing. But in reality, RCT
participants may adhere to their assigned treatment
intermittently. For example, they may take their
assigned treatment for 2 months, discontinue it for
the next 3 months, and then resume it until the end
of the study. Or subjects may take treatment con-
stantly but at a lower dose than assigned. For
example, they may take only one pill per day when
they should take two. Treatment A is generally a
time-varying variable – each day you may take it or
not take it – rather than a time-fixed variable – you
either always take it or never take it during the
follow-up.
An ‘as treated’ analysis with a time-varying treat-
ment A usually involves some sort of dose-response
model. A ‘per protocol’ analysis with a time-varying
treatment A includes all RCT participants but censors
them if/when they deviate from their assigned
treatment. The censoring usually occurs at a fixed
time after nonadherence, say, 6 months. The per
protocol population in this variation refers to the
adherent person-time rather than to the adherent
persons.
Because previous sections were only concerned
with introducing some basic problems of ITT, ‘as
treated’ and ‘per protocol’ analyses, we considered A
as a time-fixed variable. However, this simplification
may be unrealistic and misleading in practice.
When treatment A is truly time-varying (i) the
effect of treatment needs to be redefined and (ii)
appropriate adjustment for the measured confoun-
ders L cannot generally be achieved by using con-
ventional methods such as stratification, regression,
or matching.
The definition of the average causal effect of a
time-fixed treatment involves the contrast between
two clinical regimes. For example, we defined the
causal effect of a time-fixed treatment as a contrast
between the average outcome that would be
observed if all participants took treatment A ¼ 1
versus treatment A ¼ 0. The two regimes are ‘‘taking
treatment A ¼ 1’’ and ‘‘taking treatment A ¼ 0’’. The
definition of the causal effect of a time-varying
treatment also involves a contrast between two
clinical regimes. For example, we can define the
causal effect of a time-varying treatment as a contrast
between the average outcome that would be
observed if all participants had continuous treat-
ment with A ¼ 1 versus continuous treatment with
A ¼ 0 during the entire follow-up. We sometimes
refer to this causal effect as the effect of continuous
treatment.
When the treatment is time-varying, so are the
confounders. For example, the probability of taking
antiretroviral therapy increases in the presence of
symptoms of HIV disease. Both therapy and con-
founders evolve together during the follow-up.
When the time-varying confounders are affected by
previous treatment – for example, antiretroviral
therapy use reduces the frequency of symptoms –
conventional methods cannot appropriately adjust
for the measured confounders [10]. Rather, inverse
probability (IP) weighting or g-estimation are gener-
ally needed for confounding adjustment in ‘as
treated’ and ‘per protocol’ analyses involving time-
varying treatments [11–13].
Both IP weighting and g-estimation require that
time-varying confounders and time-varying treat-
ments are measured during the entire follow-up.
Thus, if planning to use these adjustment methods,
the protocol of the trial should describe the potential
confounders and how they will be measured.
Unfortunately, like in any observational study,
there is no guarantee that all confounders will be
identified and correctly measured, which may result
in biased estimates of the effect of continuous
treatment in ‘as treated’ and ‘per protocol’ analyses
involving time-varying treatments.
An alternative adjustment method is instrumen-
tal variable (IV) estimation, a particular form of
g-estimation that does not require measurement of
any confounders [14–17]. In double-blind RCTs, IV
estimation eliminates confounding for the effect of
continuous treatment A by exploiting the fact that
the initial treatment assignment Z was random.
Thus, if the time-varying treatment A is measured
and a correctly specified structural model used, IV
estimation adjusts for confounding without measur-
ing, or even knowing, the confounders.
A detailed description of IP weighting,
g-estimation, and IV estimation is beyond the
scope of this paper. Toh and Herna´n review
these methods for RCTs [18]. IP weighting and
g-estimation can also be used to estimate the
effect of treatment regimes that may be more clin-
ically relevant than the effect of continuous treat-
ment [19,20]. For example, it may be more
interesting to estimate the effect of treatment taken
continuously unless toxic effects or counterindica-
tions arise.

Discussion
An ITT analysis of RCTs is appealing for the same
reason it may be appalling: simplicity. As described
above, ITT estimates may be inadequate for the
assessment of comparative effectiveness or safety.
In the presence of nonadherence, the ITT effect is a
biased estimate of treatment’s effects such as the
effect of continuous treatment. This bias can be
corrected in an appropriately adjusted ‘as treated’
analysis via IP weighting, g-estimation, or IV estima-
tion. However, IP weighting and g-estimation
require untestable assumptions similar to those
made for causal inference from observational data.
IV estimation generally requires a dose-response
model and its validity is questionable for nonblinded
RCTs.
The ITT approach is also problematic if a large
proportion of participants drop out or are otherwise
lost to follow-up, or if the outcomes are incompletely
ascertained among those completing the study. In
these studies, an ITT comparison cannot be con-
ducted because the value of the outcome is missing
for some individuals. To circumvent this problem,
the ITT analysis is often replaced by a pseudo-ITT
analysis that is restricted to subjects with complete
data or in which the last observation is carried
forward. These pseudo-ITT analyses may be affected
by selection bias in either direction. Adjusting for
this bias is possible via IP weighting if information
on the time-varying determinants of loss to follow-
up is available, but again, the validity of the adjust-
ment relies on untestable assumptions about the
unmeasured variables [18].
RCTs with long follow-up periods, as expected in
many comparative effectiveness research settings,
are especially susceptible to bias due to nonadher-
ence and loss to follow-up. As these problems accu-
mulate over time, the RCT starts to resemble a
prospective observational study, and the ITT analysis
yields an increasingly biased estimate of the effect of
continuous treatment. Consider, for example, a
Women’s Health Initiative randomized trial that
assigned postmenopausal women to either estrogen
plus progestin hormone therapy or placebo [21].
About 40% of women had stopped taking at least
80% of their assigned treatment by the 6th year of
follow-up. The ITT hazard ratio of breast cancer was
1.25 (95% CI: 1.01, 1.54) for hormone therapy versus
placebo. The IP weighted hazard ratio of breast
cancer was 1.68 (1.24, 2.28) for 8 years of continuous
hormone therapy versus no hormone therapy [22].
These findings suggest that the effect of continuous
treatment was more than twofold greater than the
effect of assigned treatment. Of course, neither of
these estimates reflects the long-term effect of hor-
mone therapy in clinical practice (e.g., the adherence
to hormone therapy was much higher in the trial
than in the real world).
When analyzing data from RCTs, the question is
not whether assumptions are made but rather which
assumptions are made. In an RCT with incomplete
follow-up or outcome ascertainment, a pseudo-ITT
analysis assumes that the loss to follow-up occurs
completely at random whereas an IP weighted ITT
analysis makes less strong assumptions (e.g., loss to
follow-up occurs at random conditional on the
measured covariates). In an RCT with incomplete
adherence, an ITT analysis shifts the burden of
assessing the actual magnitude of the effect from
the data analysts to the clinicians and other decision
makers, who will need to make assumptions about
the potential bias introduced by lack of adherence.
Supplementing the ITT effects with ‘as treated’ or
‘per protocol’ effects can help decision makers [23],
but only if a reasonable attempt is made to appro-
priately adjust for confounding and selection bias.
In summary, we recommend that all RCTs with
substantial lack of adherence or loss to follow-up
be analyzed using different methods, including an
ITT analysis to estimate the effect of assigned treat-
ment, and appropriately adjusted ‘per protocol’ and
‘as treated’ analyses (i.e., via IP weighting or g-
estimation) to estimate the effect of received treat-
ment. Each approach has relative advantages and
disadvantages, and depends on a different combi-
nation of assumptions [18]. To implement this
recommendation, RCT protocols should include a
more sophisticated statistical analysis plan, as well as
plans to measure adherence and other postrando-
mization variables. This added complexity is neces-
sary to take full advantage of the substantial societal
resources that are invested in RCTs.
Acknowledgement
We thank Goodarz Danaei for his comments to an
earlier version of this manuscript.
Funding
This study was funded by National Institutes of
Health grants R01 HL080644-01 and R01 HD056940.
References
1. Luce BR, Kramer JM, Goodman SN, et al. Rethinking
randomized clinical trials for comparative effectiveness
research: the need for transformational change. Ann Intern
Med 2009; 151: 206–09.
2. Food and Drug Administration. International
Conference on Harmonisation; Guidance on Statistical

Principles for Clinical Trials. Federal Register 1998; 63:
49583–98.
3. Rosenberger WF, Lachin JM. Randomization in Clinical
Trials: Theory and Practice. Wiley-Interscience, New York,
NY, 2002.
4. Piantadosi S. Clinical Trials: A Methodologic Perspective
(2nd edn). Wiley-Interscience, Hoboken, NJ, 2005.
5. Sheiner LB, Rubin DB. Intention-to-treat analysis and the
goals of clinical trials. Clin Pharmacol Ther 1995; 57: 6–15.
6. Psaty BM, Prentice RL. Minimizing bias in randomized
trials: the importance of blinding. JAMA 2010; 304:
793–94.
7. McMahon AD. Study control, violators, inclusion criteria
and defining explanatory and pragmatic trials. Stat Med
2002; 21: 1365–76.
8. Schwartz D, Lellouch J. Explanatory and pragmatic
attitudes in therapeutical trials. J Chronic Dis 1967; 20:
637–48.
9. Tunis SR, Stryer DB, Clancy CM. Practical clinical trials:
increasing the value of clinical research for decision
making in clinical and health policy. JAMA 2003; 290:
1624–32.
10. Hernań MA, Hernańdez-Dıáz S, Robins JM. A structural
approach to selection bias. Epidemiology 2004; 15: 615–25.
11. Robins JM. Correcting for non-compliance in randomized
trials using structural nested mean models. Communin Stat
1994; 23: 2379–412.
12. Robins JM. Correction for non-compliance in equivalence
trials. Stat Med 1998; 17: 269–302.
13. Robins JM, Finkelstein D. Correcting for non-
compliance and dependent censoring in an AIDS clinical
trial with inverse probability of censoring weighted
(IPCW) Log-rank tests. Biometrics 2000; 56: 779–88.
14. Hernań MA, Robins JM. Instruments for causal inference:
an epidemiologist’s dream? Epidemiology 2006; 17:
360–72.
15. Ten Have TR, Normand SL, Marcus SM, et al. Intent-to-
treat vs. non-intent-to-treat analyses under treatment
non-adherence in mental health randomized trials.
Psychiatr Ann 2008; 38: 772–83.
16. Cole SR, Chu H. Effect of acyclovir on herpetic ocular
recurrence using a structural nested model. Contemp Clin
Trials 2005; 26: 300–10.
17. Mark SD, Robins JM. A method for the analysis of
randomized trials with compliance information: an appli-
cation to the Multiple Risk Factor Intervention Trial. Contr
Clin Trials 1993; 14: 79–97.
18. Toh S, Hernań MA. Causal Inference from longitudinal
studies with baseline randomization. Int J Biostat 2008; 4:
Article 22.
19. Hernań MA, Lanoy E, Costagliola D, Robins JM.
Comparison of dynamic treatment regimes via inverse
probability weighting. Basic Clin Pharmacol Toxicol 2006;
98: 237–42.
20. Cain LE, Robins JM, Lanoy E, et al. When to start
treatment? a systematic approach to the comparison of
dynamic regimes using observational data. Int J Biostat
2006; 6: Article 18.
21. Writing group for the Women’s Health Initiative
Investigators. Risks and benefits of estrogen plus proges-
tin in healthy postmenopausal women: principal results
from the women’s health initiative randomized controlled
trial. JAMA 2002; 288: 321–33.
22. Toh S, Hernańdez-Dıáz S, Logan R, et al. Estimating
absolute risks in the presence of nonadherence: an appli-
cation to a follow-up study with baseline randomization.
Epidemiology 2010; 21: 528–39.
23. Thorpe KE, Zwarenstein M, Oxman AD, et al. A prag-
matic-explanatory continuum indicator summary
(PRECIS): a tool to help trial designers. J Clin Epidemiol
2009; 62: 464–75.

VIEWPOINT
Prespecified Falsification End Points
Can They Validate True Observational Associations?
Vinay Prasad, MD
Anupam B. Jena, MD, PhD
A
S OBSERVATIONAL STUDIES HAVE INCREASED IN NUM-
ber—fueled by a boom in electronic recordkeep-
ing and the ease with which observational analy-
ses of large databases can be performed—so too
have failures to confirm initial research findings.1
Several
solutions to the problem of incorrect observational results
have been suggested,1,2
emphasizing the importance of a rec-
ord not only of significant findings but of all analyses con-
ducted.2
An important and increasingly familiar type of observa-
tional study is the identification of rare adverse effects (de-
fined by organizations such as the Council for Interna-
tional Organizations and Medical Sciences as occurring
among fewer than 1 per 1000 individuals) from population
data. Examples of these studies include whether macrolide
antibiotics such as azithromycin are associated with higher
rates of sudden cardiac death3
; whether proton pump in-
hibitors (PPIs) are associated with higher rates of pneumo-
nia4
; or whether bisphosphonates are associated with an in-
creased risk of atypical (subtrochanteric) femur fractures.5
Rare adverse events, such as these examples, occur so in-
frequently that almost by definition they may not be iden-
tified in randomized controlled trials (RCTs). Postmarket-
ing data from thousands of patients are required to identify
such low-frequency events. In fact, the ability to conduct
postmarketing surveillance of large databases has been her-
alded as a vital step in ensuring the safe dissemination of
medical treatments after clinical trials (phase 4) for pre-
cisely this reason.
Few dispute the importance of observational studies for
capturing rare adverse events. For instance, in early stud-
ies of whether bisphosphonate use increases the rate of atypi-
cal femur fractures, pooled analysis of RCTs demonstrated
no elevated risk.6
However, these data were based on a lim-
ited sample of 14 000 patients with only 284 hip or femur
fractures and only 12 atypical fracture events over just more
than 3.5 years of follow-up. In contrast, later observational
studies addressing the same question were able to leverage
much larger and more comprehensive data. One analysis that
examined 205 466 women who took bisphosphonates for
an average of 4 years identified more than 10 000 hip or fe-
mur fractures and 716 atypical fractures.5
This analysis dem-
onstrated an increased risk of atypical fractures associated
with bisphosphonate use and was validated by another large
population-based study.
However, analyses in large data sets are not necessarily
correct simply because they are larger. Control groups might
not eliminate potential confounders, or many varying defi-
nitions of exposure to the agent may be tested (alternative
thresholds for dose or duration of a drug)—a form of mul-
tiple-hypothesis testing.2
Just as small, true signals can be
identified by these analyses, so too can small, erroneous as-
sociations. For instance, several observational studies have
found an association between use of PPIs and development
of pneumonia, and it is biologically plausible that elevated
gastric pH may engender bacterial colonization.4
However,
it is also possible that even after statistical adjustment for
known comorbid conditions, PPI users may have other un-
observed health characteristics (such as poor health lit-
eracy or adherence) that could increase their rates of pneu-
monia, apart from use of the drug. Alternatively, physicians
who are more likely to prescribe PPIs to their patients also
may be more likely to diagnose their patients with pneu-
monia in the appropriate clinical setting. Both mecha-
nisms would suggest that the observational association be-
tween PPI use and pneumonia is confounded. In light of the
increasing prevalence of such studies and their importance
in shaping clinical decisions, it is important to know that
the associations identified are true rather than spurious cor-
relations. Prespecified falsification hypotheses may pro-
vide an intuitive and useful safeguard when observational
data are used to find rare harms.
A falsification hypothesis is a claim, distinct from the one
being tested, that researchers believe is highly unlikely to
be causally related to the intervention in question.7
For in-
stance, a falsification hypothesis may be that PPI use in-
creases the rate of soft tissue infection or myocardial infarc-
tion. A confirmed falsification test—in this case, a positive
association between PPI use and risks of these conditions—
Author Affiliations: Medical Oncology Branch, National Cancer Institute, Na-
tional Institutes of Health, Bethesda, Maryland (Dr Prasad); Department of Health
Care Policy, Harvard Medical School, and Massachusetts General Hospital, Bos-
ton (Dr Jena); and National Bureau of Economic Research, Cambridge, Massa-
chusetts (Dr Jena).
Corresponding Author: Anupam B. Jena, MD, PhD, Department of Health Care
Policy, Harvard Medical School, 180 Longwood Ave, Boston, MA 02115 (jena
@hcp.med.harvard.edu).
©2013 American Medical Association. All rights reserved. JAMA, January 16, 2013—Vol 309, No. 3 241
Downloaded From: http://jama.jamanetwork.com/ by a National Academy of Sciences User on 02/11/2013

would suggest that an association between PPI use and pneu-
monia initially suspected to be causal is perhaps con-
founded by unobserved patient or physician characteristics.
Ideally, several prespecified false hypotheses can be tested
and, if found not to exist, can support the main study as-
sociation of interest. In the case of PPIs, falsification analy-
ses have shown that many improbable conditions—chest
pain, urinary tract infections, osteoarthritis, rheumatoid ar-
thritis flares, and deep venous thrombosis—are also linked
to PPI use,4
making the claim of an increased risk of pneu-
monia related to use of the drug unlikely.
Another example of falsification analysis applied to ob-
servational associations involves the reported relationship
of social networks with the spread of complex phenomena
such as smoking, obesity, and depression. In social net-
work studies, persons with social ties are shown to be more
likely to gain or lose weight, or to start or stop smoking, at
similar time points than 2 random persons in the same group.
Several studies supported these claims; however, other stud-
ies have shown that even implausible factors—acne, height,
and headaches—may also exhibit “network effects.”8
Falsification analysis can be operationalized by asking in-
vestigators to specify implausible hypotheses up front and
then testing those claims using statistical methods similar
to those used in the primary analysis. Falsification could be
required both for studies that aim to show a rare harm of a
particular medical intervention as well as for studies that
aim to show deleterious interactions between medications.
For instance, in evaluating whether concomitant use of clopi-
dogrel and PPIs is associated with decreased effectiveness
of the former drug and worsens cardiovascular outcomes,
does the use of PPIs also implausibly diminish the effect of
antihypertensive agents or metformin?
Prespecifying falsification end points and choosing them
appropriately is important for avoiding the problem of mul-
tiple hypothesis testing. For instance, if many falsification
hypotheses are tested to support a particular observational
association, a few falsification outcomes will pass the falsi-
fication test—ie, will not be associated with the drug or in-
tervention of interest—whereas other falsification tests may
fail. If the former are selectively reported, some associa-
tions may be mistakenly validated. This issue cannot be ad-
dressed by statistical testing for multiple hypotheses alone
because selective reporting may still occur. Instead, pre-
specifying falsification outcomes and choosing outcomes that
are common may mitigate concerns about post hoc data min-
ing. In the case of PPIs and risk of pneumonia, falsification
analyses used prevalent ambulatory complaints such chest
pain, urinary tract infections, and osteoarthritis.4
Observational studies of rare effects of a drug may be fur-
ther validated by verification analyses that demonstrate the
presence of known adverse effects of a drug in the data set
being studied. For instance, an observational study suggest-
ing an unknown adverse effect of clopidogrel (for ex-
ample, seizures) should also be able to demonstrate the pres-
ence of known adverse effects such as gastrointestinal
hemorrhage associated with clopidogrel use. The inability
of a study to verify known adverse effects should raise ques-
tions about selection in the study population.
Although no published recommendations exist, standard-
ized falsification analyses with 3 to 4 prespecified or highly
prevalent disease outcomes may help to strengthen the va-
lidity of observational studies, as could inclusion of verifi-
cation analyses. Information on whether falsification and
validation end points were used in a study should be in-
cluded in a registry for observational studies that others have
suggested.2
Prespecified falsification hypotheses can improve the va-
lidity of studies finding rare harms when researchers can-
not determine answers to these questions from RCTs, either
because of limited sample sizes or limited follow-up. How-
ever, falsification analysis is not a perfect tool for validat-
ing the associations in observational studies, nor is it in-
tended to be. The absence of implausible falsification
hypotheses does not imply that the primary association of
interest is causal, nor does their presence guarantee that real
relations do not exist. However, when many false relation-
ships are present, caution is warranted in the interpreta-
tion of study findings.
Conflict of Interest Disclosures: The authors have completed and submitted the
ICMJE Form for Disclosure of Potential Conflicts of Interest and none were re-
ported.
REFERENCES
1. Thomas L, Peterson ED. The value of statistical analysis plans in observational
research: defining high-quality research from the start. JAMA. 2012;308(8):
773-774.
2. Ioannidis JP. The importance of potential studies that have not existed and reg-
istration of observational data sets. JAMA. 2012;308(6):575-576.
3. Ray WA, Murray KT, Hall K, Arbogast PG, Stein CM. Azithromycin and the
risk of cardiovascular death. N Engl J Med. 2012;366(20):1881-1890.
4. Jena AB, Sun E, Goldman DP. Confounding in the association of proton pump
inhibitor use with risk of community-acquired pneumonia [published online Sep-
tember 7, 2012]. J Gen Intern Med. doi:10.1007/s11606-012-2211-5.
5. Park-Wyllie LY, Mamdani MM, Juurlink DN, et al. Bisphosphonate use and the
risk of subtrochanteric or femoral shaft fractures in older women. JAMA. 2011;
305(8):783-789.
6. Black DM, Kelly MP, Genant HK, et al; Fracture Intervention Trial Steering
Committee; HORIZON Pivotal Fracture Trial Steering Committee. Bisphospho-
nates and fractures of the subtrochanteric or diaphyseal femur. N Engl J Med. 2010;
362(19):1761-1771.
7. Bertrand M, Duflo E, Mullainathan S. How much should we trust differences-
in-differences estimates? Q J Econ. 2004;119:249-275.
8. Cohen-Cole E, Fletcher JM. Detecting implausible social network effects in acne,
height, and headaches: longitudinal analysis. BMJ. 2008;337:a2533.
VIEWPOINT
242 JAMA, January 16, 2013—Vol 309, No. 3 ©2013 American Medical Association. All rights reserved.
Downloaded From: http://jama.jamanetwork.com/ by a National Academy of Sciences User on 02/11/2013

PERSPECTIVE
Orthogonal predictions: follow-up questions for suggestive datay
Alexander M. Walker MD, DrPH1,2*
1
World Health Information Science Consultants, LLC, Newton, MA 02466, USA
2
Department of Epidemiology, Harvard School of Public Health, Boston, MA 02115, USA
SUMMARY
When a biological hypothesis of causal effect can be inferred, the hypothesis can sometimes be tested in the selfsame database that gave rise to
the study data from which the hypothesis grew. Valid testing happens when the inferred biological hypothesis has scientific implications that
predict new relations between observations already recorded. Testing for the existence of the new relations is a valid assessment of the
biological hypothesis, so long as the newly predicted relations are not a logical correlate of the observations that stimulated the hypothesis in
the first place. These predictions that lead to valid tests might be called ‘orthogonal’ predictions in the data, and stand in marked contrast to
‘scrawny’ hypotheses with no biological content, which predict simply that the same data relations will be seen in a new database. The
Universal Data Warehouse will shortly render moot searches for new databases in which to test. Copyright # 2010 John Wiley & Sons, Ltd.
key words — databases; hypothesis testing; induction; inference
Received 2 October 2009; Accepted 13 January 2010
INTRODUCTION
In 2000, the Food and Drug Administration’s (FDA)
Manette Niu and her colleagues had found something
that might have been predicted by medicine, but not by
statistics.1
They were looking for infants who had
gotten into trouble after a dose of Wyeth’s RotaShield
vaccine in the Centers for Disease Control’s Vaccine
Adverse Event Reporting System.
A vaccine against rotavirus infection in infants,
RotaShield was already off the market.2
In the United
States, rotavirus causes diarrhea so severe that it can
lead to hospitalization. By contrast, the infection is
deadly in poor countries. The 1999 withdrawal had
arguably cost hundreds of thousands of lives of
children whose death from rotavirus-induced diarrhea
could have been avoided through widespread vaccina-
tion with RotaShield.3,4
The enormity of the con-
sequences of the withdrawal made it important that the
decision had been based at least in sound biology.
Wyeth suspended sales of RotaShield because the
vaccine appeared to cause intussusception, an infant
bowel disorder in which a portion of the colon slips
inside of itself. The range of manifestations of
intussusception varies enormously. It can resolve on
its own, with little more by way of signs than the baby’s
fussiness from abdominal pain. Sometimes tissue
damage causes bloody diarrhea. Sometimes the bowel
infarcts and must be removed, or the baby dies.
Dr Niu had used a powerful data-mining tool, Bill
DuMouchel’s Multi-Item Gamma Poisson Shrinker to
sift through the Vaccine Adverse Event Reporting
System (VAERS) data, and she found that intussuscep-
tion was not alone in its association with RotaShield.5
So too were gastrointestinal hemorrhage, intestinal
obstruction, gastroenteritis, and abdominal pain.
My argument here is that those correlations
represented independent tests of the biological
hypothesis that had already killed the vaccine. The
observations were sufficient to discriminate hypotheses
of biological causation from those of chance, though
competing (and testable) hypotheses of artifact may
have remained.
The biologic hypothesis was that if RotaShield had
caused intussusception, it was likely to have caused
pharmacoepidemiology and drug safety 2010; 19: 529–532
Published online 22 March 2010 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/pds.1929
* Correspondence to: A. M. Walker, World Health Information Science
Consultants, LLC, 275 Grove St., Suite 2-400, Newton, MA 02466, USA.
E-mail: Alec.Walker@WHISCON.com
y
The author declared no conflict of interest.
Copyright # 2010 John Wiley & Sons, Ltd.

cases that did not present as fully recognized instances
of the disease, but which nonetheless represented the
same pathology. Looking for these other conditions
was a test of the biologic hypothesis raised by the
occurrence of the severest cases. Like the original
observations, the test data resided in VAERS, but were
nonetheless independent, in that different physicians in
different places, acting more or less concurrently,
reported them about different patients.
INDUCTION AND TESTING
The key step in Niu’s activity was induction of a
biological hypothesis of cause from an observation of
association. Testing the biological hypothesis differs
fundamentally from testing the data-restatement that
‘There is an association between RotaShield and
intussusception.’ The latter, by itself a scrawny
hypothesis if you could call it a hypothesis at all,
might be examined in other environments, though
probably not in real time, since VAERS is a national
system and RotaShield had been marketed only in the
United States. Scrawny hypotheses have no meat to
them, that is they do no more than predict more of the
same, and even then only when the circumstances of
observation are identical. The biological hypothesis, by
contrast, was immediately testable through its implica-
tions in VAERS, and could produce a host of other
empiric tests.
From the perspective of the Wyeth, the FDA and the
Centers for Disease Control and Prevention (CDC), the
parties who had to act in that summer of crisis, only
biologic causation really mattered.
Biologic causation was not the only theory that
predicted reports of multiple related diseases in
association with RotaShield. Most of the reports came
in after the CDC had announced the association and
Wyeth had suspended distribution of RotaTeq. Phys-
icians who did not know one another might have been
similarly sensitized to the idea that symptom com-
plexes compatible with intussusception should be
reported. Stimulated reporting is therefore another
theory that competes with biological causation to
account for the findings.
For the present discussion, the key point is not how
well the competing hypotheses (biological causation,
stimulated reporting, and chance) explain the newly
found data. The key is whether one can rationally look
at the non-intussusception diagnoses in VAERS to test
theories about the RotaShield-intussusception associ-
ation, and whether such looks ‘into the same data’ are
logically suspect.
Trudy Murphy and collaborators offered another
example of testing implications of the biological
hypothesis of causation in a subsequent case-control
study of RotaShield and intussusception.6
Looking at
any prior receipt of RotaShield, they found an
adjusted odds ratio of 2.2 (95% CI 1.5–3.3). Murphy’s
data also provided a test of the theory of biological
causation, no form of which would predict a uniform
distribution of cases over time after vaccination.
Indeed there were pronounced aggregations of cases
3–7 days following first and second immunization.
Interestingly, a theory of stimulated reporting would
not have produced time clustering, at least not without
secondary theories added on top, and so the Murphy
data weighed against the leading non-biologic theory
for the Niu observations.
ORTHOGONAL PREDICTIONS
Niu’s and Murphy’s findings share a common element.
In neither case did the original observation (case
reports of intussusception for Niu, or an association
between RotaShield in and ever-immunization for
Murphy) imply the follow-up observations (other
diagnoses and time-clusters) as a matter of logic, on
the null hypothesis. That is, neither set of follow-up
observations was predicted by the corresponding
scrawny hypothesis, since neither was simply a
restatement of the initiating finding. In this sense, I
propose that we call the predictions that Niu and
Murphy tested ‘orthogonal’ to the original observation.
In the very high-dimensional space of medical
observations, the predicted data are not simply a
rotation of the original findings.
Where did the orthogonal predictions come from?
The investigators stepped out of the data and into the
physical world. We do not know about the world
directly, but we can have theories about how it works,
and we can test those theories against what we see.
Reasoning about the nature of the relations that gave
rise to observed data, we can look for opportunities to
test the theories. With discipline, we can restrict our
‘predictions’ to relations that are genuinely new, and
yet implied by our theories.
SHOCKED, SHOCKED
‘I’m shocked, shocked to find that gambling is going on
in here!’ says Captain Renault in Casablanca, just
before he discretely accepts his winnings and closes
down Rick’s Cafe´ Ame´ricain to appease his Nazi
minders. Advocates for finding new data sources to test
hypotheses might feel kinship with the captain. While
Copyright # 2010 John Wiley & Sons, Ltd. Pharmacoepidemiology and Drug Safety, 2010; 19: 529–532
DOI: 10.1002/pds
530 a. m. walker

sincerely believing in the importance of independent
replication, they find that they too examine different
dimensions of outcomes in suggestive data to evaluate
important hypotheses, particularly those hypotheses
that would require immediate action if true. This is
already the core of regulatory epidemiology, which
concerns itself with the best decision on the available
data. The necessity to act sometimes plays havoc with
prescriptions that cannot be implemented quickly.
Exploration of data in hand is not limited to public
health epidemiologists, regulators among them. In fact
most epidemiologists check causal hypotheses in the
data that generated them. Whenever observational
researchers see an important effect, they worry (or
should do) whether they have missed some confound-
ing factor. Confounding is a causal alternative
hypothesis for an observed association, and the
hypothesis of confounding often has testable implica-
tions in the data at hand. Will the crude effect disappear
when we control for age? It would be hard to describe
the search for confounders as anything other than
testing alternative causal hypotheses in the data that
gave rise to them.
Far from public health, sciences in which there is
little opportunity for experiment, such as geology,
regularly test hypotheses in existing data. Ebel and
Grossman, for example, could ‘predict for the first
time’ (their words) events 65 million years ago, in a
headline-grabbing theory that explained a world-wide
layer of iridium at just the geological stratum that
coincided with the disappearance of the dinosaurs.7
There is nothing illegitimate in the exercise.
THE UNIVERSAL DATA WAREHOUSE
The question that motivated this Symposium, ‘One
Database or Two?’ was whether it is necessary to seek
out a new database to test theories derived from a
database at hand. Above I have argued that the issue is
not the separation of the databases, but rather the
independence of the test and hypothesis-generating
data. Clearly, two physically separate databases whose
information was independently derived by different
investigators working in different sources meet the
criterion of independence, but so do independently
derived domains of single databases. Fortunately, the
question may shortly be moot, because there will be in
the future only one database.
Let me explain
In 1993, Philip Cole, a professor at the University of
Alabama at Birmingham, provided a radical solution to
the repeated critique that epidemiologists were finding
unanticipated relations in data, and that the researchers
were presuming to make statements about hypotheses
that had not been specified in advance. In ‘The
Hypothesis Generating Machine’, Cole announced the
creation of the HGM, a machine that had integrated data
on every agent, every means of exposure and every time
relation together with every disease. From these, the
HGM had formed every possible hypothesis about every
possible relationship.8
Never again would a hypothesis
be denigrated for having been newly inferred from data.
In the same elegant paper, Cole also likened the idea that
studies generate hypotheses to the once widely held view
that piles of rags generate mouse pups. People generate
hypotheses; inanimate studies do not.
With acknowledgment to Cole, couldn’t we imagine
a Universal Data Warehouse consisting of all data ever
recorded? Some twists of relativity theory might even
get us to postulate that the UDW could contain all
future data as well.9
Henceforward, all tests of all
hypotheses would occur by necessity in the UDW,
whether or not the investigator was aware that his or her
data were simply a view into the warehouse.
Researchers would evermore test and measure the
impact of hypotheses in the data that suggested them.
The new procedure of resorting to the UDW will not
constitute a departure from current practice, and may
result in more efficient discussion.
The reluctance of statisticians and philosophers to
test a hypothesis in the data that generated it makes
rigorous sense. I think that our disagreement, if there
was one, on the enjoyable morning of our Symposium
was definitional rather than scientific. In an earlier
era, when dedicated, expensive collection was the
only source of data, the sensible analysis plan
extracted everything to be learned the first time
through.
Overwhelmed by information from public and private
data streams, researchers now select out the pieces that
seem right to answer the questions they pose. The
KEY POINTS
When examination of complex data leads to a
biological hypothesis, that hypothesis may have
implications that are testable in the original data.
The test data need to be independent of the
hypothesis-generating data.
The ‘‘Universal Data Warehouse’’ reminds us of
the futility of substituting data location for data
independence.
Copyright # 2010 John Wiley Sons, Ltd. Pharmacoepidemiology and Drug Safety, 2010; 19: 529–532
DOI: 10.1002/pds
orthogonal predictions 531

answersraisenewquestions,differentones,anditmakes
sense to pick out (from the same fire hose spurting facts)
new data that will help us make sense of what we think
we may have learned the first time through.
Recorded experience is the database through which
we observe, theorize, test, theorize, observe again, test
again, and so on for as long as we have stamina and
means. We certainly should have standards as to when
data test a theory, but the standard does not need to be
that the originating databases are different.
ACKNOWLEDGEMENTS
This paper owes much to many people, none of whom
should be held accountable for its shortcomings, as author
did not always agree with his friends’ good advice. Author
is indebted to his co-participants in the Symposium, Larry
Gould particularly, Patrick Ryan and Sebastian Schnee-
weiss, and the deft organizers, Susan Sacks and Nancy
Santanello for their valuable advice. He also thanks Phil
Cole, Ken Rothman and Paul Stang for their careful reading
and to-the-point commentary. There are no relevant finan-
cial considerations to disclose.
REFERENCES
1. Niu MT, Erwin DE, Braun MM. Data mining in the US Vaccine Adverse
Event Reporting System (VAERS): early detection of intussusception and
other events after rotavirus vaccination. Vaccine 2001; 19: 4627–4634.
2. Centers for Disease Control and Prevention (CDC). Suspension of
rotavirus vaccine after reports of intussusception—United States,
1999. MMWR Morb Mortal Wkly Rep 2004; 53(34): 786–789. Erratum
in: MMWR Morb Mortal Wkly Rep 2004; 53(37): 879.
3. World Health Organization. Report of the Meeting on Future Directions
for Rotavirus Vaccine Research in Developing Countries, Geneva, 9–11
February 2000, Geneva (Publication WHO/VB/00.23).
4. Linhares AC, Bresee JS. Rotavirus vaccines and vaccination in Latin
America. Rev Panam Salud Publica 2000; 8(5): 305–331.
5. DuMouchel W. Bayesian data mining in large frequency tables, with an
application to the FDA Spontaneous Reporting System. Am Stat 1999;
53: 177–190.
6. Murphy TV, Gargiullo PM, Massoudi MS et al. Intussusception among
infants given an oral rotavirus vaccine. N Engl J Med 2001; 344: 564–
572.
7. Ebel DS, Grossman L. Spinel-bearing spherules condensed from the
Chicxulub impact-vapor plume. Geology 2005; 33(4): 293–296.
8. Cole P. The hypothesis generating machine. Epidemiology 1993; 4(3):
271–273.
9. Rindler W. Essential Relativity (rev. 2nd edn). Springer Verlag: Berlin,
1977. See Section 2.4. ‘The Relativity of Simultaneity’ for a particularly
lucid presentation of this phenomenon. The warehouse does not of
course contain all future data, as we will restrict it to information
generated by and about humans.
Copyright # 2010 John Wiley Sons, Ltd. Pharmacoepidemiology and Drug Safety, 2010; 19: 529–532
DOI: 10.1002/pds
532 a. m. walker

Special Issue Paper
Received 4 November 2011, Accepted 28 August 2012 Published online in Wiley Online Library
(wileyonlinelibrary.com) DOI: 10.1002/sim.5620
Empirical assessment of methods for
risk identification in healthcare
data: results from the experiments
of the Observational Medical
Outcomes Partnership‡
Patrick B. Ryan,a,b,c*† David Madigan,b,d Paul E. Stang,a,b J. Marc
Overhage,b,e Judith A. Racoosinb,f and Abraham G. Hartzemab,g§
Background: Expanded availability of observational healthcare data (both administrative claims and electronic
health records) has prompted the development of statistical methods for identifying adverse events associated
with medical products, but the operating characteristics of these methods when applied to the real-world data
are unknown.
Methods: We studied the performance of eight analytic methods for estimating of the strength of association-
relative risk (RR) and associated standard error of 53 drug–adverse event outcome pairs, both positive and
negative controls. The methods were applied to a network of ten observational healthcare databases, comprising
over 130 million lives. Performance measures included sensitivity, specificity, and positive predictive value of
methods at RR thresholds achieving statistical significance of p 0.05 or p 0.001 and with absolute threshold
RR 1.5, as well as threshold-free measures such as area under receiver operating characteristic curve (AUC).
Results: Although no specific method demonstrated superior performance, the aggregate results provide
a benchmark and baseline expectation for risk identification method performance. At traditional levels of
statistical significance (RR 1, p 0.05), all methods have a false positive rate 18%, with positive predictive
value 38%. The best predictive model, high-dimensional propensity score, achieved an AUC D 0.77. At 50%
sensitivity, false positive rate ranged from 16% to 30%. At 10% false positive rate, sensitivity of the methods
ranged from 9% to 33%.
Conclusions: Systematic processes for risk identification can provide useful information to supplement an
overall safety assessment, but assessment of methods performance suggests a substantial chance of identifying
false positive associations. Copyright © 2012 John Wiley Sons, Ltd.
Keywords: product surveillance, postmarketing; pharmacoepidemiology; epidemiologic methods; causality;
electronic health records; adverse drug reactions
1. Introduction
The U.S. Food and Drug Administration Amendments Act of 2007 required the establishment of an
‘active postmarket risk identification and analysis system’ with access to patient-level observational
data from 100 million lives by 2012 [1]. In this context, we define ‘risk identification’ as a systematic
aJohnson Johnson Pharmaceutical Research and Development LLC, Titusville, NJ, U.S.A.
bObservational Medical Outcomes Partnership, Foundation for the National Institutes of Health, Bethesda, MD, U.S.A.
cUNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, NC, U.S.A.
dDepartment of Statistics, Columbia University, New York, NY, U.S.A.
eRegenstrief Institute and Indiana University School of Medicine, Indianapolis, IN, U.S.A.
fCenter for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, MD, U.S.A.
gCollege of Pharmacy, University of Florida, Gainesville, FL, U.S.A.
*Correspondence to: Patrick B. Ryan, Johnson Johnson 1125 Trenton-Harbourton Road PO Box 200 MS K304 Titusville,
NJ 08560, U.S.A.
†E-mail: ryan@omop.org
‡This article expresses the views of the authors and does not necessarily represent those of their affiliated organizations.
§At the time of this work, Dr. Hartzema was on sabbatical at the U.S. Food and Drug Administration.
Copyright © 2012 John Wiley Sons, Ltd. Statist. Med. 2012

P. B. RYAN ET AL.
and reproducible process to efficiently generate evidence to support the characterization of the potential
effects of medical products. This system applied to a network of observational healthcare databases
would provide another source of evidence to complement existing safety information contributed by
preclinical data, clinical trials, spontaneous adverse event reports, registries, and pharmacoepidemiology
evaluation studies. When used in conjunction with evidence of the benefits of the product and alternative
treatments a more comprehensive understanding of the effects of medical products promises to inform
medical decision making. The practicing clinician has a critical role in both the generation of quality
data that can be used for these efforts and integration of the findings from safety assessments into routine
practice; both of which become increasingly important in the evolution of the electronic health record
and the creation of a ‘learning healthcare system’ [2].
The secondary use of observational healthcare databases (e.g., administrative claims and electronic
health records) has become the predominant resource in pharmacoepidemiology, health outcomes,
and health services research because it reflects ‘real-world’ experience. Unlike well-designed and
well-performed randomized clinical trials, the use of observational data requires special consideration
of potential biases that can distort the measurement of the true effect size. Researchers can choose
from a variety of analytic methods that attempt to control for these biases; however, the operating
characteristics of these methods and their potential utility within a risk identification system have not
been systematically studied.
The Observational Medical Outcomes Partnership (OMOP; http://omop.fnih.org) conducts
methodological research to support the development of a national risk identification and analysis system;
the details of which have been previously published [3]. The OMOP research plan consists of a series
of empirical assessments of the performance characteristics of a number of analysis methods conducted
across a network of observational data sources. This paper reports findings from a series of assessments
of risk identification methods to determine their ability to correctly identify ‘true’ drug–adverse event
outcome associations and drug–adverse outcome negative controls as ‘not associated’.
2. Methods
The OMOP established a network of ten data sources capturing the healthcare experience of 130 million
patients. The data network included administrative claims data (SDI Health, Humana Inc., and four
Thomson Reuters MarketScan® Research Databases reflecting commercial claims with and without
laboratory records, Medicare supplemental, and multistate Medicaid populations) and electronic health
records (Regenstrief Institute, Partners Healthcare System, GE Centricity, and Department of Veterans
Affairs Center for Medication Safety/Outcomes Research). Table I depicts the characteristics and popu-
lation sizes of each data source. The data sources in the OMOP were selected to reflect the diversity of
U.S. observational data [4]. This research program was approved or granted exemption by the Insti-
tutional Review Boards at each participating organizations. All of these datasets were transformed
to a common data model, where data about drug exposure and condition occurrence were structured
in a consistent fashion and defined using the same controlled terminologies, to facilitate subsequent
analysis [5].
A total of 13 different analytic methods were implemented during the OMOP experiment. Complete
descriptions, references, and source code for each method are available at http://omop.fnih.org/Methods
Library, of those eight report estimates of relative risk (RR) and its standard error. In this paper, we
examine these eight methods. Results for the remaining five methods are available upon request.
Each method had multiple parameter settings corresponding to various study design decisions,
including definition of time-at-risk, identification of outcomes based on first occurrence or all
occurrences of diagnosis codes, choice of comparator group, and specific confounding adjustment
strategy. The specific parameters for each method and the number of parameter combinations studied
for each method are shown in Table II.
The performance of the analytical methods was assessed on the basis of their ability to correctly
identify nine drug–outcome pairs that were classified as ‘positive controls’ and 44 drug–outcome pairs
classified as ‘negative controls’. Positive controls were true associations as determined by the listing
of the corresponding outcome as an adverse event in the drug product label along with prior published
observational database research suggesting an association; subsequently, these positive controls were
endorsed by expert panel consensus; negative controls lacked such evidence in their labeling and
published literature and were ruled out as having a positive association by the expert panel. Members of
the OMOP’s advisory boards and other participants [3] and literature references for the test cases [6]
Copyright © 2012 John Wiley Sons, Ltd. Statist. Med. 2012

Observational Studies in a Learning Health System

Observational Studies in a Learning Health System

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Observational Studies in a Learning Health System

Similar a Observational Studies in a Learning Health System (20)

Más de Patient-Centered Outcomes Research Institute

Más de Patient-Centered Outcomes Research Institute (20)

Último

Último (20)

Observational Studies in a Learning Health System