This document provides an overview of using the R statistical package in clinical research applications. It discusses R's history and increasing use in evidence-based medicine and clinical trials. It also addresses common myths and challenges regarding using R in regulated clinical research and drug development settings, such as concerns about validation and compliance. The presenter outlines R's capabilities for statistical analysis tasks and its ability to interface with other packages like SAS. He discusses steps for preparing R for use in industry, including validation of results and creating a controlled environment.
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
The use of R statistical package in controlled infrastructure/TITLE
1. The use of R statistical package in
controlled infrastructure
The case of Clinical Research industry
Adrian Olszewski
Senior Biostatistician at 2KMM
22th Jun 2018
Poland • Sosnowiec
www.2kmm.eu
Polish National Group of the
International Society for Clinical Biostatistics
http://www.iscb.pl
40min
www.r-clinical-research.com
r.clin.res@gmail.com
PART I
2. DISCLAIMER
All trademarks, logos of companies and names of products
used in this document
are the sole property of their respective owners
and are included here for informational, illustrative purposes only,
which falls within the nominative fair use.
This presentation is based exclusively on information
publicly available on the Internet under provided hyperlinks.
If you believe your rights are violated, please email me: r.clin.res@gmail.com
3. Agenda
► Quick introduction to R
o Description
o History. Events important for the use of R in EBM*
o Who uses R?
* Evidence-Based Medicine 3
4. Agenda
► R in Evidence-Based Medicine
o Capabilities
o A brief overview of common tasks
o Cooperation and compliance with SAS
o www.r-clinical-research.com or CRAN Task Views
4
5. Agenda
► R in Clinical Research
o Status of R on the Clinical Research market
o Myths and Facts
o What does FDA say?
o What does it mean „to validate”? Why do we want this?
o Preparing R to enter the industry
5
6. Agenda
► Validation
o Validation of installation vs. numerical validation
o Numerical validation
o Methods
o Reference data
► Fixing the environment and controlling for changes
► How does R support the creation of a controlled environment?
6
8. Quick introduction to R ► Description
R is an open-source software environment, widely used in scientific world for:
statistical computing
data manipulation
data presentation
and other general programming tasks
https://www.r-project.org
𝒙
𝒏
It’s also the name of a high-level, Turing-complete, interpreted, multi-paradigm
programming language used within the environment.
8
9. Quick introduction to R ► Description
Short characteristics:
► Description computational environment + programming language
► Developer R Development Core Team
► Operating systems cross-platform: Windows, Unix, Linux, OS X, mobile: Android, Maemo, Raspbian
► Form command line + third-party IDEs and editors
► Infrastructure R core library + shell + libraries (base and third-party)
► Model of work 1) standalone application, 2) standalone server, 3) server process
► Programming language Turing-complete, domain-specific, interpreted, high-level with dynamic typing
► Paradigm
1) array, 2) object-oriented (S3, S4, R5, R6 models), 3) imperative, 4) functional,
5) procedural, 6) reflective
► Source of libraries mirrored repository – CRAN, users' sites, third-party repositories (Github, RForge)
► License of the core GNU General Public License ver. 2
► License of libraries 99.9% open-source. 0.1% is licensed (free for non-commercial use)
model <- lm(y ~ x1 * x2)
9
13. Quick introduction to R ► History
1976 1998
1993
R was born
1997
R Core Team
was formed
1988
S-PLUS was born
Statistical Sciences, Inc.
R. Douglas Martin
University of Washington
Univ. of Auckland
Ross Ihaka, Robert Gentleman
1980
First
commercial
release
via AT&T
1988
New S Language
First statistical system to receive the
Software System Award, the top
software award from the Association
for Computing Machinery
The last version
2008
IC acquired
1993
S code boguht
for $2 mln
2004
Exclusive license
to develop and sell
the S language
20072003
R Foundation
was formed
R Consortium
was founded
First release
CRAN
S was born
Bell Laboratories
Rick Becker,
Allan Wilks,
John Chambers
from Bell Labs Insightful Corporation
from AT&T Lucent
TIBCO
2013
TERR - TIBCO Enterprise Runtime for R
2007
Revolution was born
Revolution Analytics
2015
R Open was born
Microsoft
2008
R Enterprise
Oracle
2015
Revolution
acquired
by Microsoft
v 1.0.0
2000
TIBCO Spotfire
13
14. 1997 The first release of R FDA 21 CFR Part 11 CRAN
1998 nlme
1999 FDA „Off-The-Shelf Software Use in Medical Device”
2000 xtable
2001 DBI survival
2002 multcomp FDA „General Principles of Software Validation – Final” Bioconductor
2003 lme4 nlmeODE The R Core Team
2004
2005 drc (Dose-Response) PKfit PK ggplot2 ROCR
2006 gsDesign meta mice tdm ivivc blockrand pwr
2007 SASxport Rtools
"Using R: Perspectives of a FDA Statistical RevieweR„
"R - Regulatory Compliance and Validation Issues"
"Use of R in C.T. & Industry-Sponsored Medical Res. "
"Op. Sour. Stat. Soft. in Pharma Developm.: A case study with R"
The R Foundation
2008 MCPMod bear rjags epiR plyr DanteR
2009 SAS IML studio supports R SAS7bdat metafor gamm4
2010 PKGraph pROC oro.nifti oro.dicom PowerTOST
2011 RStudio Detools ggbio RISmed rplos
2012 Shiny knitr Pmetrics TrialSize stargazer OpenCPU FDA: „Sponsors may use R in their submissions”
2013 cpk The SAS® versus R Debate in Industry and Academia
2014
Tidyverse ValidR Checkpoint Packrat Rmarkdown
rclinicaltrials pubmed.miner ReporteRs greport dplyr
MRAN
2015 rxODE gfd ThreeArmedTrials randomizeR FDA: „Statistical Software Clarifying Statement” The R Consortium
2016 R Tools for Visual Studio rankFD The R Epid. Cons.
2017 dfpk - Bayesian Dose-Finding Designs officer
2018 Mediana - general framework for CT simulations
Quick introduction to R ► History ► (few) Events important for the use of R in EBM
14
15. Quick introduction to R ► Who uses R?
Medicine and Pharmacy Other Business & Science Tycoons
► American Express
► Bank of America
► BBC
► Capgemini
► Deloitte
► Ebay
► Facebook
► Fermi National
Accelerator Laboratory
► Ford
► Goldman Sachs
► Google
► HP
► IBM
► J.P. Morgan
► Kickstarter
► Microsoft
► Monsanto
► Mozilla
► New York Times
► NIST - National Institute of
Standards & Technology
► NOAA
► Oracle
► Twitter
► Uber
► UK Government
► Wells Fargo
► 2KMM
► Amgen
► Astra Zeneca
► Bayer
► CardioDX
► Dr. Reddy’s Laboratories
► FDA
► GCE
► KCR (2014-2017)
► Medtronic
► Merck
► Novartis
► Pfizer
► Roche
15
16. Quick introduction to R ► Who uses R?
The list is built based exclusively on publicly available information:
lists of users provided by Revolution, RStudio and others
articles (example, example) and interviews (example)
published documents in which a name of a company is visible (example)
job advertisements (LinkedIn, Google, PharmiWeb, etc.)
names of companies supporting / organizing events (conferences, courses, etc)
other sources (example)
That is to say, a logo of a company is included in the list only if there is a clear evidence that the
company uses or supports (or used or supported) R, based on information shared on the Internet –
and thus available for everyone.
Please note, that I am not aware if all listed companies are still using any version of R at the time the
presentation is being viewed. If you want me to remove your logo, please send me an mail to
r.clin.res@gmail.com
16
17. Quick introduction to R ► Who uses R?
17
“We use R for adaptive designs frequently because it’s the fastest tool to explore designs that interest
us. Off-the-shelf software, gives you off-the-shelf options. Those are a good first order approximation,
but if you really want to nail down a design, R is going to be the fastest way to do that.”
Keaven Anderson
Executive Director, Late Stage Biostatistics
Merck
“De facto, R is already a significant component of Pfizer core technology. Access to a supported
version of R will allow us to keep pace with the growing use of R in the organization, and provides a
path forward to use of R in regulated applications.”
James A. Rogers Ph.D.
Associate Director, Nonclinical Statistics Group
Pfizer
https://pharma-life-sciences.cioreview.com/news/gsdesign-explorer-to-optimize-merck-s-clinical-trial-process-nid-1305-cid-36.html
Google Books: Big Data for Big Pharma: An Accelerator for The Research and Development Engine?
Publicly available sources:
https://www.featuredcustomers.com/vendor/revolution-analytics-1/customers/pfizer
Publicly available sources:
18. Quick introduction to R ► Who uses R?
18
“We use R for all of our analysis,” says Elashoff. “I think it’s fair to say that R really is the
foundation of a lot of the work that we do.” To speed up the process without sacrificing
accuracy, the team also uses Revolution R analytic products. “We use R seven or eight
hours per day, so any improvement in speed is helpful, particularly when you’re looking at a
million biomarkers and wondering if you’ll need to re-run a million analyses.”
Open-source R packages enable the biostatisticians at CardioDX to run a broad range of
analyses, accurately and effectively, on a routine basis. Adding Revolution R products to the
mix improves processing speeds and makes it easier to crunch large data sets. Accelerating
the analytic process reduces ov erall project time, increasing the team’s efficiency. “Revolution
R is faster than regular R,” says Elashoff. “The faster we can analyze data, the less time it
takes us to build our diagnostic algorithms.”
Michael Elashoff
The company’s director of biostatistics
CardioDX
https://www.featuredcustomers.com/media/CustomerCaseStudy.document/revolution-analytics-1_cardiodx_8284.pdf
Publicly available sources:
19. Quick introduction to R ► Who uses R?
19
“We use R for all of our analysis,” says Elashoff. “I think it’s fair to say that R really is the
foundation of a lot of the work that we do.” To speed up the process without sacrificing
accuracy, the team also uses Revolution R analytic products. “We use R seven or eight
hours per day, so any improvement in speed is helpful, particularly when you’re looking at a
million biomarkers and wondering if you’ll need to re-run a million analyses.”
Open-source R packages enable the biostatisticians at CardioDX to run a broad range of
analyses, accurately and effectively, on a routine basis. Adding Revolution R products to the
mix improves processing speeds and makes it easier to crunch large data sets. Accelerating
the analytic process reduces ov erall project time, increasing the team’s efficiency. “Revolution
R is faster than regular R,” says Elashoff. “The faster we can analyze data, the less time it
takes us to build our diagnostic algorithms.”
Michael Elashoff
The company’s director of biostatistics
CardioDX
https://www.businesswire.com/news/home/20110118006656/en/CardioDX-Revolution-Analytics-Develop-Non-Intrusive-Test-Predicting
Publicly available sources:
20. R in Evidence-Based Medicine ► Capabilities ► A brief overview of common tasks
descriptive analysis
& summarizing
20
errors-in-variables modeling
comparison of methods
Deming, Passing-Bablock, Bland-Altman
time-to-event / survival
Kaplan-Meier, Nelson-Aalen,
Cox regression, Weibull
design of experiments
parallel, cross-over, adaptive,
group-sequential, multi-arm
ROC analysis
categorical data
analysis
planned
& post-factum analysis
advanced plotting
sample size & power
meta-analysis
non-inferiority
superiority
(bio) equivalence
PK, PD,
Dose-Response
randomization
repeated measures &
longitudinal trials
parametric / non-parametric
modeling
(non) parametric (non) linear models
with mixed effects
resampling
bootstrap, permutation, exact
factorial design analysis
parametric / non-parametric
robust methods
regularized, M-estimators
detection of outliers
univariate / multivariate
missing data imputation
*OCF, kNN, LI, MI, censored (KM)
21. R in Evidence-Based Medicine ► Capabilities ► A brief overview of common tasks
21
logging processes
pure ascii, html, pdf, doc
cooperation with SAS interoperability
.NET, Java, Scala, Python, C++, Fortran,
PHP, Perl, DDE, COM, TCP, WebServices
accessing registers
clinicaltials.gov, PubMed, PLOS
accessing databases
ODBC, JDBC, Oracle, MS SQL, MySQL,
dBase, PostgreSQL, SQLite, DB/2,
Informix, Firebird, H2, MongoDB, more…
reproducible researchproducing documents
doc(x), ppt(x), pdf, rtf, odf, ps
exchanging data
Excel, OO Calc, GNumeric, SPSS, Weka,
Systat, Stata, EpiInfo, SAS, SAS XPT,
Minitab, Octave, Matlab, DBF, CSV, XML,
HTML, JSON, DICOM, NIFTI
production tools and
unit testing
GUI desktop
& server applications
interactive
presentations
advanced data
querying &
transforming
22. R in Evidence-Based Medicine ► Capabilities ► A brief overview of common tasks
22
Descriptive stats
Data review
23. R in Evidence-Based Medicine ► Capabilities ► A brief overview of common tasks
23
Linear regression
ANOVA
post-hoc
24. R in Evidence-Based Medicine ► Capabilities ► A brief overview of common tasks
24
GLM modelling
25. R in Evidence-Based Medicine ► Capabilities ► A brief overview of common tasks
25
NLM modelling
26. R in Evidence-Based Medicine ► Capabilities ► A brief overview of common tasks
26
27. R in Evidence-Based Medicine ► Capabilities ► A brief overview of common tasks
27
28. R in Evidence-Based Medicine ► Capabilities ► A brief overview of common tasks
28
29. R in Evidence-Based Medicine ► Capabilities ► A brief overview of common tasks
29
30. R in Evidence-Based Medicine ► Capabilities ► A brief overview of common tasks
30
31. R in Evidence-Based Medicine ► Capabilities ► A brief overview of common tasks
31
32. R in Evidence-Based Medicine ► Capabilities ► A brief overview of common tasks
32
33. R in Evidence-Based Medicine ► Capabilities ► A brief overview of common tasks
33
34. R in Evidence-Based Medicine ► Capabilities ► Cooperation & compliance with SAS
34
SAS and R Team in Clinical Research (Adrian Olszewski)
Differences in:
origin of dates
default contrasts
used sum of squares
calculation of quantiles
generation of random numbers
implementation of advanced model
representation of floating point numbers
SAS
module #1
SAS
module #2
Missing or
expensive
functionality
or different
method of
communication
SAS IML
SAS baseRequired algorithm or functionality
1
𝑛ℎ 𝑑
𝑖=1
𝑛
𝑥 − 𝑥𝑖
ℎ
Bi-directional
communication
35. Agenda
► R in Clinical Research
o Status of R on the Clinical Research market
o Myths and Facts
o What does FDA say?
o What does it mean „to validate”? Why do we want this?
o Preparing R to enter the industry
35
36. R in Clinical Research ► Status of R on the Clinical Research market
36
In clinical research, however, SAS reigns par excellence
In general bioscience and academia, S R has built over years its
position of one of the industry standards
Pharmaceutical companies, CROs and even FDA do use R “internally”.
But they resist (or hesitate) to use it in submissions (to FDA).
Clinical Programmer or Biostatistician ≝ SAS Programmer. Period.
OK, but how did it come to this?
37. R in Clinical Research ► Status of R on the Clinical Research market
37
We can only speculate on why so often R users are told the mantra:
Too many myths have accumulated, but we cannot ignore the facts.
38. R in Clinical Research ► Myths and Facts
38
Facts Myths / objections
FDA requires software to be validated
FDA demands SAS for both the analysis and producing
datasets. No other software is allowed.
R is not validated out-of-the-box R cannot generate datasets in SAS Transport format
R doesn’t facilitate the creation of CDISC datasets
R cannot cooperate with SAS, including reading and writing
SAS binary files
R doesn’t have a metadata layer
R doesn’t have paid hot-line R cannot be validated as well as commercial software
Nobody takes the responsibility if something goes
wrong
Commercial software doesn’t have errors
Packages change over time. What works today, may
not work tomorrow. Packages happen to be removed
R is full of bugs (errors) as nobody controls it
Validation of a software is challenging and time-
consuming process, so not everyone can afford.
R poorly supports the generation of TFLs
39. R in Clinical Research ► Myths and Facts
39
Facts Myths / objections
Errors happen often in non-commercial software R is limited in terms of implemented statistical methods
Announcing errors publicly doesn’t make people calm
Nobody uses R (or Open Source in general) in pharma
industry (or in “serious business’). Maybe in academia, which
is not a kind of a serious business.
FDA: “Results should be reproducible and
independent of the software used to derive them”
R doesn’t meet 21 CFR Part 11, which is a must
Creators of R packages don’t have to provide (good)
unit tests. It’s king of a good will.
R has no SAS-like “LOG”, which records everything
There is no commercial support (product and/or validation)
The entire software (all packages and functions) must be
validated
Commercial software releases the end-user from any
responsibility regarding the validation.
40. R in Clinical Research ► Myths and Facts
40
Facts Myths
Who is right and…
…is it possible to use R in controlled environment?
41. R in Clinical Research ► Myths and Facts
41
First, let us briefly address all points in the “table of shame”. Facts first.
FDA requires software to be validated
Yes. This is mandatory process. And that’s good! It protects not
only the sponsor from serious troubles but also the patients!
R is not validated out-of-the-box
Yes, the official release is not guaranteed to be errors-free. A
disclaimer note confirming that is displayed every time R is
launched. But the validation is fully possible.
R doesn’t facilitate the creation of CDISC
datasets
True. There is no easy GUI tools to map fields between CDASH
and SDTM or easy-to-use ways to generate define.xml
R doesn’t have a metadata layer
Partially true. R supports attributes on every level of a data
structure. With a few effort it can be implemented effectively. I
plan to release a package allowing datasets to be annotated and
printed in line with the assigned formats.
R doesn’t have paid hot-line
True. This is not a commercial project. But the R community is
vibrant and provides giant amount of knowledge (Stack, Github)
42. R in Clinical Research ► Myths and Facts
42
First, let us briefly address all points in the “table of shame”. Facts first.
Nobody takes the responsibility if something
goes wrong
True. There is not a commercial project. By the way, to what
extent exactly commercial companies take the responsibility? Do
you have the conditions (and $$$) written down on paper and
signed?
Packages change over time. What works
today, may not work tomorrow. Packages
happen to be removed
Very true. This can be effectively managed in many ways.
Addressed later in this presentation
Validation of a software is challenging and
time-consuming process, so not everyone
can afford.
Very true. Time is money. One has to analyze the profitability and
then make a decision.
43. R in Clinical Research ► Myths and Facts
43
First, let us briefly address all points in the “table of shame”. Facts first.
Errors happen often in non-commercial
software
That is true. There is no “paid testers”, only volunteers. It does
not mean at all they perform any worse, but also does not make
any guarantee they perform well.
Errors happen in every software, including commercial. Even in
the top-quality medical devices (FDA recalls that in their
guidelines), nuclear devices (Therac-25 medical accelerator
case), power plants, space rockets, and even in Martian Rover or
Mariner I space probe.
There is no error-free software. There is only software testes not
well enough.
Announcing errors publicly doesn’t make
people calm
Well, that is true. But hiding issues doesn’t make them less
dangerous.
“Transparency” is the most reliable way of cooperating with
software users. Programmers or end-users publicly announce
errors so the whole community can learn about that and react
quickly. Nothing is hidden, all the more so as this is Open Source.
How often are you getting informed about errors in your favorite
software with full details and the source code?
44. R in Clinical Research ► Myths and Facts
44
First, let us briefly address all points in the “table of shame”. Facts first.
FDA: “Results should be reproducible and
independent of the software used to derive
them”
That is true. Results may differ between statistical packages. A
little – but still.
If FDA uses SAS for checking, we may get into trouble in case of
resampling methods even with the same seed set.
Creators of R packages don’t have to provide
(good) unit tests. It’s king of a good will.
Yes. Even if forced to write tests, nobody can guarantee the tests
are defined properly and bring any advantage.
45. R in Clinical Research ► Myths and Facts
Now myths.
FDA demands SAS for both the analysis and
producing datasets. No other software is
allowed.
No. FDA has never claimed that. This myth is so often repeated,
so FDA issued an official “Software “Clarifying Statement”
R cannot generate datasets in SAS Transport
format
False. R can generate XPT using SASxport package.
The SAS Transport Format is an open format and published by
SAS Institute long time ago:
1. https://www.loc.gov/preservation/digital/formats/....
2. http://documentation.sas.com....
R cannot cooperate with SAS, including
reading and writing SAS binary files
False. R can be combined with SAS in may ways. Check this out:
https://www.quora.com/How-can-I-integrate-SAS-with-R
SAS enabled direct communication between R and SAS in the
IML module in 2009.
R can read SAS7 binary data files and both read/write XPT files.
R cannot be validated as well as commercial
software
False. R can be validated no worse. In fact there is at least one
company offering validated version of R – Mango.
Commercial software doesn’t have errors Facts deny this claim evidently.
R is full of bugs (errors) as nobody controls it
Errors happen in third-party packages. No trace of increased
reporting of bugs has had a place
R poorly supports the generation of TFLs
False. There are packages for creation of advanced graphs
(ggplot2), Word documents, OpenDocument files, RTF and PDF.
All tasks can be automatized since R is a programming language.
46. R in Clinical Research ► Myths and Facts
R is limited in terms of implemented statistical
methods
We have just seen how rich is the R statistical library. This is the
most complete library after SAS (plus few routines more)
Nobody uses R (or Open Source in general)
in pharma industry (or in “serious business’).
Maybe in academia, which is not a kind of a
serious business.
False. We have just seen few slides ago, that pharmaceutical
companies do use R.
Not to mention the non-clinical representatives of a “serious
business”.
Now myths.
47. R in Clinical Research ► Myths and Facts
R doesn’t meet 21 CFR
Part 11, which is a must
Let me quote this: Whoever told you that is not well-informed. CFR Part 11 has to do
with critical software that runs medical devices and about certain primary data
management software. It does not apply to statistical analysis software. We use R all
the time in industry-sponsored and NIH sponsored clinical trials. You do not need to
seek FDA's approval. FDA accepts all comers and does not dictate software policy for
analysis. They even accept Excel and Minitab for NDAs. There are many messages
related to this in the r-help archive; please look at them.
Frank E Harrell Jr
Professor and Chair School of Medicine, Department of Biostatistics
Vanderbilt University
Source
And this: “Records submitted to FDA, under predicate rules in electronic format [are Part
11 records]. However, a record that is not itself submitted, but is used in generating a
submission, is not a part 11 record unless it is otherwise required to be maintained under
a predicate rule and it is maintained in electronic format.”
Therefore, it is not mandated that 21 CFR Part 11 is appropriate to data analysis
software systems that are not primarily intended for storage and transmission of
electronic medical records. It remains the responsibility of an individual organization
however to define the applicability of Part 11 and validation to their systems.
R: Regulatory Compliance and Validation, 11 March 25, 2018
Source
Formal confirmation: Statistical Software Clarifying Statement by FDA Source
Now myths.
48. R in Clinical Research ► Myths and Facts
R has no SAS-like “LOG”, which records
everything
Yes, R doesn’t have a “LOG”, but with RMarkdown (or knitr,
sweave, odfWeave) and following the Reproducible Research
paradigm, the “LOG” can easily be reproduced effortlessly.
The generated HTML (or PDF, DOCx) document contains both
the code and corresponding results combined.
In addition, employing a versioning system (SVN, Git) to store the
“LOG” into a repository, allows the analyst to version it and track
changes. This gives a high level of confidence.
Less sophisticated, yet fully valid method can be implemented
with the “sink()” function.
There is no commercial support (product
and/or validation)
False. Mango ValidR product is a good example. Revolution also
offered paid support. This refers only to certain packages (mostly
from the “base” set)
The entire software (all packages and
functions) must be validated
No. It has to be done properly and sufficiently.
Practice shows, that only the used part of R code must be
validated. If a package contains 1000 functions, while only two of
them are used, only the two functions have to be validated. If a
validated function X calls an unvalidated function Y, the results
subjected to validation is still returned by the function X under
given parameters and conditions.
Now myths.
49. R in Clinical Research ► Myths and Facts
Commercial software releases the end-user
from any responsibility regarding the
validation.
No. Let’s quote FDA: All production and/or quality system
software, even if purchased off-the-shelf, should have
documented requirements that fully define its intended use, and
information against which testing results and other evidence can
be compared, to show that the software is validated for its
intended use
Source: General Principles of Software Validation - Final Guidance
for Industry and FDA Staff
Now myths.
50. R in Clinical Research ► What does FDA say?
50
Now, let us see what FDA has said about:
The use of any software in clinical research. This is the KEY.
The process of validation of the software
Then let us look at what some FDA-related people say about R
51. R in Clinical Research ► What does FDA say?
51
The use of any software in clinical research ( + 21 CFR part 11 status)
https://www.fda.gov/downloads/forindustry/datastandards/studydatastandards/ucm587506.pdf
52. R in Clinical Research ► What does FDA say?
52
The process of validation of the software
https://www.fda.gov/downloads/medicaldevices/.../ucm085371.pdf
[…] FDA considers software validation to be: “confirmation by examination and
provision of objective evidence that software specifications conform to user
needs and intended uses, and that the particular requirements implemented
through software can be consistently fulfilled.”
General Principles of Software Validation
Final Guidance for Industry and FDA Staff
53. R in Clinical Research ► What does FDA say?
53
This document […] can be applied to any software.
[…]
This document does not specifically identify which software is or is not regulated
[…]
The management and control of the software validation process should not be
confused with any other validation requirements, such as process validation for an
automated manufacturing process (so the regular validation of clinical programs don’t count)
[…]
design input requirements must be documented, and that specified requirements
must be verified
[…]
Success in accurately and completely documenting software requirements is a crucial
factor in successful validation of the resulting software.
54. R in Clinical Research ► What does FDA say?
54
A specification is defined as “a document that states requirements.”
[…]
There are many different kinds of written specifications, e.g., system requirements
specification, software requirements specification, software design specification,
software test specification, software integration specification, etc
[…]
Software verification provides objective evidence that the design outputs of a
particular phase of the software development life cycle meet all of the specified
requirements for that phase. Software verification looks for consistency,
completeness, and correctness of the software and its supporting
documentation, as it is being developed, and provides support for a subsequent
conclusion that software is validated.
55. R in Clinical Research ► What does FDA say?
55
Software validation is a part of the design validation for a finished device, but is not
separately defined in the Quality System regulation. For purposes of this guidance,
FDA considers software validation to be “confirmation by examination and
provision of objective evidence that software specifications conform to user
needs and intended uses, and that the particular requirements implemented
through software can be consistently fulfilled.
Production
is R and all packages done well?
Installation and work
mean( 1:3 ) == 2 ?
SOFTWARE VERIFICATION SOFTWARE VALIDATION
≠
56. R in Clinical Research ► What does FDA say?
56
Software validation includes confirmation of conformance to all software
specifications and confirmation that all software requirements are traceable to the
system specifications.
requirements
documentation
specification
of the system
( verification ) + validation
The system confirmation
57. R in Clinical Research ► What does FDA say?
57
Because of its complexity, the development process for software should be even
more tightly controlled than for hardware, in order to prevent problems that cannot
be easily detected later in the development process.
[…]
Seemingly insignificant changes in software code can create unexpected and
very significant problems elsewhere in the software program. The software
development process should be sufficiently well planned, controlled, and documented
to detect and correct unexpected results from software changes.
58. R in Clinical Research ► What does FDA say?
58
SECTION 4. PRINCIPLES OF SOFTWARE VALIDATION
4.9. INDEPENDENCE OF REVIEW
Validation activities should be conducted using the basic quality
assurance precept of “independence of review.” Self-validation is
extremely difficult. When possible, an independent evaluation is
always better, especially for higher risk applications.
Validator Builder
59. R in Clinical Research ► What does FDA say?
59
The software requirements specification document should contain a written definition of the software functions.
It is not possible to validate software without predetermined and documented software requirements.
Typical software requirements specify the following:
All software system inputs
All software system outputs
All functions that the software system will perform
All performance requirements that the software will meet, (e.g., data throughput, reliability, and timing)
The definition of all external and user interfaces, as well as any internal software-to-system interfaces
How users will interact with the system
What constitutes an error and how errors should be handled
Required response times
The intended operating environment for the software, if this is a design constraint (e.g. hardware platform,
operating system)
All ranges, limits, defaults, and specific values that the software will accept
All safety related requirements, specifications, features, or functions that will be implemented in software
60. R in Clinical Research ► What does FDA say?
60
The vendor’s life cycle documentation, such as testing protocols and results, source code, design
specification, and requirements specification, can be useful in establishing that the software has
been validated. However, such documentation is frequently not available from commercial
equipment vendors, or the vendor may refuse to share their proprietary information.
Now let’s stop for a while and quickly summarize what we already learned
commercial software open-source software
No source code Source code provided
No proprietary technical information Full documentation provided (if available)
Assurance “we did our best” Assurance “we did our best”
No guarantee No guarantee
Support No hot-line. But very active community.
…”millions of people use that” …”millions of people use that”
Full trust: it’s paid = validated well Low trust. Free things are poorly made
61. R in Clinical Research ► What does FDA say?
61
The process of validation of the software
https://www.fda.gov/downloads/MedicalDevices/.../ucm073779.pdf
Guidance for Industry, FDA Reviewers
and Compliance on
Off-The-Shelf Software Use in Medical Devices
This is another essential document. A must-read.
We are not going to analyze it thoroughly, yet it is strongly
recommended to familiarize with.
62. R in Clinical Research ► What does FDA say?
62
Introduction to the controlled environment
https://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/ucm070266.pdf
Guidance for Industry
Computerized Systems Used in Clinical Investigations
S.O.P
Dependability System Documentation System Controls
Change Control Documentation Training of Personnel
63. R in Clinical Research ► What does FDA-related people say?
63http://user2007.org/program/presentations/soukup.pdf
64. R in Clinical Research ► What does FDA-related people say?
64
Hey! Do read it again!
(see the next page)
65. R in Clinical Research ► What does FDA-related people say?
65
1. Use of R functions without proper validation is at the organizations risk
It seems that nobody forces us to validate R. Is it up to us? So
let’s better do (see the next page)
2. Results should be reproducible and independent of the software used
to derive them.
This is impossible “by definition”… SAS, R, Stata, SPSS may return
different results even for quantiles, or due to floating number representation!
The results should be maximally close to each other, but what about resampling
methods (SAS and R gives different random numbers for the same seed)?
66. R in Clinical Research ► What does FDA-related people say?
66
Another argument
for validating the R
67. R in Clinical Research ► What does FDA-related people say?
67
68. R in Clinical Research ► What does FDA-related people say?
68
69. R in Clinical Research ► What does it mean „to validate”? Why do we want this?
69
Finally, we got to this place. Let us now try to answer this question in layman terms:
“To validate” means to ensure that R does all the calculations properly.
But to confirm this, we need to check dozens of components, packages, functions.
Remember:
FDA doesn’t tell you what exactly should be validated (which functions). You decide.
The analysis of risk and validation coverage is entirely up to you.
That’s our responsibility to do it WELL.
Why? The necessity for validation is also to protect you and let you sleep well.
Try to think this way. Once done properly – it gives you a reliable, powerful tool.
70. R in Clinical Research ► Preparing R to enter the industry
70
We know FDA allows us to use R in submissions
We know what FDA wants from us and have a piece of advice how to do it
We have the source code provided for both R Core and every package
Most of the packages refers to handbooks and point to certain formulas
The R Core Team prepared a very important document on this topic
R has tools for unit testing
Reference data for testing are available in the Internet or can be obtained
There are tools allowing the system maintainer to protect (“to freeze”) the newly
validated environment against changes.
What tools do we have and what is to be done?
71. R in Clinical Research ► Preparing R to enter the industry
71
Validation is incremental. Once validated, a function doesn’t have to be re-
validated until update. Of course we can validate it many times (which I
recommend), which is easy with automated tools.
Only used functions have to be tested. Unused code means non-existent code.
Accumulation of test-cases over time significantly improves the process of
validation. Every new trial is a source of new, real data, perfect for testing.
And a bonus
72. R in Clinical Research ► Preparing R to enter the industry
72
The R-FDA.PDF document is a giant milestone. It makes a perfect starting point in the
process of establishing an own controlled R-based environment.
For obvious reasons it is limited only to a small subset o packages, labelled “Base” and
“Recommended”.
These packages don’t cover the complete ser o statistical routines used in clinical
research, but will definitely allow one to start with advanced analysis employing:
• linear mixed models (with given covariance structure), generalized additive models,
• survival analysis,
• accessing data generated by external statistical packages,
• resampling (bootstrap)
• and tons of statistical tests
• plotting (low-level and quite advanced via “lattice” package) and much more.
73. R in Clinical Research ► Preparing R to enter the industry
73
https://www.r-project.org/doc/R-FDA.pdf
74. Validation ► Validation of installation vs. numerical validation
74
What aspects of R-based computing environment can be validated?
The process of installation of the core R
The process of installation of required packages (version)
The quality of code in installed packages (code metrics)
Coverage by unit tests defined in installed packages
The outcome of these unit tests
Thought #1: incorrectly installed R or its package will not work properly or even
launch. It is useless.
75. Validation ► Validation of installation vs. numerical validation
75
What aspects of R-based computing environment can be validated?
The correctness of calculations performed by selected functions in selected
packages.
Thought #2: even correctly installed R or package, but returning wrong
results of calculation is not even useless, it’s extremely dangerous!
Well-done Validation = Validation of installation + Numerical validation
76. Validation ► Numerical validation ► Methods
76
How to validate a module numerically?
By comparing results with some reference data, obtained from trusted
source (good!)
trial versions (if license permits) of other statistical packages
asking someone who has a legal licence to run a certain analysis on given data
publicly available documentation with examples
By comparing results with calculations done by hand, step by step
(makes sense only for easy methods)
By inspecting the code and compare the implemented formula with the
reference in corresponding textbook (so-so, but allows to find issues)
77. Validation ► Numerical validation ► Methods
77
How to validate a module numerically?
Comparison has to be done with some tolerance, as it is likely, that two statistical
packages will slightly differ in results, due to numerous issues, like:
Different way of storing floating point numbers
Different approach to calculating quantiles
Different algorithm of rounding numbers
Difference in default contrasts set
Difference in type of Sum of Square used
Difference in random number generator (for same seed)
Different correction applied to a method (different rules of choice)
78. Validation ► Numerical validation ► Methods
78
How to validate a module numerically?
Obtained collection:
Statistical method name
Values of relevant parameters
Input data set provided to the reference software
An outcome returned by the reference software
…can be then enclosed into so-called “unit tests” code and stored into a
repository. A unit-testing engine queries the repository, fetches the definitions of
tests and passes them to appropriate functions for test in fully automated
manner. The tested function returns a result which is compared to the
reference. At the end it generates a report from validation.
79. Validation ► Validation of installation
79
https://www.londonr.org/wp-content/uploads/sites/2/presentations/LondonR_-_Challenges_Of_Validating_R_-_Chris_Campbell_-_20140617.pdf
80. Fixing the environment and controlling for changes
80
How to prevent the environment from being “invalidated”?
To prevent the users updating the R core
To prevent users from installing “illegal”(not validated) packages
“foreign” packages (not in the local use)
in different version
BUT!
Each project may require different set of packages in different versions
Certain project may require installation of new (yet not validated) packages
New packages are created within the company
81. Fixing the environment and controlling for changes
81
How to prevent the environment from being “invalidated”?
Docker containers (Rocker)
Read-only environment on a CD or DVD (slow!)
Portable version of R with “broken”.libPaths
Isolation of the workstation from the Internet (so cruel!)
Local repository of packages (in different versions): miniCRAN
The checkpoint solution, based on MRAN
The packrat solution, combined with miniCRAN
Employing a Concurrent Versioning System, like SVN or Git