SlideShare una empresa de Scribd logo
1 de 20
R and Reproducibility
A Proposal
David Smith
useR! 2014
What is Reproducibility?
“The goal of reproducible research is to tie
specific instructions to data analysis and
experimental data so that scholarship can be
recreated, better understood and verified.”
CRAN Task View on Reproducible Research (Kuhn)
• Method + Environment
-> Results
• A process for:
– Sharing the method
– Describing the environment
– Recreating the results
2 xkcd.com/242/
Why care about reproducibility?
Academic / Research
• Verify results
• Advance Research
Business
• Production code
• Reliability
• Reusability
• Regulation
3
www.nytimes.com/2011/07/08/health/research/08genes.html
http://arxiv.org/pdf/1010.1092.pdf
R and Reproducibility
4
Results
Interfaces
Platform
Packages
R Engine
• Hand-assembled
• Sweave/knitr/DeployR/Shiny
• R GUI / DevelopR / RStudio
• Batch / Web Services
• OS / Virtualization
• Hardware Architecture
• CRAN
• BioConductor / GitHub / …
• R Version
• Base + Recommended pkgs
Observations
• R versions are pretty manageable
– Major versions just once a year
– Patches rarely introduce incompatible changes
• Good solutions for literate programming
– Interfaces help
• OS/Hardware not the major cause of
problems
• The big problem is with packages
– CRAN is in a state of continual flux
5
Package Problem #1 : The User
http://xkcd.com/234/6
I heard you need to create a
TPS Report. Here, I’ve got an
R script that does that
already.
Oh, you need to
download these 5
packages first.
I already
did, and it
still
doesn’t
work!
Well, it worked when I
wrote it 3 weeks ago.
YOUR
Grr.
Package
updates…
Package Problem #2: The Author
http://xkcd.com/970/7
Time to update
my package on
CRAN!
>> Dependent
packages that
now fail to build:
67
>> Resubmit
your package
and try again
Crap.
Package Problem #3 : The Update
http://xkcd.com/664/8
3 days later…
Woot! A new version of R
is out! I have 10 minutes
now, time to download
and install!
… package not found …
… can’t install package…
… error …
The Proposal
• Change the default way R handles packages
– Install packages local to projects
• “Snapshot” CRAN daily
– Make it easy to get & use package versions used in script
development
Not a new idea!
– Ooms, “Possible Directions for Improving Dependency
Versioning in R”, R Journal 5/1
– BioConductor Project
– Revolution R Enterprise
– Linux distros
9
Example
• R script file using 6 most popular packages
10
Sharing a script reproducibly
… and simply
# Run with R 3.1.0
require(RRT)
mran_set(snapshot="2014-06-27")
# find packages used in this project
# get package versions used by script author
# install locally to this project
require(ggplot2)
require(data.table)
require(knitr) …
11
RRT: The R Reproducibility Toolkit
• Open Source R Package (GPLv2)
• From an R project folder:
– Detect packages & dependencies used in project
– Download and install from MRAN
– Versions selected according to script date
– Find and use packages from local install
github.com/RevolutionAnalytics/RRT
12
MRAN - Implementation
A downstream CRAN mirror with daily snapshots
• Use rsync to mirror CRAN daily
– Only downloads changed packages
• Use zfs to store incremental snapshots
– Storage only required for new packages
• Organize snapshots into a labelled hierarchy
– Access package versions by date of use
• CRAN snapshot server hosted by cloud provider
– Provisioned for availability and latency
13
Future work
• Just getting started!
• Snapshot binaries and source packages
• Other repos (BioConductor, GitHub, user)
• Institution-level package duplication
– CRAN “behind the firewall”
• User-defined package versions
• Checks on R versions
• Suggestions welcome!
github.com/RevolutionAnalytics/RRT
14
Thank You!
David Smith
david@revolutionanalytics.com
blog.revolutionanalytics.com
Possible Solution
• Bundle all packages with scripts
• Packrat solves this very well
– Project + package dependencies stored in Github
• But:
– Contributes to package fragmentation
– Adds friction to the sharing process
– Doesn’t address the problem for R generally
16
CRAN vs Github
CRAN
• “Repository of Record”
– Default for R users
• Strict quality checking
• Handles dependencies
• Binaries built
– But only current versions
saved
• Manual update process
• Dependent on volunteer
support
Github
• Frictionless publishing /
updates
– RStudio integration
• Social development
– Pull requests FTW
• Ease of updates
• Fragmented – no unified
directory of packages
• Permanence – accounts
closed / repos deleted
17
A downstream CRAN solution?
“I don't see why CRAN needs to be involved in
this effort at all. A third party could take
snapshots of CRAN at R release dates, and make
those available to package users in a separate
repository. It is not hard to set a different
repository than CRAN as the default location
from which to obtain packages.”
-- R-core member, r-devel, March 2014
18
Snapshot CRAN repository :
requirements
• Availability
• Latency
• Bandwidth
• Storage
• Binary package archives
• Other enhancements?
19
Proposal
“Development Branch” “Stable Branch”
Defaults are important!!20
MRANCRAN Downstram
Reproducible

Más contenido relacionado

La actualidad más candente

Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Revolution Analytics
 
Alex Liu Harvard Forest Presentation
Alex Liu Harvard Forest PresentationAlex Liu Harvard Forest Presentation
Alex Liu Harvard Forest Presentation
lexicron345
 
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Revolution Analytics
 

La actualidad más candente (20)

What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics PipelineWhat We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
Big Data - Analytics with R
Big Data - Analytics with RBig Data - Analytics with R
Big Data - Analytics with R
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Alex Liu Harvard Forest Presentation
Alex Liu Harvard Forest PresentationAlex Liu Harvard Forest Presentation
Alex Liu Harvard Forest Presentation
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
Reproducibility with Revolution R Open
Reproducibility with Revolution R OpenReproducibility with Revolution R Open
Reproducibility with Revolution R Open
 
Data Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program AnalysisData Science Challenges in Personal Program Analysis
Data Science Challenges in Personal Program Analysis
 
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love TestsDr. Datascience or: How I Learned to Stop Munging and Love Tests
Dr. Datascience or: How I Learned to Stop Munging and Love Tests
 
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
Introducing Revolution R Open: Enhanced, Open Source R distribution from Revo...
 
Reproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceReproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R Conference
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Intro to Reproducible Research
Intro to Reproducible ResearchIntro to Reproducible Research
Intro to Reproducible Research
 

Similar a R reproducibility

Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
BlueData, Inc.
 
Package Repositories: The Unsung Heroes of Configuration and Release Managem...
Package Repositories:  The Unsung Heroes of Configuration and Release Managem...Package Repositories:  The Unsung Heroes of Configuration and Release Managem...
Package Repositories: The Unsung Heroes of Configuration and Release Managem...
IBM UrbanCode Products
 
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
IxiaRomania
 

Similar a R reproducibility (20)

Effectively using Open Source with conda
Effectively using Open Source with condaEffectively using Open Source with conda
Effectively using Open Source with conda
 
R development
R developmentR development
R development
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
 
OpenStack Documentation in the Open
OpenStack Documentation in the OpenOpenStack Documentation in the Open
OpenStack Documentation in the Open
 
Managing Open Source Software in the GitHub Era
Managing Open Source Software in the GitHub EraManaging Open Source Software in the GitHub Era
Managing Open Source Software in the GitHub Era
 
Upgrading CentOS on the Facebook fleet
Upgrading CentOS on the Facebook fleetUpgrading CentOS on the Facebook fleet
Upgrading CentOS on the Facebook fleet
 
Docker: Containers for Data Science
Docker: Containers for Data ScienceDocker: Containers for Data Science
Docker: Containers for Data Science
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
 
Reproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RROReproducibility with Checkpoint & RRO
Reproducibility with Checkpoint & RRO
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and Semantics
 
Guidelines for Working with Contract Developers in Evergreen
Guidelines for Working with Contract Developers in EvergreenGuidelines for Working with Contract Developers in Evergreen
Guidelines for Working with Contract Developers in Evergreen
 
Package Repositories: The Unsung Heroes of Configuration and Release Managem...
Package Repositories:  The Unsung Heroes of Configuration and Release Managem...Package Repositories:  The Unsung Heroes of Configuration and Release Managem...
Package Repositories: The Unsung Heroes of Configuration and Release Managem...
 
Developing a Framework for File Format Migrations. Joey Heinen and Andrea Goe...
Developing a Framework for File Format Migrations. Joey Heinen and Andrea Goe...Developing a Framework for File Format Migrations. Joey Heinen and Andrea Goe...
Developing a Framework for File Format Migrations. Joey Heinen and Andrea Goe...
 
Keeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository worldKeeping your build tool updated in a multi repository world
Keeping your build tool updated in a multi repository world
 
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
Why It’s Important to Contribute to Open-Source Projects | Keysight Connect #10
 
Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018
Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018
Versioning in Pipeline Pilot - Pipeline Pilot Forum 2018
 
Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?Que nos espera a los ALM Dudes para el 2013?
Que nos espera a los ALM Dudes para el 2013?
 
R meetup 20161011v2
R meetup 20161011v2R meetup 20161011v2
R meetup 20161011v2
 
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
Alfresco DevCon 2018: SDK 3 Multi Module project using Nexus 3 for releases a...
 

Más de Revolution Analytics

The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
Revolution Analytics
 
Reproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageReproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint Package
Revolution Analytics
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
Revolution Analytics
 

Más de Revolution Analytics (18)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solution
 
Reproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageReproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint Package
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

R reproducibility

  • 1. R and Reproducibility A Proposal David Smith useR! 2014
  • 2. What is Reproducibility? “The goal of reproducible research is to tie specific instructions to data analysis and experimental data so that scholarship can be recreated, better understood and verified.” CRAN Task View on Reproducible Research (Kuhn) • Method + Environment -> Results • A process for: – Sharing the method – Describing the environment – Recreating the results 2 xkcd.com/242/
  • 3. Why care about reproducibility? Academic / Research • Verify results • Advance Research Business • Production code • Reliability • Reusability • Regulation 3 www.nytimes.com/2011/07/08/health/research/08genes.html http://arxiv.org/pdf/1010.1092.pdf
  • 4. R and Reproducibility 4 Results Interfaces Platform Packages R Engine • Hand-assembled • Sweave/knitr/DeployR/Shiny • R GUI / DevelopR / RStudio • Batch / Web Services • OS / Virtualization • Hardware Architecture • CRAN • BioConductor / GitHub / … • R Version • Base + Recommended pkgs
  • 5. Observations • R versions are pretty manageable – Major versions just once a year – Patches rarely introduce incompatible changes • Good solutions for literate programming – Interfaces help • OS/Hardware not the major cause of problems • The big problem is with packages – CRAN is in a state of continual flux 5
  • 6. Package Problem #1 : The User http://xkcd.com/234/6 I heard you need to create a TPS Report. Here, I’ve got an R script that does that already. Oh, you need to download these 5 packages first. I already did, and it still doesn’t work! Well, it worked when I wrote it 3 weeks ago. YOUR Grr. Package updates…
  • 7. Package Problem #2: The Author http://xkcd.com/970/7 Time to update my package on CRAN! >> Dependent packages that now fail to build: 67 >> Resubmit your package and try again Crap.
  • 8. Package Problem #3 : The Update http://xkcd.com/664/8 3 days later… Woot! A new version of R is out! I have 10 minutes now, time to download and install! … package not found … … can’t install package… … error …
  • 9. The Proposal • Change the default way R handles packages – Install packages local to projects • “Snapshot” CRAN daily – Make it easy to get & use package versions used in script development Not a new idea! – Ooms, “Possible Directions for Improving Dependency Versioning in R”, R Journal 5/1 – BioConductor Project – Revolution R Enterprise – Linux distros 9
  • 10. Example • R script file using 6 most popular packages 10
  • 11. Sharing a script reproducibly … and simply # Run with R 3.1.0 require(RRT) mran_set(snapshot="2014-06-27") # find packages used in this project # get package versions used by script author # install locally to this project require(ggplot2) require(data.table) require(knitr) … 11
  • 12. RRT: The R Reproducibility Toolkit • Open Source R Package (GPLv2) • From an R project folder: – Detect packages & dependencies used in project – Download and install from MRAN – Versions selected according to script date – Find and use packages from local install github.com/RevolutionAnalytics/RRT 12
  • 13. MRAN - Implementation A downstream CRAN mirror with daily snapshots • Use rsync to mirror CRAN daily – Only downloads changed packages • Use zfs to store incremental snapshots – Storage only required for new packages • Organize snapshots into a labelled hierarchy – Access package versions by date of use • CRAN snapshot server hosted by cloud provider – Provisioned for availability and latency 13
  • 14. Future work • Just getting started! • Snapshot binaries and source packages • Other repos (BioConductor, GitHub, user) • Institution-level package duplication – CRAN “behind the firewall” • User-defined package versions • Checks on R versions • Suggestions welcome! github.com/RevolutionAnalytics/RRT 14
  • 16. Possible Solution • Bundle all packages with scripts • Packrat solves this very well – Project + package dependencies stored in Github • But: – Contributes to package fragmentation – Adds friction to the sharing process – Doesn’t address the problem for R generally 16
  • 17. CRAN vs Github CRAN • “Repository of Record” – Default for R users • Strict quality checking • Handles dependencies • Binaries built – But only current versions saved • Manual update process • Dependent on volunteer support Github • Frictionless publishing / updates – RStudio integration • Social development – Pull requests FTW • Ease of updates • Fragmented – no unified directory of packages • Permanence – accounts closed / repos deleted 17
  • 18. A downstream CRAN solution? “I don't see why CRAN needs to be involved in this effort at all. A third party could take snapshots of CRAN at R release dates, and make those available to package users in a separate repository. It is not hard to set a different repository than CRAN as the default location from which to obtain packages.” -- R-core member, r-devel, March 2014 18
  • 19. Snapshot CRAN repository : requirements • Availability • Latency • Bandwidth • Storage • Binary package archives • Other enhancements? 19
  • 20. Proposal “Development Branch” “Stable Branch” Defaults are important!!20 MRANCRAN Downstram Reproducible

Notas del editor

  1. http://xkcd.com/242/
  2. https://stat.ethz.ch/pipermail/r-devel/2014-March/068552.html