Managing the analysis of high-throughput data

Managing the analysis of high-throughput data
It’s not so much about the tools, it’s the attitude
Javier Quilez1,2, Enrique Vidal1,2, François Le Dily1,2, François Serra1,2,3, Yasmina Cuartero1,2,3, Ralph Stadhouders1,2, Thomas Graf1,2, Marc A. Marti-Renom1,2,3,4, Miguel Beato1,2 and Guillaume Filion1,2
1Gene Regulation, Stem Cells and Cancer Program, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology (BIST), Dr. Aiguader 88, 08003 Barcelona, Spain
2Universitat Pompeu Fabra (UPF), Barcelona, Spain
3CNAG-CRG, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology (BIST), Dr. Aiguader 88, 08003 Barcelona, Spain
4ICREA, Pg. Lluis Companys 23, 08010 Barcelona, Spain
• High-throughput sequencing (HTS) experiments are pervasive in the life sciences; from small research groups to large-scale projects, HTS data accumulates at a rapid pace
• The human factor is the greatest hurdle to (i) analyse HTS data efficiently, (ii) reach the FAIR (Findable, Accessible, Interoperable and Reusable) Principles
• To overcome these limitations we propose that: (i) crucial questions need to be addressed at an early stage of the project; (ii) scientific groups must develop habits and tools for sharing data
and analyses; and (iii) data-producing teams focus on Documentation, Automation, Traceability and Autonomy
• Interested but don’t have time/energy to keep reading? Check out our parable “Parallel sequencing lives, or what makes large sequencing project successful”
What, when, how and who will have access to the sample metadata?
Collect systematically the metadata of the experiments
• Sequencing reads are not all the information derived from a HTS experiment

• Metadata provide information about HTS experiments, which are required for analysing and sharing the data
and for reproducing the results

• Very often, however, metadata are scattered, inaccurate, insufficient or even missing (especially for older
samples)

• Collect the metadata systematically and before the processing of the data starts (Fig. 1a and Box 1)
Can samples be identified unambiguously?
Establish a system: each sample a unique identifier (ID)
• Samples are often called with names that are easy to remember for the person who performed the
experiment

• This generates sample swaps as similar identifiers can refer to different experiments; also, unsystematic
naming prevents accessing samples programmatically, which may lead to errors and undermines the
capability to automate the analysis

• Establish a scheme to uniquely identify samples and the associated (meta)data (Box 2)
Where are data and results?
Structured and hierarchical organisation of the data
• Data and results derived from HTS experiments are typically stored in an untidy manner

• Organise data in a structured and hierarchical manner reflecting the way data are generated and analysed: (1)
raw data, (2) processed data and (3) analysis results (Fig. 1b)
Can multiple samples be seamlessly processed?
Scalability, parallelisation, automatic configuration and modularity of the code
• Data analysis rarely is a one-time task: (i) samples are sequenced at different time points (Fig. 1b) so core
analysis pipelines have to be executed for every new sequencing batch; (ii) samples need to be re-processed
when analysis pipelines are modified substantially; and (iii) downstream analyses are often repeated with
different datasets or variables

• Automate the data processing as much as possible (Box 3 and Fig. 2a)
Does anybody have the information to reproduce the results?
Documentation, documentation and documentation
• Results with no documentation leads to little understanding of the analysis, irreproducibility and makes harder
the identification of errors

• Document all the parts involved in the analysis (from the raw data to the results) (Box 4)
Can anybody make use of the data generated?
Empower experimenters to perform basic analysis via web applications
• Analysis workflows generate many files, which may not be accessible for users (too big to open or too difficult
to manipulate) (Box 5)

• Implement interactive web applications to display the processed data and to perform specific analyses in a
user-friendly manner (Fig. 2b)

• Building interfaces for standard analyses frees bioinformaticians to focus on the most technical parts of the
project, while allowing all the members to contribute to the analyses (Box 5)

• The features of such web applications must be discussed with their potential users, because implementing
them requires effort and time
References
1. Sci Data. 2016;3:160018. doi:10.1038/sdata.2016.18

2. GigaScience. 2017,gix100. doi.org/10.1093/gigascience/gix100

3. https://daringfireball.net/projects/markdown/

4. http://jupyter.org/

5. https://www.rstudio.com/

6. https://shiny.rstudio.com/
Funding
We received funding from the European Research Council under the European Union's Seventh Framework
Programme (FP7/2007-2013)/ERC Synergy grant agreement 609989 (4DGenome). The content of this poster
reflects only the author’s views and the Union is not liable for any use that may be made of the information
contained therein. We acknowledge support of the Spanish Ministry of Economy and Competitiveness,
‘Centro de Excelencia Severo Ochoa 2013-2017’ and Plan Nacional (SAF2016-75006-P), as well as support
of the CERCA Programme / Generalitat de Catalunya. Ralph Stadhouders was supported by an EMBO
Long-term Fellowship (ALTF 1201-2014) and a Marie Curie Individual Fellowship (H2020-MSCA-IF-2014).
@jaquol
@4DGenome
javier.quilez@crg.eu
Box 1. Features of a good metadata collection system
Easy to parse for
humans & computers
Responsible for maintenance
& metadata validation
Future-aware
flexible
Agreed & understood
by people using it
Box 2. Unique IDs: connecting tubes, metadata & data
Sample ID
Biological
Technical Logistics
Application User
Experiment
Cell type
Treatment
Target protein
Facility
Run date
Read length
Species
SE/PE
@HWI-D00733:72:C8E09ANXX:5:1101:1211:2429 1:N:0:ACAGTG
CTACCACCAAACTTAGAACGGTCATTATGTTACTCTAAGATAATAGAATA
+
AABB=FDGGGGGCGGEC1CCGEC/C1=<CFFGEFF1=CFG1>F>1FG1<1
AGGATATATTTGTTAAAAATACAACAAAAACCCCTAGTATTTGTGAGCAA
+
ABBB0EFFGGFGFGEFGGGCFGGGGGGGGGGGGG<FCCFF1<BCB11=EF
GGCTCCTCTCGGTTCTTCCGAGCCAGCTCGTCATATTGGGCCCGGATGTC
+
BCCBBEGDFGFGCBGGEGGFBCB/B0:DDF>FGGE1@CG@DFAEGGBE:=
GCTTAGTCTTATTGCTCAGGAGACCGGAGGCCTGGGTTGCTACAGTGCAG
+
A3<AA1EE@1;C1>>>>C=1;EF=G/<E/>BCFGG0FDGB1BFG1EEFF1
GGCCAACCACAAGACGATAAAGGGAAACAGGGCGTGGGGATTTCCAGTTT
Data
(Sequencing reads, FASTQ)
Metadata
Computer-fiendly
fixed length & pattern
all lower or upper case
anticipate max. # of samples
#1: simple auto-incremental
(sample001, sample002, …)
#2: hash function applied to
metadata (b1913e6c1_51720e9cf)
Examples
Fig. 2. Automating data analysis and visualisation
(a) Scalability is achieved by having a submission script (‘*.submit.sh’) that generates as many
pipeline scripts as samples listed in the configuration file (‘*.config’), so that a pipeline is executed
simultaneously for multiple samples with a single command (gray rectangle). The configuration file
also contains the hard-coded parameters shared by all samples (e.g. number of processors or
genome assembly version). Parallelisation is obtained by (i) submitting each sample pipeline script as
an independent job in the computing cluster, if there is one, where it will be queued (orange) and
eventually executed (green), and (ii) adapting the pipeline code in ‘*seq.sh’ to be suitable for running
in multiple processors. Each pipeline script is automatically configured by retrieving the pipeline
variable values (e.g. species, read length) from the metadata SQL database; in addition, selected
metadata generated by the pipeline (e.g. running time, number of aligned reads) are recorded into the
database. For further flexibility, the pipeline code is grouped into modules that can be executed all
sequentially or individually by specifying it in the configuration file. (b) We take advantage of our
structured and hierarchical data organisation as well as the available metadata to deploy a web
application to visualise processed data using Shiny6.
app.R
*.config
samples &
parameters
[full]
*submit.sh
*seq.sh
pipeline code
[module 1]
[module 2]
[module 3]
SQL
database
*.sh
pipeline script
sample A2
*.sh
pipeline script
sample A1
*.sh
pipeline script
sample N
…
> *submit.sh *.config
sample001
sample0022
sample003
…
Processed
data
sample004
sample005
sample006
…
Shiny
server
a
b
Fig. 1. Framework for the management of HTS data
(a) Metadata collection. In our projects, metadata are collected via an online Google Form and stored
both online (Google Sheet) and in a local SQL database. We design forms to be short and easy to
complete, and Google Sheets provide instant access to the metadata by authorised users. The SQL
database works both as a backup and as the source for retrieving metadata programmatically. (b)
The stages of HTS data. In general, experiments are sequenced in different multi-sample runs
separated in time. HTS data are usually analysed in two steps. First, raw data are processed sample-
wise with standard but tunable core analysis pipelines which generate a variety of files. Second,
processed data from one or more samples are combined to perform downstream analyses.
runs/
|-2017-10-09/
|--sample001_read1.fastq.gz
|--fastqc/
|---sample001_read1_fastqc.txt
>PROJECT
>APPLICATION
>SAMPLE_ID
>SAMPLE_NAME
…
Google Form Google Spreadsheet
SQL
database
ONLINE CLUSTER
sample001
sample002
sample003
…
sample001
sample002
sample003
…
Core analysis
pipeline
Downstream
analysis
Raw
data
1
Analysis 1
Analysis
results
3
Sequencing
run A
Processed
data
2
sample004
sample005
sample006
…
sample004
sample005
sample006
…
Sequencing
run B
a
b
Timestamp SAMPLE_ID CELL_TYPE TREATMENT TREATMENT_TIME
08/10/15 14:13 sample001 T47D Untreated 0
08/10/15 14:35 sample002 T47D Progesterone 60
08/10/15 14:38 sample003 T47D Untreated 0
2/22/16 12:35:00 sample004 B-cell Untreated 0
sample001/
|-alignments/
|--hg19/
|--hg38/
|-profiles/
|-logs/
|--program1.out
projects/
|-project1/
|--2017-10-09_diff_expression/
|---data/
|---figures/
|---tables/
|---scripts/
Box 3. Analysis code’s wish-list
Scalability
1 sample or 100s
Parallelisation
run all samples simultaneously
speed up individual tasks
Automatic configuration
no need to set variables for each sample
Pipeline modularity
execute it all or individually
REPRODUCIBILITY
Every task is a directory
Analysis pipelines:
• Log monitors the
progress of the
pipeline
• Keep the logs of the
programs used
• Check the integrity
of important files
(e.g. raw reads)
Use Markdown3,
Jupyter Notebook4,
RStudio5 or alike to
document
procedures
Specify the non-
default variable
values used
Version control
Code repositories
(e.g. GitHub)
Virtual
machines
(e.g. Docker)
Box 4. The multiple pieces of reproducibility
I run the pipeline on your 10 samples
Can you send me the interaction
matrix of chr2 for all of them? Excel
crashes and, yet, I’d have to do it
many times…
I wish I could
focus on the
more technical
aspects…
I wish I could be
more autonomous
with the data
analysis…
Box 5. Web applications: a win-win situation

Managing the analysis of high-throughput data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Managing the analysis of high-throughput data

Similar to Managing the analysis of high-throughput data (20)

Recently uploaded

Recently uploaded (20)

Managing the analysis of high-throughput data