This document summarizes changes made between benchmark versions v3.3.2 and v4 of PacBio's CCS benchmark. Version v4 increases the number of true positive variant calls by over 400k while slightly decreasing recall and increasing precision. It also resolves all discrepancies reported in previous studies. While most discrepancies were removed from the high confidence region, some sites may be added back after closer inspection, as other data such as from linked reads and long reads support the original CCS calls.
3. SUPPLEMENTARY TABLE 9
- Supplementary Table 9. Manual curation of
small variant discrepancies between CCS
callset and Genome in a Bottle benchmark.
For the “Discrepancy” column, “AM” means
genotype difference, “FN” means false negative
(in benchmark but not callset), and “FP s” means
false positive (in callset but not benchmark).
“Repeat family” column is from the
RepeatMasker track from the UCSC Genome
Browser. “Correct Call” column is “GIAB” when
the benchmark was deemed correct by expert
curators, and “CCS” when the CCS callset was
deemed correct. Rows where the correct call is
from the CCS callset are colored blue.
bioRxiv 519025
doi:10.1101/519025
7. SOME OF THESE REGIONS MAY BE ADDED BACK TO HIGH
CONFIDENCE AFTER CLOSER INSPECTION
CHR POS Discrepancy high conf Notes
4 11468804 AM BORDER fixed
5 42740225 AM FALSE complex variant; L1PA2
2 5143996 AM FALSE mis-mapped short reads; HERVH-int
13 48291499 AM FALSE highly variable region; L1PA3
8 5930728 FN FALSE long reads identify long insertion
15 41943823 FN FALSE simple repeat with some variability causes alignment issues
6 9737425 FN FALSE segmental dup
7 157385671 FN TRUE segmental dup, fixed
17 32064214 FN FALSE
1 94256825 FP TRUE fixed; L1PA2
2 153864971 FP FALSE supported by CCS, mate pairs, 10X, ONT; L1HS
4 112819087 FP FALSE L1HS
4 165026074 FP TRUE fixed; L1PA2
11 23338682 FP TRUE fixed; L1P1
1 35034071 FP FALSE L1HS
3 79181734 FP TRUE fixed; L1HS
4 94532444 FP FALSE supported by CCS, mate pairs, ONT; L1HS
8 46873565 FP FALSE ALR/Alpha
9 22350168 FP TRUE fixed; L1PA2
21 42288851 FP FALSE supported by CCS, mate pairs, ONT; L1PA2
8. CHR6:9737425
-V3.3.2 HET
-V4 – removed from high confidence
region
-segmental duplication15kb CCS
2x250
6kb mate pair
Linked-read
Ultralong
Seg dup
V3.3.2 high conf
V4 high conf
9. CHR13:48291499
-V3.3.2 - HOMALT
-V4 - removed from high confidence
region
-Support for HET in CCS, linked-
reads, mate pairs, and ultralong
reads
-L1PA3
15kb CCS
2x250
6kb mate pair
Linked-read
Ultralong
Seg dup
V3.3.2 high conf
V4 high conf
10. CONCLUSION
-v4! increases TP variant calls in CCS-GATK4HC call set by > 400k
-slight decreases in recall, but increases in precision
-v4! resolves all discrepancies reported in Wenger & Peluso et al.
-There may be some cases where the high confidence region could be further
expanded, based on agreement between CCS, 10X, ONT, and 6kb mate
pairs.