TRGT-denovo: accurate detection of de novo tandem repeat mutations (2024)

Journal List
bioRxiv
PMC11275785

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsem*nt of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more: PMC Disclaimer | PMC Copyright Notice

Version 1. bioRxiv. Preprint. 2024 Jul 19.

doi:10.1101/2024.07.16.600745

PMCID: PMC11275785

PMID: 39071386

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

T. Mokveld,¹ E. Dolzhenko,¹ H. Dashnow,² T. J. Nicholas,² T. Sasani,² B. van der Sanden,³ B. Jadhav,⁵ B. Pedersen,² Z. Kronenberg,¹ A. Tucci,⁶ A. J. Sharp,⁵ A. R. Quinlan,² C. Gilissen,³ A. Hoischen,^3,⁴ and M. A. Eberle¹

Author information Copyright and License information PMC Disclaimer

The complete version history of this preprint is available at bioRxiv.

Associated Data

Supplementary Materials

Abstract

Motivation

Identifying de novo tandem repeat (TR) mutations on a genome-wide scale is essential for understanding genetic variability and its implications in rare diseases. While PacBio HiFi sequencing data enhances the accessibility of the genome’s TR regions for genotyping, simple de novo calling strategies often generate an excess of likely false positives, which can obscure true positive findings, particularly as the number of surveyed genomic regions increases.

Results

We developed TRGT-denovo, a computational method designed to accurately identify all types of de novo TR mutations—including expansions, contractions, and compositional changes—within family trios. TRGT-denovo directly interrogates read evidence, allowing for the detection of subtle variations often overlooked in variant call format (VCF) files. TRGT-denovo improves the precision and specificity of de novo mutation (DNM) identification, reducing the number of de novo candidates by an order of magnitude compared to genotype-based approaches. In our experiments involving eight rare disease trios previously studied

TRGT-denovo correctly reclassified all false positive DNM candidates as true negatives. Using an expanded repeat catalog, it identified new candidates, of which 95% (19/20) were experimentally validated, demonstrating its effectiveness in minimizing likely false positives while maintaining high sensitivity for true discoveries.

Availability and implementation

Built in Rust, TRGT-denovo is available as source code and a pre-compiled Linux binary along with a user guide at: https://github.com/PacificBiosciences/trgt-denovo.

Introduction

Tandem repeats (TRs) are DNA sequences composed of variably recurring, (nearly) identical subunits that contribute significantly to both intra-sample and population-level genomic variation [¹]. De novo expansions of TRs, present in both coding and non-coding regions [²], are associated with over 60 monogenic disorders [³] and linked to conditions such as cancer [⁴^,⁵] and neurological disorders [⁶]. The mutability of TRs, influenced by their repeat length and sequence context [⁷^,⁸], is significantly higher—by several orders of magnitude—than that of non-repetitive DNA [⁸].

Accurately sizing large TR loci is challenging, specifically for pathogenic TR loci, whose size often expands significantly from normal to premutation to pathogenic repeat ranges [³]. The Tandem Repeat Genotyping Tool (TRGT) [⁹] was recently developed to characterize TR loci in PacBio HiFi sequencing data. TRGT calculates repeat length, composition, mosaicism, and CpG methylation state while also providing visualization. TRGT demonstrates a high Mendelian consistency rate exceeding 98.38% when excluding off-by-one errors, indicating its high accuracy. However, despite this high accuracy, the presence of millions of repeat loci in the genome means that trio analysis can still generate tens of thousands of false positives de novo calls. Thus, to integrate de novo TR analysis into rare disease studies effectively, a strategy is needed to filter out a substantial portion of these false positives while increasing specificity without missing true positives.

We present TRGT-denovo, a novel method for detecting DNMs in TR regions by integrating TRGT genotyping results with read-level data from family members. This approach significantly reduces the number of likely false positive de novo candidates compared to genotype-based de novo TR calling. In a follow-up to earlier research surveying DNMs in eight rare disease trios [¹⁰], we used the same data to demonstrate that TRGT-denovo would have accurately classified all high-quality candidate de novo calls—later experimentally validated as false positives—as true negatives. Moreover, by expanding de novo analysis using a larger repeat catalog in the same dataset, targeted sequencing confirmed 95% (19/20) of the selected de novo candidates detected by TRGT-denovo.

Results

Throughout, we consider TR genotyping obtained by using TRGT in 9 family trios and the GRCh38 reference genome, along with various repeat catalogs [¹¹^,¹²]. TRGT-denovo performs de novo TR calling, as detailed in the methods section. Briefly, TRGT-denovo analyzes both the genotyping outcomes and reads spanning the TRs generated by TRGT, as shown in Fig. 1. TRGT-denovo compares the alleles and supporting reads from the child against those of the parents, enabling the identification and quantification of variations exclusive to the child’s data as potential DNMs. As a result, TRGT-denovo can detect both changes in TR length and compositional variations (e.g., sequence changes such as SNVs and larger).

Open in a separate window

Fig 1.

Overview of TRGT-denovo

(full details in Methods). (a) TRGT pre-processing, which requires aligned PacBio HiFi reads, a repeat definition catalog, and a reference genome. (b) TRGT-denovo uses TRGT output, specifically spanning reads and genotyping data, along with the reference genome and repeat definitions. (c) By matching repeat definitions and corresponding allele sequences, reads are partitioned and assigned to alleles. This is achieved via TRGT-obtained classifications, consensus allele alignment, or phasing, thus determining the allele sequence each read best supports. (d) Allele partitioned reads are realigned to child allele consensus sequences for comparison purposes. (e) Potential DNMs are identified by examining discrepancies in alignment score distributions among candidate de novo alleles.

Comparing genotype-based de novo TR calling to TRGT-denovo

We assessed de novo TR calling by conducting a comparative analysis on the HG002, HG003, and HG004 trio with 30x HiFi data sequenced on the PacBio Revio system. Using a repeat catalog of 891,328 loci (excluding those in segmental duplications), TRGT generated genotype calls across the trio at 888,711 (99.7%) loci. TRGT provides various metrics, including the repeat counts per allele and allele depth (number of reads spanning each allele) for each genotyped TR locus. Our analysis focused only on sites supported with a minimum of ten reads, with at least five reads supporting each allele in each trio member, resulting in 864,990 loci for evaluation. De novo candidates were identified by detecting deviations from expected Mendelian inheritance patterns, with 81,308 (9.40%) loci displaying Mendelian inheritance inconsistencies indicative of potential de novo mutations. Of these, 89.08% and 7.04% correspond to hom*opolymer and dinucleotide repeats, respectively. When allowing for variations of one motif count, 2,582 loci (0.29%) were flagged as potential de novo mutations, with 63.07% and 22.22% corresponding to hom*opolymer and dinucleotide repeats respectively.

TRGT-denovo enhances the detection of true de novo tandem repeat expansions, contractions, or compositional changes while reducing likely false positives by analyzing reads from all trio members (see Methods). In this dataset, TRGT-denovo found that 4,214 (0.49%) of analyzed loci showed some degree of de novo evidence when testing both alleles of the child. De novo evidence is defined as at least one read inconsistent with both parents’ called genotypes. The discrepancy between the 81,308 Mendel inheritance errors identified from genotype calls and the 4,214 loci showing de novo evidence stems from TRGT-denovo’s consideration of underlying read data in addition to inferred genotypes. Genotyping can be ambiguous; however, by analyzing read and allele length distributions within the context of a trio (i.e., comparing the child with both parents), we can get a more reliable indication of de novo mutations than by just counting motifs. Note that de novo evidence does not always indicate true DNMs; it could also result from artifacts like parental allele dropout, mosaicism, somatic instability, or stutter, particularly in low coverage scenarios. To address this uncertainty, TRGT-denovo assesses the amount of de novo evidence using several metrics, including the allele de novo ratio and the child de novo ratio, which respectively measure the proportion of de novo evidence within an individual allele and across the entire locus. Furthermore, the potential size of a de novo event is evaluated by the mean absolute difference between the de novo reads and the closest parental read data. This measurement provides insight into the event’s magnitude, with smaller values under low coverage conditions likely indicative of artifacts. Fig. 2. shows how de novo coverage (the number of reads supporting the de novo event) varies in relation to these metrics for alleles exhibiting any de novo coverage.

Open in a separate window

Fig 2.

TRGT-denovo metrics.

De novo coverage relative to the (a) allele de novo ratio; (b) child de novo ratio; (c) mean absolute difference between the reads with de novo evidence and P_u. Each point represents a potential de novo allele. Horizontal and vertical lines indicate thresholds for minimal de novo coverage, allele de novo ratio, and a range for the child de novo ratio, creating shaded boxes where true de novo mutations are more likely.

Using these metrics, we established thresholds to identify potential de novo alleles (Methods). After filtering, we excluded 4,214 of the putative de novo sites, leaving 137 candidates. Comparison with candidates identified by the genotype-based approach using strict Mendelian consistency checks revealed an overlap of 95 candidates. This partial overlap suggests that DNMs may not always alter the allele length or may involve sub-motif changes that do not affect the motif count, such as SNVs within a repeat motif. Furthermore, small insertions, deletions, or substitutions within the repeat motifs themselves can introduce variability without altering the overall motif count. For example, a 5 bp change in a 20 bp motif is proportionally too small to affect the measured motif count. While the simple genotype-based method might miss these subtle changes, due to its reliance on counting motifs, TRGT-denovo can detect them through read-level analysis. When allowing for single repeat unit motif variations, the overlap further reduced to 29 candidates. This indicates that such small changes, typically expected in de novo events, were effectively excluded. In summary, TRGT-denovo outperforms genotype-based methods in detecting DNMs by capturing complex changes not reflected by simple motif counts. The simple genotype-based method identified 81,308 candidates of which only 95 (0.12%) overlapped with TRGT-denovo’s results. Indicating that TRGT-denovo effectively filters out most likely false positives. Even when the genotype-based strategy adjusts for off-by-one motif counts, it still retains 2,582 candidates out of 81,308, with only 29 (1.12%) overlapping with TRGT-denovo’s results.

Validation of de novo TR calls

Previously, an in-depth analysis was performed on eight rare disease trios to identify DNMs in PacBio HiFi sequencing data, focusing on various types of variations, including TRs [¹⁰]. The study used a repeat definition catalog of 171,146 loci [¹¹], and TRGT was used for genotyping, followed by analyzing the VCF file to detect TR DNMs. That study identified 28 de novo candidates, 18 of which were amenable to targeted sequencing; however, none of these putative de novo events were confirmed as true DNMs. Misclassifications generally occurred either because the child allele size was incorrect, or the allele was actually present in one of the parents. Using the same repeat catalog, TRGT-denovo correctly reclassified all 18 candidates as true negatives. Of the 10 unsequenced candidates, TRGT-denovo suggested that only one might be a de novo mutation. Note that this single unsequenced candidate was found with short read data, supporting its classification as a true de novo event.

To identify true de novo candidates in these trios, we used a larger catalog of 937,122 loci [¹²], and applied both TRGT and TRGT-denovo across all trios. After post-filtering (Methods), we observed 60–120 candidate DNMs per trio. A subset of these candidates was selected for validation using targeted long-range PCR and sequencing on a PacBio Sequel IIe system [¹⁰]. Based on repeat size, candidates were categorized into three groups: large (9), small (17), and those overlapping with genes associated with neurodevelopmental disorders (NDD) (6). In each category, there were 7 (large), 9 (small), and 4 (NDD overlapping) candidates for which primer design, amplification and sequencing was possible, respectively. We successfully validated 95% (19/20) of these candidate de novo mutations (Table 1). The one small candidate that could not be validated (shown in Supplemental Fig 1.) was a dinucleotide repeat with a single unit contraction, which, at high depth, displayed variability exceeding the de novo event size.

Table 1.

Validation results.

Targeted sequencing results of a subset of 20 candidate de novo TR calls detected by TRGT-denovo in eight trios. Calls are categorized by their size (large or small) or their genomic location, specifically in genes associated with neurodevelopmental disorders (NDD).

Type	TR Size (bp)	TR Size μ bp (±SD)	Validated/Candidates
Large	[40, 1200]	301 (±372)	7/7
NDD	[8, 24]	19.25 (±6.53)	4/4
Small	[2, 5]	3.6 (±0.94)	8/9

Open in a separate window

Discussion

The discovery of de novo TR mutations is highly relevant for understanding genetic disorders. However, accurately identifying these mutations presents significant challenges due to the high mutability of TRs and the vast number of repeat loci in the genome. Standard genotype filtering methods are often insufficient in this context, necessitating more sophisticated approaches. To minimize the risk of likely false positive de novo candidates in genotype-based de novo calling, an extremely high variant calling accuracy is required, reflected as a high level of Mendelian consistency (Supplemental Fig 2.). As the size of repeat catalogs grows, so does the requirement for such consistency. In this study, we introduce TRGT-denovo, a method that addresses these challenges by integrating TRGT genotyping results with read-level data from family members. We demonstrate that TRGT-denovo significantly reduces the number of de novo candidates compared to using raw TRGT genotypes alone. Specifically, TRGT plus TRGT-denovo identified ~100 de novo TRs per trio, compared to the thousands or tens of thousands identified when using TRGT alone. This reduction in likely false positives is important for making de novo TR analysis more feasible and reliable in rare disease studies.

TRGT-denovo achieves high validation rates in addition to lowering the number of likely false positives. Specifically, we validated the de novo calls identified by TRGT-denovo, achieving a 95% validation rate (19/20 de novo calls). Maintaining high validation rates while reducing likely false positives not only yields a more manageable set of candidates for validation but also supports the use of larger repeat catalogs, enabling broader genomic surveys. By revisiting the genotyping-supporting sequencing data, TRGT-denovo can recapture details that might be lost in a VCF file, thereby improving the reliability of the de novo candidates obtained without altering the underlying genotyping process. Furthermore, we demonstrate TRGT-denovo’s utility in separate work in cases involving two trios with suspected GCC repeat expansions in the AFF3 gene, associated with intellectual disabilities, identifying a DNM in AFF3 as the most significant DNM across the genome, further substantiating its pathogenic significance [¹³].

Although TRGT-denovo significantly reduces (likely) false positives in trio analysis, there are still areas for improvement. De novo assessment of TRs remains notably challenging, despite the advancements enabled by long-read sequencing and TRGT-denovo. Current work aims to introduce haplotype matching across samples to enhance the reliability of inheritance inference, further reducing false positives by addressing issues like parental allele dropout. TRGT-denovo enables more accurate TR mutation studies, potentially leading to new insights in genetic disorders and genome dynamics.

Methods

TRGT-denovo

TRGT-denovo is a method for identifying DNMs within TR loci. It is designed to work in tandem with TRGT, a tool for targeted TR genotyping using PacBio HiFi sequencing data. TRGT requires aligned HiFi reads and a set of repeat definitions (Fig. 1a, ,b).b). The generated outputs include a VCF file—with full-length repeat consensus allele sequences and genotyping information—and a BAMlet file with segments of HiFi reads spanning each repeat allele. TRGT also integrates haplotype phasing tags in the BAM from tools such as HiPhase [¹⁴] or WhatsHap [¹⁵], in addition to using mismatches surrounding the TRs for allele phasing. TRGT-denovo analyzes TRGT VCF and BAMlets from family trios. Each repeat locus is considered independently, requiring successful genotyping from each family member, proceeding as follows:

Within-sample read partitioning

TRGT-denovo extracts reads spanning each repeat allele in each family member for a specific locus of interest (Fig. 1c.). For each allele a consensus sequence is generated, consisting of the TRGT derived consensus allele sequence and 50bp of flanking genome sequence. Typically, with two alleles per family member, this results in six of such sequences per trio, representing the alleles of the child ( $C_{0}$ , $C_{1}$ ) and those of the parents ( $F_{0}$ , $F_{1}$ from the father, and $M_{0}$ , $M_{1}$ from the mother). Subsequently, within each family member, the reads are partitioned to their corresponding alleles using the TRGT-obtained allele assignments by default (Fig. 1d.). Partitioning may optionally be based on the read’s alignment to consensus sequences or available phasing data. When partitioning relies on alignment, reads are aligned to their sample-specific consensus sequences through end-to-end, gap-affine, two-piece alignment, with assignment to the consensus sequence with the highest alignment score. Ties in scores are resolved by random selection, phasing data, or TRGT allele assignment. If phasing is used for partitioning, reads are assigned according to their allele based on phasing tags; in the absence of such tags, alignment as previously described serves as the fallback method. Alignment plays a fundamental role in capturing the inherent variation across the reads. This is true regardless of the partitioning strategy used; for each allele, the alignment scores of each read are always obtained from all reads assigned to that allele. These scores indicate how closely the reads resemble the allele and reflect the distribution of uncertainty across all assigned reads within the alleles of each family member. To efficiently manage millions of alignments of highly similar sequences, we have used the Wavefront alignment algorithm (WFA) [¹⁶], which exploits similarities between sequences to accelerate the computation of the optimal alignment.

Candidate de novo allele alignment and analysis

Each child allele is evaluated as a potential de novo allele by examining its similarity to the parental alleles. This is achieved by aligning each read (previously partitioned into the alleles $F_{0}$ , $F_{1}$ , $M_{0}$ , $M_{1}$ , $C_{0}$ , and $C_{1}$ ) against the child’s allele consensus sequences, generating alignment score distributions for reads relative to the child alleles $C_{0}$ and $C_{1}$ , as shown in Fig. 4.

Open in a separate window

Fig 4.

Alignment score distributions.

Distributions of alignment scores for reads spanning alleles ( $M_{0}$ , $M_{1}$ , $F_{0}$ , $F_{1}$ , $C_{0}$ , $C_{1}$ ) when aligned to alleles $C_{0}$ (a) and $C_{1}$ (b). WFA alignment scores range from negative, less similar, to zero, perfect match. Inheritance patterns, as inferred from surrounding genetic variation, are: $M_{0} \to C_{0}$ (inherited) and $F_{1} \to C_{1}$ (inherited + de novo). Symbols P_U and C_L denote the parental upper bound and candidate de novo allele lower bound respectively. Each point corresponds to an individual read aligned to $C_{0}$ or $C_{1}$ . Red-outlined points highlight two observations: a read from $F_{1}$ , exceeding C_L, showing overlap with $C_{1}$ , and a read from $C_{1}$ that falls below P_U, and overlaps with $F_{1}$ . In $C_{1}$ there are 19 reads, of which 18 exceed P_U, contributing to the de novo coverage.

Alignment score distributions, denoted by $X_{i}^{r} \to Y_{j}^{c}$ for some $i$ , $j$ , reflect the scores from aligning reads of allele $X_{i}$ , $X_{i}^{r}$ , to the consensus sequence of allele $Y_{i}$ , $Y_{i}^{c}$ . For instance, $C_{0}^{r} \to C_{1}^{c}$ represents the scores of reads in allele $C_{0}$ when aligned to allele $C_{1}$ . Self-alignments, such as $C_{0}^{r} \to C_{1}^{c}$ and $C_{1}^{r} \to C_{1}^{c}$ typically show maximum WFA alignment scores (zero scores), with deviations recorded as negative values. An exception being mosaic alleles, like some FMR1 expansions, where numerous reads differ significantly from the consensus sequence. In instances where a child inherits both alleles, the majority of the $M_{i}^{r} \to C_{0}^{c}$ distribution is expected to overlap with $C_{0}^{r} \to C_{0}^{c}$ and $F_{j}^{r} \to C_{1}^{c}$ with $C_{1}^{r} \to C_{1}^{c}$ for some $i$ , $j$ or vice versa (Fig. 4a.). Conversely, if allele $C_{1}$ is de novo, the $C_{1}^{r} \to C_{1}^{c}$ distribution will separate, diverging from either $M_{i}^{r} \to C_{1}^{c}$ or $F_{i}^{r} \to C_{1}^{c}$ for all $i$ (Fig. 4b.). Note that this divergence is always unidirectional; for instance, if $C_{1}$ is de novo relative to $F_{1}$ , then $F_{1}^{r} \to C_{1}^{c}$ will shift left relative to $C_{1}^{r} \to C_{1}^{c}$ . Absence of such a shift would imply no underlying difference and, consequently, no DNM. Note that any divergence, regardless of changes in sequence length or composition, can signal a DNM and will cause a corresponding shift in the distribution.

Identification of DNMs relies on analyzing shifts in alignment score distributions, based on the assumption that the closest matching parental allele distribution to the child’s allele usually reflects an inherited allele without mutation. Significant shifts in these distributions suggest the presence of a DNM, indicating that the allele, although likely derived from the closest parental allele, has undergone changes and is not inherited as is. A significant divergence in distributions suggests greater dissimilarity, which acts as a measure of the magnitude of DNMs, enabling differentiation based on the size of these mutations. These distributions also facilitate estimates of inheritance patterns, where reads from an inherited allele are expected to closely match those of the corresponding child allele, though this can be complicated by scenarios such as multiple identical parental alleles. To quantify divergence, a parental upper bound, P_U, is defined, derived from the 1.0 quantile of the parental distribution, and is used as a reference point against which all child allele reads are tested. Reads with scores exceeding P_U indicate a potential DNM, shown as de novo coverage in Fig. 4b., with nearly all $C_{1}$ reads surpassing this threshold. Conversely, comparing parental read scores relative to the child allele’s lower median bound, C_L, reveals shared traits. This overlap coverage aids in differentiating characteristics between parents and child. For example, overlap in reads, such as seen in $M_{0}^{r} \to C_{0}^{c}$ and $M_{1}^{r} \to C_{0}^{c}$ with $C_{0}^{r} \to C_{0}^{c}$ , indicates inheritance of $C_{0}$ without mutation. In contrast, $C_{1}$ represents a DNM, as demonstrated by the lack of overlap in nearly all parental allele distributions $(M_{0}^{r} \to C_{1}^{c} \to M_{1}^{r} \to C_{1}^{c} \to F_{0}^{r} \to C_{1}^{c})$ with $C_{1}^{r} \to C_{1}^{c}$ . An exception being the single read showing overlap in $F_{1}^{r} \to C_{1}^{c}$ ; however, this is proportionally negligible against the background of all other reads in $F_{1}^{r} \to C_{1}^{c}$ that do not overlap with $C_{1}^{r} \to C_{1}^{c}$ . The magnitude of DNMs is further quantified by calculating the mean absolute difference between the de novo reads and the P_U threshold. In cases of inheritance (Fig. 4a.), this value is zero, reflecting (near)-perfect alignment between parental and child alleles, with any deviation denoting the presence and scale of DNMs.

Detecting de novo TR mutations

TRGT-denovo measures and outputs metrics for every single child allele, generating metrics such as the de novo coverage, overlap coverage, and the magnitude of potential DNMs. Note that the de novo and overlap coverage should be considered in the context with the allele-specific and total coverage at the locus for each family member i.e., their ratios. For example, as shown in Fig. 4b. $C_{1}$ has 19 reads, with 18 exceeding the P_U threshold, yielding an allele de novo ratio of ~0.95. A high de novo ratio strongly suggests a DNM, assuming sufficient overall coverage. Conversely, a low ratio may indicate weaker evidence or suggest complexities like segmental duplications, mosaicism, or stutter. De novo coverage is a strong predictor for detecting de novo variants, correlating with increased allele de novo ratios, approaching 1.0, and child de novo ratios converging to 0.5 (Fig. 3). This indicates that higher coverage helps in accurately identifying true de novo mutations by ensuring that these ratios align more closely with the expected balance. Note that all loci should be compared against the entire set of tested loci, creating a distribution that aids in defining thresholds for identifying likely DNMs. This comparative approach also helps to identify coverage imbalances or sample mix-ups, ensuring reliable results. Based on empirical data, the following filtering criteria have been established as a starting point:

Minimum de novo coverage of 5: This threshold ensures enough read support to confidently call a DNM. Lower coverage may result in insufficient evidence to distinguish true mutations from sequencing noise or errors.
Allele de novo ratio of at least 0.7: A high allele de novo ratio indicates that the majority of reads for an allele exceed the parental upper bound threshold. Observations show that true DNMs exhibit a strong divergence from parental alleles, and ratios below 0.7 often suggest weaker evidence or potential artifacts.
Child de novo ratio between 0.3 and 0.7: This range is chosen to balance the need to detect de novo events while accounting for the possibility of allelic dropout or partial mosaicism. Ratios below 0.3 or above 0.7 could indicate sample anomalies or errors in read assignment.
Low probability for allelic dropout: we require at least one read supporting each haplotype, as obtained from available within-sample phasing. This helps to avoid likely false positives due to incomplete haplotype representation.

These criteria were determined through a combination of empirical testing on known de novo mutations and analysis of genomic datasets. However, they should be considered as guidelines rather than strict rules and may need to be adjusted based on specific study designs or sequencing depths.

Supplementary Material

Supplement 1

Click here to view.^{(139K, pdf)}

Funding

T. J. Nicholas, T. Sasani, and A. Quinlan were supported by the following awards from the National Institutes of Health: RC2TR004391 and R01HG010757. H. Dashnow was supported by the following award from the National Institutes of Health: K99HG012796

Funding Statement

Footnotes

Competing interests

T. Mokveld, E. Dolzhenko, Z. Kronenberg, and M. A. Eberle are employees and shareholders of Pacific Biosciences.

Bibliography

1. English A, Dolzhenko E, Jam HZ, Mckenzie S, Olson ND, De Coster W, et al. Benchmarking of small and large variants across tandem repeats. bioRxiv.2023. doi: 10.1101/2023.10.29.564632 [PubMed] [CrossRef] [Google Scholar]

2. Subramanian S, Mishra RK, Singh L. Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions. Genome Biol.2003;4: R13. [PMC free article] [PubMed] [Google Scholar]

3. Depienne C, Mandel J-L. 30 years of repeat expansion disorders: What have we learned and what are the remaining challenges?Am J Hum Genet.2021;108: 764–785. [PMC free article] [PubMed] [Google Scholar]

4. Erwin GS, Gürsoy G, Al-Abri R, Suriyaprakash A, Dolzhenko E, Zhu K, et al. Recurrent repeat expansions in human cancer genomes. Nature. 2023;613: 96–102. [PMC free article] [PubMed] [Google Scholar]

5. Verbiest MA, Lundström O, Xia F, Baudis M, Bilgin Sonay T, Anisimova M. Short tandem repeat mutations regulate gene expression in colorectal cancer. Sci Rep.2024;14: 3331. [PMC free article] [PubMed] [Google Scholar]

6. Nicolas G, Veltman JA. The role of de novo mutations in adult-onset neurodegenerative disorders. Acta Neuropathol.2019;137: 183–207. [PMC free article] [PubMed] [Google Scholar]

7. Chakraborty R, Kimmel M, Stivers DN, Davison LJ, Deka R. Relative mutation rates at di-, tri-, and tetranucleotide microsatellite loci. Proc Natl Acad Sci U S A.1997;94: 1041–1046. [PMC free article] [PubMed] [Google Scholar]

8. Fan H, Chu J-Y. A brief review of short tandem repeat mutation. Genomics Proteomics Bioinformatics. 2007;5: 7–14. [PMC free article] [PubMed] [Google Scholar]

9. Dolzhenko E, English A, Dashnow H, De Sena Brandine G, Mokveld T, Rowell WJ, et al. Characterization and visualization of tandem repeats at genome scale. Nat Biotechnol.2024. doi: 10.1038/s41587-023-02057-3 [PubMed] [CrossRef] [Google Scholar]

10. Kucuk E, van der Sanden BPGH, O’Gorman L, Kwint M, Derks R, Wenger AM, et al. Comprehensive de novo mutation discovery with HiFi long-read sequencing. Genome Med.2023;15: 34. [PMC free article] [PubMed] [Google Scholar]

11. trgt/repeats/repeat_catalog.hg38.bed at main · PacificBiosciences/trgt. In: GitHub [Internet]. [cited8 Mar 2024]. Available: https://github.com/PacificBiosciences/trgt/blob/main/repeats/repeat_catalog.hg38.bed

12. Repeat catalogs for TRGT. [cited8 Mar 2024]. doi: 10.5281/zenodo.8329210 [CrossRef] [Google Scholar]

13. Jadhav B, Garg P, van Vugt JJFA, Ibanez K, Gagliardi D, Lee, et al. A phenome-wide association study of methylated GC-rich repeats identifies a GCC repeat expansion in as a significant cause of intellectual disability. medRxiv.2023. doi: 10.1101/2023.05.03.23289461 [CrossRef] [Google Scholar]

14. Holt JM, Saunders CT, Rowell WJ, Kronenberg Z, Wenger AM, Eberle M. HiPhase: jointly phasing small, structural, and tandem repeat variants from HiFi sequencing. Bioinformatics. 2024;40. doi: 10.1093/bioinformatics/btae042 [PMC free article] [PubMed] [CrossRef] [Google Scholar]

15. Patterson M, Marschall T, Pisanti N, van Iersel L, Stougie L, Klau GW, et al. WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads. J Comput Biol.2015;22: 498–509. [PubMed] [Google Scholar]

16. Marco-Sola S, Moure JC, Moreto M, Espinosa A. Fast gap-affine pairwise alignment using the wavefront algorithm. Bioinformatics. 2020;37: 456–463. [PMC free article] [PubMed] [Google Scholar]

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

TRGT-denovo: accurate detection of de novo tandem repeat mutations (2024)

This is a preprint.

Associated Data

Abstract

Motivation

Results

Availability and implementation

Introduction

Results

Comparing genotype-based de novo TR calling to TRGT-denovo

Validation of de novo TR calls

Table 1.

Discussion

Methods

TRGT-denovo

Within-sample read partitioning

Candidate de novo allele alignment and analysis

Detecting de novo TR mutations

Supplementary Material

Supplement 1

Funding

Funding Statement

Footnotes

Bibliography

References