I wanted to write a post about ClinVar and its numerous faults not only in curation and implementation, but also in how bioinformaticians use it unabashedly as a simulation of real-world data.
When creating a new variant score to add to the cornucopia of preexisting metrics, a bioinformatician may discover there is an extremely limited pool of well-validated evaluation sets. The Human Genetic Mutation Database is one such resource, but the latest releases are not open to the public without paying first. While it is somewhat relied upon by industry professionals, it is generally distrusted in the academic community. For these reasons, ClinVar, a dataset curated by clinicians and diagnostic labs and hosted by the NIH, is the primary, and often only, choice of many academicians, genetic diagnosticians, and bioinformaticians seeking to assess different metrics’ abilities to detect variant pathogenicity.
ClinVar has numerous faults. For example, many so-called “pathogenic” variants are found in gnomAD at high frequencies. In addition, various variants are incorrectly annotated in either the VCF or XML, often a result of being data mined from OMIM or a similar source. As a consequence, several variants are incorrectly scored in terms of their pathogenic significance and review status. Quite a few different labs have published papers correcting or disputing (Figure 2 in this paper is a great example) the pathogenic status of variants, and the results are often not reflected in ClinVar. Furthermore, the very definition of pathogenic is subjective and certain companies are allowed to give variants a pathogenic classification with a one-star review status (single submitter, criteria provided) even if it is only a variant that appears in a gene panel. No validation needs to be done on such a variant, nor further testing. This can lead to many false positives.
As an obvious consequence of these facts, nearly all papers promoting a particular variant pathogenicity metric use different subsets of ClinVar — not to mention different metrics or versions thereof — each claiming to have the best filters. The resultant evaluations are inconsistent and not at all reproducible. Some evaluations are restricted to one-star review variants or higher. However, this is the bulk of ClinVar, so other evaluations may utilize a smaller subset like those reviewed by an expert panel. Many eschew variants that have conflicting interpretations, even those that are classified as “Pathogenic/Likely pathogenic”, which should still be considered pathogenic. And even though each paper uses different iterations, many metrics are guilty of overfitting to ClinVar on the whole due to how some models are trained. Several metrics for scoring variants use allele frequency which many ClinVar benigns incorporate inherently as “stand-alone” benigns from population-scale datasets like gnomAD according to ACMG guidelines. Due to this limitation of ClinVar benigns, some metrics too reliant on allele frequency are overfitted on ClinVar and do not perform well when evaluating de novo variant data or rare VUS. Addtionally, there are single submitter "validated variants" (here's an example) based solely on absence from gnomAD.