How to filter VCF files? 123VCF, Exomiser, LIRICAL, then Solution!

A single exome or genome can yield millions of variants, but only a handful have real biological or clinical significance. The real skill lies not in calling variants-but in filtering them intelligently.

This article explores the three essential steps of smart variant filtering: allele frequency, variant quality, and functional impact, and summarizes key references and tools that make this process reproducible and interpretable.

1. Allele Frequency: Understanding Rarity in Context

Allele frequency (AF) helps distinguish between benign polymorphisms and potentially pathogenic variants. A variant observed in 30% of a general population database is unlikely to cause a rare genetic disorder. Conversely, variants absent or extremely rare (<0.1%) in global datasets often deserve closer inspection.

Major population frequency databases include:

gnomAD (Genome Aggregation Database) — over 800,000 exomes and genomes (1), The most completed populational database until now, so you will find the most corrected frequencies of variants in this data base.
1000 Genomes Project — representing broad population diversity (2)
ExAC — old but still widely used exome dataset (3)

The frequency threshold depends on disease model:

Rare Mendelian disorders: AF < 0.001 (0.1%) or risk free is AF < 0.05(5%)
Complex traits: AF < 0.05 (5%)
Dominant vs. recessive inheritance: For recessive disorders, each variant can tolerate slightly higher frequency if compound heterozygosity is possible (4). (Better to use AF < 0.001 for dominant inheritances and 0.05 for other modes)

Filtering based on population data reduces false positives and enriches variants with potential functional relevance.

2. Variant Quality: Eliminating Technical Noise

Not all variants are created equal. Many are artifacts caused by poor base quality, misalignment, or strand bias. Quality control metrics help remove unreliable calls.

Common quality indicators in VCF files:

QUAL: Phred-scaled confidence score (>30 is often acceptable)
DP (Depth): Total reads covering the position; <8 is typically too low for confidence
MQ (Mapping Quality): Reflects the accuracy of read placement on the genome
FS and SOR: Measure strand bias; high values may indicate sequencing artifacts
QD (Quality by Depth): Combines confidence and coverage to normalize across depth

The Genome Analysis Toolkit (GATK) provides best-practice guidelines for variant quality filtering, emphasizing hard filters and variant recalibration (5). Poor-quality variants inflate false discovery rates and can derail downstream analyses, especially in rare disease or somatic variant studies.

Important note: It should be noted that we had variants supported by only a single read, confirmed as pathogenic through Sanger sequencing. After analyzing hundreds of exomes, I can confidently say that the only filter that effectively removes false positives is the use of a BED file defining the sequencing targets. In our opinion, this is the most reliable approach to eliminate false positives, typically reducing the number of variants to around 30,000 in exome datasets.

3. Functional Impact: Prioritizing What Matters Biologically

After removing common and low-quality variants, the next goal is identifying those with meaningful biological consequences.

Functional annotation predicts how each variant might affect a gene or protein.

Major annotation tools include:

ANNOVAR – flexible and widely used for gene-based, region-based, and filter-based annotations (6)
Ensembl VEP (Variant Effect Predictor) – provides consequences, conservation, and cross-reference annotations (7)
SnpEff – predicts variant effects based on transcript models (8)

Functional predictions:

REVEL – meta-predictor for missense pathogenicity (9)
CADD (Combined Annotation Dependent Depletion) – integrates multiple features to score deleteriousness (10)
SIFT and PolyPhen-2 – classic tools predicting amino acid substitution impact (11,12)
phyloP / phastCons – conservation-based measures identifying evolutionarily constrained sites (13)

Smart filtering often combines several predictors. For example, researchers may retain variants that are:

Select variants in the sequencing target regions
Rare (AF < 0.01)
Protein-altering (missense, nonsense, splicing, frameshift)
Functionally damaging (REVEL > 0.5)

Integrating these layers improves yield of plausible disease-associated variants while minimizing false leads.

4. Integrative Filtering Strategy Example

We have developed a tool for filtering which published in BMC Bioinformatics: 123VCF

It simplified the way to confidently filter the variants in VCF file.

Additional filters can incorporate inheritance modeling (dominant/recessive) or gene-specific prioritization using HPO (Human Phenotype Ontology) terms.

Here ,we also have prepared an R script for you to have a template code for filtering TSV file in R environment.

To get familiar with the operating system (linux) that you need to install to analyze your data independently, see this page.

5. Beyond Filtering: Biological Context Matters

Filtering reduces noise, but interpretation requires context. A rare variant in a gene irrelevant to your phenotype is not a candidate, no matter how “perfect” the score. Likewise, borderline-quality variants in highly relevant genes may still warrant investigation.

Ultimately, smart filtering is about combining computational rigor with biological reasoning.

Conclusion

Effective variant filtering transforms raw sequencing chaos into biologically meaningful insight. By leveraging population frequency data, quality metrics, and functional annotation tools, researchers can dramatically improve variant interpretation and reproducibility. The best pipelines are not the most aggressive—they are the most informed.

References

Karczewski KJ, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):434–43.
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526(7571):68–74.
Lek M, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536(7616):285–91.
MacArthur DG, et al. Guidelines for investigating causality of sequence variants in human disease. Nature. 2014;508(7497):469–76.
Van der Auwera GA, et al. From FastQ data to high-confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013;43(1):11.10.1–11.10.33.
Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38(16):e164.
McLaren W, et al. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17:122.
Cingolani P, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly. 2012;6(2):80–92.
Ioannidis NM, et al. REVEL: an ensemble method for predicting the pathogenicity of missense variants. Am J Hum Genet. 2016;99(4):877–85.
Rentzsch P, et al. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res. 2019;47(D1):D886–D894.
Kumar P, Henikoff S, Ng PC. Predicting the effects of missense mutations using SIFT. Nat Protoc. 2009;4(7):1073–81.
Adzhubei IA, et al. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7(4):248–9.
Siepel A, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15(8):1034–50.

Blog

Filtering Variants the Smart Way: Allele Frequency, Quality, and Functional Impact

1. Allele Frequency: Understanding Rarity in Context

2. Variant Quality: Eliminating Technical Noise

3. Functional Impact: Prioritizing What Matters Biologically

4. Integrative Filtering Strategy Example

5. Beyond Filtering: Biological Context Matters

Conclusion

References

From FASTQ to BAM: A Beginner’s Guide

Reproducible NGS Pipelines: Why Your Results Should Survive a Power Outage

Leave your thought here Cancel reply

Bioinformatics Job Market 2026: The Skills That Actually Get You Hired (and the Ones That Don’t)

AI in Bioinformatics: The Tools You Should Actually Use (and the Hype to Ignore)

Benchmarking Bioinformatics Tools: How to Compare Without Bias

Blog

Filtering Variants the Smart Way: Allele Frequency, Quality, and Functional Impact

Filtering Variants the Smart Way: Allele Frequency, Quality, and Functional Impact

1. Allele Frequency: Understanding Rarity in Context

2. Variant Quality: Eliminating Technical Noise

3. Functional Impact: Prioritizing What Matters Biologically

4. Integrative Filtering Strategy Example

5. Beyond Filtering: Biological Context Matters

Conclusion

References

From FASTQ to BAM: A Beginner’s Guide

Reproducible NGS Pipelines: Why Your Results Should Survive a Power Outage

Related Posts

Leave your thought here Cancel reply