Decoding Genomics Data: The Complete Guide to Sanger and NGS File Formats
By the Yaazh Xenomics Bioinformatics Team
Part 1: Sanger Sequencing Data Formats
Sanger sequencing is the gold standard for targeted amplicon sequencing. Because the data output is smaller and more focused than NGS, the file formats are relatively straightforward.
.ab1 (Chromatogram / Trace File)
What it is: The raw, primary data file generated by the genetic analyzer (capillary electrophoresis equipment).
Specific Purpose: It contains the visual chromatogram—the colored peaks representing the fluorescence of each nucleotide (A, T, C, G) as well as the Phred quality score for every base call. Bioinformatics visualizers (like SnapGene) use this file so researchers can manually verify heterozygous mutations or resolve ambiguous base calls.
.fasta or .seq (Sequence File)
What it is: A plain text file extracted directly from the AB1 file.
Specific Purpose: It strips away the visual data and quality scores, leaving only the string of nucleotide letters. This file is perfectly formatted for rapid downstream analyses, such as pasting into NCBI BLAST for identification or running simple ClustalW alignments.
Part 2: Next-Generation Sequencing (NGS) Formats
NGS platforms (such as Illumina or PacBio) generate millions of reads simultaneously. The bioinformatics pipeline requires specialized, highly structured file formats to process this massive scale of data.
.fastq (The Raw Data)
What it is: The foundational text-based format for NGS data.
Specific Purpose: It stores both the biological sequence (the read) and an ASCII character string representing the quality score of each individual base. FASTQ files are the starting point of any NGS pipeline. Before alignment, these files undergo Quality Control (QC) to trim adapters and remove low-quality bases.
.sam, .bam, and .cram (The Alignment Files)
What they are: Files containing mapped sequencing data.
Specific Purpose: Once FASTQ reads are mapped to a reference genome (like GRCh38), the alignment data is stored here.
- SAM (Sequence Alignment/Map): A human-readable text file detailing exactly where reads map to the genome.
- BAM (Binary Alignment/Map): The compressed, binary version of the SAM file. It is not human-readable but is significantly smaller and faster for computers to process. BAM is the standard input for variant callers and visualization tools like IGV.
- CRAM: An even more compressed format that relies on the reference genome to save space.
.vcf (Variant Call Format)
What it is: A specialized text file containing genomic variations.
Specific Purpose: It strips away all the normal, matching genomic data and stores only the differences (SNPs, Indels, Structural Variants) found in the sample compared to the reference genome. VCF files are the primary clinical deliverable, which are then annotated to determine if a mutation is pathogenic.
.bed and .gtf (Genomic Annotations)
What they are: Coordinate maps for the reference genome.
Specific Purpose: These files do not contain sample data. Instead, a BED file defines specific genomic regions (telling the pipeline which exonic regions to target). A GTF file maps structural features like introns and exons, which is absolutely critical for RNA-Seq pipelines to calculate gene expression.
The NGS Workflow Summary
At Yaazh Xenomics, we process data through these formats daily. Here is how they connect in a standard Whole Exome or Genome pipeline:
Step 1: SequencingThe sequencer generates raw biological reads and outputs them as FASTQ files.
Step 2: AlignmentBioinformatics algorithms map the FASTQ reads against a reference genome, saving the results in a BAM file.
Step 3: Variant CallingThe software scans the BAM file for mutations and outputs the biological differences into a VCF file.
Step 4: AnnotationUsing BED and GTF files, the VCF variants are linked to known diseases and biological functions.
