File Formats
PED
Describes individuals and their genetic data.
- Space or tab delimited file
- One line for each individual
Columns:
- Family ID [string]
- Individual ID [string] – unique, containing only alphanumeric characters
- Father ID [string]
- Mother ID [string]
- Sex [integer] – 1 for female, 2 for male
- Phenotype [float]
- SNP1 first allele
- SNP1 second allele
- SNP2 first allele
- SNP2 second allele
VCF
- meta-information lines
- header line
- data lines (each for a different position in the genome)
Meta-information lines
- start with “##”
- must have ‘‘file format’ field
- ##fileformat=VCFv4.0
###Header line * names the 8 fixed, mandatory columns * tab-delimited
The columns are:
- #CHROM
- POS
- ID
- REF
- ALT
- QUAL
- FILTER
- INFO
Data lines
- tab-delimited
Fields:
- #CHROM: identifier from the reference genome
- POS: reference position
- ID: unique identifier like dbSNP rs #
- REF: A, C, G, T or N, indels include base before event
- ALT: non-reference alleles called on at least one of the samples; A, C, G, T, N or
- QUAL: high scores indicate high confidence
- FILTER: PASS or codes for filters that fail
- INFO: additional info
BAM/CRAM
###BAM * binary version of a SAM file * sequence alignment data
###SAM * Header lines start with ‘@’ * Alignment lines have 11 mandatory fields for essential alignment info * More Info
###CRAM * like BAM * compressed version of the alignment * More Info
FASTQ
- stores sequences and Phred qualities
- More Info
23&Me
snp/rs id chrm # position genotype
rs4477212 1 82154 AA rs3094315 1 752566 AG rs3131972 1 752721 AG rs12124819 1 776546 AA