VCF

Maya Anand bio photo By Maya Anand

1KGP-phase3-vcf

1000 genomes project phase 3 data

Commands Used to Generate Files in this directory

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
gzip -d ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 
gzip -d ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
bgzip -c ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf > ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
bgzip -c ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf > ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 
tabix ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
tabix ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
for i in chrMT.phase3_callmom-v0_4 chrX.phase3_shapeit2_mvncall_integrated_v1b chrY.phase3_integrated_v2a; do 
     wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.${i}.20130502.genotypes.vcf.gz; 
     gzip -d ALL.${i}.20130502.genotypes.vcf.gz;
     bgzip -c ALL.${i}.20130502.genotypes.vcf > ALL.${i}.20130502.genotypes.vcf.gz;
     tabix ALL.${i}.20130502.genotypes.vcf.gz;
done
for i in 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 Y X MT; do 
     wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr${i}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz; 
     gzip -d ALL.chr${i}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz; 
     bgzip -c ALL.chr${i}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf > ALL.chr${i}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz ; 
     tabix ALL.chr${i}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz; 
done

bgzip and tabix

To download .vcf.gz files from 1000genomes:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

To compress .vcf.gz files to .vcf files:

gzip -d ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 

gzip -d ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

To re-compress using bgzip:

bgzip -c ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf > ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 
bgzip -c ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf > ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 

To index with tabix:

tabix ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
tabix ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

To query with tabix:

tabix ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 21:10000000-12000000 | wc -l  Output: 22287

tabix ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 1:10000000-12000000 | wc -l  Output: 62245

bcftools

How many samples

bcftools query -l ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | wc -l Output: 2504
bcftools query -l ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | wc -l Output: 2504

How many SNPs

bcftools query -f '%ID \n' ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | wc -l Output: 1105538
bcftools query -f '%ID \n' ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | wc -l Output: 6468094

Are there duplicate SNP ids

bcftools query -f '%ID \n' ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | sort| uniq -d

uniq -d shows only duplicates, -c shows the count of each > filename.txt saves the output into a textfile | wc -l to get the count of duplicates

Are there duplicate positions

bcftools query -f '%POS \n' ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | sort| uniq -d 

How are samples stored? How are genotypes stored? Extract one sample. Extract one attribute.

In the header row, there will be a format field giving the format for genotype and other data types. The format field is followed by unique sample ids for each sample. Basically each sample gets a column and each Position/SNP gets a row. Details on how genotypes are encoded is below.

To extract one sample: bcftools view -s SAMPLEID Filename

To extract one attribute/tag: bcftools query -f ‘%INFO/TAG \n’ filename

Format: * %CHROM The CHROM column (similarly also other columns, such as POS, * ID, QUAL, etc.) * %INFO/TAG Any tag in the INFO column * %TYPE Variant type (REF, SNP, MNP, INDEL, OTHER) * %MASK Indicates presence of the site in other files (with multiple files) * %TAG{INT} Curly brackets to subscript vectors (0-based) * [] The brackets loop over all samples * %GT Genotype (e.g. 0/1) * %TGT Translated genotype (e.g. C/A) * %LINE Prints the whole line * %SAMPLE Sample name

Can pipe view query to get specific attributes for specific samples

How are complicated variations encoded?

  • 0 for ref allele
  • 1 for ALT1
  • 2 for ALT2, etc
  • sep
    • / for unphased
    • for phased
  • . for missing values
  • if there more nucleotides for either the reference or the alternate, that signifies an insertion or deletion

What are each of the Info tags? Why do they appear only once even though there are thousands of samples?

  • AA ancestral allele
  • AC allele count in genotypes, for each ALT allele, in the same order as listed
  • AF allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes
  • AN total number of alleles in called genotypes
  • BQ RMS base quality at this position
  • CIGAR cigar string describing how to align an alternate allele to the reference allele
  • DB dbSNP membership
  • DP combined depth across samples, e.g. DP=154
  • END end position of the variant described in this record (esp. for CNVs)
  • H2 membership in hapmap2
  • MQ RMS mapping quality, e.g. MQ=52
  • MQ0 Number of MAPQ == 0 reads covering this record
  • NS Number of samples with data
  • SB strand bias at this position
  • SOMATIC indicates that the record is a somatic mutation, for cancer genomics
  • VALIDATED validated by follow-up experiment

The info tags only appear once because they are about the all the alleles/summary, not specific to each sample

For the DNA.Land imputed VCF:

What’s the difference between the NS and GT:GL fields? * the NS field is the number of samples and is part of INFO; it is an aggregate/summary value and isn’t associated with a specific sample. GT:GL is the genotype and genotype likelihood. It is part of FORMAT and has a different value for each sample

How are GL/GT fields encoded? * GT is encoded as allele values separated by “/” (unphased). 0 is the ref allele and 1 is the alternate. * GL is the log10-scaled likelihood for AA, AB, BB genotypes where A is the reference allele and B is the alternate.

What are the GT values for snp rs200013390 ? grep -s rs200013390 /data2/agordon/imputed-vcf-example/yaniv_erlich.imputed.vcf * the GT values are 0/0 which corresponds to C/C

Last updated: 06/25/16 by Maya