Week 3

Maya Anand bio photo By Maya Anand

This week I learned about shell job controls (fg, bg, wait, ctrl-z) and how check the load on the server before running jobs using multiple cpus (df, du, average load, top, uptime). We also went over for loops in shell and parallelization with xargs. Since my script for converted a VCF from the 1000 Genomes Project to multiple 23&Me type files (one for each sample present in the VCF) was taking so long, we hypothesized it could be due to the number of lines in the VCF file and decided to modify the script so it only included SNPs that were part of dbSNP Common. It took me a while to figure out how to modify my script to do this because I ran into the issue of two seemingly similar options for a bcftools command that weren’t documented very well, so I had to generate a lot of small text cases and look at the results to understand what the commands were doing. When I finally got the script working and ran it on the data for a few chromosomes, it seemed like the resulting 23&Me files were much shorter than I expected and I realized that the script had trouble matching the positions listed in the dbSNP file and the VCF file since files downloaded from UCSC are 0-based even though the online genome browser is 1-based. The VCF files are 1-based. This discovery cleared up the issue and once I converted the dbSNP file to 1-based the script seemed to work but still took a long time to run. I tried converting the vcf to a bcf file and the script ran SO much faster, which was another helpful discovery.