My Project

Overview

My project focuses on developing techniques for work with large publicly available datasets containing anonymized genomic data. As DNA sequencing is becoming more affordable (already reaching the milestone of a $1000 genome), there is an increased need for efficient and affordable means of computing to process and analyze this data for research purposes. My project will focus on using and developing open-source programs to convert between common file formats for genomic data, verify build and dbSNP version and analyze genomic data. In order to efficiently work with “big data”, I will harness cloud computing (Amazon Web Services EC2 and S3) for horizontal scaling.

Final Report