Hitaxi Kalaria's DREU 2010 webpage




Current advancement in information technology has made it possible for data provider (the government and organization) to collect and store various types of information about individuals in statistical databases. The government and organizations have shown a growing interest in sharing such large datasets as they recognize the critical value of data stored in these databases. Important statistics obtained from statistical databases are used for understanding valuable traits and features of a population, understanding spread of a disease, understanding economic trends and much more. However, when such statistics are released it poses a threat to privacy of individuals representing the datasets and thus data providers must guarantee to protect privacy of individuals besides releasing accurate statistics about datasets. The later is called the problem of Statistical Disclosure Control – a data provider must release accurate information about a dataset while preserving privacy of individuals representing the dataset.

My project goals are:
--To understand the works in the field of data privacy addressing the Statistical Disclosure Control problem. I will be learning about several privacy preserving mechanisms used in privacy preserving data publishing. The study includes k-anonymity and l-diversity principles and differential privacy mechanism.
--Study a differentially private algorithm for histogram release, run experiments by modifying key parameters in the algorithms and analyze the outcome.
--Study the HIDE framework , install the software and test it on sample files. HIDE™: Health Information DE-identification, is a framework for publishing and sharing health data while preserving data privacy. The HIDE software is used for de-identifying sensitive information like name, medical record number etc from medical text before releasing the medical text for research and public use. Usually the anonymization and removal of sensitive data is done manually and is very time consuming but HIDE automises the process of identifying and anonymizing sensitive information using a well-trained CRF model. To learn more about HIDE please visit the project page and code page .