Intern Bio
Heidy Khlaaf is currently an undergraduate student at the Florida State University. She is currently a Computer Science and Philosophy double major with prospective plans to graduate in Fall 2012.
Heidy can be contacted at hak09 at fsu dot edu
Mentor Bio
Carla E. Brodley is a professor in the Department of Computer Science at Tufts University. She received her PhD in computer science from the University of Massachusetts, at Amherst in 1994. From 1994-2004, she was on the faculty of the School of Electrical Engineering at Purdue University, West Lafayette, Indiana. Professor Brodley's research interests include machine learning, knowledge discovery in databases and computer security. She has worked in the areas of intrusion detection, anomaly detection in networks, hardware support for security, classifier formation, unsupervised learning and applications of machine learning to remote sensing, computer security, digital libraries, astrophysics, content-based image retrieval of medical images, computational biology, saliva diagnostics, and chemistry.
Carla E. Brodley's site can be found here.
Professor Brodley can be contacted at brodley at cs dot tufts dot edu
Research Problem
Machine Learning and Information Retrieval for Finding Potential CRA-W and CDC Participants
The CRA-W and CDC BPC alliance is reaching a large number of people, but we can do better. We propose to research how to automatically create mailing lists of potential participants. To this end, we will build a system that crawls the web and finds participants who are in computer science or engineering and who are women or under-represented minorities. We anticipate that female computer scientists/engineers will be far easier to identify than underrepresented minorities due to the use of the pronoun "she" in individual bios. The target group is graduate students, professors, and researchers in industry and academia, as undergraduates for the most part do not have home pages. We can frame this challenge as a text-based machine-learning problem. Given a web page in text form can we automatically predict whether it describes a computer scientist and further whether the person is female or a male? The state of the art in supervised machine learning applied to text classification is to create a vector representation of the text (perhaps using ngrams) and then apply a support vector machine to form an automated classifier. The learned SVM can then be applied to classify new text. Text-based classification has been applied to many domains including newspaper story classification, spam filtering, and information retrieval. In the proposed project, we will treat the two problems separately: forming one classifier for classifying a webpage as belonging to a female computer scientist and another for classifying a webpage as belonging to a computer scientist who is an under-represented minority. Forming a classifier (known as supervised learning) requires that we have labeled training data from which to learn the classification rule. CRA-W and the CDC have lists of current and past participants that can be used as "positive" examples of computer scientists to include. Finding non-minority, male computer scientists is not difficult and it is likely that we can create a large set of examples by hand in just one day. An interesting research issue arises because we can view the construction of a training dataset as an active learning problem in which we can selectively ask to have more web pages labeled to create more training data. It is not hard for a human to determine whether the web pages belong to a man or a woman (pictures, pronouns, etc).
Research Journal
The weekly research blog can be found here.
Final Report
The PDF to our final report is located here.