Machine Learning and Information Retrieval for Finding Potential CRA-W and CDC Participants

The CRA-W and CDC BPC alliance is reaching a large number of people, but we can do better. We propose to research how to automatically create mailing lists of potential participants. To this end, we will build a system that crawls the web and finds participants who are in computer science or engineering and who are women or under-represented minorities. We anticipate that female computer scientists/engineers will be far easier to identify than underrepresented minorities due to the use of the pronoun "she" in individual bios. The target group is graduate students, professors, and researchers in industry and academia, as undergraduates for the most part do not have home pages.

We can frame this challenge as a text-based machine-learning problem. Given a web page in text form can we automatically predict whether it describes a computer scientist and further whether the person is female or a male? The state of the art in supervised machine learning applied to text classification is to create a vector representation of the text (perhaps using ngrams) and then apply a support vector machine to form an automated classifier. The learned SVM can then be applied to classify new text. Text-based classification has been applied to many domains including newspaper story classification, spam filtering, and information retrieval. In the proposed project, we will treat the two problems separately: forming one classifier for classifying a webpage as belonging to a female computer scientist and another for classifying a webpage as belonging to a computer scientist who is an under-represented minority.

Forming a classifier (known as supervised learning) requires that we have labeled training data from which to learn the classification rule. CRA-W and the CDC have lists of current and past participants that can be used as "positive" examples of computer scientists to include. Finding non-minority, male computer scientists is not difficult and it is likely that we can create a large set of examples by hand in just one day. An interesting research issue arises because we can view the construction of a training dataset as an active learning problem in which we can selectively ask to have more web pages labeled to create more training data. It is not hard for a human to determine whether the web pages belong to a man or a woman (pictures, pronouns, etc).

A final complication is finding the contact information on the webpage. We anticipate that due to obfuscation of email address to prevent spam, we may only be able to harvest some email addresses. This is a separate classification problem. But because, aside from using images, people seem to use a small set of rules (e.g., transform brodley@cs.tufts.edu to be lastname - at - cs.institution.edu) we should be able to get the majority.

Go back to my homepage.