News Item

Rich Context Competition | The Coleridge Initiative

Researchers and analysts who want to use data for evidence and policy cannot easily find out who else worked with the data, on what topics and with what results. As a result, good research is underused, great data go undiscovered and are undervalued, and time and resources are wasted redoing empirical work.

We want you to help us develop and identify the best text analysis and machine learning techniques to discover relationships between data sets, researchers, publications, research methods, and fields. We will use the results to create a rich context for empirical research – and build new metrics to describe data use.

This challenge is the first step in that discovery process.

The goal of this competition is to automate the discovery of research datasets and the associated research methods and fields in social science research publications. Participants should use any combination of machine learning and data analysis methods to identify the datasets used in a corpus of social science publications and infer both the scientific methods and fields used in the analysis and the research fields.

The competition has two phases (details below).

First Phase: You will be provided labeled data, consisting of a listing of datasets and a labeled corpus of 2,500 publications with an additional dev fold of 100 publications. The provided data will indicate which of the datasets are used in each publication. You can use this data to train and tune your algorithms. A separate corpus of 2,500 publications will be the test corpus, which we will run ourselves on our servers with your submissions; you can validate your algorithm against this test corpus up to 5 times. On submission, you will primarily be scored on the accuracy of your techniques, the quality of your documentation and code, and the efficiency of the algorithm – and also on your ability to infer methods and research fields in the associated passage retrieval.

Second Phase: Up to five teams will be asked to participate in the second phase. If selected, you will be provided with a large corpus of unlabeled publications and asked to discover which of the datasets were used in each publication as well as the associated research methods and fields. As in the first phase, you will be scored on the accuracy of your techniques, the quality of your documentation and code, and the efficiency of the algorithm – and also on your ability to infer methods and research fields in the associated passage retrieval.

Teams reaching the second phase will be awarded a prize of $2,000 and economy-class travel costs for one participant to the finalist workshop in New York City. A stipend of $20,000 will be awarded to the winning team; the winning team will work with the sponsors in the subsequent implementation of the algorithm.

All submitted algorithms will be made publicly available as open source tools.

New York University’s Coleridge Initiative

Source: Rich Context Competition | The Coleridge Initiative