Seminar: ML to Augment Scholarly Infrastructure
|Dates:||14 October 2019|
|Times:||14:00 - 15:00|
|What is it:||Seminar|
|Organiser:||Department of Computer Science|
|Who is it for:||University staff, Adults, Current University students|
Join us for the next Computer Science Seminar with speaker Paco Nathan
ML to Augment Scholarly Infrastructure
This talk is about Rich Context and the associated new features for data governance and collaboration which are going into JupyterLab. Rich Context is a research effort within the Coleridge Initiative at NYU Wagner, developing a knowledge graph built from metadata about the use of curated datasets and related research. As a foundation, the Administrative Data Research Facility (ADRF) is a platform used across 15 government agencies in the US for social science research with sensitive data. ADRF promotes evidence-based policymaking and provides support for data stewardship practices among the agencies.? Rich Context, in turn, is used to represent several kinds of entities involved: datasets, data providers, researchers, research publications, subject headings, etc. Machine learning applications leverage this graph to perform entity linking, e.g., identifying dataset attribution within open access research publications.
Collaboration between Coleridge Initiative and Project Jupyter has led to a new feature set for JupyterLab that supports Rich Context and other scholarly infrastructure. Developed as extensions to JupyterLab, these new data governance features support: dataset registry, metadata exchange, usage telemetry, and comments/annotations about datasets which offer feedback for data stewards. For example, a researcher working with a new dataset can use new features in Jupyter to explore the metadata describing that dataset.
Other approaches explored by the Rich Context project include:
- a machine learning leaderboard competition on GitHub for the ML apps that leverage the Rich Context knowledge graph (AllenAI won first round, LARC @ Singapore Mgmt U is currently leading)
- semi-supervised learning: inferred metadata from the ML competition gets presented back to authors through scholarly infrastructure platforms such as RePEc for their feedback.
An immediate goal is to provide recommendations for researchers working with well-known datasets, e.g., those curated by government agencies. A long-term goal is to collect workflow configurations associated with those datasets, then provide meta-learning services (AutoML), e.g., suggested configurations to help optimize research workflows and support reproducible research.
Overall, Coleridge Initiative is collaborating with Bundesbank, USDA, Digital Science, SAGE Pub, ResearchGate, RePEc, and GESIS on this work, which is funded by Schmidt Futures, Sloan, and Overdeck.
Travel and Contact Information