The Data Nutrition Project

Sarah Newman (US), Kasia Chmielinski (US), Matthew Taylor (US)

Artificial intelligence models and algorithms are increasingly used to make decisions about people, often leading to unintended consequences, particularly for communities that are already marginalized, underserved, and underrepresented. One cause of harm is the data that is used to train the models that make these decisions. Problematic, incomplete, or otherwise biased datasets used to train these models will replicate the very issues found in the training datasets.  

The Data Nutrition Project is an initiative run by technologists, artists, scholars, and practitioners to enable the quick evaluation and interrogation of datasets through educational practices and a tool: the Dataset Nutrition Label. Like a nutrition label for food, but including important information about the dataset, the Dataset Nutrition Label can help mitigate harm caused by using poorly chosen data for a particular use case. In addition to building labels and a forthcoming label-maker ingestion engine, the team also works on educational initiatives, and has a children's book and a podcast that are in the works. These initiatives are intended to drive cultural awareness about algorithmic risks, as well as current interventions to improve the AI ecosystem.  

The Labels created by the Data Nutrition Project team, which are visual, interactive, and easy to digest, are intended for practitioners and researchers to have easy access to dataset contents. Too often "found data" is shared and reused with insufficient documentation: it either has no documentation at all, the documentation is incomplete, or the documentation is provided within the context of a domain’s epistemic norms, which may be unfamiliar to data practitioners. There is no agreement on standards when it comes to dataset quality or documentation, and this is exacerbated by the "move fast" culture endemic in the tech industry, that often prioritizes rapid deployment above all else. The Data Nutrition Project seeks to address these challenges through a user-friendly tool (the Label) and through our creative-educational work. 

Software 

Our project uses two software components: (1) an Ingestion Engine, which is a logic-conditional questionnaire, currently in development, that will receive information necessary for Data Nutrition Labels from dataset owners and (2) the Dataset Nutrition Label, a user experience in which data practitioners can easily view important information about a dataset and the implications of using it, based on the information submitted by dataset owners. All our software is custom built and currently all our code is open source. Our ingestion engine and newest version of the Dataset Nutrition Label (V3) will be publicly released in late 2022.  

Artificial intelligence models and algorithms are increasingly used to make decisions about people, often leading to unintended consequences, particularly for communities that are already marginalized, underserved, and underrepresented. One cause of harm is the data that is used to train the models that make these decisions. Problematic, incomplete, or otherwise biased datasets used to train these models will replicate the very issues found in the training datasets.  

The Data Nutrition Project is an initiative run by technologists, artists, scholars, and practitioners to enable the quick evaluation and interrogation of datasets through educational practices and a tool: the Dataset Nutrition Label. Like a nutrition label for food, but including important information about the dataset, the Dataset Nutrition Label can help mitigate harm caused by using poorly chosen data for a particular use case. In addition to building labels and a forthcoming label-maker ingestion engine, the team also works on educational initiatives, and has a children's book and a podcast that are in the works. These initiatives are intended to drive cultural awareness about algorithmic risks, as well as current interventions to improve the AI ecosystem.  

The Labels created by the Data Nutrition Project team, which are visual, interactive, and easy to digest, are intended for practitioners and researchers to have easy access to dataset contents. Too often "found data" is shared and reused with insufficient documentation: it either has no documentation at all, the documentation is incomplete, or the documentation is provided within the context of a domain’s epistemic norms, which may be unfamiliar to data practitioners. There is no agreement on standards when it comes to dataset quality or documentation, and this is exacerbated by the "move fast" culture endemic in the tech industry, that often prioritizes rapid deployment above all else. The Data Nutrition Project seeks to address these challenges through a user-friendly tool (the Label) and through our creative-educational work. 

Software 

Our project uses two software components: (1) an Ingestion Engine, which is a logic-conditional questionnaire, currently in development, that will receive information necessary for Data Nutrition Labels from dataset owners and (2) the Dataset Nutrition Label, a user experience in which data practitioners can easily view important information about a dataset and the implications of using it, based on the information submitted by dataset owners. All our software is custom built and currently all our code is open source. Our ingestion engine and newest version of the Dataset Nutrition Label (V3) will be publicly released in late 2022.  

datanutrition.org
twitter.com/makedatahealthy

Data Nutrition Project team:  

Project lead: Kasia Chmielinski  

Research lead: Sarah Newman 

Tech lead: Matt Taylor  

Engineer: Kemi Thomas, HG King,  

Data science advisor: Chris Kranzinger 

Designer: Carine Teyrouz  

Children's book illustrator: Michael Sherman 

Data Nutrition Project Board: Jessica Fjeld, Mary Gray, Josh Josephs, and James Mickens 

With previous support from: The Harvard Data Science Initiative at Harvard University, CR Digital Labs, The Assembly Fellowship at the Berkman Klein Center for Internet & Society at Harvard University, and The Miami Foundation 

Kasia Chmielinski (US) is Co-Founder of the Data Nutrition Project and a technologist focused on building responsible data systems across industry, academia, government, and non-profit domains. When not thinking about data, Kasia is usually cycling or birdwatching around the Northeastern US. Sarah Newman (US) is Co-Founder of the Data Nutrition Project and the Director of Art & Education at metaLAB at Harvard. Working at the intersection of research and art, Newman's work explores technology’s role in human experience. Newman is an avid napper and seashell collector. Matthew Taylor (US) is a learning experience designer with a background in AI. Previously worked at the MIT Media Lab, and in Boston Public Schools. Currently creating curricula to demystify AI, building mutual aid networks, and organizing tech workers for social justice. Seasoned pun specialist. 

Kasia Chmielinski (US) is Co-Founder of the Data Nutrition Project and a technologist focused on building responsible data systems across industry, academia, government, and non-profit domains. When not thinking about data, Kasia is usually cycling or birdwatching around the Northeastern US. Sarah Newman (US) is Co-Founder of the Data Nutrition Project and the Director of Art & Education at metaLAB at Harvard. Working at the intersection of research and art, Newman's work explores technology’s role in human experience. Newman is an avid napper and seashell collector. Matthew Taylor (US) is a learning experience designer with a background in AI. Previously worked at the MIT Media Lab, and in Boston Public Schools. Currently creating curricula to demystify AI, building mutual aid networks, and organizing tech workers for social justice. Seasoned pun specialist.