Ensuring Dataset Health: Strategies for Effective Maintenance and Support

cover
11 Jun 2024

Authors:

(1) TIMNIT GEBRU, Black in AI;

(2) JAMIE MORGENSTERN, University of Washington;

(3) BRIANA VECCHIONE, Cornell University;

(4) JENNIFER WORTMAN VAUGHAN, Microsoft Research;

(5) HANNA WALLACH, Microsoft Research;

(6) HAL DAUMÉ III, Microsoft Research; University of Maryland;

(7) KATE CRAWFORD, Microsoft Research.

1 Introduction

1.1 Objectives

2 Development Process

3 Questions and Workflow

3.1 Motivation

3.2 Composition

3.3 Collection Process

3.4 Preprocessing/cleaning/labeling

3.5 Uses

3.6 Distribution

3.7 Maintenance

4 Impact and Challenges

Acknowledgments and References

Appendix

3.7 Maintenance

As with the questions in the previous section, dataset creators should provide answers to these questions prior to distributing the dataset. The questions in this section are intended to encourage dataset creators to plan for dataset maintenance and communicate this plan to dataset consumers.

• Who will be supporting/hosting/maintaining the dataset?

• How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

• Is there an erratum? If so, please provide a link or other access point.

• Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (e.g., mailing list, GitHub)?

• If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.

• Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.

• If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description.

• Any other comments?

This paper is available on arxiv under CC 4.0 license.