This is a prototype tool for authors of NLP research papers to report biases related to their datasets. It dynamically adapts to information about the dataset as you add it to suggest possible biases.
If there is a research paper that investigates bias or a statistic you think should be added to the tool, you can suggested it through this form. For example, Wikipedia articles are edited by 81% men.
All code for the prototype is published on our GitHub.
You can also read the paper to understand the research that motivated this prototype and discussion of related projects.
Select only one item. If your data come from more than one source, we recommend filling out multiple forms.
As a prototype, only some of the following options are populated in our database. Currently Reddit and Wikipedia contain the most complete entries.
If biases are in our database they will be populated below as "suggested biases." If bias information is not in our database, an empty text box is created for that category. You also have the option to create a new bias category and enter the information.
Please select a major language group (sometimes called macro languages, or language families).
This list contains the top 20 most spoken languages according to Ethnologue. We recognize this list is far from inclusive, but hope to one day include more.
As a prototype, only some of the following options are populated in our database. Currently Arabic and Chinese contain the most complete entries.
If your dialect is more finely grained, you can further edit it and the country list in the next section.
Formality refers to the one aspect of the social context in which a language is exchanged, often influencing structure and complexity of the language. For example, a speech given presented to a conference would be formal, whereas a conversation between friends that includes lots of slang would be informal.
Here context refers to the genre or type of platform in which the text was originally written or spoken.
Timestamps are the dates when the data itself was generated. For example, a social media post was written by the user on X day and time.