Intersectional bias in hate speech and abusive language datasets
Intersectional Bias in Hate Speech and Abusive Language Datasets
Algorithms are widely applied to detect hate speech and abusive language in popular social media platforms such as YouTube, Facebook, Instagram, and Twitter. Using algorithms helps identify, at scale, which posts contain socially undesirable content. This computational method is efficient but not perfect. Most algorithms are trained with labeled data. What if the training data, used to detect bias in social media, were itself biased?
For transparency and reproducibility, we only used publicly available datasets. The annotated Twitter dataset (N = 99,996) on hate speech and abusive language was created by a crowd-sourcing platform and its quality has been ensured by several rounds of validations. (The dataset is also part of the public data shared for the first International Conference on Web and Social Media data challenge. In the data wrangling process, we discovered that 8,045 tweets were duplicates and removed them. Consequently, the size of the final dataset was reduced to 91,951 tweets.) Founta et al. (2018), who generated the aforementioned dataset, defined abusive language as “any strongly impolite, rude, or hurtful language using profanity” and hate speech as “language used to express hatred towards a targeted individual or group” (5). We followed their definitions.
The Twitter dataset does not contain any information on the racial, gender, or partisan identities of the authors of these tweets. We utilized additional public datasets to classify and fill in the missing information. Our primary interest was the interaction between race and gender and whether that generated a biased distribution of hateful and abusive labels. Such an underlying data distribution would generate uneven false positive and negative rates for different groups. However, human annotators could be biased not only in terms of race and gender but also in terms of political affiliation. This is likely true if annotators were recruited in the United States, where political polarization is extreme. For this reason, we also classified party identification, the degree to which a person identifies with a particular political party. To be clear, what we classified were not the actual racial, gender, or partisan identities of the authors of these tweets. The objective was to classify whether the language features expressed in a tweet were closer to the ones commonly expressed by one racial/gender/partisan group than those of other groups.
The data analysis was correlational and thus, descriptive. We first described the bivariate relationship between race and gender and then added uncertainty about the measures using bootstrapping. We further investigated how the interaction between race and gender influences the distribution of hateful and abusive labels using logistic regression models. By taking a statistical modeling approach, we estimated the partial effect of the interaction bias on the outcomes while controlling for partisan bias.
Racial bias: The first hypothesis is about between-group differences. Consistent with the prior research, we expect that tweets more closely associated with African American than White English language features would be more likely to be labeled as abusive and hateful.
Intersectional bias: The second hypothesis is about within-group differences. Influenced by broad social science literature, we argue that tweets more closely associated with African American males than other groups’ language features are more likely to be labeled as hateful.
Figure 1. Descriptive analysis
Figure 1 displays the bivariate relationship between tweets classified by race and by gender.
Figure 2. Bootstrapping results
One limitation of Figure 1 is that it does not show the uncertainty of the measures. Figure 2 addresses this problem by randomly resampling the data 1,00 times with replacement and stratifying on race, gender, and label type (bootstrapping). This figure reaffirms what we found earlier: African American tweets are overwhelmingly more likely to be labeled as abusive than their White counterparts. An opposite pattern is found in the normal label; White tweets are far more likely to be labeled as normal than their African American counterparts. These patterns are statistically significant because they are far outside confidence intervals. Gender difference matters little in these cases. By contrast, the intersection between race and gender matters in hate speech annotation. African American male tweets are far more likely to be labeled as hateful than the rest of the groups are. African American female tweets are only slightly more likely to be labeled as hateful than their White counterparts are.
Figure 3. Logistic regression analysis
Figure 3 extends the previous investigation by adding party identification as a control variable. We constructed two logistic regression models. In both models, the dependent variable was an abusive or hateful category defined as dummy variables (yes = 1, no = 0). The first model did not involve party controls and its predictor variables were race, gender, and their interaction. The second model involved party controls and its predictor variables were race, gender, party identification, the intersection between race and gender, and the intersection between and race and party identification. In the figure, the results of the first model are indicated by light blue, and the second model by red dots.