In this vignette, we will look for gender-associated words We will do this by looking for words that follow the pronoun “she” more often than “he”, or vice-versa. The vignette is based on Julia Silge’s blog post, “Gender roles with text mining and n-grams”.
We will use the corpus package, but not rely on any other external dependencies. The set-up is as follows.
library("corpus")
# colors from RColorBrewer::brewer.pal(6, "Set1")
palette(c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00", "#FFFF33"))
# ensure consistent runs
set.seed(0)
We will be studying gender in the context of Jane Austen’s completed novels, downloaded from Project Gutenberg.
# Gutenberg IDs gotten from gutenbergr::gutenberg_works(author == "Austen, Jane")
(austen <- gutenberg_corpus(c(161, 1342, 141, 158, 121, 105)))
Determining mirror for Project Gutenberg from https://www.gutenberg.org/MIRRORS.ALL
Using mirror http://aleph.gutenberg.org/
Converting text from declared encoding "ISO-8859-1" to UTF-8
title author language text
1 Sense and Sensibility Jane Austen English SENSE AND SENSIBILITY\n\nby Jane Austen\n\n(1811)…
2 Pride and Prejudice Jane Austen English PRIDE AND PREJUDICE\n\nBy Jane Austen\n\n\n\nChap…
3 Mansfield Park Jane Austen English MANSFIELD PARK\n\n(1814)\n\n\nBy Jane Austen\n\n…
4 Emma Jane Austen English EMMA\n\nBy Jane Austen\n\n\n\n\nVOLUME I\n\n\n\nC…
5 Northanger Abbey Jane Austen English NORTHANGER ABBEY\n\n\nby\n\nJane Austen (1803)\n…
6 Persuasion Jane Austen English Persuasion\n\n\nby\n\nJane Austen\n\n(1818)\n\n\n…
For tokens, we will be using the lower-cased non-punctuation words
text_filter(austen)$drop_punct <- TRUE
Specifically, we will look at the bigrams from these novels where the first type is a pronoun.
pronouns <- c("he", "she")
bigram_counts <- term_stats(austen, ngrams = 2, types = TRUE,
subset = type1 %in% pronouns)
print(bigram_counts)
term type1 type2 count support
1 she had she had 1472 6
2 she was she was 1377 6
3 he had he had 1023 6
4 he was he was 889 6
5 she could she could 817 6
6 he is he is 399 6
7 she would she would 383 6
8 she is she is 330 6
9 he could he could 307 6
10 he would he would 264 6
11 she did she did 227 6
12 he has he has 222 6
13 he did he did 195 6
14 she felt she felt 189 6
15 she must she must 184 6
16 she might she might 173 6
17 he should he should 158 6
18 she has she has 149 6
19 he must he must 143 6
20 he might he might 137 6
⋮ (1573 rows total)
Above, we can see the most common bigrams with gendered pronouns as their first types.
Next, we rearrange the data into tabular form, with one row for each term and one column for each pronoun:
terms <- with(bigram_counts,
tapply(count, list(type2, type1), identity, default = 0))
head(terms)
he she
_did_ 4 1
_had 1 0
_had_ 2 4
_is_ 0 1
_may_ 1 0
_meant_ 0 1
It looks like some terms get emphasized in the text by underscores. We can investigate some of these:
# inspect a random sample of the occurrences
text_sample(austen, c("_did_", "_had", "_meant_"))
text before instance after
1 3 …rprise to herself. And the\nnext day _did_ bring a surprise to her. Henry had sa…
2 2 …ome,” cried Elizabeth; “perhaps she\n _meant_ well, but, under such a misfortune as…
3 2 …ad been brought on by himself. If he _had another_\nmotive, I am sure it would …
4 2 … would never come back again. People _did_ say\nyou meant to quit the place enti…
5 3 … he\nis Henry, for a time.”\n\nJulia _did_ suffer, however, though Mrs. Grant di…
6 2 …red her--indeed I rather believe he\n _did_ --I heard something about it--but I ha…
7 2 …humour encouraged, yet, whenever she _did_ speak, she must be vulgar.\nNor was h…
8 3 …he others prepared to begin.\n\nThey _did_ begin; and being too much engaged in …
9 4 …in her own mind, determined that he _did_ know what he was talking\nabout, and …
10 2 …the\nfirst, you may remember.”\n\n“I _did_ hear, too, that there was a time, whe…
11 2 …ced with her twice. To be\nsure that _did_ seem as if he admired her--indeed I r…
12 3 …found, from Edmund’s manner, that he _did_ mean to\ngo with her. He too was taki…
13 2 …y evident whenever they met, that he _did_ admire her and\nto _her_ it was equal…
14 3 …e\nheard of beyond themselves. Julia _did_ seem inclined to admit that\nMaria’s …
Let’s clean the terms by removing underscores, and recompute the counts:
terms <- with(bigram_counts,
tapply(count, list(gsub("_", "", type2), type1),
sum, default = 0))
head(terms)
he she
a 5 6
able 0 1
abominates 1 0
absented 0 1
absolutely 3 1
abstained 0 1
Here, we changed the aggregation function from identity()
to sum()
to sum the counts over all types that are equal after removing underscores.
We want to find terms that are associated with particular gendered pronouns. As an example, take term “loves”:
term <- "loves"
i <- match(term, rownames(terms))
tab <- cbind(terms[i,], colSums(terms[-i,]))
colnames(tab) <- c(term, paste0("\u00ac", term))
print(tab)
loves ¬loves
he 5 7287
she 7 10130
When the first type in a bigram is “he”, it is followed by “loves” in 5 instances and another type in 6988 instances. Here are the relative rates of “loves”, conditional on the first type in the bigram:
(rates <- tab[,"loves"] / rowSums(tab))
he she
0.0006856829 0.0006905396
The ratio of these rates is very close to one:
rates[["she"]] / rates[["he"]]
[1] 1.007083
The log (base-2) ratio of the rates is close to zero:
log2(rates[["she"]]) - log2(rates[["he"]])
[1] 0.01018254
This log ratio is a good measure of the strength of association. When the value is close to zero, there is no meaningful association. For large absolute log ratios, we would say that there is a meaningful difference in usage between the genders.
We compute the log ratios for all types. To avoid dividing by zero, and to smooth the estimates when the counts are small, we add 0.5 to the count when estimating the rate. (Many people smooth by adding 1, but I prefer 0.5 since that value has stronger theoretical justification.)
Some terms only appear once in the corpus. The rate estimates for these terms will be unreliable, so we discard them.
tot <- colSums(terms)
common_terms <- terms[rowSums(terms) > 1,]
he <- (common_terms[,"he"] + 0.5) / (tot[["he"]] + 1)
she <- ((common_terms[,"she"] + 0.5) / (tot[["she"]] + 1))
log2_ratio <- log2(she) - log2(he)
Here are a histogram and normal probability plot of the estimates for the remaining terms.
par(mfrow = c(1, 2))
hist(log2_ratio, breaks = "Scott", col = 2, border = "white")
qqnorm(log2_ratio, col = 2)
qqline(log2_ratio, col = 1, lty = 2)
This doesn’t quite look like a normal, but the distribution is symmetric, and the tails are about as heavy as those of a normal. So, the z-score will be a reasonable measure of how typical or unusual each log rate is.
z <- (log2_ratio - mean(log2_ratio)) / sd(log2_ratio)
Here are the words more than two standard deviations away from the mean log ratio:
sort(log2_ratio[abs(z) > 2])
gravely smiling no indeed turning addressing believes bowed
-4.382079 -4.382079 -3.934620 -3.645114 -3.645114 -3.282544 -3.282544 -3.282544
intends next perhaps quite sought won't beheld lay
-3.282544 -3.282544 -3.282544 -3.282544 -3.282544 -3.282544 3.225251 3.225251
looks refused yet checked dreaded eagerly seized hurried
3.225251 3.225251 3.225251 3.431702 3.431702 3.431702 3.431702 3.612274
remembered
4.569205
The terms “remembered”, “hurried”, and “seized” have more “she” usages; the terms “gravely”, “believes”, and “bowed” have more “he” usages.
It’s hard to know which of these differences are meaningful without quantifying the error associated with the estimates. Some words are common, and we have reliable estimates of the log ratios. Other words are rare, and the estimates are based on a small number of occurrences. In the rare case, the estimates of the log ratios will be unreliable.
We need standard errors of our estimates. We can get these by starting with the proportion estimates he
and she
, and then applying the delta method.
The he
and she
rates are proportions. We get their standard errors by using the usual formula for the standard error of a proportion, based on the Binomial variance formula:
he_se <- sqrt(he * (1 - he) / tot[["he"]])
she_se <- sqrt(she * (1 - she) / tot[["she"]])
To find the standard errors for the logarithms of these quantities, we use the delta method. We multiply the standard error by the absolute value of the derivative of the logarithm function evaluated at the estimate:
log2_he_se <- abs(1 / (log(2) * he)) * he_se
log2_she_se <- abs(1 / (log(2) * she)) * she_se
These formulas follow from using a Taylor expansion of log2
around the estimate, using that the derivative of log2(x)
is 1 / (log(2) * x)
.
Finally, for the standard error of log2_ratio
, we assume that the log2_he
and log2_she
estimates are approximately independent, so that the variance of their difference is the sum of their variances. This gives a formula for the standard error of the log ratio:
log2_ratio_se <- sqrt(log2_he_se^2 + log2_she_se^2)
Here are all of the estimated log ratios, along, with error bars for those that are statistically significantly different from zero.
# put the estimates in increasing order
r <- rank(log2_ratio, ties.method = "first")
# find the statistically significant cases
signif <- (abs(log2_ratio) / log2_ratio_se > 2)
i <- signif
# set up the plot
xlim <- xlim <- range(r)
ylim <- range(log2_ratio,
(log2_ratio - log2_ratio_se)[i],
(log2_ratio + log2_ratio_se)[i])
plot(xlim, ylim, type = "n", xlab = "Rank",
ylab = expression(paste(Log[2], " (She / He) Usage Rate")))
# horizontal line at zero
abline(h = 0, col = "gray", lty = 2)
# standard error around interesting points
segments(r[i], (log2_ratio - log2_ratio_se)[i],
r[i], (log2_ratio + log2_ratio_se)[i])
segments((r - 0.4)[i], (log2_ratio - log2_ratio_se)[i],
(r + 0.4)[i], (log2_ratio - log2_ratio_se)[i])
segments((r - 0.4)[i], (log2_ratio + log2_ratio_se)[i],
(r + 0.4)[i], (log2_ratio + log2_ratio_se)[i])
# points at the estimates
points(r, log2_ratio, col = 2, cex = 0.5)
Notice that many of the more extreme estimates are not statistically significant. One word with a large estimate not deemed statistically significant is the word “expressed”. Here are the counts for that word, along with the log ratio estimate and standard error:
print(terms["expressed",])
he she
9 5
print(log2_ratio[["expressed"]])
[1] -1.263685
print(log2_ratio_se[["expressed"]])
[1] 0.7727217
This is a word with a large imbalance, but only in a relatively small number of samples. Due to the lack of data for the word “expressed”, we deem this imbalance to not be statistically significant. Compare this to the word “felt”:
print(terms["felt",])
he she
36 189
print(log2_ratio[["felt"]])
[1] 1.901041
print(log2_ratio_se[["felt"]])
[1] 0.2598566
This has an effect size estimate on the same order as “stopped”, but the estimate is much more reliable given the larger number of appearances of the word “felt”. We judge “felt” both to have a practically significant difference and a statistically significant difference.
The word “was” exhibits one other extreme:
print(terms["was",])
he she
891 1378
print(log2_ratio[["was"]])
[1] 0.1536038
print(log2_ratio_se[["was"]])
[1] 0.0579161
This word has only a minor difference between “he” and “she”, but it shows up often enough for us to get a reliable estimate of the difference. The word “was” has a statistically significant difference between the genders, but probably not a practically significant difference.
In the context of our application, a word is practically significance if the absolute value of its estimate is large. This difference is statistically significant if the estimate is large relative to its standard error. If we only care about Austen’s writing in these six novels, we may not care about statistical significance. If, however, we want to make a statement about Austen’s writing style in general, then we want to generalize beyond our data set, and we should only do so for words we can get reliable estimates for.
Here are the estimates for the words with statistically significant gender imbalances, defined s having an estimated log ratio more than two standard errors away from zero.
# order by log ratio, then name
o <- order(log2_ratio, names(log2_ratio))
est <- log2_ratio[o]
se <- log2_ratio_se[o]
# only keep statistically significant cases
signif <- (abs(est) / se > 2)
est <- est[signif]
se <- se[signif]
# find practically significant cases
pract <- abs(est) > 1
# get the terms
term <- names(est)
i <- seq_along(term)
# set up the plot
par(mar = c(5, 6, 2, 2) + 0.1)
xlim <- range(est - se, est + se)
ylim <- range(i)
plot(xlim, ylim, type = "n",
xlab = "Relative appearance after 'she' compared to 'he'",
ylab = "", axes = FALSE)
# x-axis
ticks <- seq(-3, 6)
labels = paste0(2 ^ ticks, "x")
labels[ticks == 0] <- "Same"
axis(1, at = ticks, labels = labels)
abline(v = ticks, col = "gray", lwd = 0.5)
# y-axis, with term labels
axis(2, at = i, labels = term, las = 1)
abline(h = i, lty = 3, col = "gray")
# frame the plot
box()
# standard errors
col <- ifelse(est > 0, 3, 4)
segments(est - se, i, est + se, i, lwd = 1.5, col = col)
segments(est - se, i - 0.2, est - se, i + 0.2, lwd = 1.5, col = col)
segments(est + se, i - 0.2, est + se, i + 0.2, lwd = 1.5, col = col)
# estimates
points(est, i, pch = 16, cex = 0.75, col = col)
Not all of these are practically significant. Words like “should”, “will”, and “was” show up because they are common, and it is possible to get precise estimates of their gender imbalances. Still, we can see many examples of words that have both statistically and practically meaningful gender imbalances, especially “she”-slanted words like “remembered”, “read”, “resolved”, and “felt”.
In Silge’s original analysis, she used an ad-hoc filter of statistical significance, considering only words that appeared more than 10 times in the corpus. Her filter does a reasonable job of removing spurious results. Indeed, her list of the most gender-associated words mostly agrees with those reported here. There are, however, some notable differences. Silge’s list of “he” words does not include “begged”, but it does include “stopt”, “shook”, “expressed”, and “wants”. Similarly, Silge’s list of top “she” words includes some words that are excluded here: “longed”, “instantly”, “expected”, “ran” and “caught”.
I suspect that many of the words here have gender imbalances simply because the protagonists are female. The novels give us insights into the protagonists’ emotional states, but not into those of the supporting characters. It would be nice if we could get more balanced comparison by adjusting for the genders of the protagonists. Until we do that, it’s hard to know if there’s anything meaningful in these differences.