The Evolution of Gendered Language

Google does a lot of cool things. One project of theirs that deserves recognition is Google Books, which is a corpus of millions of scanned books, or about 4% of all books ever published (1,2). Furthermore, they have an algorithm that assigns each word that appears in each book to its part of speech (noun, verb, adjective, adverb, number, etc.). That’s a lot of data, and since metadata is all the rage, it’s a veritable playground for people who like crunching numbers.

If you’ve ever looked at their Ngram viewer, you can see the frequency of a word or short phrases over time from the years 1800 to 2008*, as a percentage of total word usage based on the word count in the books that they scanned. Ngram claims to be a way to study cultural phenomena, and has published results of their in-house linguistic studies in Science.

So the first thing I did was to look up words that appeared or gained popular usage after 1800. For instance “google” or “DNA” flatline until 1996 or 1949, respectively, and then they jump up. The next thing I looked at was words whose frequency I expected to stay relatively constant. Articles such as “a” and “the” fit this bill. I expected pronouns such as “he” and “she” to remain flat as well. But they don’t (the y-axis is frequency of the word in the corpus; the x-axis is the year):

Ngram of the words, "he" and "she" over time. The y-axis is the frequency of those words in printed media. The x-axis is the year of publication.

Figure 1: Ngram of the words, “he” and “she” over time. The y-axis is the frequency of those words in printed media. The x-axis is the year of publication.

This caught my eye, and I realized that variations of gendered pronouns might reflect some interesting historical phenomena, such as wars or social movements (Ngram bills itself as a historical linguistic tool, so those sorts of trends might be noticeable).

The most visible aspect is that word “she” hit an all time low in printed medium in the year 1965. It was used less than half as frequently in 1965 as it is in 2008. Furthermore, the frequency of “she” in 1965 was lower than any other point in the graph, which is, unsurprisingly, 1801. “He” on the other hand, is much more common, but has a general downward trend from 1825 to 2000.

I wanted to look at the causes of the dip in the use of “she”. A zoomed in chart is shown below for clarification:

Ngram of the word "she" from 1800 to 2008.

Figure 2: Ngram of the word “she” from 1800 to 2008.

At first I thought that since Ngram looks at books, maybe a style guide was to blame. I remember reading a peculiar passage in Strunk & White’s The Elements of Style, that dealt with gender. (With the exception the two pages that I will quote for this article, which may be considered sexist, I think it is a clear and useful style guide.) The same author (E.B. White) who wrote children’s classics like Charlotte’s Web, Trumpet of the Swan, and Stuart Little, thinks that when looking for a general pronoun, “he” is correct (emphasis added is mine, though in later editions, this advice was removed.):

The use of he as a pronoun for nouns embracing both genders is a simple, practical convention rooted in the beginnings of the English Language.  He has lost all suggestion of maleness in these circumstances. The word was unquestionably biased to begin with (the dominant male), but after hundreds of years it has become seemingly indispensable. It has no pejorative connotations, it is never incorrect. Substituting he or she in place of it is the logical thing to do if it works. But it often doesn’t work, if only repetition makes it sound boring or silly. […] No one need fear to use he if common sense supports it. The furor recently raised about he would be more impressive if there were a handy substitute for the word. Unfortunately there isn’t–or, at least, no one has come up with one yet. If you think she is a handy substitute for he, try it and see what happens. Alternatively put all controversial nouns in the plural and avoid the choice of sex altogether, and you may find your prose sounding general and diffuse as a result.

The first edition of The Elements of Style was written in 1959, which was certainly a different time in our cultural journey, but already well into the downtrend of using “she”  It is unclear what the “furor recently raised about he”, specifically refers to, but it might be the second wave of the feminist movement, which gained a lot of steam in 1960, slightly after publication. This movement could certainly have stemmed the downward trend of “she” usage, and lead to its irrevocable uprise in 1966. A trend that might also be a result of the second wave feminism is the adoption of the phrases “he or she” and (the less popular) “she or he”. There is almost a constant (zero use) of both phrases from 1800 to 1970, after which time the phrases explodes, peaking in 1997:

Ngram of the phrases, "he or she", and "she or he"

Figure 3: Ngram of the phrases, “he or she”, and “she or he”

 

For clarity, a rescaling of “she or he” is below, it shows an almost identical trend to “he or she” :

Figure 4: Ngram of the phrase, “she or he”. It is very similar in shape to “he or she”, but is used less frequently in printed media

I also tried to look a gender neutral pronouns, such as “ze” and “one”, but ze is used so infrequently in published books that it is very noisy, and “one” is used as a pronoun and a number, so one cannot tell its use in (non-)gendered language.

Unfortunately, as a chemist by training, I don’t have the expertise to know which social events to look into to really dissect the trends in a rigorous manner (or at least I don’t have the time to investigate it). I pose it as a challenge to the interested reader to look into this, and let me know if you find anything interesting.

*A smoothing of 3 was used on Ngram viewer. Basically, smoothing fits a smooth function to the raw data points, reducing how noisy it looks, and helps with visualization of trends. It does slightly manipulate the actual numbers, and therefore frequencies, of word occurrences.

References:

  1. Michel et al. Quantitative Analysis of Culture Using Millions of Digitized Books. Science. 2010. 331, 176-182. 
  2. Lin et al. Syntactic Annotations for the Google Books Ngram Corpus. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 2012. 169-174.
  3. Strunk, William, Jr., and White, E.B. The Elements of Style. 3rd Ed. New York: MacMillan. 1979. Print.

Feature image courtesy https://fakecaptcha.com/