Mapping the Demographics of American English with Twitter

« previous post | next post »

[This is a guest post by David Bamman.]

It took me a while to really make sense of Twitter. For the longest time, it was (to me) the stomping ground of 14-year-olds and Ashton Kutcher, each issuing a minute-by-minute feed of their lives. Around the time Twitter arrived, however, I had just had a breakthrough on YouTube's enormous popularity – it was only after watching a dozen different videos of the Super Mario Brothers theme song performed a dozen different ways that I finally got it: I may not care about cats playing the keyboard or wedding parties dancing down the aisle, but somebody does, and without a distribution system for people to broadcast whatever their hearts felt like, I never would have had my life improved by that kid with the beatboxing flute or the one with the double guitar.

So I waited for a similar breakthrough with Twitter. It came, at long last, after I realized that it was exactly what I first thought it was: 14-year-olds (and Ashton Kutcher) chronicling the minutiae of their lives. It is colloquial language, constrained by 140 characters: everyday conversations about waiting in line at the grocery store, your flight just landing at ORD, what to do this Saturday night, "omg did u see hr dress?" In spurts it is, of course, much more than that, as its use during the protests of the 2009 Iranian election proved, but in its unmarked use, it's the language of how millions of people across the world talk to their friends.

To say Twitter is colloquial is putting it lightly. "Brother," for example, occurs in Twitter data during the week of May 10-17, 2010 with an average frequency of once every 7,338 words, not too distant from its frequency in its closest cousin, the Corpus of Contemporary American English (once every 9,405 words). The difference for "bro," however, is much more dramatic: in the Twitter data during that same period, it occurs once every 5,833 words (more frequently, in fact, than "brother"), while in the COCA it occurs once every 757,575 words – two orders of magnitude less frequently.

In April 2010, Twitter had approximately 106M registered users. The volume of data that flows through the Twitter pipe dwarfs any other publicly available linguistic corpus in existence (except the web itself), and unlike fixed corpora, it still flows. Such a huge dataset has proven itself to be a fertile resource for a number of natural language processing tasks (such as trend detection and sentiment analysis), but its value as a collection of colloquial language begs to be used for lexicography as well: if the purpose of a dictionary is to record actual usage, then Twitter data allows us to broaden the scope of our corpus beyond newswire, literary works and other forms of privileged publication and include the unedited language of everyday folks as well.

Inducing language demographics

In addition to letting us capture the colloquial language of over one hundred million people, Twitter also provides us with a rich data source for inducing the demographics of that language community.

The data that Twitter releases as part of its public datastream includes a number of features for each "tweet." In addition to the content of the message itself, each tweet is accompanied by the creator's username, display information such as the URL of the profile picture and background color, follower count, and a wealth of other metadata, including a timestamp for its creation and user-defined geographical information – for me (@dbamman), this is "Boston, MA" but a user can write anything ("New York," "NYC," "in the intertubes") or choose to have their location tagged with precise latitude and longitude coordinates (though only a small fraction actually do so). The datasource is ostensibly so rich in order to enable third-party development (e.g., location-enabled iPhone apps), but all of this information contains valuable demographic indicators for plotting language use across time, space and different populations.

Geography

The geographic information embedded in each tweet allows us to map language use across the US and, like the Dictionary of American Regional English, report on the nuances of language that are characteristic of certain communities.

The user-defined geographic information is noisy data: while "Boston, MA" can be automatically disambiguated relatively easily to a physical location on the earth (corresponding to coordinates 42.35843, -71.05977), others ("Springfield") are more difficult (there are many Springfields); others still are nearly impossible ("home of that boy Biggie," a reference to New York City quoted from Jay-Z's "Empire State of Mind"), and some ("in ur fridge eatin ur foodz") don't map to any space in physical reality. The sheer volume of data, however, gives us the flexibility to focus more on precision than on overall accuracy – we can throw away all tweets where we aren't over 99% sure of the physical location.

With this disambiguated data, we can map the usage of words and phrases across the US by normalizing each word's count by the volume of total data coming out of each state (to avoid biasing the statistics toward populous states such as New York and California). Comparing these resulting ratios allows us to get a demographic picture of word use across the US. Here "grand canyon" is visualized on a map using the Google Charts API (lighter blue represents more characteristic usage).

Figure 1: Demographics of "grand canyon"

A sanity check reveals that, yes, it is in Boston that the Red Sox are most characteristically talked about (this is different from "most talked about" which, again, could even be a state like New York given its large population). Californians talk characteristically of earthquakes and Wisconsin of the Green Bay Packers (even in May).

This same method that works for detecting prevalent topics in certain areas also allows us to detect regionalisms in slang as well. The southern US is the focal point for words like "bruh" and "ima" (and its orthographic variants i'ma and imma), while "hella" is centered in California and "rad" in the Pacific Northwest. "Wicked" is more characteristic of New England (especially Massachusetts), but less strongly than the others (perhaps because of its polysemy – its meaning as "very" is likely a regionalism but its older sense of "evil" is almost certainly not). This gets to one limitation of this method: statistics are all computed on a token level (the word form), not at the level of individual senses, so the clear regional distinction between "pop" and "soda" that we would love to see gets blurred, since "pop" is used throughout the US not simply as a synonym of "soda" but in other senses as well ("pop music," "pop out," etc.).

Figure 2: Demographics of "bruh"

Age and gender

While Twitter doesn't explicitly ask for or subsequently publish any age or gender data for its users, we can approximate both on a large scale using common demographic indicators such as the user's first name. While some names like "David" have relatively even distributions across birth years (which we can compute using information from the US Social Security Department), other names are heavily biased toward certain generations. "Jasmyn," for instance, is far more likely to be the name of a teenager now than someone named "Pearl"; if your name is "Arsenio" and you were born in the US, it's over 99% likely that you're a male born between the years of 1989-1991. With this statistical information, we can compute a probability distribution for the entire age range between 12 and 75 and increment the weight count of each word according to this distribution.

With these distributional counts for each word and phrase of interest in the corpus, we can chart out the demographics using the same normalizing technique used for mapping geographical distributions above: for each word or phrase, dividing the observed weight count in each age group by the total volume of tweets for that age (otherwise the statistics would be biased toward heavily tweeting age groups, such as 12-17 year olds) and then comparing those computed ratios.

A sanity check again reveals that, yes, women aged 12-24 who tweet do love Grey's Anatomy, Gossip Girl and 90210, while men aged 35+ like TV shows such as 24 and Mythbusters. To complete the demographic picture of "bruh" above, we can see that it's used predominantly by males aged 18-24. In contrast, a word like "bro" is used comparatively more frequently by females and by all age groups, while a formal word like "brother" is used with more or less equal geographic and gender distribution across the entire US.

Figure 3: Gender demographics of "bruh"
Figure 4: Age demographics of "bruh"

Inducing age demographics in this way results in a noisier picture than inducing gender and geographic information, for which the indicators are much more clear (it's my intuition, at least, that people over the age of 65 are probably using "bruh" a little less than reported). The sheer volume of information coming from this data source does, however, help to offset this noise – even if there is some fixed amount artificially inflating the probabilities of each age group, we can at least begin to see a picture emerging of the major age groups involved.

An evolving dictionary

The goal of the Lexicalist project is to develop a dictionary that depicts, in real time, the changing demographics of English in the United States, a dictionary that supplements the fundamental meaning of a word or phrase with the current cultural backdrop that's informing its use today. My work in the NEH-funded Dynamic Lexicon project has taught me that (for Ancient Greek and Latin at least) the language of a given era is not a homogeneous beast able to be captured in a single volume (or caged in a set of fascicles); it is the language of Caesar plus the language of Vergil and so on. English two thousand years later in the United States is no different: it is the sum of the hundreds of millions of people who use it, often in very different ways. By focusing on the demographics of contemporary usage, my hope is to shine a spotlight on all of those millions of individuals and see American English as the product of their distinct and discernable voices.

[This is a guest post by David Bamman.]



26 Comments

  1. Kevin said,

    May 18, 2010 @ 11:46 pm

    Neilsen uses spyware-like software to track what 250,000 internet users do. They found Twitter is disproportionately popular among people over 25. See: http://blog.nielsen.com/nielsenwire/online_mobile/teens-dont-tweet-twitters-growth-not-fueled-by-youth/

  2. John G said,

    May 19, 2010 @ 12:00 am

    Lovely, thanks. The analysis you are doing seems likely to minimize the distortion in usage caused by the need to stay within 140 characters, and the expectations of discourse generally that are created by that need.

  3. Clarissa at Talk to the Clouds said,

    May 19, 2010 @ 12:07 am

    Nifty analysis, geographically, but overall still, I think, an unfair characterization. It really depends on what you do with it–much like the rest of the internet. I use Twitter to connect with language teachers, linguists, journalists, writers, and so on, to share research, news, and ideas. (On another account, I connect with language learners, though there's some spillover.)

  4. YM said,

    May 19, 2010 @ 12:51 am

    What fun! I thought I'd try some Hawaiian Pidgin. Ono ('tasty'), as expected, is mostly in Hawaii, with Idaho (???) a significant second. Pau ('ready'), again, mostly in Hawaii. Middle aged people somewhat more than younger people. No 'kine', as in 'da kine', oddly.

    And the Beatles are big in…Indiana!

  5. David said,

    May 19, 2010 @ 12:53 am

    Funny, 'roflmao' is centered in North Dakota with very little in South Dakota, while 'rofl' is reversed!

    [(myl) I suspect that effects of this sort are sampling error. Remember that word counts are normalized by the state's overall tweet rate. North and South Dakota may have a small enough N that one user who is fond of a particular word can send its apparent geographical frequency soaring. (David Bamman will be able to tell us what the un-normalized statewise ounts (of rofl/roflmao and overall words) were for this case.)

    It would probably be prudent to add something to the code, at least optionally, to discount displayed color-shades on the basis of their confidence intervals, or whatever.]

  6. Will said,

    May 19, 2010 @ 3:18 am

    David, I actually tried those exact two words before I even read your comment! I also tried entering many city/state names and it's really cool how accurate the graphs are at pinpointing the locations. I would expect people in a particular place to be talking more about that place than people elsewhere, but I wouldn't have expected as much such polarization as these graphs suggest. I also wouldn't have expected the smooth tapering off correlating with distance that many of these locations show.

  7. Daan said,

    May 19, 2010 @ 3:51 am

    Fascinating approach! One question: your database seems to count singular and plural forms of nouns separately. I looked up both 'bro' and 'bros' and got different frequencies as well as gender distribution charts. Is this common in corpus linguistics?

  8. Jarek Weckwerth said,

    May 19, 2010 @ 7:47 am

    Wow. Idaho tops the list for "potato". It works!

  9. Jay Lake said,

    May 19, 2010 @ 7:58 am

    What JimG said above. I'm a genre fiction author and a reasonably heavy Twitter user (I'm in about 99.8% percentile for follower count) and I work pretty hard to keep my Tweets in plain English. Have never used "UR" for "your" except perhaps ironically. But citing Twitter for instances of "bro" vs "brother" is likely deeply misleading, given the character count restrictions that drive people to use SMS (or l337sp33k) style truncations.

  10. Timothy Martin said,

    May 19, 2010 @ 8:24 am

    Hawaii, Mass, and of course New Jersey talk about the "shore" the most! Awesome.

  11. Jorge said,

    May 19, 2010 @ 10:18 am

    Peple who talk about "love" the most are female, aged 12-17, in Maine and Indiana.

  12. Jorge said,

    May 19, 2010 @ 10:22 am

    And about "war"? Male, over 65, and overwhelmingly in West Virginia!

  13. Lane said,

    May 19, 2010 @ 10:38 am

    This very entertainingly turns traditional dialect studies on its head – before, you went to the countryside to tease dialectal usages out of old people who hadn't spent a lot of time consuming national media. That wasn't exactly representative, but it did show you where you were most likely to hear regional words like "poke" for "bag".

    This is the opposite; you're surveying young, urban technophiles who tweet. Just as skewed, but in the opposite direction. Fun, but you can't exactly build an accurate atlas of regional English out of it.

  14. jrome said,

    May 19, 2010 @ 11:13 am

    Congratulations! And please, keep up this important work! Since August 2006, when I first signed up for Twitter from my desk at Merriam-Webster (in Springfield, Mass, "Go Sox!") it has been a dream of mine to one day see Twitter usage (in real-time, of course) continually analyzed like this.

    Because people do communicate on Twitter, sometimes more often than they email or would have occasion to speak, and because it is so accessible for research, we must study it, particularly since the language of tweets will evolve. Twitter usage and vocabulary are analogous in my mind to the language used in early telephone calls or telegrams. Before either technology, everyday language was different, not better or worse. 140 characters is certainly a tight space, a small box in which to put large thoughts and feelings, but it is large enough to communicate.

    Biz Stone once told me he had pondered "what would happen if all email was public?" That thought, lucky for us, came to be a fundamental part of Twitter. We can look into modern languages as we could at no other point in our history. And we can do it as it happens. Remarkable. Continually investigate, analyze and report. And don't forget to tweet your results!

  15. Ilana said,

    May 19, 2010 @ 12:05 pm

    Extremely cool! Although I find the idea of inferring age from username somewhat dubious. What do you do with names like "David" which have no clear demographic? What do you do with names that aren't clearly gendered? What do you do with blog posts and tweets that have only pseudonyms attached? I'd think that analyzing only input from Madisons and Mabels would leave out a lot of data.

  16. pinboard May 19, 2010 — arghh.net said,

    May 19, 2010 @ 2:11 pm

    […] Language Log » Mapping the Demographics of American English with Twitter Mapping the Demographics of American English with Twitter – this will be a very rich area for academics #twitter […]

  17. ohwilleke said,

    May 19, 2010 @ 4:19 pm

    One way to do brute force analysis for trends that exist but haven't been hypothesized would be to search the entire database for regional trends on every single term, sorting them between "national" and "localized" useages (with terms that are used at similar rates everywhere being national), and culling appearances to infrequent to constitute a statistically significant trend (which would cull lots of typos, for example), and then culling again to remove geographic place names that a proper nouns and official uses (like Grand Canyon) and abbreviations for them.

    Regional useages could then be broken up automatically into a variety of categories with an additional "other region" categories to catch unconventional groupings, and the regional results could be ranked by the degree to which they differ from national average results.

    After breaking out regional useages, one could also look for subregion clusters, or cohabiting dialects by looking at the frequency with which common regional terms are used by the same user as a high level of statistical signficance. For example, you'd probably find separate clusters of Spanish language and English language tweets in California or Texas.

    This is essentially the same kind of analysis done when seeking hypervariable regions in the human genome for DNA fingerprinting.

    A totally different kind of study worth doing would be to analyze tweets in language v. non-language statistical tests because tweets share with many disputed language v. non-language inscriptions stringent de facto length restrictions making them a better comparison than most control groups for language.

  18. Army1987 said,

    May 20, 2010 @ 11:48 am

    Couldn't the relatively large frequency of words such as "bro" have something to do with the 140-character limit?

  19. Pendejo said,

    May 20, 2010 @ 9:55 pm

    The most amusing result I found in five minutes of noodling with this site is that the state using the word "anal" the most is Utah, by a comfortable margin.

    http://www.lexicalist.com/search.cgi?s=anal

  20. NemaVeze said,

    May 21, 2010 @ 3:57 pm

    Facebook posts are now publicly searchable and often tagged with location and gender. Also, the profile pictures are more commonly pictures of the actual person, if you care to break things down by race/ethnicity ("bruh" seems to be overwhelmingly African American).

    http://www.youropenbook.org lets you search Facebook – it's meant as a privacy advocacy / "reality check" stunt but gives you a different slice of Internet people compared to what you get on Twitter.

  21. Chas Belov said,

    May 22, 2010 @ 4:52 am

    Interesting MYL comment @David re roflmao and the Dakotas. "nada" also turns up with South Dakota as the prime user at 8.4%, nearly twice that of California, while "manana" (mañana with "normalized" spelling) has the expected Southwest presence. (I'm using both as if they are nearly everyday English words here in California, so I would expect them both to show such a profile – yet "nada" doesn't.) And why is Iowa #1 for "burrito"? And Idaho #1 for "taco"?

    [(myl) Suppose during the sampling interval in question, there were just three tweets identified as coming from Iowa, one of which was "Had a Thai Chicken Burrito at Hot Harry's". Then burrito might turn out to have a lexical frequency of 4%, which would be 4-5 orders of magnitude above the background rate (which is less than one per million). But this wouldn't really be a stable fact about the English usage of people in Iowa.

    No doubt this thought experiment underestimates Iowa's tweet volume. And maybe in fact Iowa is in the grip of a Mexican food craze. But still, I suspect that we're looking at sampling error here.]

  22. Who is talking about LOST and where? « A Ruach Journey said,

    May 22, 2010 @ 12:00 pm

    […] which I have subscribed to for a couple of months (great for writers)  The following comes from a guest post by David Bamman on the languagelog blog if you want to read about how and why he is running […]

  23. Raymond Ho said,

    May 26, 2010 @ 4:32 pm

    Interesting post. I have included this post on Four Stone Hearth #93, an anthropology blog carnival that I'm hosting this week. You can check it out at http://theprancingpapio.blogspot.com/2010/05/four-stone-hearth-93.html

  24. The demographics of language via Twitter said,

    June 1, 2010 @ 2:53 pm

    […] is apparently a word people use. In the south. As demonstrated by this graph of its usage in Twitter posts: Via The Morning […]

  25. Mapping the Demographics of American English with Twitter « On Wings of Song said,

    June 25, 2010 @ 5:14 am

    […] link ] Posted in: Language, Social Media ← The Journey of a Thousand Steps Be […]

  26. A week in Linklog said,

    March 5, 2011 @ 11:20 am

    […] Fun with words (remember Zembla?): a comma in the wrong place; the origins of the fanboy; Twitter as a linguistic corpus; a truly remarkable pun on a famous line in […]

RSS feed for comments on this post