Tax season is fast approaching. If you are like many Americans, you may think that you pay too much in taxes. So maybe you decide to fudge your tax return, and you use a random number generator to make some of the numbers up (taking care, of course, to make sure it is below what you actually owe). You think you do a great job with it, but a few months later the the IRS and FBI come knocking at your door, and take you away.
“How did you catch me?” you might ask.
“Well,” they tell you, “the numbers were too random. That’s not what real records look like.”
This happens more often than you may think. Company financial records, and many other data sets, follow a useful mathematical oddity, known as Benford’s Law.
Benford’s law, generally states that for a given data series, the probability of finding a number beginning with the digit ‘1’ is greater that the probability of finding a number beginning with the digit ‘2’, and so on. It is a common example of math in the ‘real world’ and is used by the SEC and IRS to find companies and people who fudge their ledgers and taxes.
Specifically, it follows a logarithmic relationship:
where P(d) is the probability of finding a number that has a starting digit, d. The law looks like this in graphical form:
So taxes and tax returns generally obey Benford’s law. But does Death escape the inevitability of Benford’s law? If the CDC exists for anything, it’s to give us data on how people died. So I went to their website, and downloaded all the mortality rates (by age, demarcated in single years) of people who died in the US from 1999-2014. (A gaint database ranging from mortality rates to births to cancer statistics can be found at http://wonder.cdc.gov/.) The first thing to notice is that with this data set, the digit “0” is meaningful, because during that 16 year period, there were 426,966 children who did not make it to their first birthday. This makes the data set slightly unusual for analysis with Benford’s law. The second thing to notice is that the most likely ages to die is not the set that comprises one-year-olds, 10-19 years olds, and 100+ year olds, but the set of 8 year olds and octogenarians.
So age-of-death clearly defies Benford’s law. And this makes sense. Most people make it out of childhood these days, and young, healthy people usually don’t die except in accidents or due to reckless behavior. After a while, the body starts to fail and modern medicine can only do so much.
So besides taxes, what else actually obeys Benford’s law? What are the assumptions we have to make about the data set for it to follow this strange phenomenon?
In my last article, I wrote about Google’s ngram viewer, which looks at the frequency of words, phrases, and numbers in books from the time period 1800 to 2008. Google has millions of scanned books in their corpus, so I thought it would be an interesting dataset to analyze with Benford’s law.
If you were to look on google Ngram at the numbers “one” through “nine”, you will see that one is used a lot more than two, which is used a lot more than three, which appears more frequently than four, and so on. The convention “one_NUM” indicates that the word “one” is being used as a number in the context of the book, and not as a pronoun or other part of speech.
Frequency of numbers one through nine in books in 1891:
Interestingly, the the word (and number) eight are used more frequently than the word (and number) seven in their corpus. While it looks close to obeying Benford’s law, a corpus of books isn’t necessarily a “clean” data set.
But what is a “clean” data set? For Benford’s law to hold, the data set should span several orders of magnitude, and should be large. While Ngram has a lot of data, the numbers one through nine only cover one order of magnitude. To put it another way, think of the collection of speeding tickets that drivers in the US receive. Probably very few people get tickets for driving in the 10s, 20s, or even in the 30s; but many more for going in the 80s or 90s or the 100s. In fact, for the stretch of highway on I-15 between California and Las Vegas, the average speed of a ticketed vehicle is 95 miles per hour. So in reality, only 2 orders of magnitude are covered, and it doesn’t satisfy Benford’s law.
In order to find a good data set that both spans many orders of magnitude has and a lot of numbers to crunch, I took a look at the population of every country and dependency. The populations spanned eight orders of magnitude; from over 1 billion people residing in India and China, to just 56 souls inhabiting the Pitcairn Islands, a dependency of the UK.
Surprisingly, this data set does not satisfy Benford’s law (eight appears more often than seven, just like in the Ngram case). But a growing population should satisfy Benford’s law. I figured that country borders are pretty arbitrary, and there are only 253 countries in the list, so maybe that is the problem. Bigger is better as far as statistics go. So I pulled up the list of the populations of all US counties and county equivalents (some states have boroughs or the like) as of July 1, 2013, which had the populations for 3142 counties. This trove is more than ten times richer than previous data set. The data spans a large range of populations as well, from 10,017,068 for Los Angeles county in California, to only 90 for Kalawao county in Hawaii.
This obviously holds up much better, and follows the general trend of Benford’s law.
So if taxes obey Benford’s law, but speeding tickets and age of death don’t, why is should county population obey it? Well, the human population is currently growing (it is projected to doubled every 70 years or so for next few doublings), and if we assume that rate of population growth is uniform over the US (I’m ignoring migration and such, but it makes life easier), then we can start to rationalize it. If we begin with 100 people in a town, there will be somewhere between 100 and 199 people in that town each year for 70 years, until the 200th person shows up. The population will hit 300 only 41 years later, and in a mere 29 years it will hit 400 (when the town finished its second population doubling). This means that the first digit of the population will have been 1 for 70 years, and 2 for 41, and 3 only 29. Therefore, a snapshot in time of populations of a few thousand counties should obey Benford’s law.
While Benford’s law is certainly an interesting phenomena, as far as life is concerned, only death and taxes are certain. Benford’s law is not universal, but needs stringent requirements, such as a large data set spanning many orders of magnitude.