Digging into the past with a computational shovel

When you think about the best tools that we have available for understanding the origins of civilization, you might imagine a pickaxe, a dusting brush, a shove–no doubt all of them wielded by some Indiana Jones-esque adventurer standing chest-deep in the bowels of some excavated lost city.

The problem with these methods is that they involve a painstakingly detailed approach to understanding history, picking through evidence one piece at a time. While this can be an incredibly useful way to understand the world, it isn’t the only method for building our knowledge of the past.  An increasingly popular approach takes the opposite approach: leveraging large amounts of information at the same time in order to discover hidden complexities and patterns that aren’t available to the human eye.

All I need to fend off danger is my trusty Monte Carlo resampling model.

In a paper published this week out of UC Berkeley, a team of researchers has leveraged the raw power of modern-era computers in order to understand the complicated process by which ancient languages morph into their present-day forms.  The the evolution of language is an incredibly complicated process, it also has a lot of structure.  This structure tends to be relatively consistent over time, with small changes being enacted across epochs of civilization.

At the heart of the researchers’ approach is an attempt to understand and model these tiny changes to language over time, a process that they accomplish using a class of models that attempt to  explain the world probabilistically rather than having a clear “yes or no” direction.  Using these models, it becomes possible to describe the probability that certain components of a word might change at a moment in time.  Fast forward the clock, and these tiny changes eventually accumulate, resulting in the marked difference between languages that exist today.

Screen Shot 2013-02-15 at 9.24.12 AM

Modern languages show common “parent” languages that can be depicted as branches on a tree.

However, understanding the statistical probabilities of language evolution isn’t just for understanding modern-day language; it also allows us to look into the past and investigate what kinds of words existed in the so-called “protolanguage” cultures.  To do so, the researchers began with modern language.  They then “ran back the clock” using their model of language evolution to gradually morph our languages into their ancestral forms.  Incredibly, the ancient languages that the model produced were comparable to those that linguistic scholars had assembled by hand.

Thousands of modern-day languages (the perimeter of the circle) gradually evolved from a common linguistic ancestor.

Thousands of modern-day languages (the perimeter of the circle) gradually evolved from a common linguistic ancestor.

Such a data-driven approach is unparalleled in historical linguistics–these models learned how language changes by churning through mountains of present-day linguistic data, not as a result of explicit programming.   That this probabilistic model was able to begin with modern language and “learn” how it evolved over thousands of years is a testament to the power of computational statistics.  At its heart, the approach touches on a more fundamental truth about the world: there is structure everywhere; we need only use the right tools to discover it.

Leave a Reply