An automated ‘time machine’ to reconstruct ancient languages
February 14, 2013

Computer scientists have reconstructed ancient Proto-Austronesian, which gave rise to languages spoken in Polynesia, among other places (credit: A. Bouchard-Cote et al./University of California – Berkeley)
Researchers from University of California, Berkeley and the University of British Columbia have created a computer program that can rapidly reconstruct “proto-languages” — the linguistic ancestors from which all modern languages have evolved.
These earliest-known languages include Proto-Indo-European, Proto-Afroasiatic and, in this case, Proto-Austronesian, which gave rise to languages spoken in Southeast Asia, parts of continental Asia, Australasia and the Pacific.
Ancient languages hold a treasure trove of information about the culture, politics and commerce of millennia past. Yet, reconstructing them to reveal clues into human history can require decades of painstaking work.
”What excites me about this system is that it takes so many of the great ideas that linguists have had about historical reconstruction, and it automates them at a new scale: more data, more words, more languages, but less time,” said Dan Klein, an associate professor of computer science at UC Berkeley and co-author of the paper published online in the journal Proceedings of the National Academy of Sciences.

Proto-Austronesian “genealogical tree” (credit: A. Bouchard-Cote et al./University of California – Berkeley)

Zooming in a portion of the “genealogical tree” Oceanic languages, where the Nuclear Polynesian family (i) and Polynesian family (ii) are visible (credit: A. Bouchard-Cote et al./University of California – Berkeley)
The research team’s computational model uses probabilistic reasoning — which explores logic and statistics to predict an outcome — to reconstruct more than 600 Proto-Austronesian languages from an existing database of more than 140,000 words, replicating with 85 percent accuracy what linguists had done manually.
While manual reconstruction is a meticulous process that can take years, this system can perform a large-scale reconstruction in a matter of days or even hours, researchers said.
Not only will this program speed up the ability of linguists to rebuild the world’s proto-languages on a large scale, boosting our understanding of ancient civilizations based on their vocabularies, but it can also provide clues to how languages might change years from now.
“Our statistical model can be used to answer scientific questions about languages over time, not only to make inferences about the past, but also to extrapolate how language might change in the future,” said Tom Griffiths, associate professor of psychology, director of UC Berkeley’s Computational Cognitive Science Lab and another co-author of the paper.
The discovery advances UC Berkeley’s mission to make sense of big data and to use new technology to document and maintain endangered languages as critical resources for preserving cultures and knowledge. For example, researchers plan to use the same computational model to reconstruct indigenous North American proto-languages.
Humans’ earliest written records date back less than 6,000 years, long after the advent of many proto-languages. While archeologists can catch direct glimpses of ancient languages in written form, linguists typically use what is known as the “comparative method” to probe the past. This method establishes relationships between languages, identifying sounds that change with regularity over time to determine whether they share a common mother language.
“To understand how language changes — which sounds are more likely to change and what they will become — requires reconstructing and analyzing massive amounts of ancestral word forms, which is where automatic reconstructions play an important role,” said Alexandre Bouchard-Côté, an assistant professor of statistics at the University of British Columbia and lead author of the study, which he started while a graduate student at UC Berkeley.
The UC Berkeley computational model is based on the established linguistic theory that words evolve along the branches of a family tree — much like a genealogical tree— reflecting linguistic relationships that evolve over time, with the roots and nodes representing proto-languages and the leaves representing modern languages.
Using an algorithm known as the Markov chain Monte Carlo sampler, the program sorted through sets of cognates, words in different languages that share a common sound, history and origin, to calculate the odds of which set is derived from which proto-language. At each step, it stored a hypothesized reconstruction for each cognate and each ancestral language.
The algorithm for ancestral word form reconstruction is based on a fundamental Bayesian decision theoretic concept called Bayes estimators.
“Because the sound changes and reconstructions are closely linked, our system uses them to repeatedly improve each other,” Klein said. “It first fixes its predicted sound changes and deduces better reconstructions of the ancient forms. It then fixes the reconstructions and re-analyzes the sound changes. These steps are repeated, and both predictions gradually improve as the underlying structure emerges over time.”
Examples of Protolanguage Words Reconstructed
| Modern Languages | Reconstructed Protolanguage | ||||
| English | Fijian | Melanau | Inabaknon | Manual | Automated |
| star | kalokalo | biten | bitu’on | bituqen | bituqen |
| bird | manumanu | manuk | manok | qayam | qayam |
| wind | cagi | parjay | bariyo | bali | beliu |
Comments (17)
by NakedApe
I watched a program about India and how there is a religious sect which has chants in a language so ancient that it may date to the very rise of human language. The monks are religious :) about preserving the integrity of this language even though they don’t understand it. Wouldn’t it be awesome to study this language, which may date back 40-50 K years, and gain a much better understanding of the roots of our evolutionary success. Mind boggling!
by willard van de bogart
Could you provide the info on that program. I would like to watch it.
by Chasseurnoir
NakedApe, Could you give me the name of this religious sect ? or where do you have found it ? I’m a student in liguistic
Thank you
by aus
I’ve stated before that I think highly advanced computer simulations will serve as “virtual time machines” in the future.
by Dennis R.
I don’t understand the need. Without a written language and no current native speakers, this doesn’t seem to have any useful purpose. Why re-create a language no one speaks? Why not spend more effort on creating simultaneous translation of current languages? We’d be better off being able to communicate with contemporaries who speak different languages now than hypothesizing what long-dead communities might have said.
I applaud the thinking that went into this invention. I don’t see much value in how it’s being applied.
by WLGJR
Hans Moravec talked about “Ancestor Simulation”.
With this invention we can reconstruct our ancestors and their cultural details better.
by John Coghlan
For one thing, we can decode ancient scripts better and understand humanity better. We can find out more about the late prehistoric period. Had we not studied the death of the Polynesian culture on Easter Island, we would not have understood the environmental collapse of that people. As well, we would not have had writers such as Jared Diamond to write books warning us of certain kinds of environmental collapse. Now that IS very relevant to the modern world.
by eldras
I salute this struggle. I’ve been hobbying in the area of historical data reconstruction for a while in Quantum Archaeology on the forums, and formally here:
https://sites.google.com/site/quantumarchaeology/
by Ian
Quantum archaeology is… seriously freaky.
by WLGJR
If we gather enough information (after we *suturate* the whole universe, like Kurzweil predicted) from an universe-wide array of detectors/data-collectors, we may be able to reconstruct the totality of the universe’s history, and perhaps even the previous universe(s) (ones that exist prior to the Big Bang).
But will the future-beings have spare computing powers to run these simulations? Or will they be more concerned with survival (into the Deep Time, through extending the universe’s lifespan and preventing Heat Death/Big Crunch/Big Rip or other Cosmic Eschatons) and spend no energy running such simulations?
by WLGJR
BTW, eldras, what you have written is a masterpiece!
Thanks for sharing, I enjoy reading it.
by FinitDeMorte
Turn this thing loose on the Voynich Manuscript!
by WLGJR
Uh… Since the Voynich manuscript is of an unknown language, so is not directly related to what this new language algorithm complex does (this one takes in known languages, complete with vocabularies, and reonstruct *not necessarily accurately* the ancestral language).
looked it up on the Wikipedia, though, thanks for mentioning. Very fascinating. I suggest try Rongo Rongo as well.
I imagine that, if a AI has sufficient knowledge about human culture AND cryptography, it can decipher many (now) undecipherable ancient languages. (But I also think, to decipher successfully, the orignial text has to be long and monotonal (in content) enough so that the AI can *make sense* of the text.
E.G. The Rosetta Stone was sufficiently large and contain almost fully the “midsection” (the simplified Egyptian version). It is thanks to this that the archaeologists could decipher ancient Egyptian (or Kemet-ese, “Kemet” being the ancient Egyptians’ own name for their civilization).
This will probably have (but must be limited compare to human cultures) value in deciphering alien transmissions. I remember an excellent Hugo (or was it Nebula?) short novel written about an alien transmission.
Full of ambiguity is what we must expect when we decode an alien transmission for the first time.
by WLGJR
I think if would be great if the researchers use the new application to reconstruct the “Proto-Sapiens” language, the hypothetical ancestral language of all humanity.
by WLGJR
Which in my opinion is (political correctness-wise) a more befitting Lingua Franca for the world than English, Esperanto and other European languages.
But for many practical purposes it would inevitably borrow a large number of words from European languages, so let’s stick to English anyways.
by WLGJR
Will the “Proto-Sapiens” language have any *magic* effect?
E.G. In the SF novel Snow Crash, it is hypothesized that the Sumerian
language is the *default* language of Homo Sapiens species.
But Sumerian is not really that *universal*, when compared to the Proto-Sapiens language.
If occurences similar to that depicted in Snow Crash happen in real life, I guess *Proto-Sapiens* would be a more likely language to be spoken.
Uncovering Proto-Sapiens will pose dangers as well. If there are living (spoken natural languages) languages that, by coincidence, are *very* similar to the Proto-Sapiens (Austronesian Languages and some sub-Saharan African languages, the latter I suppose will be more likely because H. Sapiens originiated in, according to paleoanthropologists, Olduvai Gorge, which is in Africa), the speakers of these languages would be called “primitive” or even “un-evolved”, which of course violates the Golden Rule of PC (Political Correctness).
by Camaxtli
This is an amazing tool. I look forward to the information linguistics researchers, archeologists and anthropologists uncover with it.