Wednesday, July 3, 2013

Talking Trash: What's in a Word?

What's in a word? Several nucleotides, some researchers might say. By applying statistical methods developed by linguists, investigators have found that "junk" parts of the genomes of many organisms may be expressing a language. These regions traditionally been regarded as 'useless' accumulations of material from millions of years of evolution.
'The feeling is,' says Boston University physicist Eugene Stanley, 'that there's something going on in the non-coding region.'

Junk DNA got its name because the nucleotides there (the fundamental pieces of DNA, combined into so-called base pairs) do not encode instructions for making proteins, the basis for life. In fact, the vast majority of genetic material in organisms from bacteria to mammals consists of non-coding DNA segments, which are interspersed with the coding parts. In humans, about 97 percent of the genome is junk. Over the past 10 years biologists began to suspect that this feature is not entirely trivial.
"It's unlikely that every base pair in non-coding DNA is critical, but it is also foolish to say that all of it is junk" notes Robert Tjian, a biochemist at the University of California at Berkeley.

For instance, studies have found that mutations in certain parts of the non-coding regions lead to cancer. Physicists backed the suspicions a few years ago, when those studying fractals noticed certain patterns in junk DNA. They found that non-coding sequences display what are termed long-range correlations. That is, the position of a nucleotide depends to some extent on the placement of other nucleotides.

Their patterns follow a fractal-like property called 1/f noise, which is inherent in many physical systems that evolve over time, such as electronic circuits, periodicity of earthquakes and even traffic patterns. In the genome, however, the long-range correlations held only for the non-coding sequences; the coding parts exhibited an uncorrelated pattern. Those signs suggested that junk DNA might contain some kind of organized information. To decipher the message, Stanley and his colleagues Rosario N. Mantegna, Sergey V. Buldyrev and Shlomo Haviin collaborated with Amy L Goldberg, Chung-Kang Peng and Michael Simons of Harvard Medical School.

They borrowed from the work of linguist George K. Zipf who by looking at texts from several languages ranked the frequency with which words occur. Plotting the rank of words against those in a text produces a distinct relation. The most common word "the" in English occurs 10 times, than the 10th most common word, 100 times more often than the 100th most common, and so forth. The researchers tested the relation on 40 DNA sequences of species ranging from viruses to humans.

They then grouped pairs of nucleotides to create words between three and eight pairs long (it takes three pairs to specify an amino acid). In every case, they found that non-coding regions followed the Zipf relation more closely than did coding regions, suggesting that junk DNA follows the structure of languages.
"We didn't expect the coding DNA to obey Zipf," Stanley notes. "A code literal one if by land, two if by sea."

You can't have any mistakes in a code. Language, in contrast, is a statistical, structured system with built-in redundancies. A few mumbled words or scattered typos usually do not render a sentence incomprehensible.

In fact, the workers tested this notion of repetition by applying a second analysis, this time from information theorist Claude E Shanon who in the 1950s quantified redundancies in languages. They found that junk DNA contains three to four times the redundancies of coding segments. Because of the statistical nature of the results, the researchers admit their findings are unlikely to help biologists identify functional aspects of junk DNA. Rather the work may indicate something about efficient information storage.
"There has to be some sort of hierarchical arrangement of the information to allow one to use it in an efficient fashion and to have some adaptability and flexibility," Goldberger observes.

Another speculation is quences may be essential to the way DNA has to fold to fit into the nucleus.

Some researchers question whether the group has found anything significant. One of those is Beniot Mandelbrot of Yale University. In the 1950s the mathematician pointed out that Zipf's law is a statistical numbers game that has little to do with recognizable language features, such as semantics. Moreover, he claims the group made several errors.
'Their evidence does not establish Zipf's law even remotely.' he says.
But such criticisms are not stopping the Boston workers from trying to deciphers junk DNA's tongue.
'It could be a dead language,' Stanley says, 'but the search will be exciting.'
Post a Comment