From 1859 when Charles Darwin proposed the theory of evolution to the discovery of DNA and the genetic code, in the 1950s, the mystery of life and heredity has been laid bare down the centuries. At its core is the code, built of three-letter words, using a four-character alphabet, which helps rebuild millions of proteins, to enable living things to do what sets them apart – to reproduce.

The code is a mathematically elegant construction – it is precise, economical and errorprotected – an end product more efficient than any variant that we can suggest. It is universal and unchanged, from the simplest, single-celled organism to the greatest of mammals. By what stages could this code have arisen? Masayori Inouye, Risa Takino, Yojiro Ishida and Keiko Inouye from the Rutgers-Robert Wood Johnson Medical School, New Jersey, propose a new look at the question in the journal, Proceedings of the National Academy of Sciences.

In the same way that the most inspiring concept of an architect cannot be realised unless she prepares a blueprint, an organism, no matter how efficient, cannot have a second generation unless it contains within itself the blueprint of its own construction. Living things are essentially their cells and the set of proteins, which cells produce and control the way other cells of the organism behave. The cells of all living things hence contain a blueprint, in the form of a long (very long – billions of units long) ticker tape that carries the code for the proteins. The DNA molecule is the tape, and the code for the proteins are bits of DNA, called the genes. And the genes are built up of three-letter words of an alphabet of four kinds of chemical groups, the letters, called the bases.

Now, the structure of proteins has got optimised to consist of a chain, often a very long chain, of components from a set of just 20 different units, called amino acids. Within the DNA, each group of three letters, formed out of the four letters that are available, is called a codon and is the template for creation of an amino acid. The box on this page shows how many three-letter words we can form with four alphabets, and it works out to be 64. If the word had only two letters, there would be only 16 ways that it could be formed, which is not enough to describe 20 amino acids. We hence need at least three letters in the word, and if 64 is a lot more than 20, well, three codons have special uses, but the remaining 61 provide alternate forms for the most frequent amino acids — as an insurance to avoid errors when the code in the DNA is transcribed!

That living organisms are able to implement this mathematically elegant system, using just chemical combinations within the organisms’ cells, shows the great power of the process of evolution and raises a question of how it may have come about. One theory is that the first amino acids were born from the elements in the stormy and energetic environment of early Earth. Amino acids that have been created in laboratory simulations, and traces found in meteorites, suggest that there may have been 10 amino acids at the start of life, and these grew into 10 more, stabilising at the efficient number of 20. The work done by the authors of the paper, however, finds that there may have been seven amino acids to start with, and more than one route for their development.

The four letters, or chemical groups, which form the codons are – U for uracil, C for cytosine, A for adenine and G for guanine. The picture shows how the 20 amino acids (and three “stop” codons to separate the genes) are formed by combining U, C, A and G. Significantly, we see two amino acids are encoded by only one codon, there are eight coded by two codons, just one coded by three codons, five coded by four codons and three coded by six codons. The number of redundant forms, however, does not generally correspond to the abundance of the amino acids, the paper says. For example, among the three amino acids coded by six codons, (green) arginine and serine are not the most frequently found. It is hence likely that the different forms came about by different processes.

In the case of leucine and arginine, the codons share bases in such a way that one codon can transform to another with a change of only one base. This, however, is not true in the case of serine. Here, we have four codons that start with “UC” and two more that start with “AG”. It would hence take a change of two bases for a codon in one group to reach a form in the other. Further, the paper notes, single base changes, in the first or second place, leads to six different amino acids that are unrelated to serine. The authors hence suggest that the origin of the two forms which start with “AG” was different from the origin of forms that start with “UC”.

To seek evidence of this suggestion, the authors analyse 4,225 protein coding genes of E. coli, a common intestinal bacterium. What they find is that although there are, in serine, theoretically two “AG” codons to four “UC” codons, the occurrence is not in the ratio of 1:2, but is as high as 3:4. The “AG” codons are thus used disproportionately more often, and again, within the “AG” codons, it is more often the “AGC” codon. And then, there are differences in where the two forms of serine occur or are used.

This fits in, the paper says, with the idea that more analysis brings forward, that “AGC” was evolutionarily one of the most primitive codons for serine, itself having descended from a form for GGC, for glycine. The analysis leads to the hypothesis that the codon for first amino acid had the form “GG” and from this the first seven amino acids arose. The remaining 13 arose from these seven, but the alternate form, “AG” of serine came through an independent route.

More work on the genomes of other bacteria and other life forms, and the roles that the two forms of serine play, could further illuminate the path by which they came to be, the paper says.

The writer can be contacted at [email protected]