Where did I come from?

Phylogenetics and Evolutionary History

Here is some background information on the process of estimating evolutionary relationships among organisms.

  • What are genomes?
  • Which genes do scientists use for estimating evolutionary relationships?
  • How are gene sequences organised before analysis?
  • How are relationships presented?
  • What methods are used to estimate these relationships?

Genomes are the cellular components which contain genetic material. All plants and animals have a large genome in the nucleus of (most of) their cells. This genome is usually seen in the form of chromosomes. These organisms also have other genomes. Both plants and animals have energy-producing cell structures called mitochondria which contain very small genomes. Plants have additional structures called chloroplasts, where photosynthesis occurs, which also have small genomes.



Gene Regions Used in Studying Species Relationships

Scientists interested in studying species relationships generally choose to study specific gene regions for very practical reasons. Their primary concerns are:

  • Is there sufficient genetic material in the specimen for me to analyze?
  • Is there enough genetic difference among species for me to be able to distinguish among them?
  • Is the genetic difference among species small enough that it makes some sense?

They are generally not interested in whether the gene region is involved in some significant metabolic function or cellular process. In fact they might try to avoid using such regions. Why do you think that might be?

In general, they choose gene regions in the mitochondrial genome. There are thousands of copies of a mitochondrial gene per cell, rather than just the two copies of any gene in the nuclear genome. There are different rates of change in different gene regions. Fast-evolving regions such as the "D-loop" might be used for studying closely related species while slowly-evolving regions such as cytochrome b (CYTB) and cytochrome oxidase I (COI or COX1) might be used for distantly related species.




Here is the mitochondrial genome for the pig (Sus scrofa). Dark green represents genes which produce a protein, light green are ribosomal RNA, red are transfer RNA, and white represents other structures.

Sequence Alignment

In studying evolutionary relationships from genetic sequences, the first step is to create a sequence alignment. First all of the sequences are drawn together.




Then gaps ('-') are inserted to bring the sequences into alignment. The hypothesis is that nucleotides in the same column are the same or different by reason of evolution. The gaps represent either an insertion of genetic material in some sequences, or a deletion of material in others. Because we often don't know which happened, these are called indels.

Phylogenetic Trees

Evolutionary relationships can be represented in a phylogenetic tree. Think of it as a family tree of species. Trees can be drawn with the tree expanding upwards, horizontally or down. When the tree is drawn horizontally, closely related species are at the ends of branches (horizontal lines) which start at a node (vertical lines).

The length of a branch indicates either the amount of elapsed time or the amount of evolutionary change between one node and the next. Each node represents a common ancestor of all the organisms to the right of it.

The relatedness of two species is indicated by how far back (leftward) you need to trace their ancestry until you reach a common ancestor. So in the tree shown here, sheep and goat are more closely related to each other than either is to cow. Also both sheep and goat are equally related to cow. The tree could equally have been drawn with sheep and goat reversed.




Evolution is a one-way process; it is a time-dependent process that doesn't go backwards. So how do we know what should be the oldest point on the tree? We add an organism to the analysis which we are certain is not closely related to the species of interest. In this example it is the kangaroo, the only marsupial among a group of placental mammals. This organism is called the outgroup. The oldest point on the tree, called the root, is placed between the outgroup and the rest of the tree.

Estimating the Phylogenetic Tree

So where did that phylogenetic tree come from? Evolutionary biologists have, for a very long time, been building evolutionary trees from analyses of physical characteristics of organisms. They have done this primarily by applying logic to the problem; looking at where in the fossil record we find organisms with a particular trait and then thinking about which traits must have evolved before others. It is much more difficult to do that with genetic sequences, in part because most of our evidence is from modern specimens. So, phylogenetic trees are estimated computationally by applying a particular type of reasoning many times with the aid of a computer.

The simplest methods for computing the tree are statistical methods called clustering. Clustering involves drawing diagrams with the most similar things linked together, and progressively more different things linked more distantly. To use clustering we need some way of measuring similarity (or its inverse, distance) of genetic sequences. The simplest measure of distance between two sequences is the proportion of positions in an alignment at which they differ. So if two sequences are the same at 98, and different at 2, out of 100 positions, then we would say that they have a distance of 0.02 (2%).

Once we have estimated the genetic distance between all of the pairs of sequences then we can proceed to cluster them. We begin by finding the pair which is most similar, i.e. has the smallest genetic distance, and then we join them. Now we consider them as a unit. Then we find the next closest pair of sequences, including the group that we have just constructed, and we join them together. We repeat this process any times until all of the sequences have been added to the tree. It might be that some sections of the tree get build separately and are joined to the rest of the tree only at the end of the procedure.

Of course, this description of genetic distance and clustering is not the whole story. We need to make statistical corrections for the possibility that evolutionary changes have occurred more than once at a position in the sequence alignment. Also we need to decide where those changes have occurred in the history of the organisms, and so we use other statistical procedures to decide how many changes should be assigned to each branch of the phylogenetic tree.

Another group of more sophisticated methods search through the very large number of possible trees to find the best one. That best tree might be the one which predicts the fewest number of evolutionary changes needed to produce the alignment that we observed. Or, it could be the tree with the greatest probability of giving the observed sequences given some statistical model of molecular genetic evolution.