## 8/06/2006

### Mr Bayes plays silly blighters

(Statistics)

I've just noticed that a fellow blogger is having a bit of trouble getting her head round Bayesian analysis. Actually, it seems she's got the hang of it, but it occurs to me that some background on why Bayesian analysis works might be useful. Being a maths geek and all-round nosey person, I am of course going to get involved with attempting to explain the situation.

Bayesian analysis consists, quite simply, of one method to determine how probabilities behave when our model is accurate plus another method that uses this to refine our model. I'll briefly cover both.

The way things are

The result I'm going to discuss here is called Bayes' Theorem. It's most easily explained using a visual technique called Venn diagrams, but that involves lots of irritating drawing and uploading so I'll skip it for the moment. Instead I'll briefly run you through the maths.

The first thing you need to know is conditional probability. Conditional probability is a way of figuring out how likely it is that one thing happens given that another thing has already happened. The important result here is: the probability that A happens given B happens = the probability that A happens and B happens divided by the probability that B happens. I'll write this as P(A|B) = P(AnB)/P(B). Stare at this for a while and it'll hopefully start to make sense.
P(A|B) = P(AnB)/P(B)

This allows us to produce an interesting little equality - Bayes' Theorem. Since P(A|B)P(B) = P(AnB) = P(BnA) = P(B|A)P(A), we can write P(A|B) = P(B|A)P(A)/P(B). This is important - memorise it.
P(A|B) = P(B|A)P(A)/P(B)

Let's consider an example. Say we're picking black and white balls out of a pot, with replacement. We know there are 10 balls in the pot; what we don't know is the number that are black. Our task is to figure out the number of black balls. I'll need to cover some other topics, but I'd like to point out an interesting phenomenon here: if we guess correctly, the data will match our guess. No amount of data will change our mind. The posterior probability P(# black | data) will equal the prior probability P(# black).
If our distribution is accurate, P(distribution|data) = P(distribution)

The way things should be

I'm going to leave our concrete example for the moment and briefly cover an important numerical approximation technique. The actual proof thereof takes advanced mathematical skills, but the basic idea is simple. It is this: if an equation can be formulated as x = f(x), then setting xn+1 = f(xn) will result in xn converging on a solution of the equation as n increases.
If x=f(x) then setting xn+1 = f(xn) will cause xn to converge to x

Let's take an example. The other day my dad asked me to work out a fairly complicated problem in compound interest for him. It turned out that the answer I was after was a solution to the equation 46.40*x300 - 23446.40*x + 23400 = 0. Remembering the above mathematical technique, I rewrote the equation as x = ((23446.40*x + 23400)/46.40)1/300. I then plugged in an initial value of 1.05 and iterated a couple of times. The algorithm very quickly converged on a correct solution of 1.0233653112365164.

How does this apply to our example? Well, if you look at the equations we derived above, you'll notice something: if our distribution is accurate, we have the equation P(distribution) = P(distribution|data) = P(data|distribution)*P(distribution)/P(data). This is precisely the sort of equation that this algorithm is designed to find solutions for.

Let's get back to the balls. We have eleven possible values for the number of black balls - anything from 0 to 10 - which gives rise to eleven possible binomial distributions that could be giving rise to the data. To start with, let's assume that these distributions are equally likely - P(# black = k) = 1/11 for all possible k. This distribution of distributions will act as our prior probabilities.

Now let's go out and collect some data. We pick 40 balls and find that 21 are black and 19 are white. What are the associated posterior probabilities?

Well, first we need to work out two things: the probability of getting that data given a value k for the number of black balls, and the overall probability of getting that data. The first is fairly standard mathematics, so I won't go into it here - just google on "Binomial Distribution" if you're confused. The answer turns out to be 40!/(21!*19!)*(k/10)21(1 - k/10)19. So, for example, with k=5 this probability is 0.119.

Now we can work out the overall probability of getting that data - note that P(data) = sum over distributions of P(data n distribution) = sum( P(data|distribution)*P(distribution) ). In the specific example, this turns out to be 1/11*(0 + 1.77*10-11 + 3.97*10-6 + 0.00157 + 0.0352 + 0.119 + 0.0792 + 0.00852 + 6.35*10-5 + 1.44*10-9 + 0) = 0.244/11. Cool, huh?

OK, we now have all the parts we need to generate posterior probabilities. Here goes nothing...

P(# black = k | 21 black, 19 white) = P(21 black, 19 white | # black = k)*P(# black = k)/P(21 black, 19 white)
= 40!/(21!*19!)*(k/10)21(1 - k/10)19 * 1/11 / (0.244/11)
= 40!/(21!*19!)*(k/10)21(1 - k/10)19 / 0.244

Which gives us the following posterior probabilities:
P(# black = 0) = 0.0
P(# black = 1) = 7.27105875099e-11
P(# black = 2) = 1.62678304084e-05
P(# black = 3) = 0.0064179907059
P(# black = 4) = 0.144252566798
P(# black = 5) = 0.489542214277
P(# black = 6) = 0.324568275296
P(# black = 7) = 0.0349423938432
P(# black = 8) = 0.000260285286534
P(# black = 9) = 5.8895575883e-09
P(# black = 10) = 0.0

If you sum these up you'll find they add up to 1, as necessary. You'll note that our distribution has already started converging on a value for k of either 5 or 6, which is more or less what we'd expect. With sufficient data, we can expect that one distribution will emerge as the clear winner - adding more data is directly equivalent to doing another pass of the approximation algorithm, and could even be formulated in the same way if we were complete masochists.

Hope that helps everyone else as much as it just helped me.

## 7/23/2006

### Background Page: Genetic Algorithms

Description

Ever since Darwin, we've had a fairly good idea of how the complexity of the world's organisms has appeared. It's the result of a stochastic optimisation process known as natural selection, which turns out to be very good at producing efficient solutions to problems.

This raises an interesting question: can we harness this effect to produce better designs? What problems is this approach good at solving? How do we tailor the approach to be as effective as possible? These questions are encapsulated and answered in the study of genetic algorithms.

This isn't strictly a part of computational biology, but it's a technique that's widely used in this field (if only because CBists have generally come across GAs). See, for example, this paper.

Problem solved

How can we use evolutionary processes to solve optimisation problems?

Books/Resources

"Evolutionary Computing" - Kenneth A. De Jong. Barely started reading.

Posts

None so far.

## 7/22/2006

### Sequence comparison

(Bioinformatics)

OK, so I got word through from the guy in charge of the computer course about my little ethical crisis. He suggested that, instead of posting actual code, I instead merely include pseudocode for the bits covered by the course. Now why the heck didn't I think of that?

On with the show. A major part of bioinformatics involves the comparison of different genetic sequences. Say you have a working gene for vitamin C production, plus something that you think might be a broken copy of same. How do you figure out whether they are actually related?

The basic technique is to figure out how many mutations (insertions, deletions, substitutions) it takes to get from one to the other. There are many many variants on this basic principle - tailor-made modifications designed to account for a variety of changeable factors such as:

1) are certain substitutions more common than others?
2) how likely are insertions and deletions of various sizes? (coughZipf distributioncough)
3) do certain subsections match better than others?

For this post, I'll only cover the most simple case: all three types of mutation are equally likely, and indels (insertions and deletions) are always of length 1.

Cutting it down to size

OK, so we have a couple of strings (referred to as S and T) to align - we'll think of this as finding a way to mutate S so it ends up looking like T. There's no immediately-obvious way of solving this, so let's try a standard approach: see if we can reduce it to a simpler problem.

I'm going to look at a few cases and show how they reduce to various easier problems, with examples of situations where each reductive approach can be ideal. Hope people can follow! I will be using the Python slice notation extensively, so you may want to familiarise yourself with that first.

Matching

OK, so say our two strings are "CAAATAGCAA" and "CCGCGAGCAA". There's something you should notice here: we don't actually need to do anything with the last 5 digits! We can just say "hey, they match", and stop worrying about them. The problem then reduces itself to trying to align "CAAAT" with "CCGCG". Much simpler.

Formally: if S[-1] and T[-1] are identical, they may be replaced by S[:-1] and T[:-1] at no cost.

Substitution

Let's try a different set of strings. Say we have "AAATCCAGCA" and "AAATCCAGCT". Now, if you look at these, you'll see they differ by only the last element. Time for a substitution! If we transform the first string by substituting a T for the A, we'll then just be able to match the rest. Home and dry! This does "cost" us one mutation, but in this case that's a small price to pay.

Formally: S and T may be replaced by S[:-1] and T[:-1] at a cost of 1 mutation

Deletion

Our next two example strings are "GGGTCATCTTA" and "GGGTCATCTT". Now, it should be fairly obvious what you need to do to the first string to get it to match the second: just delete the "A".

Formally: S may be replaced with S[:-1] at a cost of 1 mutation

Insertion

Finally, consider the strings "CCGGAGACCT" and "CCGGAGACCTG". Our instinctive response here is to stick a G on the end of the first string. However, that presents us with a problem: if we're going to start adding stuff to strings, we're going to end up with longer strings. That ruins the whole idea of trying to simplify the problem.

There's a simple workaround - we just need to combine two operations. If we add a "G" and then immediately match the two "G"s away, we end up with the same result as if we'd simply dropped the "G" off the end of the second string. That's a lot more tractable, and incidentally is pleasingly symmetrical with the aforementioned operation of deletion.

Formally: T may be replaced by T[:-1] at a cost of 1 mutation

These are the only options for this model!

OK, so let's combine these four approaches into a single algorithm. What we're going to do is take our pair of strings and apply each of the four abovementioned alterations to them (using matching or substitution as applicable). This will provide us with three pairs of strings, at least one of which will be shorter, which we can then apply the same method to. Eventually both strings will sink to size zero in every single case. We'll then select the sequence of operations that gets us to this point via the fewest expensive mutations.

We can write this more formally. Refer to the "distance" (number of mutations) between S and T as D(S,T). Then we have the following:

D(S,T) = min{
D(S[:-1],T[:-1]) + s(S[-1],T[-1])
D(S[:-1],T) + 1
D(S,T[:-1]) + 1
}

where s(A,B) = 1 if A and B are different letters and 0 if A and B are the same (this represents the cost of either substituting B for A or matching them)

If we implement this as a function that takes two strings and returns the distance between them, we can simply design it so that it recurses through all possible substrings. Puzzle solved!

A problem

We have, however, missed something. What happens if, say, we're comparing "ATGGCTA" with ""? The string T[:-1] will now no longer exist, and the program will crash horribly when it tries to evaluate D(S,T[:-1]). Doh.

Fortunately, there's an easy solution. Since the only way of getting from "ATGGCTA" to "" is by deleting all the letters, we can just say that this has a cost of 7 mutations. A similar approach applies if the first string is empty. We'll need to write this into the code:

if S=="": D(S,T) = len(T)

elif T=="": D(S,T) = len(S)

else:
D(S,T) = min{
D(S[:-1],T[:-1]) + s(S[-1],T[-1])
D(S[:-1],T) + 1
D(S,T[:-1]) + 1
}

A computational note

If you decide to implement this alignment algorithm for yourself - and I strongly recommend you try it - you'll quickly notice a major problem with the standard recursive implementation of the algorithm. In short, it's damn slow for long strings. More precisely, since each distance calculation spawns three other distance calculations, the time required to run the thing will be somewhere on the order of 3 to the power of the length of the strings[1]. This sucks.

Fortunately, there's a much nicer approach, albeit one that requires a modicum more memory (try saying that three time fast...). The problem with the recursion is that it's wasting time by repeatedly calculating the same distance. Why not simply precalculate these values, and refer to them as necessary? One way of implementing this is by creating a grid such that element (i,j) of the grid is equal to D(S[:i],T[:j]). This can be populated by filling in the sides adjacent to D("","") and then working outwards from that corner.

A rather fuzzy mathematical note

It turns out, if I understand the book I'm currently reading correctly, that the problem solved here is part of a wider class of mathematical problems, which also includes things like determining RNA secondary structure. In particular, they're ones that in some rather abstract sense involve the concept of context-free grammars, and it turns out that you can solve them all in a similar fashion.

At present, I have sod-all idea how this works. However, as an incurable maths student, I'm very interested in exploring the theoretical underpinnings involved, and will probably rant on about this in some detail at a later date. Consider yourselves warned :)

The next step

In this post I've provided a fairly solid discussion of the basic alignment algorithm. The next post on this topic will be fairly brief, and will cover a phenomenon known as BLOSUM matrices that are used to fine-tune the alignment of two protein (not DNA) sequences.

[1] At some point it will actually be beneficial to have a detailed discussion of calculations of this sort. They come up in the real world a heck of a lot - for example, in cryptography, mathematicians can spend years of their lives trying to reduce an algorithm's runtime by a comparatively tiny-looking amount. The reason they do this is because that "tiny amount" can often be disproportionately important as the codes get harder to crack.

## 7/19/2006

I've been away for a while.

I have an excuse: my computer hates me.

More precisely, a standard upgrade caused it to crap its pants in at least 6 different ways and counting. X.org died in 4 different ways, 3 of which related to the scourge upon the earth that is NVidia's godawful policy regards open-sourcing their drivers. Xfce (the window manager) died in two ways, one of which involved apparently losing every single tweak I've ever made to its appearance. I'd just got it how I wanted it too.

This sucks.

On the other hand, at least I have a (relatively) functional command line now, rather than having to use my dad's old laptop. Time to get back to work.

I suspect my computer will shortly find a more serious way to die - the unseasonal heatwave is hitting it like a sledgehammer - but let's not contemplate that.

Think happy thoughts...

## 7/04/2006

### Surprisingly non-scary

The first edition of the newly-formed Bio::Blogs[1] blog carnival is up at Public Rambling. It's fairly short compared to many carnivals, but that's to be expected for a first edition.

For anyone who's wondering what I'm talking about: a blog carnival is a synopsis of interesting and relevant entries from different blogs, usually organised round a common theme. In this case, bioinformatics. Very cool.

[1] This is an amusing name if and only if you're a Perl geek

## 7/02/2006

### Personal update

Sadly I didn't get into the MPhil program. This was precisely what I'd expected, so it's not exactly a shock to the system.

I now have one major goal: to make them wish they had taken me. To this end, I'll be posting about more advanced topics, possibly skipping chunks of background material in the process. If I have time, or if anyone has special requests, I'll go back and post about this missing material, but the end goal is to bring myself up to scratch on the various elements of computational biology as quickly as humanly possible.

Assistance in achieving this goal is greatly appreciated. The small number of people who have found this blog have already been incredibly helpful. Advice on any of the following is gratefully received:

1) Suggestions for material to look at, for example:
a) books
b) papers
c) articles
d) programs

2) critical analysis of posts[1], in particular:
b) accuracy

3) Absolutely anything else that you feel would improve either the blog or my understanding

And once I'm up to scratch on all fronts, and my blog is getting hits from millio thousa hundreds of people who are interested in learning more about CB from the educational resource I've produced, maybe someone from the Cambridge CB dept will wander across it, and read for a bit, and absentmindedly think "shame we didn't accept this guy's offer".

And I'll be satisfied.

[1] Please consider this an open invitation to be as bitchy as you please - I won't take it personally :)

## 6/29/2006

Tomorrow I graduate. I also get kicked out of uni accommodation, with the consequence that I won't be around much over the next week or so.

As a final pressie for my readers before I vanish, here is the simulation of speciation by genetic drift that I was talking about: http://coalescent.freewebpage.org/popgen/speciation1.py

This is the longest program I've produced yet, so when I get back online I will probably end up:
1) producing decent documentation
2) possibly converting it to an object-oriented approach to organisms
3) writing several posts discussing its behaviour in various circumstances
4) upgrading the program in a range of interesting ways, such as by adding a graphical interface

Until then, have fun :)

### A cry for help

Could someone help me here? I'm planning to write a program to model drift-induced speciation. However, to do this I need two statistics that I don't have.

Firstly, I need to know how the chance of two organisms mating is likely to behave as a function of the distances between their birth locations. Obviously this is going to vary heavily from species to species - eagles will cover more ground than sloths. I would expect the distribution to be an exponential decay, with the parameter representing the organism's migratory ability. Can anyone confirm or correct?

Secondly, I need to know how the ability of two organisms to mate varies as a function of the accumulated genetic differences between them. On the face of it, this is a daft question - in the real world, some differences will have an impact so much greater than others as to make the distribution virtually meaningless. However, some kind of average would still be handy.

Again, my inclination would be to model this as an exponential curve, with the parameter being something that a Real Scientist would determine through experiment. It's presumably affected by genome compactness, though, so I can set it to pretty much any value I want for experimental purposes.

Can anyone help? I appreciate that this is research I should really be doing for myself, but frankly I don't have the first clue as to where to look.