Bio/statistics handout 10: Binomial and Poisson applications
My
purpose in this handout is to provide some examples of how the binomial
probability function and the Poisson function arise.
a) Point statistics: Suppose that you
do an experiment N times and find that a certain even occurs m times out of the
N experiments. Can determine from this
data a probability for the event to occur?
If we assume that the N experiments are identical in set
up, and that the appearance of the even in any one has no bearing on its
appearance in any other, then we are let to propose the following
hypothesis: The event occurs in any
given experiment with probability q (to be determined) and so the probability
that some n ≤ N events occurs in N experiments is given by the q-version
of the binomial function, thus
Pq(n) =
qn (1-q)N-n
.
(10.1)
The question now is: What value should be used for q?
The use of experimental data to estimate a single parameter—q in this case—is is an example of what is called point statistics. Now, it is important for you to realize that there are various ways to obtain a ‘reasonable’ value to use for q. Here are some:
∑
Since we found m events in N
trials, take the value of q that gives m for the mean of the probability
function in (10.1). With reference to
(9.18) in Handout 9, this choice for q is
.
∑ Take q to so that n = m is the integer with the maximal probability. If you recall (9.20) from Handout 9, this entails taking q so that both
Pq(m)/ Pq(m+1) > 1 and Pq(m-1)/ Pq(m) < 1.
(10.2)
This then
implies that
< q <
. Note that q =
satisfies these
conditions.
b) P-value and bad choices: A
different approach asks for the bad choices of q rather than the ‘best’
choice. The business of ruling out
various choices for q is more in the spirit of the scientific method. Moreover, giving the unlikely choices for q
is usually much more useful to others than simply giving your favorite
candidate. What follows explains how
statisticians determine the likelyhood that a given choice for q is
realistic.
For this
purpose, suppose that we have some preferred value for q. There is some general agreement that q is not
a reasonable choice in the case that there is small probability as computed by
the q-version of (10.1) of their being m occurrences of the event of
interest. To make the notion of ‘small probability’
precise, statisticians have introduced the notion of the ‘P-value’ of a
measurement. This is defined with
respect to some hypothetical probability function, such as our q-version of
(10.1). In our case, the P-value of m is
the probability for the subset of numbers n Î
{0, 1, . . . , N} that are at least as far from the mean as is m. For example, m has P-value
in the case that
(10.1) assigns probability
to the set of integers
n that obey |n-Nq| ≥ |m-Nq|.
A P-value that is less than 0.05 is deemed ‘significant’ by
statisticians. This is to say that if m
has such a P-value, then q is likely to be incorrect.
In general, the definition of the P-value for a measurement is along the same lines:
Definition: Suppose
that a probability function on the set of possible measurements for some
experiment is proposed. The P-value of any given measurement is
the probability for the subset of measurements values that lie as far or
farther from the mean than the given measurement. The P-value
is deemed significant if it is smaller than 0.05.
∑n≥m Pq(n) < 0.05 .
(10.3)
An estimate from the P-value can be had using the Theorem in Section c) of Handout 9. As you might recall, this theorem invokes the standard deviation, s, as it asserts that the probability of finding a measurement with distance Rs from the mean is less than R-2. Granted this, a measurement that differs from the mean by 5s or more has probability less than 0.04 and so has a significant P-value. Such being the case, the 5s bound is often used in lieu of the 0.05 bound.
To return to our binomial case, to say that m differs from the mean, Nq, by at least 5s, is to say that
|m-Nq| ≥ 5 Nq(1-q) .
(10.4)
We should consider q to be a ‘bad’ choice in the case that (10.4) holds.
c) A binomial example using DNA: As you may recall, a strand of a DNA molecule consists of a chain of smaller molecules tied end to end. Each small molecule in the chain is one of four types, these labeled A, T, G and C. Suppose we see that A appears some n times on some length N strand of DNA. Is this an unusual?
To make this question precise, we
have to decide what is ‘usual’, and this means choosing a probability function
for the sample space whose elements consist of all length N strings of letters,
where each letter is either A, C, G or T.
For example, the assumption that the appearances of any given molecule
on the DNA strand are occurring at random suggests that we take the probability
of any given letter A, C, G or T appearing at any given position to be
. Thus, the
probability that A does not appear at any given location is
, and so the probability that there are n appearances of A in
a length N string (if our random model is correct) would be given by the q =
version of the
binomial function in equation (10.1).
This information by itself is not
too useful. A more useful way to measure
whether n appearances of A is unusual is to ask for the probability in our
standard model for more (or less) appearances of A to occur. This is to say that if we think that there
are too many A’s for the appearance to be random then we should consider the
probability as determined by our binomial
function of their being at least
this many A’s appearing. Thus, we should
be computing the P-value of the measured number, n. In the binomial case with q =
, this means computing
∑kÎB
(
)k (
)N-k
(10.5)
where the sum is over all integers k from the set, B, of
integers in {0, . . . , N} that obey |b-
| ≥ |n-
|.
As this sum
might be difficult in any given case, we can also resort to using the fact that
the probability of being R standard deviations from the mean is less than R-2.
In the case at hand, the standard deviation, s, is
(3N)1/2, and so the set of integers b that obey
|b-
| > R
(3N)1/2 has probability less than R-2.
Taking R = 5, we see that our value for n has P-value less that 0.05 if
the measured value of n obeys |n-
| ≥ 5
(3N)1/2. In
this regard, never forget that the P-value is defined with respect to an
underlying theoretical proposal for a particular probability function. Thus, a significant P-value kills the theory.
Note that
our result from the preceding paragraph for this DNA example can be framed as
follows: The measured fraction,
, of occurrences of A has significant P value in our random
model in the case that
|
-
| > ![]()
.
(10.6)
You should note here that as N gets bigger, the right hand
side of this last inequality gets smaller.
Thus, as N gets bigger, the experiment must find the ratio
ever closer to
so as to forstall the
death of our hypothesis about the random occurrences of the constituent
molecules on the DNA strant.
d) An
example using the Poisson function: All versions of the Poisson probability
function are defined on the set À = {0, 1, 2, …}. As noted in the previous handout, a
particular version is determined by a choice of a positive number, t. The Poisson probability for the given value
of t is:
Pt(n) =
tn e-t.
(10.7)
Here
is a suggested way to think about Pt:
Pt(n) gives the probability of seeing n occurrences of a particular event in
any given unit time interval when the occurrences are unrelated and they
average t per unit time.
(10.8)
Here is an example that doesn’t come
from Biology but is none-the-less dear to my heart: I like to go star gazing, and over the years,
I have noted an average of 1 meteor per night.
Tonight I go out and see 5 meteors.
Is this unexpected given the hypothesis that the appearance of any two
meteors are unrelated? To test this
hypothesis, I should compute the P-value of n = 5 using the t = 1
version of (10.7). Since the mean of Pt
is t, this involves computing
(∑m≥5
) e-1 = 1 – (1 + 1 +
+
+
)·e-1
(10.9)
My
trusty computer can compute this, and I find that P(5) ≤ 0.004. Thus, my hypothesis of the unrelated and
random occurrence of meteors is unlikely to be true.
What follows is an example from
biology, this very relevant to the theory behind the ‘genetic clocks’ that
predict the divergence of modern humans from an African ancestor some 100,000
years ago. To start the story, there is
the notion of a ‘point mutation’ of a DNA molecule. This occurs when the molecule is copied for
reproduction when a cell divides; it involves the change of one letter in one
place on the DNA string. Such changes,
cellular typographical errors, occur with very low frequency under
non-stressful conditions. Environmental
stresses tend to increase the frequency of such mutations. In any event, under normal circumstances, the
average point mutation rate per site on a DNA strand, per generation has been
determined via experiments. Let m
denote the latter. The average number of
point mutations per generation on a segment of DNA with N sites on it is thus mN. In T ≥ 1 generations, the average
number of mutations in this N-site strand is thus mNT.
Now, make the following assumptions:
∑ The
occurrence of any one mutation on the given N-site strand has no bearing on the occurrence
of another.
∑ Environmental
stresses are no different now than in the past,
∑ The
strand in question can be mutated at will with no effect on the organism’s
reproductive success.
(10.11)
Granted
the latter, the probability of seeing n mutations in T generations on this
N-site strand of DNA is given by the t = mNT
version of the Poisson probability:
(mNT)n e-mNT .
(10.10)
The genetic clock idea exploits this formula in the following manner: Suppose that two closely related species diverged from a common ancestor some unknown number of generations in the past. This is the number we want to estimate. Call it R. Today, a comparison of the N site strand of DNA in the two organisms finds that they differ by mutations at n sites. The observed mutations have arisen over the course of T = 2R generations. That is, there are R generations worth of mutations in the one species and R in the other, so 2R in all. We next say that R is a reasonable guess if the t = mN(2R) version of the Poisson function gives n any P-value that is greater than 0.05. For this purpose, remember that the mean of the t version of the Poisson probability function is t.
We might also just look for the values of R that make n within 5 standard deviations of the mean for the t = mN(2R) version of the Poisson probability. Since the square of the standard deviation of the t version of the Poisson probability function is also t, this is equivlent to the demand that | 2 mNR – n | ≤ 5(2 mNR)1/2. This last gives the bounds
≤ R ≤
.
(10.12)
Exercises:
1.
Define a probability function, P, on {0,
1, 2, …} by setting P(n) =
(
)n. What is
the P-value of 5?
2. Suppose we lock a monkey in a room with a word processor, come back some hours later and see that the monkey has typed N lower case characters. Suppose this string of N characters contains consecutive characters that read:
professor taubes is a jerk
Is this monkey onto something? Or is this just a chance occurrence? To decide, note that this string has 26 characters. The monkey’s word processor key board allows 48 lower case characters including the space bar. Assume that the monkey is typing at random, and give the probability that this string appears in the N-characters. Estimate (within a power of 10) an upper bound for N below which this string has significant P-value.*
* This is not quite the correct question to ask since we would be surprised (or maybe not) by any string that had ‘taubes’ in a derogatory fashion. True, it is very unlikely that this particular string arises. Somewhat more likely, some string with ‘taubes’ arises. In particular, such a string has a reasonable chance when N is on the order of 100 billion.