Bio/statistics Handout 1: Basic Notions
What follows is a brief summary of the basics of
Probability theory. Some exercises
appear at the end.
∑ Sample
space.
This is just terminology. A sample space is the set of all possible
outcomes of what every ‘experiment’ you are doing.
For example, if you are flipping a coin 3 times, the sample space is
S = {TTT, TTH, THT, HTT, THH, HTH, HHT, HHH}.
(1.1)
If your are considering the possible years of age of a human being, the sample space consists of the non-negative integers up to 150. If you are considering the possible birthdates of a person drawn at random, the sample space consists of the days of the year, thus the integers from 1 to 366. If you are considering the possible birthdates of two people selected at random, the sample space consists of all pairs of the form (j, k) where j and k are integers from 1 to 366.
To reiterate: S is just the collection of all conceivable outcomes.
∑ Events An event is a subset of the sample space, thus, a subset of possible outcomes for your experiment. Thus, if S is the sample space for flipping a coin three times, then HTH is an event. The event that a head appears on the first flip is the four element subset {HTT, HHT, HTH, HHH}.
Thus, an event is simply a certain subset of the possible outcomes.
∑
Axiomatic definition of probability: A probability
function on a sample space is, by definition, an assignment of a non-negative
number to every subset of S subject to the following rules:
P(S) = 1 and
P(AÈB) = P(A) + P(B) when
AÇB = ø ..
(1.2)
Here,
the notation is as follows: A subset of
S is a collection of its elements. If A
and B are subsets of S, then AÈB is the subset of elements that
are in A or in B. Meanwhile, AÇB is
the subset of elements that are in both A and B. Finally, ø is the stupid subset with no
elements; deemed the ‘empty set’. Note
that AÈB is said to be the union of A and B, while AÇB is
said to be the intersection of A and B.
Note that condition P(S) = 1 says that there is probability
1 of at least something happening.
Meanwhile, the condition P(AÈB) = P(A) + P(B) when A and B
have no points in common asserts the following:
The probability of something happening that is in either A or B is the
sum of the probabilities of something happening from A or something happening
from B.
There is
a general rule:
If you know what P assigns to each element
in S, then you know P on every subset:
Just add up the probabilities that are assigned to its elements.
This assumes that S is a finite set. We’ll talk about the story when it isn’t
later in the course. Anyway, the
preceding illustrates the more intuitive notion of probability that we all
have: It says simply that if you know
the probability of every outcome, then you can compute the probability of any
subset of outcomes by summing up the probabilities of the outcomes that are in
the subset.
For example, if S is the set of outcomes for flipping a
fair coin three times (as depicted in (1.1)), then each of its elements has
P(·) =
and then we can use
the rule in (1.2) to assign probabilities to any given subset of S. For example, the subset given by {HHT, HTH,
THH} has probability
since
P({HHT, HTH, THH}) = P({HHT, HTH})
+ P(THH)
by invoking (1.2).
Invoking it a second time finds P({HHT, HTH}) = P(HHT) + P(HTH), and so
P({HHT, HTH, THH}) = P(HHT) + P(HTH) + P(THH) =
.
Here are some consequences of the definition of
probability.
a) P(ø) = 0.
b) P(AÈB) = P(A) + P(B) – P(AÇB).
c) P(A) ≤ P(B) if A Ì B.
d) P(B) = P(BÇA) + P(BÇAc).
e) P(Ac) = 1 – P(A).
(1.3)
In the
preceding, Ac is the set of elements that are not in A. The set Ac
is called the ‘complement’ of A.
I want to stress that all of these conditions are simply
translations into symbols of intuition that we all have about
probabilities. Here are the respective
English versions of (1.3):
a) The probability that no outcomes appear is
zero. This is to say that if S is the list of all possible outcomes, then
at least one outcome must appear.
b) The probability an outcome is in either A or B is the probability that is in A plus the probability that it is in B minus the probability that it is in both. The point here is that if A and B have elements in common, then one is overcounting by just summing the two probabilities. If you doubt this, try the case where A = B.
c) The probability of an outcome from A is no greater than that of an outcome from B in the case that all outcomes from A are contained in the set B.
d) The probability of an outcome from the set B is the sum of the probability that the outcome is in the portion of B that is contained in A and the probability that the outcome is in the portion of B that is not contained in A.
e) The probability of an outcome that is not in A is 1 minus the probability that an outcome is in A.
∑
Conditional probability: This is the
probability that an event in A occurs given that you already know that an event
in B occurs. It is denoted by P(A|B), and
it is a probability assignment for S that is typically not the same as the
original one, P. The rule for computing
this new probability is
P(A|B) º P(AÇB)/P(B).
(1.4)
You can check that this obeys all of the rules for being a probability. In English, this says:
The probability of an event occuring from A given that the event is in B is the probability of the event being in
both A and B divided by the probability of the event being in B in the first place.
Another
way to view this notion is as follows:
Since we are told that the event B happened, we can shrink the sample
space from the whole of S to just the elements that define the event B. The probability of A given that B happened is
then the probability assigned to the part of A in B (thus, P(AÇB)) divided by P(B). In this regard, the division by P(B) is done
to make the conditional probability of B given that B happened equal to 1.
Anyway, here is an
example: Suppose we want the conditional
probability of a head on the last flip granted that there is a head on the
first flip. Use B to denote the event
that there is a head on the first flip.
Then P(B) =
. The conditional
probability that there is a head on the final flip given that you know there is
one on the first flip is obtained using (1.4).
Here, A is the event that there is a head on the final flip, thus the
set {TTH, THH, HTH, HHH}. Its
intersection with B is A Ç B = {HTH,
HHH}. This set has probability
so our conditional
probability is
/
=
.
∑
That’s all there is to probability: You have just seen
most of probability theory for sample spaces with a finite number of
elements. There are a few new notions
that are introduced later, but a good deal of what follows concerns either
various consequences of the notions that were just introduced, or else various convenient
ways to calculate probabilities that arise in common situations.
∑
Decomposing a subset to compute
probabilities: It is often the case (as we will see) that it
is easier to compute conditional probabilities.
This can be used to one’s advantage in the following situation: Suppose that S is decomposed into a union of
some number, N, of subsets that have no elements in common: S = È1≤j≤N Aj where {Aj}1≤j≤N
are subsets of S with AjÇAj´ = ø when j
≠ j´. Now suppose that A is any
given set. Then
P(A) = ∑1≤j≤N P(A|Aj)·P(Aj).
(1.5)
In words, this says the following:
The probability of A is the probability
that an outcome from A occurs that is
in A1, plus
the probability that an outcome from A occurs that is in A2, plus …
By the way, do you recognize (1.5) as a linear
equation? You might if you denote P(A)
by y, each P(Aj) by xj and P(A|Aj) by aj
so that this reads
y = a1x1 + a2x2
+ · · · + aNxN
Thus, linear systems arise!
Here is
an example: Suppose we have a stretch of
DNA of length N, and
want to
know what the probability of not seeing the base G = Guanine in this
stretch. Let A = Event of this happening
for length N stretch and B = Event for a length N-1 stretch. If each of the four basis have equal
probability of appearing, the P(A|B) =
. Thus, we learn that
PN =
PN-1, and so we can iterate this taking N = 1, N =
2, etc to find the general formula PN =
N.
Here more linear algebra:
Let {Aj}j=1,2,3,4 denote the event that a given
site in DNA has base {A, G, C, T} = {1, 2, 3, 4}. Let {Bj}j=1,…4 denote
the analogous event for the adjacent site to the 5´ end of the DNA. (The ends of a DNA molecule are denoted 3´
and 5´ for reasons that have to do with a tradition of labelling carbon atoms
on sugar molecules.) According to the
rule in (1.5), we must have
P(Aj) = ∑k
P(Aj|Bk)·P(Bk).
So, we
have a 4 ´ 4 matrix M whose entry in row j and column k is P(Aj|Bk). Now write each P(Aj) as yj
and each P(Bk) as xk, and this last equation reads yj
= ∑k Mjkxk.
∑
Independent events: An event A is said
to be independent of B in the case that
P(A|B) = P(A).
In
English: Events A and B are independent
when the probability of A given B is the same as that of A with no knowledge
about B. Thus, whether the outcome is in
B or not has no bearing on whether it is in A.
Here is an equivalent definition: Events A and B are independent when P(AÇB) =
P(A)P(B). This is equivalent because P(A|B)
= P(AÇB)/P(B). Note that the
equality between P(AÇB) and P(A)P(B) implies that P(B|A) = P(B). Thus, independence is symmetric. Here is the English version of this
equivalent definition: Events A and B
are independent in the case that the probality of an event being both in A and
in B is the product of the probability that it is in A and in B.
For an example, take A to be the event
that a head appears on the first coin toss and B the event that it appears on
the third. Are these events
independent? Well, P(A) is
as is P(B). Meanwhile, P(A Ç
B) =
which is
P(A)P(B). Thus, they are indeed
independent.
For a second example, consider A to be the event that a
head appears on the first toss and B the event that a tail appears on the first
toss. Then A Ç B =
ø, so P(A Ç B) is zero but P(A)P(B) =
. So, these two events
are not independent. (Are you
surprised?)
Here is food for thought:
Is it reasonable to suppose that the probability of seeing the base G at
a given site is independent of seeing it at the next site in a stretch of
DNA? Check out the DNA code for most
commonly used amino acids and let me know.
∑
Bayes Theorem: Suppose that A and
B are given subsets of A. If we know the
probability of A given B, how can we compute the probability of B given A. This is a typical issue: What does knowledge of the outcomes say about
the probable ‘cause’?
Here is a typical example:
You observe a distribution of traits in the human population today and
want to use this information to say something about the distribution of these
traits in an ancestral population. Thus,
you have the probabilities of the ‘outcomes’ and want to discern those of the
‘causes’.
In any event, to reverse ‘cause’ and ‘effect’, use the
equalities
P(A|B) = P(A Ç B)/P(B) and P(B|A) = P(A Ç B)/P(A).
to write
P(B|A) = P(A|B)·P(B)/P(A)
(1.6)
This is
the simplest form of ‘Bayes theorem’.
For example, you flip a coin three times
and find that heads occurs twice. What
is the probability that heads appeared on the first flip? Let A = event that heads appears twice in 3
flips and B the probability that it appears on the first flip. We are asked for P(B|A). Now, P(A|B) =
since if we know heads
happens on the first flip, then we can get two heads only with {HHT} and
{HTH}. These events are independent and
their probabilities sum to
. Meanwhile, P(B) =
so P(A|B) =
/
=
. On the other hand,
P(A) =
since A = {HHT, HTH,
THH}. Thus, (1.6) finds that the
probability of interest, P(B|A), is equal to
.
An iterated form of Bayes’ theorem: Suppose next that
S = È1≤k≤N Aj
is the union of N pairwise disjoint subsets.
Suppose that A is a given subset of S and we know that an outcome from A
appears. Given this knowledge, what is
the probability that the outcome was from some given Ak? For example, S could be the various diseases
and A diseases where the lungs filling with fluid. Take A1 to be pneumonia, A2
to be ebola viral infection, A3 to be West Nile viral infection,
etc. An old man dies and the autopsy
finds that the cause of death was the filling of the lungs. What is the probability that death was due to
West Nile viral infection? We are
interested in P(A3|A). We
know the death rates of the various diseases, so we know P(A|Ak)
for all k. This is the probability of
the lung filling with fluid if you have the disease that corresponds to Ak Suppose we also know P(Ak) for all
k; the probability of catching the disease that corresponds to Ak. How can we use the latter to compute P(A3|A)? This is done using the following chain of
equalities: First,
P(A3|A) = P(A3 Ç
A)/P(A) = P(A|A3)·P(A3)/P(A) .
Second,
we have
P(A) = ∑k P(A Ç Ak)
= ∑k P(A|Ak)·P(Ak) .
Together,
these last two inequalities imply the desired one:
P(A3|A) = P(A|A3) P(A3)/[∑1≤k≤N
P(A|Ak)·P(Ak)].
This is
the general form of ‘Bayes’ theorem:
P(An|A) = P(A|An) P(An)/[∑1≤k≤N
P(A|Ak)·P(Ak)].
(1.7)
and it allows you to figure out the probability of An given that A occurs from the probabilities of the various Ak and the probability that A occurs given that any one of these Ak occur.
Exercises:
1. Suppose we have an experiment with three possible outcomes, labeled 1,2, and 3.
Suppose in addition, that we do the experiment three successive times.
a) Give the sample space for the possible outcomes of the three experiments.
b) Write down the subsets of your sample space that correspond to the event that
outcome 1 occurs in the second experiment.
c) Suppose that we have a theoretical model of the
situation that predicts equal
probability for any of the three outcomes for any one given experiment. Our model also says that the event that outcome k appears in one experiment and
outcome j in another are independent. Use these facts to give the probability of three successive experiments getting as outcome any given triple (i, j, k) with i either 1, 2 or 3, and with j and k likewise constrained.
d) Which is more likely: Getting exactly two identical outcomes in the three
experiments, or getting three distinct outcomes in the three experiments.
2. Suppose that 1% of Harvard students have a
particular mutation in a certain protein,
that 20% of people with this mutation have trouble digesting lactose, and that 5% of Harvard students have trouble digesting lactose. If a Harvard student has trouble digesting Lactose, what is the probability that the student has the particular mutation? (Hint: Think Bayes’ theorem.)
3. A certain experiment has N ≥ 2
possible outcomes. Let S1 denote
the corresponding
sample
space. Suppose that k is a positive
integer and that p Î (0, 1).
a) How many elements are in the sample space for
the possible outcomes of k separate
repeats of the experiment?
b) Suppose that N ≥ 2 and that we have a theoretical model that predicts that one
particular outcome, ô Î S1, has probability p and all
others have probability
. Suppose that we run
the experiment twice and that our model predicts that the event of getting any
given outcome in the first run is independent from the event of getting any
particular outcome in the second. How
big must p be before it is more probable to get ô twice as opposed to
never?
c) How big must p be before it is more probable to get ô twice in two
consecutive runs as opposed to ô just once in the two experiments?
4. Label the four basis that are used in a DNA molecule as {1, 2, 3, 4}.
a) Granted this labelling, write down the sample space for the possible basis at two
given sites on the molecule.
b) Let {Aj}j=1,2,3,4 denote the event in this 2-site sample space that the first site has the
base i, and let {Bj}j=1,…4 denote the analogous event for the second site. Explain why P(A1|Bk) + P(A2|Bk) + P(A3|Bk) + P(A4|Bk) = 1 for all k.
c) If Ai is independent from each Bk, we saw that P(Ai|Bk) = P(Bk|Ai) for all i and k.
Can this last symmetry condition hold if some pair Ai and Bk are not independent?
If so, give an example by specifying the associated probability function on the two site sample space. If not, explain why..