Bio/statistics Handout 3: Random variables
In favorable circumstances, the different outcomes of any given experiment have measurable properties that distinguish them. Of course, if a given outcome has a certain probability, then this is also the case for any associated measurement. The notion of a ‘random variable’ provides a mathematical framework for studying these induced probabilities on the measurements.
·
Definition of a random variable: A random variable
is no more nor less than a
function on the sample space.
In this regard, such a function assigns a number to each element in the
sample space. One can view the function
as giving the results of measurements of some property of the elements of the
sample space.
Sometimes, the
notion is extended to consider a function from the sample space to another set.
For example, suppose that S is the set of possible
3-letter amino acid
sequences, and ƒ is the function on S that assigns to each sequence one of the 20 amino acids. Thus, ƒ maps a 32 point space to a 21 point space.
Most often, random variables take values in R. For example, let S
denote the 20 possible amino acids that can occupy the 127’th
position from the end of a certain enzyme (a type of protein molecule) in that
helps the cell metabolize the sugar glucose.
Now, let f denote the function on S that measures the rate of glucose
metabolism in growing bacteria with the given enzyme at the given site. Thus, f associates to each element in a 20
element set some number.
·
Probability for a random variable: Suppose
that S is our sample space and P is a probability function on S. If f is a random variable and r a possible
value for f, then the probability that ƒ takes value r is P(Event
f = r). This is number is given by
P(f
= r) = ∑sÎS:ƒ(s)=r P(s).
(3.1)
A parenthetical
remark:
This last equation can be viewed (at least in a formal sense) as a
matrix equation in the following way:
Introduce a matrix by writing Ars =
1 if f(s) = r and Ars = 0
otherwise. Then, P(f
= r) = ∑sÎS Ars P(s) is
a matrix equation. Of course, this is
rather silly unless the set S and the possible values for f are both
finite. Indeed, if S is finite, say with
n elements, number them from 1 to n.
Even if f has real number values, one typically makes its range finite
by rounding off at some decimal place anyway.
This understood, there exists some number, N, of possible values for f. Label
the latter by the integers between 1 and N.
Using these numberings of S and the values of f, the matrix A can be
thought of as a matrix with n columns and N rows.
For an example of what happens in (3.1), take S to be the
set of 20 possible amino acids at the 127’th position
from the end of the glucose metabolizing enzyme. Let f now denote the function from S to the
10 element set that is obtained by measuring to the nearest 10% the fraction of
glucose used in one hour by the growing bacteria. Number the elements of S from
1 to 20, and suppose that P assigns the k’th amino
acid probability
if k ≤ 5,
probability
if 6 ≤ k ≤
10 and probability
if k > 10.
Meanwhile, suppose that f(k) = 1-
if k ≤ 10 and f(k) = 0 if k ≥ 10. This understood, it then follows using (3.1)
that P(f =
) is equal to
0 for
n = 10,
for 5 ≤ n ≤
9,
for 1 ≤ n ≤
4, and
for n = 0.
(3.2)
·
A probability functon
on the possible values of f: As it turns out, the assignment r ® P(f = r) of a non-negative number to each of the possible
values for f defines a probability function on the set of these possible
values. Let us call this new sample
space Sf, and the new probability function
Pf(r). Thus, if r Î Sf, then Pf(r) is given by the sum on
the right hand side of (3.1).
To verify that it is a probability function, observe that
it is never negative by the nature of its definition. Also, summing the values of Pf
over all elements in Sf gives 1. Indeed, using (3.1), the latter sum can be
seen as the sum of the values of P over all elements of S.
The example in (3.2) illustrates this idea of associating a
new sample space with probability function to a random variable. In the case of (1.4), the new sample space Sf is the 11 element set of fractions of the
form
where n Î {0, . . . , 10}. The
function Pf is that given in (3.2).
You can verify on your own that ∑0≤n≤10 P(f =
) = 1.
·
Mean and standard distribution for a random
variable: Suppose that f is a random variable on a sample space S, in this case just a
function that assigns a number to each element in S. The mean
of f is the ‘average’ of these assigned numbers, but with the notion of average
defined here using the probability function.
The mean is typically denoted as m; here is its formula:
m =
∑sÎS
f(s) P(s)
(3.3)
A
related notion is that of the standard
deviation of the random variable f.
This is a measure of the extent to which f differs from its mean. The standard deviation is typically denoted
by the Greek letter s and it is defined so that its square is the mean of (f - m). To be explicit,
s2 = ∑sÎS (ƒ(s) - m)2 P(s).
(3.4)
Thus, the standard deviation is larger when f differs from its mean to a greater extent. The standard deviation is zero only in the case that f is the constant function.
To see how this plays out in an example, consider again the example that surrounds (3.2). The sum for the mean in this case is
1´0 +
´
+
´
+
´
+
´
+
´
+
´
+
´
+
´
+
´
+ 0´
,
which equals
. Thus, m =
. The standard
deviation in this example is the number whose square is the sum
´0 +
´
+
´
+
´
+
´
+
´
+ 0´
+
´
+
´
+
´
+
´
,
which equals
. Thus, s =
~ 0.33.
Statisticians
are partial to using a one or two numbers to summarize what might be a
complicated story. The mean and
standard deviation are very commonly employed for this purpose. To some extent, the mean of a random variable
is the best guess for its value. Howeve, the mean speaks nothing of the expected
variation. The mean and standard
deviation together give both an idea as to the expected value of the variable,
and also some idea of the spread of the values of the variable about the
expected value.
·
Random variables as proxies. Of ultimate
interest are the probabilities for the points in S, but it is
often the case it is only possible to directly measure the probabilities for
some random variable, a given function on S.
In this case, a good theoretical understanding of the measurements and
the frequencies of occurences of the various measured
values can be combined so as to make an educated guess for the probabilities of
the elements of S. Here is how this is
typically done: The experiment is done
many times and the frequencies of occurrence of the possible values for f are
then taken as a reasonable approximation for the probability function Pf. This is to say that we set P(f
= r) in (3.1) equal to the measured frequency that the value r was measured for
f. This value for P(f
= r) is then used with (3.1) to deduce the desired P on S.
To be explicit here, suppose that there is some finite set
of possible values for f, these labeled as {r1, .
. ., rN}. When k Î {1, . . . ,
N}, let yk denote the frequency that rk appears as the value for f. Label the elements in S as {s1, . . . , sn}. Now introduce the symbol xj
to denote the unknown but desired P(sj).
Thus, the subscript j on x can be any integer in the set {1, . . . , n}. The
goal is then to solve for the collection {xj}1≤j≤n by writing (3.1) as
the linear equation
y1 = a11 x1 + · · · + a1n xn
![]()
![]()
![]()
yN = aN1 x1 + · ·
· + aNn xn
(3.5)
where akj = 1 if f(sj)
= rk and akj
= 0 otherwise. Note that this whole
strategy is predicated on two things:
First, that the sample space is known.
Second, that there is enough of a theoretical understanding to predict apriori the values for the measurement f on its
elements.
To see something of this in action,
consider again the example from (3.2).
For the sake of argument, suppose that the measured frequency of P(f =
) are exactly those given in (3.2). Label the possible values of f using r1
= 0, r2 =
, · · · ,
r11 = 1. This done, the
relevant version of (3.5) is the following linear equation:
= x10 + · · · + x20
= x9
= x8
= x7
= x6
= x5
= x4
= x3
= x2
= x1
0 = 0
As you
can see, this determines xj = P(sj) for j ≤ 9,
but there are infinitely many ways to assign the remaining probabilities.
·
A second example: Here is some
background: It is typical that a given
gene along a DNA molecule is read by a cell for its information only if certain
nearby sites along the DNA are bound to certain specific protein
molecules. These nearby sites are called
‘promoter’ regions (there are also ‘repressor’ regions) and the proteins that
are involved are called ‘promoters’.
Note that promoter regions are not genes per se,
rather they are regions of the DNA molecule that attract proteins. The effect of these promoter regions is to
allow for switching behavior: The gene
is ‘turned on’ when the corresponding promoter is
present and the gene is ‘turned off’ when the promoter is absent. For example, when you go for a walk, your leg
muscle cells do work and need to metabolize glucose to supply the energy. Thus, some genes need to be turned on to make
the required proteins that facilitate this metabolism. When you are resting, these proteins are not
needed—furthermore, they clutter up the cells.
Thus, these genes are turned off when you rest. This on/off dichotomy is controlled by the
relative concentrations of promoter proteins.
The nerve impulses to the muscle cell cause a change in the folding of a
few particular proteins on the cell surface.
This change starts a chain reaction that ultimately frees up promoter
proteins which then bind to the promoter regions of the DNA, thus activating
the genes for the glucose metabolizing machinery. The latter then make lots of metabolic
proteins for use while walking.
Anyway, here is my example:
Let S denote the set of positive integers up to some large number N, and
let P(s) denote the probability that a given protein is attached to a given
promoting stretch of DNA for the fraction of time s/N. We measure the values of
a function, f, which is the amount of protein that would be produced by the
cell were the promoter uncovered. Thus,
we measure P(f = r), the frequencies of finding level
r of the protein. A model from
biochemistry might tell us f and thus, we can write Pf(r) = ∑s
ars P(s).
Note that our task then is to solve for the collection {P(s)},
effectively solving a version of the linear equation in (3.5).
· Correlation matrices and independent random variables: A correlation matrix involves two random variables, say f and g. As I hope is clear from what follows, the matrix is related to the notion that we introduced earlier of independent events in that it measures the extent to which the event that f has a given value is independent of the event that g has a given value.
To see how this works, label the possible values for f as {r1, . . . , rN} and label those of g as {r1, . . . , rM}. Here, N need not equal M, and there is no reason for the r’s to be the same as the r’s. Indeed, f can concern apples and g oranges: The r’s might be the weights of apples, rounded to the nearest gram; and the r’s might be the acidity of oranges, measured in pH to two decimal places.) Anyway, the correlation matrix is the N ´ M matrix C with coefficients (Ck,j)1≤k≤N,1≤j≤M where
Ckj = P(f = rk
& g = rj) – P(f = rk) P(g = rj) .
(3.6)
Here, P(f = rk & g = rj) is the probability of the event that f has value rk and g has value rj; it is the sum of the values of P on the elements s Î S
where f(s) = rk and g(s) = rj. Thus, Ckj = 0 if and only if the event that f = rk is independent from the event that g = rj. If all entries are
zero, the random variables f and g are said to be independent random variables.
This means that the probabilities for the values of f have no relation
to those for g.
To see how this works in our toy model
from (3.2) suppose that g measures the number of cell division cycles in six
hours from our well fed bacteria.
Suppose, in particular, that the values of g range from 0 to 2, and that
g(k) = 2 if k Î {1, 2}, that g(k) = 1 if 3 ≤ k ≤ 7, and that
g(k) = 0 if k ≥ 7,. In this case,
the probability that g has value r Î {0, 1, 2} is
for r = 0,
for r = 1, and
for r = 2
(3.7)
Label the values of f so that r1
= 0, r2 =
, …, r10 =
, r10 = 1. Meanwhile, label those of g as in the
order they appear above, r1
= 0, r2 = 1 and r3 = 2, The correlation matrix in this case is
an 11´3 matrix. For example, here are the coefficients in the
first row:
C11 =
, C12 = -
, C13 = -
.
To explain, note that the event
that f = 0 consists of the subset {10, . . . , 20} in
the set of integers from 1 to 20. This
set is a subset of the event that g is zero since the latter set is {7, . . . , 20}. Thus, P(f = 0 & g = 0) = P(f = 0) =
, while there are no events where f is 0 and g is either 1 or
2.
By the way, this example illustrates
something of the contents of the correlation matrix: If Ckj
> 0, then the outcome f = rk is
relatively likely to occur when g = rj. On the other hand,
if Ckj < 0, then the outcome f = rk is unlikely to occur when g = rj. Indeed, in the
most extreme case, the function f is never rk
when g is rj and so
Ckj = -P(f = rk)
P(g = rj) .
As I noted above, statisticians are want
to use a single number to summarize behavior.
In the case of correlations, they favor what is known as the correlation coefficient. The latter, c(f,g), is obtained from the correlation matrix and is
defined as follows:
c(f,g) =
∑k,j (rk - m(f))(rj - m(g)) Ckj .
(3.8)
Here, m(f) and s(f)
are the respective mean and standard deviation of f, while m(g)
and s(g) are their counterparts for g.
·
Correlations and proteomics: Lets
return to the story about promoters for genes.
In principle, a given protein might serve as a promoting protein for one
or more genes, or it might serve as a promoter for some genes and a represser for others.
Indeed, one way a protein can switch off a gene
is to bind to the DNA in such a way as to cause all or some key part of the
gene coding stretch to be covered.
Anyway, suppose that f measures the level of protein #1 and
g measures that of protein #2. The
correlation matrix for the pair f and g is a measure of the extent to which the
levels of f and g tend to track each other.
If the coefficients of the matrix are positive, then f and g are
typically both high and both low simultaneously. If the coefficients are negative, then the
level of one tends to be high when the level of the other is low.
This said, note that a reasonable
approximation to the correlation matrix can be inferred from experimental
data: One need only measure simultaneous
levels of f and g for a cell, along with the frequencies that the various pairs
of levels are observed.
By the way, the search for correlations in protein levels
is a major preoccupation of cellular biologists these days. They use rectangular ‘chips’ that are covered
with literally thousands of beads in a regular array, each coated with a
different sort of molecule, each sort of molecule designed to bind to a
particular protein, and each fluorescing under ultraviolet light when the
protein is bound. Crudely said, the
contents of cell at some known stage in its life cycle are then washed over the
chip and the ultraviolet light is turned on.
The pattern and intensity of the spots that light up signal the presence
and levels in the cell of the various proteins.
Pairs of spots that tend to light up under the same conditions signal
pairs of proteins whose levels in the cell are positively correlated.
1.
A number
from the three element set {-1, 0, 1} is selected at random; thus each of –1, 0
or 1 has probability
of appearing. This operation is repeated twice and so
generates an ordered set (i1, i2) where i1 can
be any one of –1, 0 or 1, and likewise i2. Assume that these two selections are done
independently so that the event that i2 has any given value is
independent from the value of i1.
a) Write
down the sample space.
b) Let f denote the random variable that assigns i1+i2
to any given (i1, i2) in the sample space. Write down the probabilities P(f = r) for the various possible values of r.
c) Compute the mean
and standard deviation of f.
d) Let g denote the random variable that assigns |i1|
+ |i2| to any given (i1, ij).
Write down the probabilities P(g = r) for
the various possible values of r.
e) Compute the mean and standard deviation of g
f) Compute the correlation matrix for the pair
(f, g).
g) Which pairs of (r,
r) with r a possible value for f and r one
for g are such that the event f = r is independent from the event g = r?
2.
Let S
denote the same sample space that you used in Problem 1, and let P denote some
hypothetical probability function on S.
Label by consecutive integers starting from 1, and also label the
possible values for f by consecutive integers starting from 1. Let xj
denote P(sj)
where sj is the j’th
element of the sample space. Meanwhile,
let yk denote the P(f
= rk) where rk
is the k’th possible value for f. Write down the linear equation that relates {yk} to {xj}.
3.
Repeat
Problem 1b through 1e in the case that the probability of selecting either –1
or 1 in any given selection is
and that of selecting
0 is
.
4.
Suppose
that N is a positive integer, and N selections are made from the set {-1, 0,
1}. Assume that these are done
independently so that the probability of any one number arising on the k’th selection is independent of any given number arising
on any other selection. Suppose, in
addition, that the probability of any given number arising on any given
selection is
.
a) How
many elements are in the sample space for this problem?
b) What is the probability of any given element?
c) Let f denote the
random variable that assigns to any given (i1, . .
. , iN) their sum, thus: f = i1 + · ·
· + iN. What are P(f = N)
and P(f = N-1)?