Bio/statistics handout 17: Symmetric matrices and data sets
Suppose that you take some large
number of measurements of various facets of a system that you are
studying. Lets say that there are n
facets and you take N >> n measurements under different conditions; thus,
you generate a collection of N vectors in Rn
which you can label as {
1,
2, …,
N}.
An issue
now is whether some of the n facets that you measure are dependent on the
rest. For example, suppose that one
facet is the temperature of the sample, and if all of the other facets of the
system are completely determined by the temperature, then all of these N
vectors will lie on very near some curve in the n-dimensional space, a curve
parameterized as T ®
(T) by the temperature.
On the other hand, if no facets are determined by any collection of the
others, then the N vectors could be spread in a more or less random fashion
through a region of Rn. The point here is that if you are interested
in discovering relations between the various facets, then you would like to
know if the distribution of the N vectors is spread out or concentrated near
some lower dimensional object—a curve or a surface or some such. Any such concentration towards something less
spread out indicates relations between the various facets.
For
example, suppose n = 2. If you plot the
endpoints of the vectors {
k} in R2 and see the result as very
much lying near a particular curve (not necessarily a line), this says that the
two facets are not able to vary in an independent fashion. Indeed, if y = f(x) is the equation for the
curve, it says that the variation in x determines the variation in y. Now, for this n = 2 case, you can go an plot
your N vectors and just look at the picture.
However, for n > 2, this is going to be hard to do. How then can you discern relationships when n
> 2?
a) An example from biology
Here is a topical example: The technology is such that it is possible to moniter the levels of huge numbers of proteins in a cell as it goes about its business. These proteins are typically interacting with each other, and so it is of crucial import to determine which are influencing which. Let n denote the number of proteins involved. Discerning the levels of these n proteins at some N times during the cell cycle with various environmental factors altered at various times gives some very large data set of the sort described above, N vectors in Rn. Of interest is how these vectors distribute themselves—is it random through some region, or are the vectors concentrated near some intrinsically much thinner set such as a curve, surface or what ever in Rn. If the latter is the case, then the structure of this thinner set, a curve, a surface or whatever, carries information about the complicated chemical interactions in a cell.
b) A fundamental concern
I don’t
think that there is a completely fool proof way to discern relationships
between a bunch of vectors in Rn. What follows is an n = 2 example to keep in
mind. As is typically the case, the
vectors {
k} can not be too big; they are constrained to lie
in some region of fixed size in Rn. Let me suppose in this example that no vector
from this collection has norm greater than 1 and so all lie in the disk of
radius 1 about the origin.
Now, consider the curve in this disk given in the parametric form by
t ® (x = c t cos(t), y = c t sin(t)) ,
(17.1)
Here c > 0 is a constant and 0 ≤ t ≤
is the parameter for
the curve. For any given c, this curve
is a spiral. The radius, r(t), is equal
to c·t and the angle is t as measured from the postive x-axis in the
anti-clockwise direction.
As c gets smaller, the spiral gets
tighter and tighter; there are more and more turns before the curve hits the
boundary of the disk. Indeed, when c is
very large, the spiral hardly turns and stays very close to the x-axis. When c =
, the spiral makes one complete turn before exiting the
disk. When c =
and m is an integer,
the spiral makes m turns before it exits.
Now, here
is the problem: Suppose that our vectors
are near some very small c version of (17.1).
Since this spiral makes a lot of turns in the disk, all points in the
disk are pretty close to the spiral and so even a random collection of points
in the disk will find each point close to the spiral. In particular, our collection of vectors {
k} will be close to all small c versions of the
spiral no matter what!!
Here is
another example: Suppose that the points
{
k} are distributed randomly in a thin strip of
width r << 1 along diagonal, y = x.
Thus, each
k has coordinates x and y that obey |x-y| ≤
r. So, inside this strip, the points are
spread at random. Outside the strip,
there are no points at all. If your
experimental error on the order of r,
then I would be happy to conclude that the points lie on the line y = x
and thus the experiment indicates that the y-measurement is completely
determined by the x-measurement. On the
otherhand, if the experimental error is much less than r, then I would not
agree that the concentration near the x-y line signifies that y is determined
by x. Maybe some part of y, but there is
some part left over that is independent.
c) A method
Suppose
we have data {
k}1≤k≤N, each a vector in Rn. Here is a method to analyze whether the data
near any given
j is concentrated near
some lower dimensional subspace of Rn.
Step 1: You must
choose a number, r, that is a reasonable amount greater than your experimental
error. In this regard, r should also be
significantly less than the maximum distance between any two points in {
k}. Thus,
r << maxj,k |
j -
k|. This
number r determines the scale on which you will be looking for the clustering
of the vectors.
Step 2: Let
j Î {
k}. To see
if the data is clustering around a lower dimensional subspace near
j, take all points in {
k} that have distance r or less from
j. Let m
denote the number of such points. Allow
me to relable these points as {
1, …,
m}. These
are the only points from the collection {
k} that will concern us while we look for
clustering around the given point
j at the scale determined by r.
Step 3: Let
denote the vector
(
1 + ··· +
m). This is
vector should be viewed as the center of the collection {
1, … ,
m}. For
each index i = 1, …, m, set
i =
i -
. This just shifts the
origin in Rn.
Step 4: View each i Î
{1, …, m} version of
i as a matrix with 1 column and then introduce the
transpose,
jT, which is a matrix with 1 row. Note that if ever
is a vector in Rn
viewed as a 1-column matrix, then
can multiply
T to give the square matrix ![]()
T. For
example, if
has top entry 1 and
all others 0, then ![]()
T has top left entry 1 and all others zero. In general, the entry (![]()
T)ik is the product of the i’th and
k’th entries of
.
Granted the preceding, introduce the matrix
A =
(
1
1T + ··· +
m
mT)
(17.2)
This is a symmetric, n ´ n matrix, so it has n real eigenvalues which I will henceforth denote by {l1, …, ln}.
Step
5: As I explain below, none of
the eigenvalues has absolute value greater than r2. I also explain below why A has no negative
eigenvalues. Granted this, then A’s eigenvalues
can be any where from 0 to r2.
Those eigenvalues that are much smaller than r2 correspond to
‘pinched’ directions. If only d ≤ n of the eigenvalues are much
smaller than r2, this indicates that the distribution of the vectors
from {
k} with distance r or less from
j is concentrated near a subspace of Rn
whose dimension is n-d. To quantify this, I am going to say that the
vectors near
j cluster around a subspace whose dimension is no
greater than d when A has n – d or more eigenvalues that are smaller than
r2. In this
regard, note that I am not going to expect much accuracy with this prediction
unless m is large.
I give some
examples below to illustrate what is going on.
At the end, I outline my reasons for choosing the factor
to distinguish between
small and reasonably sized eigenvalues.
d) Some loose ends
To
tie up some loose ends, let me show you why all of A’s eigenvalues are in the
interval between 0 and r2. To
this end,
is some vector. To see what A
looks like, the first thing to realize is that if
is any given vector,
then
T
is a 1´1 matrix,
thus a number and that this number is the dot product between
and
. This implies that
A
=
(
1·
)
1 + ··· +
(
m·
)
m ,
(17.3)
Therefore, if I take the dot product of A
with
, I find that
·(A
) =
(
1·
)2 + ··· +
(
m·
)2 .
(17.4)
Note that this is a sum of non-negative terms, so
·(A
) ≥ 0 for any vector
.
Now suppose
that
is an eigenvector with
eigenvalue l. Then A
= l
and so
·(A
) = l |
|2. As this
is non-negative, we see that l ≥
0.
To see that l ≤ r2, use the fact that
|
·
| ≤ |
| |
|
(17.5)
for any given vector
on each term on the
right hand side of (17.5) to conclude that
·(A
) ≤
|
1|2·|
|2 + ··· +
|
m|2 |
|2 .
(17.6)
When I pull out the common factor |
|2 on the right of this last inequality, I find
that
·(A
) ≤
(|
1|2 + ··· + |
m|2) |
|2 .
(17.7)
since each
k has norm less than r, the right hand side of
(17.7) is no larger than r2|
|2. Thus,
·(A
) ≤ r2 |
|2 .
(17.8)
for any vector
.
Now, if
is an eigenvector with
eigenvalue l, then the left hand side
of this last equation is l|
|2 and so l
≤ r2 as claimed.
d) Some examples
What
follows are a few examples to try to convince you that may attempt at
estimating a dimension is not unreasonable..
Example 1: Suppose that all
k sit very close to the origin, for example,
suppose that each
k has |
k| < e r
with e2 ≤
. To see what this
implies, return now to (17.7) to conclude that
·(A
) ≤ e2 r2 |
|2 .
(17.9)
This then means A’s eigenvalues are no larger than e2 r2 which is smaller
than
r2. Thus,
we would predict the vectors from {
k} are clustering around a point near
j.
Example 2: Suppose that m is large and that the vectors
{
k} all sit on a single line, and that they are
evenly distributed along the line. In
particular, let
denote the unit
tangent vector to the line, and suppose that
k = (-1+ 2
) r
. Thus,
1 = -
r
and
m =
r
. This being the case
A = r2
∑1≤k≤m
(-1 + 2
)2 ![]()
T .
(17.10)
As it turns out, this sum can be computed in closed form and the result is
A = r2 ![]()
![]()
T .
(17.11)
The matrix A has the form c ![]()
T where c is a constant. Now any matrix  = c ![]()
T has the following property: If
is any vector in Rn,
then
Â
= c
(
·
)
(17.12)
Thus, Â
= 0 if
is orthogonal to
, and Â
= c
. Hence  has 0 and c
as its eigenvalues, where 0 has multiplicity n –1 and c has multiplicity 1.
In our
case, this means that there is one eigenvalue that is on the order of
r2 and the others are all zero. Thus, we would say that the clustering here
is towards a subspace of dimension 1, and this is precisely the case.
Example 3: To generalize the preceding example, suppose
that d ≥ 1, that V Ì
Rn is a d-dimensional subspace, and that the vectors {
k}1≥k≤m all lie in V.
In this case, one can immediately deduce that A will have n-d orthonormal eigenvectors that have zero as eigenvalue. Indeed, any vector in the orthogonal complement to V is in the kernel of A and so has zero eigenvalue. As this orthogonal subspace has dimension n-d, so A has n-d linearly independent eigenvalues with eigenvalue 0.
Thus, we see predict here that the clustering is towards a subspace whose dimension is no greater than d.
e) Small versus reasonably sized eigenvalues
I am going
to give some indictation here for my choice of the factor
to distinguish the
small eigenvalue of A. Note here that
you or others might want some other, smaller factor. For example, I am probably being conservative
with this factor and it might be better to say that if m is very large, then
the vectors near
j cluster around a subspace whose dimension is no
greater than d when A has n – d or more eigenvalues that are smaller than
r2, where ad is a computable factor
that depends on d but not on m. There is
a closed form expression for this ad, but its form is not of
consequence to my discussion.
To explain
where this factor
(or
) is coming from, I have to take you on a digression that
starts here with the introduction of the uniform probability function on the
ball of radius r in Rd. This is to say that the probability of
choosing a point in any given subset of the ball is proportional to the
d-dimensional volume of the subset. To
each vector
, in this ball (thus, to each
with |
| ≤ r), we can assign the square matrix ![]()
T. This can
be viewed as a random variable that maps vectors in the ball to d ´ d matrices.
For example, in the case d = 2, one has
![]()
T =
.
(17.13)
For any d, the mean of the random
variable
® ![]()
T is the matrix whose entry in the j’th row and
k’th column is the average of ujuk over the ball. These averages can be computed and one finds
that the mean of ![]()
T is
r2 Id, where Id is the d ´ d identity matrix.
By the way, there is also a sort of standard
deviation that can be assigned to the random variable
® ![]()
T. Its
square, s2, is the average
value over the ball of the sum ∑1≤j,k≤d CjkCjk
where the collection {Cjk} are the entries of the matrix ![]()
T -
r2 Id.
This standard deviation can also be computed and doing so finds s =
(
)1/2 r2.
Keeping all
of this in mind, go back to the formula for A in (17.2). Suppose that I consider a set of m vectors, {
k}1≤k≤m that all lie in a
d-dimensional subspace. Suppose further
that I sprinkle these vectors at random in the radius r ball inside this
subspace. I can then view
 =
(
1
1T + ··· +
m
mT)
(17.14)
as the average of m identical, but unrelated random
variables, this being m maps of the form
® ![]()
T.
According to the Central Limit Theorem, when m is large, the probability
that  differs from
r2 I is very small.
If I my actual data, the vectors {
k}, are truly clustering around a d-dimensional
subspace of Rn and are more or less randomly sprinkled in the
radius r ball of this set, then I should expect the following:
When m is very large, the Central Limit Theorem tells me that the matrix A
in (17.2) is very close to the matrix
r2 I.
Thus, it should have d eigenvalues
that are very close to
r2 and n-d eigenvalues that are much smaller than
this number. In particular, when m is large,
then under the hypothesis that the vectors {
}1≤k≤m are sprinkled at random in a
d-dimensional subspace, the matrix A in (17.2) will be very likely to have d
eigenvalues that are greater than
r2 and n-d eigenvalues that are much less than
this number.
This application of the Central
Limit Theorem explains my preference for the size distinction
r2 between small eigenvalues of A and eigenvalues
that are of reasonable size.
As I
mentioned above, there is an argument for using a cut off of the form
r2 to distinguish the small eigenvalues of A. The latter suggestion uses more from the
Central Limit Theorem. To outline how
this comes about, remember that I told you that there is a standard deviation
associated to the random variable
® ![]()
T, this s
=
(
)1/2 r2. There is also a d-dimensional version of the
Central Limit Theorem that can be used when m is very large, and it says the
following: When m is very large, then
the probability that the sum of the squares of the entries of  -
r2 I, thus
∑1≤j,k≤d
(Â -
r2 I)jk(Â -
r2 I)jk ,
(17.15)
has square root differing from zero by more than R
s is no greater than
cc R e![]()
(17.16)
where cd is a certain d-dependent factor.
This version of the Central Limit
Theorem has the following consequence:
When m is very large, and under the hypothesis that the vectors {
}1≤k≤m are sprinkled at random in a
d-dimensional subspace, the matrix A in (17.2) will be very likely to have d
eigenvalues that are greater than (
– âd
) r2 where âd is a constant that is
independent of m but depends on d. Here,
the terms ‘very likely’ means that the probability is greater than 95%. (Thus, the P-value of this not happening is
less than the usual
).
Granted this, write ad = (d+2) âd to get my alternate factor for distinguishing small from reasonable eigenvalues of A.
Exercises:
1. The purpose of this exercise is to compute the average of the matrix in (17.13) over the
disk of radius r in the x-y plane. This average is the matrix, U, whose entries are
U11
=
òò
x2 dxdy, U22 =
òò
y2 dxdy and
U12 = U21 =
òò
xy dxdy .
To compute U, change to polar coordinates (r, q) where r ≥ 0 and q Î [0, 2π]
using the formual x = r cos(q) and y = r sin(q). Show that
a)
U11 =
òò
r2 cos2q rdrdq,
b) U22
=
òò
r2 sin2q rdrdq
c) U12
= U21 =
òò
r2 cosq sinq
rdrdq
Next, use the formula cos(2q) = 2cos2q - 1 = 1 – 2sin2q and sin(2q) = 2 sinq cosq to do the angle integrals first and so find that
d)
U11 = U22 =
ò0≤r≤r r3 drdq,
e) U12 = U21 = 0 .
Finally, do the r-integral to find that U11 = U22
=
r2 and U12 = U21 = 0. Note that this means that U is the d = 2
version of
r2 Id.