Bio/statistics handout 17: Symmetric matrices and data sets

Suppose that you take some large number of measurements of various facets of a system that you are studying. Lets say that there are n facets and you take N >> n measurements under different conditions; thus, you generate a collection of N vectors in Rⁿ which you can label as {₁, ₂, …, _N}.

An issue now is whether some of the n facets that you measure are dependent on the rest. For example, suppose that one facet is the temperature of the sample, and if all of the other facets of the system are completely determined by the temperature, then all of these N vectors will lie on very near some curve in the n-dimensional space, a curve parameterized as T ® (T) by the temperature. On the other hand, if no facets are determined by any collection of the others, then the N vectors could be spread in a more or less random fashion through a region of Rⁿ. The point here is that if you are interested in discovering relations between the various facets, then you would like to know if the distribution of the N vectors is spread out or concentrated near some lower dimensional object—a curve or a surface or some such. Any such concentration towards something less spread out indicates relations between the various facets.

For example, suppose n = 2. If you plot the endpoints of the vectors {_k} in R² and see the result as very much lying near a particular curve (not necessarily a line), this says that the two facets are not able to vary in an independent fashion. Indeed, if y = f(x) is the equation for the curve, it says that the variation in x determines the variation in y. Now, for this n = 2 case, you can go an plot your N vectors and just look at the picture. However, for n > 2, this is going to be hard to do. How then can you discern relationships when n > 2?

a) An example from biology

Here is a topical example: The technology is such that it is possible to moniter the levels of huge numbers of proteins in a cell as it goes about its business. These proteins are typically interacting with each other, and so it is of crucial import to determine which are influencing which. Let n denote the number of proteins involved. Discerning the levels of these n proteins at some N times during the cell cycle with various environmental factors altered at various times gives some very large data set of the sort described above, N vectors in Rⁿ. Of interest is how these vectors distribute themselves—is it random through some region, or are the vectors concentrated near some intrinsically much thinner set such as a curve, surface or what ever in Rⁿ. If the latter is the case, then the structure of this thinner set, a curve, a surface or whatever, carries information about the complicated chemical interactions in a cell.

b) A fundamental concern

I don’t think that there is a completely fool proof way to discern relationships between a bunch of vectors in Rⁿ. What follows is an n = 2 example to keep in mind. As is typically the case, the vectors {_k} can not be too big; they are constrained to lie in some region of fixed size in Rⁿ. Let me suppose in this example that no vector from this collection has norm greater than 1 and so all lie in the disk of radius 1 about the origin.

Now, consider the curve in this disk given in the parametric form by

t ® (x = c t cos(t), y = c t sin(t)) ,

(17.1)

Here c > 0 is a constant and 0 ≤ t ≤ is the parameter for the curve. For any given c, this curve is a spiral. The radius, r(t), is equal to c·t and the angle is t as measured from the postive x-axis in the anti-clockwise direction.

As c gets smaller, the spiral gets tighter and tighter; there are more and more turns before the curve hits the boundary of the disk. Indeed, when c is very large, the spiral hardly turns and stays very close to the x-axis. When c = , the spiral makes one complete turn before exiting the disk. When c = and m is an integer, the spiral makes m turns before it exits.

Now, here is the problem: Suppose that our vectors are near some very small c version of (17.1). Since this spiral makes a lot of turns in the disk, all points in the disk are pretty close to the spiral and so even a random collection of points in the disk will find each point close to the spiral. In particular, our collection of vectors {_k} will be close to all small c versions of the spiral no matter what!!

Here is another example: Suppose that the points {_k} are distributed randomly in a thin strip of width r << 1 along diagonal, y = x. Thus, each _k has coordinates x and y that obey |x-y| ≤ r. So, inside this strip, the points are spread at random. Outside the strip, there are no points at all. If your experimental error on the order of r, then I would be happy to conclude that the points lie on the line y = x and thus the experiment indicates that the y-measurement is completely determined by the x-measurement. On the otherhand, if the experimental error is much less than r, then I would not agree that the concentration near the x-y line signifies that y is determined by x. Maybe some part of y, but there is some part left over that is independent.

c) A method

Suppose we have data {_k}_1≤k≤N, each a vector in Rⁿ. Here is a method to analyze whether the data near any given _j is concentrated near some lower dimensional subspace of Rⁿ.

Step 1: You must choose a number, r, that is a reasonable amount greater than your experimental error. In this regard, r should also be significantly less than the maximum distance between any two points in {_k}. Thus, r << max_j,k |_j - _k|. This number r determines the scale on which you will be looking for the clustering of the vectors.

Step 2: Let _j Î {_k}. To see if the data is clustering around a lower dimensional subspace near _j, take all points in {_k} that have distance r or less from _j. Let m denote the number of such points. Allow me to relable these points as {₁, …, _m}. These are the only points from the collection {_k} that will concern us while we look for clustering around the given point _j at the scale determined by r.

Step 3: Let denote the vector (₁ + ··· + _m). This is vector should be viewed as the center of the collection {₁, … , _m}. For each index i = 1, …, m, set _i = _i - . This just shifts the origin in Rⁿ.

Step 4: View each i Î {1, …, m} version of _i as a matrix with 1 column and then introduce the transpose, _j^T, which is a matrix with 1 row. Note that if ever is a vector in Rⁿ viewed as a 1-column matrix, then can multiply ^T to give the square matrix ^T. For example, if has top entry 1 and all others 0, then ^T has top left entry 1 and all others zero. In general, the entry (^T)_ik is the product of the i’th and k’th entries of .

Granted the preceding, introduce the matrix

A = (₁₁^T + ··· + _m_m^T)

(17.2)

This is a symmetric, n ´ n matrix, so it has n real eigenvalues which I will henceforth denote by {l₁, …, l_n}.

Step 5: As I explain below, none of the eigenvalues has absolute value greater than r². I also explain below why A has no negative eigenvalues. Granted this, then A’s eigenvalues can be any where from 0 to r². Those eigenvalues that are much smaller than r² correspond to ‘pinched’ directions. If only d ≤ n of the eigenvalues are much smaller than r², this indicates that the distribution of the vectors from {_k} with distance r or less from _j is concentrated near a subspace of Rⁿ whose dimension is n-d. To quantify this, I am going to say that the vectors near _j cluster around a subspace whose dimension is no greater than d when A has n – d or more eigenvalues that are smaller than r². In this regard, note that I am not going to expect much accuracy with this prediction unless m is large.

I give some examples below to illustrate what is going on. At the end, I outline my reasons for choosing the factor to distinguish between small and reasonably sized eigenvalues.

d) Some loose ends

To tie up some loose ends, let me show you why all of A’s eigenvalues are in the interval between 0 and r². To this end, is some vector. To see what A looks like, the first thing to realize is that if is any given vector, then ^T is a 1´1 matrix, thus a number and that this number is the dot product between and . This implies that

A = (₁·) ₁ + ··· + (_m·) _m ,

(17.3)

Therefore, if I take the dot product of A with , I find that

·(A) = (₁·)² + ··· + (_m·)² .

(17.4)

Note that this is a sum of non-negative terms, so ·(A) ≥ 0 for any vector .

Now suppose that is an eigenvector with eigenvalue l. Then A = l and so ·(A) = l ||². As this is non-negative, we see that l ≥ 0.

To see that l ≤ r², use the fact that

|·| ≤ || ||

(17.5)

for any given vector on each term on the right hand side of (17.5) to conclude that

·(A) ≤ |₁|²·||² + ··· + |_m|² ||² .

(17.6)

When I pull out the common factor ||² on the right of this last inequality, I find that

·(A) ≤ (|₁|² + ··· + |_m|²) ||² .

(17.7)

since each _k has norm less than r, the right hand side of (17.7) is no larger than r²||². Thus,

·(A) ≤ r² ||² .

(17.8)

for any vector .

Now, if is an eigenvector with eigenvalue l, then the left hand side of this last equation is l||² and so l ≤ r² as claimed.

d) Some examples

What follows are a few examples to try to convince you that may attempt at estimating a dimension is not unreasonable..

Example 1: Suppose that all _k sit very close to the origin, for example, suppose that each _k has |_k| < e r with e² ≤ . To see what this implies, return now to (17.7) to conclude that

·(A) ≤ e²r² ||² .

(17.9)

This then means A’s eigenvalues are no larger than e²r² which is smaller than r². Thus, we would predict the vectors from {_k} are clustering around a point near _j.

Example 2: Suppose that m is large and that the vectors {_k} all sit on a single line, and that they are evenly distributed along the line. In particular, let denote the unit tangent vector to the line, and suppose that _k = (-1+ 2) r . Thus, ₁ = - r and _m = r . This being the case

A = r² ∑_1≤k≤m (-1 + 2)² ^T .

(17.10)

As it turns out, this sum can be computed in closed form and the result is

A = r² ^T .

(17.11)

The matrix A has the form c ^T where c is a constant. Now any matrix Â = c ^Thas the following property: If is any vector in Rⁿ, then

Â = c (·)

(17.12)

Thus, Â = 0 if is orthogonal to , and Â = c. Hence Â has 0 and c as its eigenvalues, where 0 has multiplicity n –1 and c has multiplicity 1.

In our case, this means that there is one eigenvalue that is on the order of r² and the others are all zero. Thus, we would say that the clustering here is towards a subspace of dimension 1, and this is precisely the case.

Example 3: To generalize the preceding example, suppose that d ≥ 1, that V Ì Rⁿ is a d-dimensional subspace, and that the vectors {_k}_1≥k≤m all lie in V.

In this case, one can immediately deduce that A will have n-d orthonormal eigenvectors that have zero as eigenvalue. Indeed, any vector in the orthogonal complement to V is in the kernel of A and so has zero eigenvalue. As this orthogonal subspace has dimension n-d, so A has n-d linearly independent eigenvalues with eigenvalue 0.

Thus, we see predict here that the clustering is towards a subspace whose dimension is no greater than d.

e) Small versus reasonably sized eigenvalues

I am going to give some indictation here for my choice of the factor to distinguish the small eigenvalue of A. Note here that you or others might want some other, smaller factor. For example, I am probably being conservative with this factor and it might be better to say that if m is very large, then the vectors near _j cluster around a subspace whose dimension is no greater than d when A has n – d or more eigenvalues that are smaller than r², where a_d is a computable factor that depends on d but not on m. There is a closed form expression for this a_d, but its form is not of consequence to my discussion.

To explain where this factor (or ) is coming from, I have to take you on a digression that starts here with the introduction of the uniform probability function on the ball of radius r in R^d. This is to say that the probability of choosing a point in any given subset of the ball is proportional to the d-dimensional volume of the subset. To each vector , in this ball (thus, to each with || ≤ r), we can assign the square matrix ^T. This can be viewed as a random variable that maps vectors in the ball to d ´ d matrices. For example, in the case d = 2, one has

^T = .

(17.13)

For any d, the mean of the random variable ® ^T is the matrix whose entry in the j’th row and k’th column is the average of u_ju_k over the ball. These averages can be computed and one finds that the mean of ^T is r² I_d, where I_d is the d ´ d identity matrix.

By the way, there is also a sort of standard deviation that can be assigned to the random variable ® ^T. Its square, s², is the average value over the ball of the sum ∑_1≤j,k≤d C_jkC_jk where the collection {C_jk} are the entries of the matrix ^T - r² I_d. This standard deviation can also be computed and doing so finds s = ()^1/2r².

Keeping all of this in mind, go back to the formula for A in (17.2). Suppose that I consider a set of m vectors, {_k}_1≤k≤m that all lie in a d-dimensional subspace. Suppose further that I sprinkle these vectors at random in the radius r ball inside this subspace. I can then view

Â = (₁₁^T + ··· + _m_m^T)

(17.14)

as the average of m identical, but unrelated random variables, this being m maps of the form ® ^T. According to the Central Limit Theorem, when m is large, the probability that Â differs from r² I is very small.

If I my actual data, the vectors {_k}, are truly clustering around a d-dimensional subspace of Rⁿ and are more or less randomly sprinkled in the radius r ball of this set, then I should expect the following:

When m is very large, the Central Limit Theorem tells me that the matrix A in (17.2) is very close to the matrix r² I.

Thus, it should have d eigenvalues that are very close to r² and n-d eigenvalues that are much smaller than this number. In particular, when m is large, then under the hypothesis that the vectors {}_1≤k≤m are sprinkled at random in a d-dimensional subspace, the matrix A in (17.2) will be very likely to have d eigenvalues that are greater than r² and n-d eigenvalues that are much less than this number.

This application of the Central Limit Theorem explains my preference for the size distinction r² between small eigenvalues of A and eigenvalues that are of reasonable size.

As I mentioned above, there is an argument for using a cut off of the form r² to distinguish the small eigenvalues of A. The latter suggestion uses more from the Central Limit Theorem. To outline how this comes about, remember that I told you that there is a standard deviation associated to the random variable ® ^T, this s = ()^1/2r². There is also a d-dimensional version of the Central Limit Theorem that can be used when m is very large, and it says the following: When m is very large, then the probability that the sum of the squares of the entries of Â - r² I, thus

∑_1≤j,k≤d (Â - r² I)_jk(Â - r² I)_jk ,

(17.15)

has square root differing from zero by more than Rs is no greater than

c_c R e

(17.16)

where c_d is a certain d-dependent factor.

This version of the Central Limit Theorem has the following consequence: When m is very large, and under the hypothesis that the vectors {}_1≤k≤m are sprinkled at random in a d-dimensional subspace, the matrix A in (17.2) will be very likely to have d eigenvalues that are greater than ( – â_d ) r² where â_d is a constant that is independent of m but depends on d. Here, the terms ‘very likely’ means that the probability is greater than 95%. (Thus, the P-value of this not happening is less than the usual ).

Granted this, write a_d = (d+2) â_d to get my alternate factor for distinguishing small from reasonable eigenvalues of A.

Exercises:

1. The purpose of this exercise is to compute the average of the matrix in (17.13) over the

disk of radius r in the x-y plane. This average is the matrix, U, whose entries are

U₁₁ = òò x² dxdy, U₂₂ = òò y² dxdy and U₁₂ = U₂₁ = òò xy dxdy .

To compute U, change to polar coordinates (r, q) where r ≥ 0 and q Î [0, 2π]

using the formual x = r cos(q) and y = r sin(q). Show that

a) U₁₁ = òò r² cos²q rdrdq,

b) U₂₂ = òò r² sin²q rdrdq

c) U₁₂ = U₂₁ = òò r² cosq sinq rdrdq

Next, use the formula cos(2q) = 2cos²q - 1 = 1 – 2sin²q and sin(2q) = 2 sinq cosq to do the angle integrals first and so find that

d) U₁₁ = U₂₂ = ò_0≤_r≤r r³ drdq,

e) U₁₂ = U₂₁ = 0 .

Finally, do the r-integral to find that U₁₁ = U₂₂ = r² and U₁₂ = U₂₁ = 0. Note that this means that U is the d = 2 version of r² I_d.