Analysis and Clustering of Gene Expression Data

1: Statistical and Numerical Considerations

2: Searching for Causal Relationships

Introduction

The rise of new technologies for assaying gene expression has generated a large quantity of data that is driving the development of new algorithms and techniques for discovering correlations and patterns in datasets that are far too large for any human analyst to comprehend. The concept of ãclusteringä is applicable to many different fields: given a set of records (which may be gene expression levels, protein sequences, or any quantitative phenomena), create groups of records that are in some sense ãsimilarä.

 

In the case of gene expression data (from micro-arrays or other technologies), there are several possible goals for clustering analysis:

 

á        Find genes that are triggered by the same causative mechanism,

á        Find cause/effect relationships among gene expression levels

á        Compare expression levels among different experiments to determine the impact of different experimental conditions,

á        Find which genes are or are not sensitive to different experimental conditions

 

A body of literature concerning the clustering of gene expression levels as assayed by microarray testing has begun to arise. As will be seen, this era of large datasets with records numbering in the millions requires strict attention be paid to statistics and the implications of very large numbers, lest coincidences appear significant and chance correlations seem to signal relationships that exist only as statistical anomalies.

 

A case in point can be found in a paper by Eisen and co-workers, titled ãCluster analysis and display of genome-wide expression patternsä[1]. In fairness to the authors, the paper was not intended as a mathematically rigorous exercise in algorithm-building, but rather an adaptation of old techniques to a new type of data, and focused as much on data visualization as on algorithmics. Nonetheless, an analysis of the work serves as an excellent cautionary tale.

Part 1: Clustering and Similarity

ÊÊÊÊÊÊÊÊÊÊÊ

ÊÊÊÊÊÊÊÊÊÊÊ In their 1998 paper, Eisen and co-workers made an early attempt at gene clustering using techniques developed by Sokal and Michener[2] for data-driven generation of taxonomic trees. In this work, the ãsimilarityä of two gene expression patterns is defined as their correlation coefficient, using a formulation similar to the standard Pearson correlation. This similarity measure was used in a pairwise average-linkage cluster analysis, where the pair of genes with the highest similarity score are designated a cluster, and replaced by a node with an expression profile that is the average of the two input genes. This pseudo-gene replaces the two constituent genes in the list of expression profiles, and the process is repeated until only one gene or node remains. The resulting tree is used as the basis of an ordered list of genes, their proximity within the list indicative of their relative correlation.

 

ÊÊÊÊÊÊÊÊÊÊÊ One advantage of this procedure is that it is ãunsupervisedä, in machine learning parlance: it makes no a priori assumptions about the number or form of any putative clusters, unlike some other clustering procedures like self-organizing maps.[3] Unfortunately, the process also has a serious disadvantage in that it clusters all genes regardless of the absolute strength of the correlations, which at the very least can imply associations where no exist. This is particularly problematic when the laws of large numbers come into play; given that over three million pairwise correlations exist in the dataset of 2467 genes used in the paper, a more careful accounting of the statistical properties of the correlation coefficient is called for. This issue will be looked at more carefully elsewhere in the present study.

 

ÊÊÊÊÊÊÊÊÊÊÊ Another problem lies in the semantics of ãsimilarityä, which in this paper is defined as ãpositively correlated expression profilesä: deviations from the mean expression level of one gene are mirrored by proportional (here meaning ãthe same number of standard deviationsä) deviations in the expression level of the ãsimilarä gene. Biologically, this implies that gene1 and gene2 respond in the same way to the same causal factor. However, what of the case where gene1 and gene2 have significant negative correlation? They are evidently still responding to the same causal factor, but in opposite ways. Such negative correlations must be ignored by the procedure under discussion, because the average of negatively correlated profiles tends to a flat profile and results in almost complete information loss. Nonetheless, ignoring large negative correlations is discarding biologically relevant information. This problem could be overcome by translating negatively correlated profiles to a zero mean, and reversing the sign of all datapoints before computing the average profile, as shown in Figure 1.

 

a)

b)

Ê

c)

Figure 1 a) Genes 1 and 2 have a ö1 correlation, but a mean of zero. In b) Gene2 has been translated to the origin; the correlation of the two genes is still ö1.Ê In c) the sign of each data point in Gene 2 has been inverted, changing its correlation to Gene 1 to +1. The average expression profile now correctly reflects the correlated features of the two input genes.

 

The data analyzed and discussed in the paper comes from various studies of the yeast genome, and is comprised of expression profiles of 2467 genes gathered under various experimental conditions. The paper does not detail the exact procedure, but it appears that the authors simply concatenated all the experimental data together into one large profile for each gene, and generated a matrix of correlation coefficients from these profiles. This procedure can generate misleading results if the means or variances of the profiles being concatenated together are not similar. For example, consider two sets of profiles (Table 1):

 

 

Experiment 1

Experiment 2

Gene 1

1 2 3 4 5

6 7 8 9 10

Gene 2

1 2 3 4 1

6 7 8 9 6

Table 1: Time Series of Two Experiments on Two Genes

The correlation coefficient of the two genes within each experiment is 0.2425; they are not particularly well correlated. However, the correlation coefficient of the two experiments concatenated together (Table 2) is 0.8393, a significant improvement.

 

Gene 1

1 2 3 4 5 6 7 8 9 10

Gene 2

1 2 3 4 1 6 7 8 9 6

Table 2: Two Genes with Experimental Observations Concatenated

If the sequences are normalized to a mean of zero and unit variance before concatenation, then the correlation for the concatenated sequence drops to 0.2425, which intuitively seems a more appropriate figure of merit.

 

ÊÊÊÊÊÊÊÊÊÊÊ Concatenating experiments together also implies that ãsimilarityä means ãsimilar at all times and under all experimental conditionsä, which is fine if that is that is the information sought by the clustering. However, this means disregarding any transient correlations that only exist in a subset of experiments or observations. Segal et al[4] take a Bayesian approach to clustering which takes into account this type of local order, given enough data to make good parameter estimates.

 

ÊÊÊÊÊÊÊÊÊÊÊ The averaging procedure used to produce a ãconsensus profileä of two correlated profiles is not described in detail in the paper, but is potentially a source of trouble as well: simply taking the arithmetic average of two expression profiles will generally produce a profile that is not particularly representative of the two input sequences. The average profile should be generated not by taking the mean of the input data points, but rather the mean of the z-scores of the input points. This will produce an average profile that has the same correlation to each of the input profiles, and reproduces accurately the common features of the input profiles.

 

ÊÊÊÊÊÊÊÊÊÊÊ Finally, a key shortcoming of this method is that it takes no account of the statistical significance of the correlations studied. Given the large numbers of pairwise correlations present in a dataset of this size (2467 profiles taken two at a time yields 3041811 correlations), to not carefully consider the statistical distribution of correlations is to admit a large number of spurious relationships into consideration. To get some idea of the significance of any given correlation, a program was written to generate random datasets with the same statistical properties as the data under study, and tabulate the distribution of correlation values amongst the records that dataset. The mean and variance of the data from each of the ten experiments used was determined, and used to generate mock datasets of the same size and with the same mean and variance. The mock experiments were then concatenated into one large dataset, and the correlation of each row of this large dataset with each other row was measured.Ê The results for the extreme regions (near the boundaries of statistical significance) are tabulated below in Table 3.

 

Correlation

-0.622

-0.621

-0.62

-0.619

0.636

0.637

0.638

0.639

0.64

Expected

ÊSequences

/ 3041811 Trials

0.0439

0.0474

.0495

0.0543

0.0509

0.0470

0.0480

0.0404

0.0461

Table 3: Expected Sequences Per 3,041,811 Trials For Particular Correlation Coefficients, based on 2.3408e+010 trials.

 

Note that these figures were calculated using the experimental data as recorded, with no scaling of means or variances as proposed above. The correlation requirements are not terribly stringent, and it may prove informative to look at the correlations with some of the clusters proposed in the paper. Figure 2B of the Eisen paper lists eleven genes that were clustered; the matrix of correlation coefficients of these eleven is shown below in Table 4.

 

 

STU2

DHS1

BNR1

SPC42

CNM67

CLB4

CDC10

CDC3

CLB3

APC4

CDC16

STU2

1

 

 

 

 

 

 

 

 

 

 

DHS1

0.90294

1

 

 

 

 

 

 

 

 

 

BNR1

0.900762

0.944625

1

 

 

 

 

 

 

 

 

SPC42

0.858939

0.915106

0.874128

1

 

 

 

 

 

 

 

CNM67

0.887488

0.920278

0.879147

0.852228

1

 

 

 

 

 

 

CLB4

0.901142

0.857636

0.838259

0.810104

0.85717

1

 

 

 

 

 

CDC10

0.852406

0.832763

0.843004

0.847756

0.832974

0.884979

1

 

 

 

 

CDC3

0.910766

0.866702

0.882075

0.841237

0.869079

0.84589

0.883561

1

 

 

 

CLB3

0.824061

0.824815

0.838447

0.873035

0.822136

0.77878

0.821932

0.82964

1

 

 

APC4

0.882062

0.860115

0.84

0.806454

0.902605

0.820372

0.805811

0.841021

0.805368

1

 

CDC16

0.758781

0.825366

0.812123

0.861522

0.767708

0.714379

0.79756

0.79158

0.838047

0.767866

1

Table 4: Correlation coefficients from a cluster of eleven genes.

 

ÊÊÊÊÊÊÊÊÊÊÊ Disregarding the concerns expressed in and around Tables 1 and 2 above, these are consistently strong correlations, all well above the point required for statistical significance. However, assume that the expression levels for each gene in each experiment were normalized to a mean of zero and variance of one. This lowers the level of correlation required for significance as shown in Table 5 below.

 

Correlation

-0.529

-0.528

-0.527

-0.525

0.527

0.528

0.529

0.530

Expected

ÊSequences

/ 3041811 Trials

0.0522ÊÊÊ

0.0420ÊÊÊ

0.0449ÊÊÊ

0.0565

0.0594

0.0493

0.0478

0.0420

Table 5: Expected Sequences Per 3,041,811 Trials For Particular Correlation Coefficients, based on 2.0996e+010 trials.

 

Here are the same genes as correlated in Table 4 but after the experiment-by-experiment normalization procedure has been applied (Table 6):

 

 

STU2

DHS1

BNR1

SPC42

CNM67

CLB4

CDC10

CDC3

CLB3

APC4

CDC16

STU2

1

 

 

 

 

 

 

 

 

 

 

DHS1

0.313718

1

 

 

 

 

 

 

 

 

 

BNR1

0.309534

0.51728

1

 

 

 

 

 

 

 

 

SPC42

0.402066

0.66359

0.446931

1

 

 

 

 

 

 

 

CNM67

0.224291

0.420331

0.122209

0.442056

1

 

 

 

 

 

 

CLB4

0.483372

0.162059

-0.03756

0.108855

0.224874

1

 

 

 

 

 

CDC10

0.226701

0.006486

0.100663

0.161288

0.000201

0.355918

1

 

 

 

 

CDC3

0.563692

0.03367

0.183147

0.149937

0.182473

0.281812

0.310724

1

 

 

 

CLB3

0.484789

0.442078

0.341455

0.51614

0.401616

0.278575

0.222834

0.209578

1

 

 

APC4

0.086954

0.229685

0.037831

0.20958

0.497245

0.034537

0.044378

0.061341

0.406528

1

 

CDC16

0.100451

0.332902

0.371209

0.414532

-0.0167

-0.05993

0.002332

0.107163

0.176582

0.134365

1

Table 6: Correlation coefficients from a cluster of eleven genes after their expression profiles have been normalized.

ÊÊÊÊÊÊÊÊÊÊÊ Apart from the distressing lack of significant correlations, the other disturbing conclusion drawn from comparing Table 4 to Table 6 is that the correlations themselves do not correlate well; correlations of similar magnitude in Table 4 may be of very different magnitudes in Table 6. The correlations generated from un-normalized data paint a misleading picture indeed.

Conclusion

ÊÊÊÊÊÊÊÊÊÊÊ The methods put forward in the paper under review, while a step in the right direction, were not applied with sufficient mathematical rigor to generate meaningful conclusions. In particular, the lack of normalization when joining together data from different experiments, and the failure to deal with negative correlations sharply limits the usefulness of clusters generated using the methods outlined.

Part 2: Looking For Cause/Effect Relationships

 

None of the literature on gene clustering examined in the preparation of this paper looked at the possibility of searching for cause-effect relationships among gene expression levels. Generally, the similarity measures employed only compared data points within a given column of a dataset; correlations across a temporal offset were not considered. Of course, this would only make sense applied to time series data generated by the same experiment; comparisons between different experiments, or among data points that do not represent observations of the same quantity at different points in time, would be nonsensical. However, given time series data from such experiments as Spellman and coworkersâ[5], is it possible to find statistically significant relationships between time series that have been offset relative to one another?

 

Two of the records from Spellmanâs ãalpha factor arrest and releaseä (hereafter simply ãalphaä) experiments are shown in Table 7 below.

Ê

Table 7: Two Time Series, r=-0.4407

The expression levels of the two genes are clearly uncorrelated; the correlation coefficient of the two records is ö0.4407. However, if the RNR3 gene is moved ahead by four observations (Table 8) the picture changes dramatically and the correlation coefficient rises to 0.9589, a much more impressive figure.

 

Table 8: Two Time Series Offset By Four Observations, r=0.9589

ÊÊÊÊÊÊÊÊÊÊÊ So in at least this one case, a correlation exists between the offset sequences that did not exist when the sequences were compared in the same time frame. What is the likelihood that this is a chance occurrence? To answer that question, the program for generating random experiments (mentioned earlier in this paper) was revised to generate mock experiments based on the pr