advertisement

Statistics Tools in GeneSpring The Center for Bioinformatics UNC at Chapel Hill Jianping Jin Ph.D. Bioinformatics Scientist Phone: (919)843-6015 E-mail: jjin@email.unc.edu Fax: (919)966-6821 What GeneSpring Can do? • Works with both Affymetrix and two-color data. • Views data graphically (classification, graph, tree, scatter plot, Vann Diagram …) • Performs statistical analyses. • Annotates genes (updating from GenBank, LocusLink, Unigene; biochemical pathways). • …… What statistical analyses does GS do? • Clustering: • k-means (non-hierarchical) • Self-organizing map • Gene trees (hierarchical dendrograms). • principal component analysis • T-Test analyses ( p-values) • Like a known gene or average of genes • Like a pattern drawn with the mouse • Genes with high confidence • Genes with relative expression in certain ranges • Pathway analysis finding genes that fit in a certain place in a pathway. • Sequence analysis to automatically find regulatory sequences. • Automatic functional annotation of sub-trees in dendrograms. •… Tree Clustering 1. 2. 3. 4. 5. 6. 7. 8. 9. Standard correlation Smooth correlation Change correlation Upregulated correlation Pearson correlation Spearman correlation Spearman confidence Two-sided Spearman confidence Distance Notations to the Formulas Result: the result of the calculation for genes A and B. n: the numbers of samples being correlated over. a: the vector (a 1 , a 2 , a 3 ... a n) of expression values for gene A. b: the vector (b 1 , b 2 , b 3 ... b n) of expression values for gene B. a.b = a 1 b 1 +a 2 b 2 +...+a n b n. |a|=square root(a.a) Standard Correlation • Equation: a.b/(|a||b|) • also called “Pearson correlation around zero”. • Measure the angular separation of expression vectors for genes A & B. • Answer the question “do the peaks match up?” Pearson Correlation • Equation: A.B / ( | A || B | ) • Very similar to the Std correlation, except it measures the angle of expression vector for genes A & B around the mean of the expression vectors. • A = the mean of all element in vector a - the value from each element in a. • Do the same for b to make a vector B Spearman Confidence • r = the value of the Spearman correlation, SC = 1-(probability you would get a value of r or higher by chance) • A measure of similarity, not a correlation • High SC value if a high Spearman corr, & a low p-value. • Takes account of the number of subexperiment in your experiment set. Two-sided Spearman Confidence • A measure of similarity, very similar to the Spearman conf. • Two-sided test of whether the Spearman corr. is either significantly gt/lt zero. • “what genes behave similarly/opposite to a specific gene?” • Probably not good for k-means/tree clustering. • 1-(probability you would get a Spearman correlation of |r| or higher, or -|r| or lower, by chance). Distance • A measurement of dissimilarity, not a correlation at all. • Euclidian dist. b/w expression Profiles ( values for each point in N-dimensional space) of genes A & B. • Distance = |a-b|/square root of N (expt. points) Special Case Correlations • Smooth correlation, Change correlation and Upregulated correlation. • All three modified version of the Std. correlation. • Only make sense when data in a sequence, such as “before”/”after”, a time series, or a drug series. Smooth Correlation • Make a new vector A from a by interpolating the avg. of each consecutive pair of elements of a. • Insert this new value b/w the old values • Do this for each pair of elements that would connected by a line in the graph screen • Do the same to make a vector B from b. Change Correlation • The opposite of what the Smooth corr. looks for. Only the chg. in expression level of adjacent points. • Similar to the Std corr., but use an arc tangent transformation of ratio b/w adjacent pairs of points to create the expr. vector. Less sensitive to outliers than using the ratio directly. • The value created b/w two values a i and a i+1 is atan(a i+1 /a i )- /4 Upregulated Correlation • Very similar to the Chg. Corr., but it only considers positive changes. All negative values for the arc tangent are set to zero. • Make a new vector A from a by looking at the change b/w each pair of elements of a. • The value created b/w two values a i and a i+1 is max(atan(a i+1 /a i )- /4.0). Algorithm to Build Gene Tree • Determine if there is only one gene or subtree left. If yes, go to step five. • Find the two closest genes/subtrees. • Merge these two into one subtree. • Return to step one. • Merge together branches where the distance between sub-branches is less than the separation ratio, subject to considering genes with less than the minimum distance apart. Algorithm to Build Tree • The minimum distance: how far down the tree discrete branches are depicted. Higher number, more genes in a group, less specific. • The separate ratio: the correlation diff. b/w groups of clustered genes. B/w 0 and 1. Increasing separation increases the branchiness of the tree. Principal Components Analysis • Not a clustering method. • PCA, the most abundant building blocks, a set of expression patterns. • 1st PC is obtained by finding the linear combination of expr. Patterns for the most of variability in the data. And so on. k-Means Clustering • Divides genes into a user-defined # (k) of equal-sized groups, based on their expression patterns. • Creates centroids at the avg. location of each group of genes • With each iteration, genes are reassigned to the group with closest centroid • After all of the genes have been reassigned, the location of the centroids is recalculated. Self-Organizing Maps • Similar to k-means clustering. • Relationship b/w groups in a 2-D map. • Best represents the variability of the data, while still maintaining similarity b/w adjacent nodes, e.g. point 1,2 is one unit away from 1,3. What does t-test mean in GS • Replicates: one-sample Student’s t-test • Comparisons for 2 groups: Student’s two-sample t-test. • Comparisons for multiple groups: one-way analysis of variance (ANOVA). • Filtering genes: based on a one-sample t-test of the mean expression level across replicates vs. a reference value (Expression Percentage Restriction) Filter Genes Analysis Tools • Global Error Model: filters out genes with large std deviations or error values. • Raw data filtering: gets rid of genes too close to the background. • Sample to sample comparison: fold cmp. Among different samples. • Statistical Group cmp.: filters out genes not vary significantly across different groups. • Data File Restriction: based on other field ( P/S call, +/- pairs). Statistical Group Comparison • Genes statistically significant difference in the mean expression levels across all group. • For two groups: Students’s two-sample t-test. • For multiple groups: ANOVA • Non-parametric cmp.: for each gene, the rank order is used for analysis. Wilcoxon two-sample test (Mann-Whitney U test), the Kruskal-Wallis test for multiple groups. Data Normalization • In two-color experiments, normalizing vs. the control channel (green) for each gene. • Normalize each sample to itself or to a positive control. Make diff. samples comparable to one another. • Normalizing each gene to itself: remove the differing intensity scales from multiple expt readings (highly recommended if not using a twocolor experiment. NCI-60 cell lines DrugActivity_AT