This document explores gene expression signatures of nitrogen-use
efficiency in maize.
We analyze data from GSE32361, validate sample metadata, perform
clustering, filter for reliable genes,
apply ANOVA to rank treatment-sensitive genes, visualize patterns via
heatmaps, assess cluster quality, and conduct PCA.
Objective: Download and inspect the structure of the microarray data.
#BiocManager::install("GEOquery") #already loaded it
library("GEOquery") #Otherwise wont work
gse<-getGEO("GSE32361")
# Extract expression and metadata
expression.data <- as.data.frame(exprs(gse[[1]])) #This exprs function will give me the expression from the gse I created. The gse is a list with 1 element - so need to tell it to find find from the first element of the list.
expression.metadata <- pData(gse[[1]])
# Inspect data structure
str(expression.data)
## 'data.frame': 84246 obs. of 90 variables:
## $ GSM801254: num 11.58 12.2 6.28 5.51 8.22 ...
## $ GSM801255: num 11.71 12.34 6.15 5.56 8 ...
## $ GSM801256: num 11.55 12.2 6.25 5.53 8.23 ...
## $ GSM801257: num 11.2 11.95 6.31 5.35 8.39 ...
## $ GSM801258: num 11.73 12.36 6.45 5.25 8.36 ...
## $ GSM801259: num 11.68 12.32 6.1 5.35 7.79 ...
## $ GSM801260: num 12.55 13.01 6.34 5.46 7.81 ...
## $ GSM801261: num 11.93 12.53 5.42 5.67 6.79 ...
## $ GSM801262: num 12.08 12.72 6.43 5.34 7.91 ...
## $ GSM801263: num 12.51 12.95 5.85 5.32 7.87 ...
## $ GSM801264: num 12.5 12.91 6.08 5.2 7.79 ...
## $ GSM801265: num 12.39 12.87 6.07 5.37 7.75 ...
## $ GSM801266: num 12.72 13.2 6.05 5.33 7.44 ...
## $ GSM801267: num 12.95 13.3 5.8 5.52 7.6 ...
## $ GSM801268: num 12.77 13.12 5.78 5.36 7.47 ...
## $ GSM801269: num 12.44 12.92 5.76 5.68 7.48 ...
## $ GSM801270: num 12.61 13.12 5.88 5.8 7.46 ...
## $ GSM801271: num 12.69 13.14 5.91 5.64 7.44 ...
## $ GSM801272: num 11.67 12.36 6.76 5.34 8.01 ...
## $ GSM801273: num 11.74 12.38 5.9 5.41 7.67 ...
## $ GSM801274: num 12.2 12.62 6.62 5.31 7.95 ...
## $ GSM801275: num 11.64 12.32 6.18 5.61 7.82 ...
## $ GSM801276: num 11.71 12.4 6.47 5.48 8.16 ...
## $ GSM801277: num 11.47 12.1 5.96 5.51 7.62 ...
## $ GSM801278: num 11.98 12.44 6.35 5.43 8.38 ...
## $ GSM801279: num 11.89 12.54 6.04 5.36 8.1 ...
## $ GSM801280: num 11.88 12.46 6.42 5.31 8.35 ...
## $ GSM801281: num 12.28 12.9 6.31 5.52 7.58 ...
## $ GSM801282: num 12.2 12.78 5.99 5.38 7.94 ...
## $ GSM801283: num 12.07 12.75 5.7 5.45 7.18 ...
## $ GSM801284: num 12.36 12.79 5.85 5.41 7.6 ...
## $ GSM801285: num 12.49 12.97 5.98 5.4 7.92 ...
## $ GSM801286: num 12.38 12.82 5.8 5.35 7.72 ...
## $ GSM801287: num 12.73 13.22 5.76 5.5 7.4 ...
## $ GSM801288: num 12.77 13.25 5.37 5.57 6.77 ...
## $ GSM801289: num 12.94 13.35 5.76 5.59 7.26 ...
## $ GSM801290: num 12.72 13.12 5.74 5.57 7.69 ...
## $ GSM801291: num 12.75 13.18 5.96 5.54 7.69 ...
## $ GSM801292: num 12.81 13.27 5.91 5.44 7.76 ...
## $ GSM801293: num 12.41 12.85 6.16 5.46 8.1 ...
## $ GSM801294: num 11.63 12.31 6.27 5.58 7.78 ...
## $ GSM801295: num 11.98 12.52 6.38 5.32 8.17 ...
## $ GSM801296: num 11.04 11.82 6.43 5.51 8.2 ...
## $ GSM801297: num 11.49 12.24 6.37 5.42 8.47 ...
## $ GSM801298: num 11.48 12.18 6.5 5.54 8.09 ...
## $ GSM801299: num 11.84 12.38 6.23 5.15 8.26 ...
## $ GSM801300: num 11.6 12.28 6.36 5.46 8.01 ...
## $ GSM801301: num 11.85 12.42 6.42 5.32 8.37 ...
## $ GSM801302: num 12.42 12.97 6.07 5.45 7.95 ...
## $ GSM801303: num 12.1 12.8 6 5.6 7.5 ...
## $ GSM801304: num 12.39 12.95 5.95 5.53 7.49 ...
## $ GSM801305: num 12.56 13.02 5.91 5.39 7.91 ...
## $ GSM801306: num 12.47 12.91 6.02 5.39 7.55 ...
## $ GSM801307: num 12.6 12.96 6.27 5.3 8 ...
## $ GSM801308: num 12.71 13.22 5.77 5.55 7.37 ...
## $ GSM801309: num 12.86 13.29 5.7 5.52 7.45 ...
## $ GSM801310: num 12.69 13.2 5.89 5.55 7.43 ...
## $ GSM801311: num 12.83 13.15 5.94 5.57 7.56 ...
## $ GSM801312: num 12.6 13.06 5.75 5.55 7.39 ...
## $ GSM801313: num 12.36 13.01 5.84 5.59 7.6 ...
## $ GSM801314: num 11.54 12.28 6.24 5.6 7.84 ...
## $ GSM801315: num 12.18 12.68 6.44 5.51 8.15 ...
## $ GSM801316: num 11.99 12.56 6.52 5.35 8.28 ...
## $ GSM801317: num 11.41 12.1 6.13 5.42 8.02 ...
## $ GSM801318: num 11.59 12.27 6.06 5.49 7.99 ...
## $ GSM801319: num 11.31 12.06 6.09 5.4 8.05 ...
## $ GSM801320: num 11.83 12.22 6.23 5.36 8.14 ...
## $ GSM801321: num 11.45 12.11 5.93 5.31 7.82 ...
## $ GSM801322: num 11.88 12.38 6.3 5.41 8.22 ...
## $ GSM801323: num 11.22 11.93 6.31 5.57 8.3 ...
## $ GSM801324: num 10.9 11.55 5.81 5.3 7.93 ...
## $ GSM801325: num 11.13 11.87 6.77 5.31 8.14 ...
## $ GSM801326: num 12.33 12.86 6.29 5.55 7.97 ...
## $ GSM801327: num 12.28 12.77 6.06 5.46 7.81 ...
## $ GSM801328: num 12.35 12.93 5.93 5.41 7.84 ...
## $ GSM801329: num 12.56 13.01 6.07 5.27 7.92 ...
## $ GSM801330: num 12.39 12.86 6.06 5.21 7.82 ...
## $ GSM801331: num 12.52 13.01 6.03 5.34 7.63 ...
## $ GSM801332: num 12.76 13.15 5.74 5.39 7.3 ...
## $ GSM801333: num 12.77 13.24 5.59 5.53 7.19 ...
## $ GSM801334: num 12.77 13.31 5.94 5.7 7.22 ...
## $ GSM801335: num 12.72 13.09 6.01 5.47 7.99 ...
## $ GSM801336: num 12.6 13.07 5.86 5.56 7.62 ...
## $ GSM801337: num 12.56 13.08 5.89 5.59 7.81 ...
## $ GSM801338: num 12.33 12.82 6.24 5.41 7.95 ...
## $ GSM801339: num 12.01 12.55 6.37 5.51 8.02 ...
## $ GSM801340: num 12.15 12.61 6.6 5.38 8.18 ...
## $ GSM801341: num 12 12.62 5.99 5.36 7.94 ...
## $ GSM801342: num 12.29 12.83 6.08 5.41 7.86 ...
## $ GSM801343: num 12.39 12.84 5.9 5.39 8.04 ...
Objective: Extract key metadata columns for downstream analyses.
#need to extract genotypes, growth nutrient, sampling time
#What they are called in metadata - geo_accession, genotype:ch1, growth nutrient:ch1, sampletime:ch1
#Creating a data frame - with 4 columns
library(dplyr)
imp.metadata.columns <- expression.metadata %>% dplyr::select("geo_accession", "genotype:ch1", "growth nutrient:ch1", "sampletime:ch1") %>% as.data.frame()
#For expression data - each column has the geo accession - so need to find the number of columns
# Number of samples
ncol(expression.data) # should be 90
## [1] 90
nrow(expression.metadata) # should be 90
## [1] 90
ncol(expression.data) == nrow(expression.metadata)
## [1] TRUE
# Check sample ID correspondence
colnames(expression.data) == rownames(expression.metadata)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# Check uniqueness
unique.ids <- unique(colnames(expression.data))
length(unique.ids)
## [1] 90
Objective: Compute Euclidean distances, generate dendrograms, and color by factors.
#Loading packages
#install.packages("ggdendro") #Already loaded it so commenting it out
library("ggdendro")
# Calculate distance matrix and hierarchical clustering
distance.data <- dist(t(expression.data), method = "euclidean")
clustering <- hclust(distance.data, method = "average")
dendro <- ggdendrogram(clustering, rotate = FALSE, size = 1)
print(dendro)
What I’m doing: Coloring sample labels on the dendrogram by genotype to explore grouping.
# Order metadata to match dendrogram
extract.order <- clustering$order
order.imp.metadata <- imp.metadata.columns[extract.order, ]
# Define color mapping for genotypes
color.vector <- c("blue", "red", "green", "purple")
then.color <- character(nrow(order.imp.metadata))
for (i in seq_len(nrow(order.imp.metadata))) {
geno <- order.imp.metadata[i, "genotype:ch1"]
then.color[i] <- switch(
geno,
"Monsanto Z. mays line 1" = color.vector[1],
"Monsanto Z. mays line 2" = color.vector[2],
"Monsanto Z. mays line 3" = color.vector[3],
"Monsanto Z. mays line 4" = color.vector[4],
"black"
)
}
dendro + theme(axis.text.x = element_text(color = then.color))
What I’m doing: Coloring sample labels by growth nutrient condition.
# Define color mapping for nutrient treatments
color.vector.growth <- c("blue", "red", "green")
then.color.growth <- character(nrow(order.imp.metadata))
for (i in seq_len(nrow(order.imp.metadata))) {
nut <- order.imp.metadata[i, "growth nutrient:ch1"]
then.color.growth[i] <- switch(
nut,
"20mM NH4NO3" = color.vector.growth[1],
"2mM and then 20mM NH4NO3" = color.vector.growth[2],
"2mM NH4NO3" = color.vector.growth[3],
"black"
)
}
dendro + theme(axis.text.x = element_text(color = then.color.growth))
What I’m doing: Coloring sample labels by sampling time.
# Define color mapping for sample times
color.vector.time <- c("blue", "red", "green")
then.color.time <- character(nrow(order.imp.metadata))
for (i in seq_len(nrow(order.imp.metadata))) {
tm <- order.imp.metadata[i, "sampletime:ch1"]
then.color.time[i] <- switch(
tm,
"10AM day1" = color.vector.time[1],
"10AM day2" = color.vector.time[2],
"11PM day1" = color.vector.time[3],
"black"
)
}
dendro + theme(axis.text.x = element_text(color = then.color.time))
Objective: Retain genes with total expression above the median.
# Filter Low-Expression Genes
# Convert to data frame with gene_id column
expr_df <- tibble::rownames_to_column(as.data.frame(expression.data), var = "gene_id")
# Compute sum of expression values across numeric columns only
numeric_cols <- sapply(expr_df, is.numeric)
expr_df$sum.expr <- rowSums(expr_df[, numeric_cols], na.rm = TRUE)
# Calculate median of summed expression
median.sum <- median(expr_df$sum.expr)
# Subset genes above the median
gfiltered <- expr_df[expr_df$sum.expr > median.sum, ]
# Remove the helper column
gfiltered$sum.expr <- NULL
# Restore row names from gene_id and drop that column
rownames(gfiltered) <- gfiltered$gene_id
filtered.genes <- gfiltered[, setdiff(names(gfiltered), "gene_id")]
Objective: Identify genes significantly affected by genotype, nutrient, or time.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# The following function will return the smallest p-value associated with a non-intercept model term
# This is different from the model F-statistic which does incorporate the intercept term
# There are better ways to do this; use this way because it's relatively fast and kinda ok
# Arguments:
# 1) a vector of expression values
# 2) a dataframe with the three factors as three columns
msk3wayAnova <- function(expvalue, expfactors) {
templm<- lm(as.numeric(expvalue) ~ as.factor(expfactors[,1]) +
as.factor(expfactors[,2]) +
as.factor(expfactors[,3]))
return(min(summary(templm)$coef[-1,4]))
}
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# Apply the above function to each row of the expression values.
imp.metadata.columns.nolabels <- imp.metadata.columns[,2:4]
filtered.gene.expression.norowsums.1 <- filtered.genes [,-1]
filtered.gene.expression.norowsums <- filtered.gene.expression.norowsums.1[,-91]
anovaPvalues <- apply(filtered.gene.expression.norowsums, imp.metadata.columns.nolabels, MARGIN = 1, FUN = msk3wayAnova)
## Error in model.frame.default(formula = as.numeric(expvalue) ~ as.factor(expfactors[, : variable lengths differ (found for 'as.factor(expfactors[, 1])')
str(anovaPvalues)
## Error: object 'anovaPvalues' not found
max(anovaPvalues)
## Error: object 'anovaPvalues' not found
min(anovaPvalues)
## Error: object 'anovaPvalues' not found
Objective: Choose genes with the smallest ANOVA p-values.
#Identifying genes with the lowest p-values
#Creating a data frame with the p value
#For some reason I lost my gene names - so adding that again
pval.df <- tibble::rownames_to_column(as.data.frame(anovaPvalues), var = "gene")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'anovaPvalues' not found
top1000 <- pval.df %>%
arrange(anovaPvalues) %>%
slice(1:1000) %>%
pull(gene)
## Error: object 'pval.df' not found
selected.data <- filtered.genes[top1000, ]
## Error: object 'top1000' not found
p.value.data.frame <- cbind(filtered.genes,anovaPvalues)
## Error: object 'anovaPvalues' not found
library(dplyr)
arrange.p.value <- p.value.data.frame %>% arrange(anovaPvalues)
## Error: object 'p.value.data.frame' not found
lowest.p.value <- arrange.p.value[1:1000,] #this data frame has 2 extra columns for p-value and sum. Now that I have sorted it by p-value - I can get rid of those to create the 90 by 1000 data frame
## Error: object 'arrange.p.value' not found
lowest.p.value.final <- lowest.p.value[,1:91]
## Error: object 'lowest.p.value' not found
rownames(lowest.p.value.final) = NULL
## Error: object 'lowest.p.value.final' not found
lowest.p.value.final = column_to_rownames(lowest.p.value.final,var="gene")
## Error: object 'lowest.p.value.final' not found
Objective: Visualize z-scored expression for top genes.
#From class notes getting the z scores
centered.data <- sweep(lowest.p.value.final, 1, rowMeans(lowest.p.value.final), FUN = "-")
## Error: object 'lowest.p.value.final' not found
centered.data.std.dev <-apply(centered.data , 1, sd)
## Error: object 'centered.data' not found
recentered.data <- sweep(centered.data, 1, centered.data.std.dev, FUN = "/" )
## Error: object 'centered.data' not found
recentered.data$genes <- rownames(recentered.data)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'recentered.data' not found
recentered.data.melt<- melt(recentered.data)
## Error: object 'recentered.data' not found
library(MASS)
library(reshape2)
#Online - https://www.r-graph-gallery.com/79-levelplot-with-ggplot2.html
ggplot(recentered.data.melt, aes(x = variable, y= genes, fill= value)) +
geom_tile() +
scale_fill_gradient2(
low = "blue", na.value = "white",
high = "red") + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
## Error: object 'recentered.data.melt' not found
Objective: Cluster genes and reorder heatmap rows accordingly.
#Samething as before
#first calculating distance #Copied code from above
distance.data.new <- dist(recentered.data, method ="euclidean")
## Error: object 'recentered.data' not found
clustering.new <- hclust(distance.data.new , method = "average")
## Error: object 'distance.data.new' not found
clustering.new
## Error: object 'clustering.new' not found
#One of the things in the list is the order - extract the order - make vector
extract.order.new <- clustering.new$order
## Error: object 'clustering.new' not found
ordered.names.data <- rownames(recentered.data)[extract.order.new]
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'recentered.data' not found
ggplot(recentered.data.melt, aes(x = variable, y= genes, fill= value)) +
geom_tile() +
scale_fill_gradient2(
low = "blue", na.value = "white",
high = "red") + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) + scale_y_discrete(limits= ordered.names.data)
## Error: object 'recentered.data.melt' not found
arrange.genotypes <-imp.metadata.columns %>% arrange(`genotype:ch1`)
arrange.genotypes.geo <- arrange.genotypes[,1]
ggplot(recentered.data.melt, aes(x = variable, y= genes, fill= value)) +
geom_tile() +
scale_fill_gradient2(
low = "blue", na.value = "white",
high = "red") + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) + scale_y_discrete(limits= ordered.names.data) + scale_x_discrete(limits= arrange.genotypes.geo )
## Error: object 'recentered.data.melt' not found
arrange.growth <-imp.metadata.columns %>% arrange(`growth nutrient:ch1`)
arrange.genotypes.growth <- arrange.growth[,1]
ggplot(recentered.data.melt, aes(x = variable, y= genes, fill= value)) +
geom_tile() +
scale_fill_gradient2(
low = "blue", na.value = "white",
high = "red") + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) + scale_y_discrete(limits= ordered.names.data) + scale_x_discrete(limits= arrange.genotypes.growth)
## Error: object 'recentered.data.melt' not found
arrange.time <-imp.metadata.columns %>% arrange(`sampletime:ch1`)
arrange.genotypes.time <- arrange.time[,1]
ggplot(recentered.data.melt, aes(x = variable, y= genes, fill= value)) +
geom_tile() +
scale_fill_gradient2(
low = "blue", na.value = "white",
high = "red") + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) + scale_y_discrete(limits= ordered.names.data) + scale_x_discrete(limits= arrange.genotypes.time)
## Error: object 'recentered.data.melt' not found
Based on the heat map - we plotted genes on the Y axis and GSM on the X axis, so if we look along the X-axis we can see how for the one gene how the expression is changing. Therefore, we can say that gene is varying alot if we see more patterning along the X-axis. On the basis of this, when we group by genotypes, we see alot of change (more patterning), showing that genotypes don’t have a profound effect on the change in the gene expression. This correlates with the dendogram that we saw - where when we order by genotypes, we dont see a really see groupings for similar genotypes. When we look at the heatmap for time - we see that when we group by time, we see patterns that are constant - or large clusters, indicating time does affect gene expression. Therefore sample time based on heatmaps has the most effect on gene expression.
Using the heatmaps in the previous section, we see - two clusters that change according to sample time, and one cluster that changes according to nutrient.
#3 clusters - based on the above
cut.the.tree <- data.frame(group = cutree(clustering.new, k = 3)) #Storing it as a data drame - because otherwise it returns a vector
## Error: object 'clustering.new' not found
#Creating a heatmap with only the genes in the largest group. Order the x-axis by `sampletime`.
#Determine how many genes have 1 or 2 or 3
sum(cut.the.tree == 1) #409
## Error: object 'cut.the.tree' not found
sum(cut.the.tree == 2) #191
## Error: object 'cut.the.tree' not found
sum(cut.the.tree == 3) #400
## Error: object 'cut.the.tree' not found
#The largest group is 1
cut.the.tree<- rownames_to_column(cut.the.tree)
## Error: object 'cut.the.tree' not found
#get names of gnes in group 1
largest.group.gene <- cut.the.tree[cut.the.tree$group == 1, 1] %>% unlist()
## Error: object 'cut.the.tree' not found
#TRYING - DOES NOT WORK
# cut.1 <- cut.the.tree[cut.the.tree$group == 1,]
# genes.cut.1<- cut.1[,1]
# genes.cut.1.new <- as.data.frame(genes.cut.1) #converting to data frame
largest.group.gene.data<- recentered.data[largest.group.gene,]
## Error: object 'recentered.data' not found
# largest.genes.gene <- as.data.frame(largest.group.gene.1)
largest.group.gene.data.melt <- melt(largest.group.gene.data)
## Error: object 'largest.group.gene.data' not found
#TRYING! DOESNT WORK _ DO NOT USE
#largest.group.gene <- melt(largest.group.gene) #I have two columns of the same thing - I will remove it to confuse
#largest.group.gene<- largest.group.gene[,-3]
#So now doing the time thing like before
# arrange.time.new <-imp.metadata.columns %>% filter()
# arrange.genotypes.time <- arrange.time[,1]
# extract.order.new <- clustering.new$order
# ordered.names.data <- rownames(recentered.data)[extract.order.new]
#GG plot code
ggplot(largest.group.gene.data.melt, aes(x = variable, y= genes, fill= value)) +
geom_tile() +
scale_fill_gradient2(
low = "blue", na.value = "white",
high = "red") + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())+scale_x_discrete(limits= arrange.genotypes.time)
## Error: object 'largest.group.gene.data.melt' not found
library(cluster)
empty.data.frame <-data.frame(ncol = 3)
k.values.new <- c(3,4,5)
for (i in 1:length(k.values.new)) {
empty.data.frame.new <- cutree(clustering.new, k = k.values.new[i])
empty.data.frame <- cbind(empty.data.frame, empty.data.frame.new)
}
## Error: object 'clustering.new' not found
#These columns generated have bad names - renaming it
three.cluster.names <- c("ncol","3 cluster", "4 cluster", "5 cluster")
colnames(empty.data.frame) <- three.cluster.names
## Error in names(x) <- value: 'names' attribute [4] must be the same length as the vector [1]
#Calculate silhoutte
three.cluster <- silhouette(empty.data.frame[,2], dist(recentered.data))
## Error in `[.data.frame`(empty.data.frame, , 2): undefined columns selected
four.cluster <- silhouette(empty.data.frame[,3], dist(recentered.data))
## Error in `[.data.frame`(empty.data.frame, , 3): undefined columns selected
five.cluster <- silhouette(empty.data.frame[,4], dist(recentered.data))
## Error in `[.data.frame`(empty.data.frame, , 4): undefined columns selected
#Make a plot now
plot.three.cluster <- plot(three.cluster, border = NA,main = "3 cluster")
## Error: object 'three.cluster' not found
plot.four.cluster <- plot(four.cluster, border = NA, main = "4 cluster")
## Error: object 'four.cluster' not found
plot.five.cluster <- plot(five.cluster, border = NA, main = "5 cluster")
## Error: object 'five.cluster' not found
Comparing the results to your original guess for the optimal number of groups. The number of optimal of clusters should be 3. This correlates with what we set before. We can infer this with the average silhoutte width - which is the maximum for 3 clusters.
Objective: Perform PCA on the top genes and plot PC1 vs PC2.
#Using my code from HW11
straight.line.pca <- prcomp(t(lowest.p.value.final), center=T, scale=T)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 't': object 'lowest.p.value.final' not found
summary(straight.line.pca)
## Error: object 'straight.line.pca' not found
growth.time.experiment.data <- imp.metadata.columns[, 3:4]
straight.line.pca.time.exp <- as.data.frame(cbind(as.data.frame(straight.line.pca$x), growth.time.experiment.data))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'straight.line.pca' not found
ggplot(straight.line.pca.time.exp, aes(x=PC1, y=PC2, shape=`growth nutrient:ch1` , col=`sampletime:ch1`)) +
geom_point() +
scale_fill_brewer() +
labs(x="PC1 (68.11%)", y="PC2 (20.19%)")
## Error: object 'straight.line.pca.time.exp' not found
There are two clusters along the PC1. These seem to predominantly separated by time, like for cluster one on the PC1 axis - it all seems to be predominantly of 10am day1, while the second cluster seems to be all 11pm day2. Considering PC1 explains the most variance, we see the two clusters that far away from each from each, so time has the most important effect on gene expression, and 11 pm accounts for the most variability in gene expression. #With respect to the PC2 they are clustered primarily as per nutrient, and we can see three clusters. Since PC2 explains the next most important variability after PC1, nutrient has the second most important effect on gene expression. The third cluster along the PC2 for 20mM NH4NO3 contributes to the most variability in gene expression for the nutrient conditions.
These results correlate with earlier analysis and we see that sampling time has the most effect on the variability in gene expression, followed by nutrients and then genotype. For the dendogram we could see for sampling time, the groupings showed a pattern, and the same was explained by the heat map, where for the time when we grouped it together, we could see pattern (and constantly changing patterns).