Introduction

This document explores gene expression signatures of nitrogen-use efficiency in maize.
We analyze data from GSE32361, validate sample metadata, perform clustering, filter for reliable genes,
apply ANOVA to rank treatment-sensitive genes, visualize patterns via heatmaps, assess cluster quality, and conduct PCA.

1. Data Loading and Inspection

Objective: Download and inspect the structure of the microarray data.

#BiocManager::install("GEOquery") #already loaded it
library("GEOquery") #Otherwise wont work
gse<-getGEO("GSE32361")

# Extract expression and metadata
expression.data <- as.data.frame(exprs(gse[[1]])) #This exprs function will give me the expression from the gse I created. The gse is a list with 1 element - so need to tell it to find find from the first element of the list.
expression.metadata <- pData(gse[[1]])

# Inspect data structure
str(expression.data)
## 'data.frame':    84246 obs. of  90 variables:
##  $ GSM801254: num  11.58 12.2 6.28 5.51 8.22 ...
##  $ GSM801255: num  11.71 12.34 6.15 5.56 8 ...
##  $ GSM801256: num  11.55 12.2 6.25 5.53 8.23 ...
##  $ GSM801257: num  11.2 11.95 6.31 5.35 8.39 ...
##  $ GSM801258: num  11.73 12.36 6.45 5.25 8.36 ...
##  $ GSM801259: num  11.68 12.32 6.1 5.35 7.79 ...
##  $ GSM801260: num  12.55 13.01 6.34 5.46 7.81 ...
##  $ GSM801261: num  11.93 12.53 5.42 5.67 6.79 ...
##  $ GSM801262: num  12.08 12.72 6.43 5.34 7.91 ...
##  $ GSM801263: num  12.51 12.95 5.85 5.32 7.87 ...
##  $ GSM801264: num  12.5 12.91 6.08 5.2 7.79 ...
##  $ GSM801265: num  12.39 12.87 6.07 5.37 7.75 ...
##  $ GSM801266: num  12.72 13.2 6.05 5.33 7.44 ...
##  $ GSM801267: num  12.95 13.3 5.8 5.52 7.6 ...
##  $ GSM801268: num  12.77 13.12 5.78 5.36 7.47 ...
##  $ GSM801269: num  12.44 12.92 5.76 5.68 7.48 ...
##  $ GSM801270: num  12.61 13.12 5.88 5.8 7.46 ...
##  $ GSM801271: num  12.69 13.14 5.91 5.64 7.44 ...
##  $ GSM801272: num  11.67 12.36 6.76 5.34 8.01 ...
##  $ GSM801273: num  11.74 12.38 5.9 5.41 7.67 ...
##  $ GSM801274: num  12.2 12.62 6.62 5.31 7.95 ...
##  $ GSM801275: num  11.64 12.32 6.18 5.61 7.82 ...
##  $ GSM801276: num  11.71 12.4 6.47 5.48 8.16 ...
##  $ GSM801277: num  11.47 12.1 5.96 5.51 7.62 ...
##  $ GSM801278: num  11.98 12.44 6.35 5.43 8.38 ...
##  $ GSM801279: num  11.89 12.54 6.04 5.36 8.1 ...
##  $ GSM801280: num  11.88 12.46 6.42 5.31 8.35 ...
##  $ GSM801281: num  12.28 12.9 6.31 5.52 7.58 ...
##  $ GSM801282: num  12.2 12.78 5.99 5.38 7.94 ...
##  $ GSM801283: num  12.07 12.75 5.7 5.45 7.18 ...
##  $ GSM801284: num  12.36 12.79 5.85 5.41 7.6 ...
##  $ GSM801285: num  12.49 12.97 5.98 5.4 7.92 ...
##  $ GSM801286: num  12.38 12.82 5.8 5.35 7.72 ...
##  $ GSM801287: num  12.73 13.22 5.76 5.5 7.4 ...
##  $ GSM801288: num  12.77 13.25 5.37 5.57 6.77 ...
##  $ GSM801289: num  12.94 13.35 5.76 5.59 7.26 ...
##  $ GSM801290: num  12.72 13.12 5.74 5.57 7.69 ...
##  $ GSM801291: num  12.75 13.18 5.96 5.54 7.69 ...
##  $ GSM801292: num  12.81 13.27 5.91 5.44 7.76 ...
##  $ GSM801293: num  12.41 12.85 6.16 5.46 8.1 ...
##  $ GSM801294: num  11.63 12.31 6.27 5.58 7.78 ...
##  $ GSM801295: num  11.98 12.52 6.38 5.32 8.17 ...
##  $ GSM801296: num  11.04 11.82 6.43 5.51 8.2 ...
##  $ GSM801297: num  11.49 12.24 6.37 5.42 8.47 ...
##  $ GSM801298: num  11.48 12.18 6.5 5.54 8.09 ...
##  $ GSM801299: num  11.84 12.38 6.23 5.15 8.26 ...
##  $ GSM801300: num  11.6 12.28 6.36 5.46 8.01 ...
##  $ GSM801301: num  11.85 12.42 6.42 5.32 8.37 ...
##  $ GSM801302: num  12.42 12.97 6.07 5.45 7.95 ...
##  $ GSM801303: num  12.1 12.8 6 5.6 7.5 ...
##  $ GSM801304: num  12.39 12.95 5.95 5.53 7.49 ...
##  $ GSM801305: num  12.56 13.02 5.91 5.39 7.91 ...
##  $ GSM801306: num  12.47 12.91 6.02 5.39 7.55 ...
##  $ GSM801307: num  12.6 12.96 6.27 5.3 8 ...
##  $ GSM801308: num  12.71 13.22 5.77 5.55 7.37 ...
##  $ GSM801309: num  12.86 13.29 5.7 5.52 7.45 ...
##  $ GSM801310: num  12.69 13.2 5.89 5.55 7.43 ...
##  $ GSM801311: num  12.83 13.15 5.94 5.57 7.56 ...
##  $ GSM801312: num  12.6 13.06 5.75 5.55 7.39 ...
##  $ GSM801313: num  12.36 13.01 5.84 5.59 7.6 ...
##  $ GSM801314: num  11.54 12.28 6.24 5.6 7.84 ...
##  $ GSM801315: num  12.18 12.68 6.44 5.51 8.15 ...
##  $ GSM801316: num  11.99 12.56 6.52 5.35 8.28 ...
##  $ GSM801317: num  11.41 12.1 6.13 5.42 8.02 ...
##  $ GSM801318: num  11.59 12.27 6.06 5.49 7.99 ...
##  $ GSM801319: num  11.31 12.06 6.09 5.4 8.05 ...
##  $ GSM801320: num  11.83 12.22 6.23 5.36 8.14 ...
##  $ GSM801321: num  11.45 12.11 5.93 5.31 7.82 ...
##  $ GSM801322: num  11.88 12.38 6.3 5.41 8.22 ...
##  $ GSM801323: num  11.22 11.93 6.31 5.57 8.3 ...
##  $ GSM801324: num  10.9 11.55 5.81 5.3 7.93 ...
##  $ GSM801325: num  11.13 11.87 6.77 5.31 8.14 ...
##  $ GSM801326: num  12.33 12.86 6.29 5.55 7.97 ...
##  $ GSM801327: num  12.28 12.77 6.06 5.46 7.81 ...
##  $ GSM801328: num  12.35 12.93 5.93 5.41 7.84 ...
##  $ GSM801329: num  12.56 13.01 6.07 5.27 7.92 ...
##  $ GSM801330: num  12.39 12.86 6.06 5.21 7.82 ...
##  $ GSM801331: num  12.52 13.01 6.03 5.34 7.63 ...
##  $ GSM801332: num  12.76 13.15 5.74 5.39 7.3 ...
##  $ GSM801333: num  12.77 13.24 5.59 5.53 7.19 ...
##  $ GSM801334: num  12.77 13.31 5.94 5.7 7.22 ...
##  $ GSM801335: num  12.72 13.09 6.01 5.47 7.99 ...
##  $ GSM801336: num  12.6 13.07 5.86 5.56 7.62 ...
##  $ GSM801337: num  12.56 13.08 5.89 5.59 7.81 ...
##  $ GSM801338: num  12.33 12.82 6.24 5.41 7.95 ...
##  $ GSM801339: num  12.01 12.55 6.37 5.51 8.02 ...
##  $ GSM801340: num  12.15 12.61 6.6 5.38 8.18 ...
##  $ GSM801341: num  12 12.62 5.99 5.36 7.94 ...
##  $ GSM801342: num  12.29 12.83 6.08 5.41 7.86 ...
##  $ GSM801343: num  12.39 12.84 5.9 5.39 8.04 ...

2. Metadata Preparation and perform checks

Objective: Extract key metadata columns for downstream analyses.

#need to extract genotypes, growth nutrient, sampling time
#What they are called in metadata - geo_accession, genotype:ch1,  growth nutrient:ch1, sampletime:ch1

#Creating a data frame - with 4 columns 

library(dplyr)
imp.metadata.columns <- expression.metadata %>% dplyr::select("geo_accession", "genotype:ch1", "growth nutrient:ch1", "sampletime:ch1") %>% as.data.frame()
#For expression data - each column has the geo accession - so need to find the number of columns
# Number of samples
ncol(expression.data) # should be 90
## [1] 90
nrow(expression.metadata) # should be 90
## [1] 90
ncol(expression.data) == nrow(expression.metadata)
## [1] TRUE
# Check sample ID correspondence
colnames(expression.data) == rownames(expression.metadata)
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# Check uniqueness
unique.ids <- unique(colnames(expression.data))
length(unique.ids)
## [1] 90

3. Unsupervised Clustering of Samples

Objective: Compute Euclidean distances, generate dendrograms, and color by factors.

#Loading packages 
#install.packages("ggdendro") #Already loaded it so commenting it out
library("ggdendro")

# Calculate distance matrix and hierarchical clustering
distance.data <- dist(t(expression.data), method = "euclidean")
clustering <- hclust(distance.data, method = "average")

dendro <- ggdendrogram(clustering, rotate = FALSE, size = 1)
print(dendro)

3.1 Color Dendrogram by Genotype

What I’m doing: Coloring sample labels on the dendrogram by genotype to explore grouping.

# Order metadata to match dendrogram
extract.order <- clustering$order
order.imp.metadata <- imp.metadata.columns[extract.order, ]

# Define color mapping for genotypes
color.vector <- c("blue", "red", "green", "purple")
then.color <- character(nrow(order.imp.metadata))
for (i in seq_len(nrow(order.imp.metadata))) {
  geno <- order.imp.metadata[i, "genotype:ch1"]
  then.color[i] <- switch(
    geno,
    "Monsanto Z. mays line 1" = color.vector[1],
    "Monsanto Z. mays line 2" = color.vector[2],
    "Monsanto Z. mays line 3" = color.vector[3],
    "Monsanto Z. mays line 4" = color.vector[4],
    "black"
  )
}
dendro + theme(axis.text.x = element_text(color = then.color))

3.2 Color Dendrogram by Growth Nutrient

What I’m doing: Coloring sample labels by growth nutrient condition.

# Define color mapping for nutrient treatments
color.vector.growth <- c("blue", "red", "green")
then.color.growth <- character(nrow(order.imp.metadata))
for (i in seq_len(nrow(order.imp.metadata))) {
  nut <- order.imp.metadata[i, "growth nutrient:ch1"]
  then.color.growth[i] <- switch(
    nut,
    "20mM NH4NO3" = color.vector.growth[1],
    "2mM and then 20mM NH4NO3" = color.vector.growth[2],
    "2mM NH4NO3" = color.vector.growth[3],
    "black"
  )
}
dendro + theme(axis.text.x = element_text(color = then.color.growth))

3.3 Color Dendrogram by Sample Time

What I’m doing: Coloring sample labels by sampling time.

# Define color mapping for sample times
color.vector.time <- c("blue", "red", "green")
then.color.time <- character(nrow(order.imp.metadata))
for (i in seq_len(nrow(order.imp.metadata))) {
  tm <- order.imp.metadata[i, "sampletime:ch1"]
  then.color.time[i] <- switch(
    tm,
    "10AM day1" = color.vector.time[1],
    "10AM day2" = color.vector.time[2],
    "11PM day1" = color.vector.time[3],
    "black"
  )
}
dendro + theme(axis.text.x = element_text(color = then.color.time))

4. Filter Low-Expression Genes

Objective: Retain genes with total expression above the median.

# Filter Low-Expression Genes
# Convert to data frame with gene_id column
expr_df <- tibble::rownames_to_column(as.data.frame(expression.data), var = "gene_id")

# Compute sum of expression values across numeric columns only
numeric_cols <- sapply(expr_df, is.numeric)
expr_df$sum.expr <- rowSums(expr_df[, numeric_cols], na.rm = TRUE)

# Calculate median of summed expression
median.sum <- median(expr_df$sum.expr)

# Subset genes above the median
gfiltered <- expr_df[expr_df$sum.expr > median.sum, ]

# Remove the helper column
gfiltered$sum.expr <- NULL

# Restore row names from gene_id and drop that column
rownames(gfiltered) <- gfiltered$gene_id
filtered.genes <- gfiltered[, setdiff(names(gfiltered), "gene_id")]

5. Three-Way ANOVA for Gene Prioritization

Objective: Identify genes significantly affected by genotype, nutrient, or time.

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# The following function will return the smallest p-value associated with a non-intercept model term
# This is different from the model F-statistic which does incorporate the intercept term
# There are better ways to do this; use this way because it's relatively fast and kinda ok
# Arguments:
#   1) a vector of expression values
#   2) a dataframe with the three factors as three columns
msk3wayAnova <- function(expvalue, expfactors) {
  templm<- lm(as.numeric(expvalue) ~ as.factor(expfactors[,1]) +
              as.factor(expfactors[,2]) +
              as.factor(expfactors[,3]))
  return(min(summary(templm)$coef[-1,4]))
}
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #

# Apply the above function to each row of the expression values.
imp.metadata.columns.nolabels <- imp.metadata.columns[,2:4]
filtered.gene.expression.norowsums.1 <- filtered.genes [,-1]
filtered.gene.expression.norowsums <- filtered.gene.expression.norowsums.1[,-91]


anovaPvalues <- apply(filtered.gene.expression.norowsums, imp.metadata.columns.nolabels, MARGIN = 1, FUN = msk3wayAnova)
## Error in model.frame.default(formula = as.numeric(expvalue) ~ as.factor(expfactors[, : variable lengths differ (found for 'as.factor(expfactors[, 1])')
str(anovaPvalues)
## Error: object 'anovaPvalues' not found
max(anovaPvalues)
## Error: object 'anovaPvalues' not found
min(anovaPvalues)
## Error: object 'anovaPvalues' not found

6. Select Top 1000 Genes

Objective: Choose genes with the smallest ANOVA p-values.

#Identifying genes with the lowest p-values
#Creating a data frame with the p value

#For some reason I lost my gene names - so adding that again
pval.df <- tibble::rownames_to_column(as.data.frame(anovaPvalues), var = "gene")
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'anovaPvalues' not found
top1000 <- pval.df %>%
  arrange(anovaPvalues) %>%
  slice(1:1000) %>%
  pull(gene)
## Error: object 'pval.df' not found
selected.data <- filtered.genes[top1000, ]
## Error: object 'top1000' not found
p.value.data.frame <- cbind(filtered.genes,anovaPvalues)
## Error: object 'anovaPvalues' not found
library(dplyr)


arrange.p.value <- p.value.data.frame %>% arrange(anovaPvalues) 
## Error: object 'p.value.data.frame' not found
lowest.p.value <- arrange.p.value[1:1000,] #this data frame has 2 extra columns for p-value and sum. Now that I have sorted it by p-value - I can get rid of those to create the 90 by 1000 data frame
## Error: object 'arrange.p.value' not found
lowest.p.value.final <- lowest.p.value[,1:91]
## Error: object 'lowest.p.value' not found
rownames(lowest.p.value.final) = NULL
## Error: object 'lowest.p.value.final' not found
lowest.p.value.final = column_to_rownames(lowest.p.value.final,var="gene")
## Error: object 'lowest.p.value.final' not found

7. Heatmap of Top Genes

Objective: Visualize z-scored expression for top genes.

#From class notes getting the z scores
centered.data <- sweep(lowest.p.value.final, 1, rowMeans(lowest.p.value.final), FUN = "-")
## Error: object 'lowest.p.value.final' not found
centered.data.std.dev <-apply(centered.data , 1, sd)
## Error: object 'centered.data' not found
recentered.data <- sweep(centered.data, 1, centered.data.std.dev, FUN = "/" ) 
## Error: object 'centered.data' not found
recentered.data$genes <- rownames(recentered.data)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'recentered.data' not found
recentered.data.melt<- melt(recentered.data)
## Error: object 'recentered.data' not found
library(MASS)
library(reshape2)
#Online - https://www.r-graph-gallery.com/79-levelplot-with-ggplot2.html
  ggplot(recentered.data.melt, aes(x  = variable, y= genes, fill=  value)) + 
    geom_tile() +
    scale_fill_gradient2(
                        low = "blue", na.value = "white",
                        high = "red") + theme(axis.text.x = element_blank(), axis.ticks.x =  element_blank()) + theme(axis.text.y =  element_blank(),  axis.ticks.y =  element_blank()) 
## Error: object 'recentered.data.melt' not found

8. Gene Clustering Heatmap

Objective: Cluster genes and reorder heatmap rows accordingly.

#Samething as before 
#first calculating distance  #Copied code from above

distance.data.new <- dist(recentered.data, method ="euclidean")
## Error: object 'recentered.data' not found
clustering.new <- hclust(distance.data.new , method = "average")
## Error: object 'distance.data.new' not found
clustering.new
## Error: object 'clustering.new' not found
#One of the things in the list is the order - extract the order - make vector 

extract.order.new <- clustering.new$order
## Error: object 'clustering.new' not found
ordered.names.data <- rownames(recentered.data)[extract.order.new]
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'rownames': object 'recentered.data' not found
ggplot(recentered.data.melt, aes(x  = variable, y= genes, fill=  value)) + 
    geom_tile() +
    scale_fill_gradient2(
                        low = "blue", na.value = "white",
                        high = "red") + theme(axis.text.x = element_blank(), axis.ticks.x =  element_blank()) + theme(axis.text.y =  element_blank(),  axis.ticks.y =  element_blank()) + scale_y_discrete(limits= ordered.names.data)
## Error: object 'recentered.data.melt' not found

8.1 Reorder the biological samples in the heatmap based on the genotypes. Print the new heatmap.

arrange.genotypes <-imp.metadata.columns %>% arrange(`genotype:ch1`)
arrange.genotypes.geo <- arrange.genotypes[,1]


ggplot(recentered.data.melt, aes(x  = variable, y= genes, fill=  value)) + 
    geom_tile() +
    scale_fill_gradient2(
                        low = "blue", na.value = "white",
                        high = "red") + theme(axis.text.x = element_blank(), axis.ticks.x =  element_blank()) + theme(axis.text.y =  element_blank(),  axis.ticks.y =  element_blank()) + scale_y_discrete(limits= ordered.names.data) + scale_x_discrete(limits= arrange.genotypes.geo  )
## Error: object 'recentered.data.melt' not found

8.2 Reorder the biological samples in the heatmap based on the growth nutrients. Print the new heatmap.

arrange.growth <-imp.metadata.columns %>% arrange(`growth nutrient:ch1`)
arrange.genotypes.growth <- arrange.growth[,1]


ggplot(recentered.data.melt, aes(x  = variable, y= genes, fill=  value)) + 
    geom_tile() +
    scale_fill_gradient2(
                        low = "blue", na.value = "white",
                        high = "red") + theme(axis.text.x = element_blank(), axis.ticks.x =  element_blank()) + theme(axis.text.y =  element_blank(),  axis.ticks.y =  element_blank()) + scale_y_discrete(limits= ordered.names.data) + scale_x_discrete(limits= arrange.genotypes.growth)
## Error: object 'recentered.data.melt' not found

8.3 Reorder the biological samples in the heatmap based on the sample time. Print the new heatmap.

arrange.time <-imp.metadata.columns %>% arrange(`sampletime:ch1`)
arrange.genotypes.time <- arrange.time[,1]


ggplot(recentered.data.melt, aes(x  = variable, y= genes, fill=  value)) + 
    geom_tile() +
    scale_fill_gradient2(
                        low = "blue", na.value = "white",
                        high = "red") + theme(axis.text.x = element_blank(), axis.ticks.x =  element_blank()) + theme(axis.text.y =  element_blank(),  axis.ticks.y =  element_blank()) + scale_y_discrete(limits= ordered.names.data) + scale_x_discrete(limits= arrange.genotypes.time)
## Error: object 'recentered.data.melt' not found

8.4 Interpreting patterns using the heatmaps

Based on the heat map - we plotted genes on the Y axis and GSM on the X axis, so if we look along the X-axis we can see how for the one gene how the expression is changing. Therefore, we can say that gene is varying alot if we see more patterning along the X-axis. On the basis of this, when we group by genotypes, we see alot of change (more patterning), showing that genotypes don’t have a profound effect on the change in the gene expression. This correlates with the dendogram that we saw - where when we order by genotypes, we dont see a really see groupings for similar genotypes. When we look at the heatmap for time - we see that when we group by time, we see patterns that are constant - or large clusters, indicating time does affect gene expression. Therefore sample time based on heatmaps has the most effect on gene expression.

8.5 Creating discrete cluster

Using the heatmaps in the previous section, we see - two clusters that change according to sample time, and one cluster that changes according to nutrient.

#3 clusters - based on the above

cut.the.tree <- data.frame(group = cutree(clustering.new, k = 3)) #Storing it as a data drame - because otherwise it returns a vector
## Error: object 'clustering.new' not found
#Creating a heatmap with only the genes in the largest group. Order the x-axis by `sampletime`.
#Determine how many genes have 1 or 2 or 3
sum(cut.the.tree == 1) #409
## Error: object 'cut.the.tree' not found
sum(cut.the.tree == 2) #191
## Error: object 'cut.the.tree' not found
sum(cut.the.tree == 3) #400
## Error: object 'cut.the.tree' not found
#The largest group is 1

cut.the.tree<- rownames_to_column(cut.the.tree)
## Error: object 'cut.the.tree' not found
#get names of gnes in group 1

largest.group.gene <- cut.the.tree[cut.the.tree$group == 1, 1] %>% unlist()
## Error: object 'cut.the.tree' not found
#TRYING - DOES NOT WORK
# cut.1 <- cut.the.tree[cut.the.tree$group == 1,] 
# genes.cut.1<- cut.1[,1]
# genes.cut.1.new <- as.data.frame(genes.cut.1) #converting to data frame

largest.group.gene.data<- recentered.data[largest.group.gene,]
## Error: object 'recentered.data' not found
# largest.genes.gene <- as.data.frame(largest.group.gene.1)
largest.group.gene.data.melt <- melt(largest.group.gene.data)
## Error: object 'largest.group.gene.data' not found
#TRYING! DOESNT WORK _ DO NOT USE
#largest.group.gene <- melt(largest.group.gene) #I have two columns of the same thing - I will remove it to confuse
#largest.group.gene<- largest.group.gene[,-3]


#So now doing the time thing like before 
# arrange.time.new <-imp.metadata.columns %>% filter()
# arrange.genotypes.time <- arrange.time[,1]
# extract.order.new <- clustering.new$order
# ordered.names.data <- rownames(recentered.data)[extract.order.new]

#GG plot code
ggplot(largest.group.gene.data.melt, aes(x  = variable, y= genes, fill=  value)) + 
    geom_tile() +
    scale_fill_gradient2(
                        low = "blue", na.value = "white",
                        high = "red") + theme(axis.text.x = element_blank(), axis.ticks.x =  element_blank()) + theme(axis.text.y =  element_blank(),  axis.ticks.y =  element_blank())+scale_x_discrete(limits= arrange.genotypes.time)
## Error: object 'largest.group.gene.data.melt' not found

9 Silhouette Analysis for Gene Clusters

library(cluster)

empty.data.frame <-data.frame(ncol = 3) 
k.values.new <- c(3,4,5)
for (i in 1:length(k.values.new))  {
  empty.data.frame.new <- cutree(clustering.new, k = k.values.new[i])
  empty.data.frame <-  cbind(empty.data.frame, empty.data.frame.new)
 }
## Error: object 'clustering.new' not found
#These columns generated have bad names - renaming it
three.cluster.names <- c("ncol","3 cluster", "4 cluster", "5 cluster")
colnames(empty.data.frame) <- three.cluster.names
## Error in names(x) <- value: 'names' attribute [4] must be the same length as the vector [1]
#Calculate silhoutte
three.cluster <- silhouette(empty.data.frame[,2],  dist(recentered.data))
## Error in `[.data.frame`(empty.data.frame, , 2): undefined columns selected
four.cluster <- silhouette(empty.data.frame[,3],  dist(recentered.data))
## Error in `[.data.frame`(empty.data.frame, , 3): undefined columns selected
five.cluster <- silhouette(empty.data.frame[,4],  dist(recentered.data))
## Error in `[.data.frame`(empty.data.frame, , 4): undefined columns selected
#Make a plot now

plot.three.cluster <- plot(three.cluster, border = NA,main = "3 cluster")
## Error: object 'three.cluster' not found
plot.four.cluster <- plot(four.cluster, border = NA, main = "4 cluster")
## Error: object 'four.cluster' not found
plot.five.cluster <- plot(five.cluster, border = NA, main = "5 cluster")
## Error: object 'five.cluster' not found

Comparing the results to your original guess for the optimal number of groups. The number of optimal of clusters should be 3. This correlates with what we set before. We can infer this with the average silhoutte width - which is the maximum for 3 clusters.

10. Principal Component Analysis

Objective: Perform PCA on the top genes and plot PC1 vs PC2.

#Using my code from HW11
straight.line.pca <- prcomp(t(lowest.p.value.final), center=T, scale=T)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 't': object 'lowest.p.value.final' not found
summary(straight.line.pca)
## Error: object 'straight.line.pca' not found

10.1 Ploting PC1 and PC2

growth.time.experiment.data <- imp.metadata.columns[, 3:4] 

straight.line.pca.time.exp <- as.data.frame(cbind(as.data.frame(straight.line.pca$x), growth.time.experiment.data))
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': object 'straight.line.pca' not found
ggplot(straight.line.pca.time.exp, aes(x=PC1, y=PC2, shape=`growth nutrient:ch1` , col=`sampletime:ch1`)) +
  geom_point() +
  scale_fill_brewer() +
  labs(x="PC1 (68.11%)", y="PC2 (20.19%)")
## Error: object 'straight.line.pca.time.exp' not found

10.2 Interpretion the data using the PCA plot

There are two clusters along the PC1. These seem to predominantly separated by time, like for cluster one on the PC1 axis - it all seems to be predominantly of 10am day1, while the second cluster seems to be all 11pm day2. Considering PC1 explains the most variance, we see the two clusters that far away from each from each, so time has the most important effect on gene expression, and 11 pm accounts for the most variability in gene expression. #With respect to the PC2 they are clustered primarily as per nutrient, and we can see three clusters. Since PC2 explains the next most important variability after PC1, nutrient has the second most important effect on gene expression. The third cluster along the PC2 for 20mM NH4NO3 contributes to the most variability in gene expression for the nutrient conditions.

These results correlate with earlier analysis and we see that sampling time has the most effect on the variability in gene expression, followed by nutrients and then genotype. For the dendogram we could see for sampling time, the groupings showed a pattern, and the same was explained by the heat map, where for the time when we grouped it together, we could see pattern (and constantly changing patterns).