---
output: pdf_document
---
# Protocol to Build Interaction Network from Co-Expression Data

Set knitr global properties
```{r setup, include=F}
knitr::opts_chunk$set(echo=T, message=F, warning=F)
```

## load common functions defined in Utils.R
```{r load Utils.R}
source("Utils.R")
```

## Package list needed
igraph: to ultimately build a co-expression graph  
reshape: function cast()  
```{r package list}
CRAN.packages <- c("igraph", "reshape")
```

## Package install and library loading
```{r libraries}
install.packages.if.necessary(CRAN.packages)
```

## Set data directory, create it if necessary
```{r set data dir}
data.dir <- "networks"
if (!file.exists(data.dir)) {
  dir.create(data.dir)
}
```

## Build Co-Expression Network from RNA-seq expression data  
Site: http://www.proteinatlas.org/about/download  
Work on RNA data: "rna_tissue.csv.zip" and "rna_celline.csv.zip" expression data file from Human Protein Atlas in the "networks" directory  
These data correspond to RNA-seq expression levels in 45 cell lines and 32 tissues.  
It contains Ensembl genes, corresponding FPKM values (Fragments Per Kilobase Of Exon Per Million Fragments Mapped) and HGNC canonical names 

### Fetch data from the internet, build data.frame, work on interesting columns (Gene.name, Sample, Value), 
```{r fetch data}
tissue.url <- "http://www.proteinatlas.org/download/rna_tissue.csv.zip"
data.tissue <- get.data.from.zip(download.if.necessary(tissue.url, data.dir))
data.tissue <- data.tissue[, c(2,3,4)]

celline.url <- "http://www.proteinatlas.org/download/rna_celline.csv.zip"
data.celline <- get.data.from.zip(download.if.necessary(celline.url, data.dir))
data.celline <- data.celline[, c(2,3,4)]
```

### Merge both data sets
```{r merging data}
#data <- rbind(data.tissue, data.celline) # too heavy, hence use tissue-only as a workaround
data <- data.tissue
```

### Find the mean of Value by Sample for each Gene.name.  
Output a matrix with Gene.name as first column and, in every other column compute the mean Value for each given Sample  
Then transform "Gene.name"" column into row names
```{r co-expr matrix}
coexpr <- cast(data, Gene.name ~ Sample, mean, value="Value")
rownames(coexpr) <- coexpr[, 1]
coexpr[, 1] <- NULL
```

### Compute Spearman correlation (correlation between ranks of values)  
Need to take the transposed matrix (rows <-> columns)  
Values rounded to the 2nd decimal  
Be aware it is a 3GB matrix
```{r threshold}
matrix.coexpr <- as.matrix(coexpr)
CR <- round(cor(t(matrix.coexpr), method="spearman"), 2)
```

### Select correlation threshold and build a network  
The correlation threshold is set to 0.7. All pairs of genes that have a correlation above this threshold (positive or negative) are considered as linked in the Co-Expression network.  
Select indexes over the threshold  
Retrieve names matching the index  
Construct a network from indexes  
Transform matrix into data.frame  
And change column names
```{r build graph as matrix}
inds <- which(abs(CR[1:nrow(CR), ]) >= 0.9, arr.ind=T)
rnames <- rownames(CR)[inds[, 1]]
cnames <- colnames(CR)[inds[, 2]]
net <- apply(cbind(rnames, cnames), 1, function(x) unname(unlist(x)))
net <- as.data.frame(t(net))
colnames(net) <- c("Gene1", "Gene2")
```

### Graph conversion, simplification and export to file
```{r build graph as igraph}
net <- build.network(as.matrix(net))
ecount(net)
gorder(net)
write.graph(net, file.path(data.dir, "Co-Expression.gr"), format="ncol", weights=NULL)
```