Title: | Clustering of Omics Data of Multiple Types with a Multilayer Network Representation |
---|---|
Description: | Omics data come in different forms: gene expression, methylation, copy number, protein measurements and more. 'NCutYX' allows clustering of variables, of samples, and both variables and samples (biclustering), while incorporating the dependencies across multiple types of Omics data. (SJ Teran Hidalgo et al (2017), <doi:10.1186/s12864-017-3990-1>). |
Authors: | Sebastian J. Teran Hidalgo, Shuangge Ma, Ruofan Bie |
Maintainer: | Sebastian J. Teran Hidalgo <[email protected]> |
License: | GPL-3 |
Version: | 0.1.1 |
Built: | 2024-11-06 04:11:38 UTC |
Source: | https://github.com/seborinos/ncutyx |
This function will output K clusters of the columns of Y using the help of X.
ancut(Y, X, K = 2, B = 3000, L = 1000, alpha = 0.5, nlambdas = 100, sampling = "equal", ncv = 5, dist = "correlation", sigma = 0.1)
ancut(Y, X, K = 2, B = 3000, L = 1000, alpha = 0.5, nlambdas = 100, sampling = "equal", ncv = 5, dist = "correlation", sigma = 0.1)
Y |
is a n x p matrix of p variables and n observations. The columns of Y will be clustered into K groups. |
X |
is a n x q matrix of q variables and n observations. |
K |
is the number of clusters. |
B |
is the number of iterations in the simulated annealing algorithm. |
L |
is the temperature coefficient in the simulated annealing algorithm. |
alpha |
is the coefficient of the elastic net penalty. |
nlambdas |
is the number of tuning parameters in the elastic net. |
sampling |
if 'equal' then the sampling probabilities is the same during the simulated annealing algorithm, if 'size' the probabilites are proportional the the sizes of the clusters in the current iterations. |
ncv |
is the number of cross-validations in the elastic net. |
dist |
is the type of distance metric for the construction of the similarity matrix. Options are 'gaussian', 'euclidean' and 'correlation', the latter being the default. |
sigma |
is the parameter for the gaussian kernel distance which is ignored if 'gaussian' is not chosen as distance measure. |
The algorithm minimizes a modified version of NCut through simulated annealing.
The modified NCut uses in the numerator the similarity matrix of the original data Y
and the denominator uses the similarity matrix of the prediction of Y
using X
.
The clusters correspond to partitions that minimize this objective function.
The external information of X
is incorporated by using elastic net to predict Y
.
A list with the final value of the objective function, the clusters and the lambda penalty chosen through cross-validation.
A list with the following components:
a vector of length N
which contains the loss
at each iteration of the simulated annealing algorithm.
a matrix representing the clustering result of dimension p
times
K
, where p
is the number of columns of Y
.
is the optimal lambda chosen through cross-validation for the elastic net for
predicting Y
with Y
.
Sebastian Jose Teran Hidalgo and Shuangge Ma. Maintainer: Sebastian Jose Teran Hidalgo. [email protected].
Hidalgo, Sebastian J. Teran, Mengyun Wu, and Shuangge Ma. Assisted clustering of gene expression data using ANCut. BMC genomics 18.1 (2017): 623.
#This sets up the initial parameters for the simulation. library(MASS)#for mvrnorm library(fields) n=30 #Sample size B=50 #Number of iterations in the simulated annealing algorithm. L=10000 #Temperature coefficient. p=50 #Number of columns of Y. q=p #Number of columns of X. h1=0.15 h2=0.25 S=matrix(0.2,q,q) S[1:(q/2),(q/2+1):q]=0 S[(q/2+1):q,1:(q/2)]=0 S=S-diag(diag(S))+diag(q) mu=rep(0,q) W0=matrix(1,p,p) W0[1:(p/2),1:(p/2)]=0 W0[(p/2+1):p,(p/2+1):p]=0 Denum=sum(W0) B2=matrix(0,q,p) for (i in 1:(p/2)){ B2[1:(q/2),i]=runif(q/2,h1,h2) in1=sample.int(q/2,6) B2[-in1,i]=0 } for (i in (p/2+1):p){ B2[(q/2+1):q,i]=runif(q/2,h1,h2) in2=sample(seq(q/2+1,q),6) B2[-in2,i]=0 } X=mvrnorm(n, mu, S) Z=X%*%B2 Y=Z+matrix(rnorm(n*p,0,1),n,p) #Our method Res=ancut(Y=Y,X=X,B=B,L=L,alpha=0,ncv=3) Cx=Res[[2]] f11=matrix(Cx[,1],p,1) f12=matrix(Cx[,2],p,1) errorL=sum((f11%*%t(f11))*W0)/Denum+sum((f12%*%t(f12))*W0)/Denum #This is the true error of the clustering solution. errorL par(mfrow=c(1,2)) #Below is a plot of the simulated annealing path. plot(Res[[1]],type='l',ylab='') #Cluster found by ANCut image.plot(Cx)
#This sets up the initial parameters for the simulation. library(MASS)#for mvrnorm library(fields) n=30 #Sample size B=50 #Number of iterations in the simulated annealing algorithm. L=10000 #Temperature coefficient. p=50 #Number of columns of Y. q=p #Number of columns of X. h1=0.15 h2=0.25 S=matrix(0.2,q,q) S[1:(q/2),(q/2+1):q]=0 S[(q/2+1):q,1:(q/2)]=0 S=S-diag(diag(S))+diag(q) mu=rep(0,q) W0=matrix(1,p,p) W0[1:(p/2),1:(p/2)]=0 W0[(p/2+1):p,(p/2+1):p]=0 Denum=sum(W0) B2=matrix(0,q,p) for (i in 1:(p/2)){ B2[1:(q/2),i]=runif(q/2,h1,h2) in1=sample.int(q/2,6) B2[-in1,i]=0 } for (i in (p/2+1):p){ B2[(q/2+1):q,i]=runif(q/2,h1,h2) in2=sample(seq(q/2+1,q),6) B2[-in2,i]=0 } X=mvrnorm(n, mu, S) Z=X%*%B2 Y=Z+matrix(rnorm(n*p,0,1),n,p) #Our method Res=ancut(Y=Y,X=X,B=B,L=L,alpha=0,ncv=3) Cx=Res[[2]] f11=matrix(Cx[,1],p,1) f12=matrix(Cx[,2],p,1) errorL=sum((f11%*%t(f11))*W0)/Denum+sum((f12%*%t(f12))*W0)/Denum #This is the true error of the clustering solution. errorL par(mfrow=c(1,2)) #Below is a plot of the simulated annealing path. plot(Res[[1]],type='l',ylab='') #Cluster found by ANCut image.plot(Cx)
Builds similarity matrices for the rows of X and the rows of an assisted dataset Z. Clusters them into K groups while conducting feature selection based on the AWNCut method.
awncut(X, Z, K, lambda, Tau, B = 500, L = 1000)
awncut(X, Z, K, lambda, Tau, B = 500, L = 1000)
X |
is an n x p1 matrix of n observations and p1 variables. |
Z |
is an n x p2 matrix of n observations and p2 variables. Z is the assistant dataset. |
K |
is the number of clusters. |
lambda |
is a vector of tuning parameter lambda in the objective function. |
Tau |
is a vector of tuning parameters tau to be used in the objective function. |
B |
is the number of iterations in the simulated annealing algorithm. |
L |
is the temperature coefficient in the simulated annealing algorithm. |
The algorithm maximizes a sum of the weighed NCut measure for X and assisted dataset Z, with the addition of a correlation measure between the two datasets. Feature selection is implemented by using the average correlation of each feature as a criterion.
A list with the following components:
the value of tuning parameter lambda for the result
the value of tuning parameter tau for the result
a matrix of the clustering result
a vector of the feature selection result
the value of the objective function
Ruofan Bie. Maintainer: Sebastian Jose Teran Hidalgo [email protected].
Li, Yang; Bie, Ruofan; Teran Hidalgo, Sebastian; Qin, Yinchen; Wu, Mengyun; Ma, Shuangge. Assisted gene expression-based clustering with AWNCut. (Submitted.)
set.seed(123456) #This sets up the initial parameters for the simulation. lambda <- seq(2,6,1) #Tuning parameter lambda Tau <- seq(0.2,0.8,0.2) #Tuning parameter tau n=30; n1=10; n2=10; n3=n-n1-n2 #Sample size p1=10; p2=10; r1=8; r2=8; #Number of variables and noises in each dataset K=3; #Number of clusters mu=1; #Mean of the marginal distribution u1=0.5; #Range of enties in the coefficient matrix library(mvtnorm) epsilon <- matrix(rnorm(n*(p1-r1),0,1), n, (p1-r1)) # Generation of random error Sigma1 <- matrix(rep(0.8,(p1-r1)^2),(p1-r1),(p1-r1)) # Generation of the covariance matrix diag(Sigma1) <- 1 # Generation of the original distribution of the three clusters T1 <- matrix(rmvnorm(n1,mean=rep(-mu,(p1-r1)),sigma=Sigma1),n1,(p1-r1)) T2 <- matrix(rmvnorm(n2,mean=rep(0,(p1-r1)),sigma=Sigma1),n2,(p1-r1)) T3 <- matrix(rmvnorm(n3,mean=rep(mu,(p1-r1)),sigma=Sigma1),n3,(p1-r1)) X1 <- sign(T1)*(exp(abs(T1))) #Generation of signals in X X2 <- sign(T2)*(exp(abs(T2))) X3 <- sign(T3)*(exp(abs(T3))) ep1 <- (matrix(rnorm(n*r1,0,1),n,r1)) #Generation of noises in X X <- rbind(X1,X2,X3) beta1 <- matrix(runif((p1-r1)*(p2-r2),-u1,u1),(p1-r1),(p2-r2)) #Generation of the coefficient matrix Z <- X%*%beta1+epsilon #Generation of signals in Z ep2 <- (matrix(rnorm(n*r2,0.5,1),n,r2)) #Generation of noises in Z X <- cbind(X,ep1) Z <- cbind(Z,ep2) #our method Tune1 <- awncut.selection(X, Z, K, lambda, Tau, B = 20, L = 1000) awncut.result <- awncut(X, Z, 3, Tune1$lam, Tune1$tau, B = 20, L = 1000) ErrorRate(awncut.result[[1]]$Cs, n1, n2)
set.seed(123456) #This sets up the initial parameters for the simulation. lambda <- seq(2,6,1) #Tuning parameter lambda Tau <- seq(0.2,0.8,0.2) #Tuning parameter tau n=30; n1=10; n2=10; n3=n-n1-n2 #Sample size p1=10; p2=10; r1=8; r2=8; #Number of variables and noises in each dataset K=3; #Number of clusters mu=1; #Mean of the marginal distribution u1=0.5; #Range of enties in the coefficient matrix library(mvtnorm) epsilon <- matrix(rnorm(n*(p1-r1),0,1), n, (p1-r1)) # Generation of random error Sigma1 <- matrix(rep(0.8,(p1-r1)^2),(p1-r1),(p1-r1)) # Generation of the covariance matrix diag(Sigma1) <- 1 # Generation of the original distribution of the three clusters T1 <- matrix(rmvnorm(n1,mean=rep(-mu,(p1-r1)),sigma=Sigma1),n1,(p1-r1)) T2 <- matrix(rmvnorm(n2,mean=rep(0,(p1-r1)),sigma=Sigma1),n2,(p1-r1)) T3 <- matrix(rmvnorm(n3,mean=rep(mu,(p1-r1)),sigma=Sigma1),n3,(p1-r1)) X1 <- sign(T1)*(exp(abs(T1))) #Generation of signals in X X2 <- sign(T2)*(exp(abs(T2))) X3 <- sign(T3)*(exp(abs(T3))) ep1 <- (matrix(rnorm(n*r1,0,1),n,r1)) #Generation of noises in X X <- rbind(X1,X2,X3) beta1 <- matrix(runif((p1-r1)*(p2-r2),-u1,u1),(p1-r1),(p2-r2)) #Generation of the coefficient matrix Z <- X%*%beta1+epsilon #Generation of signals in Z ep2 <- (matrix(rnorm(n*r2,0.5,1),n,r2)) #Generation of noises in Z X <- cbind(X,ep1) Z <- cbind(Z,ep2) #our method Tune1 <- awncut.selection(X, Z, K, lambda, Tau, B = 20, L = 1000) awncut.result <- awncut(X, Z, 3, Tune1$lam, Tune1$tau, B = 20, L = 1000) ErrorRate(awncut.result[[1]]$Cs, n1, n2)
This Function Outputs the Selection of Tuning Parameters for the AWNCut Method.
awncut.selection(X, Z, K, lambda, Tau, B = 500, L = 1000)
awncut.selection(X, Z, K, lambda, Tau, B = 500, L = 1000)
X |
is an n x p1 matrix of n observations and p1 variables. |
Z |
is an n x p2 matrix of n observations and p2 variables. Z is the assistant dataset. |
K |
is the number of clusters. |
lambda |
is a vector of tuning parameter lambda in the objective function. |
Tau |
is a vector of tuning parameter tau in the objective function. |
B |
is the number of iterations in the simulated annealing algorithm. |
L |
is the temperature coefficient in the simulated annealing algorithm. #' @return A list with the following components:
|
Li, Yang; Bie, Ruofan; Teran Hidalgo, Sebastian; Qin, Yinchen; Wu, Mengyun; Ma, Shuangge. Assisted gene expression-based clustering with AWNCut. (Submitted.)
set.seed(123456) #This sets up the initial parameters for the simulation. lambda <- seq(2,6,1) #Tuning parameter lambda Tau <- seq(0.2,0.8,0.2) #Tuning parameter tau n=30; n1=10; n2=10; n3=n-n1-n2 #Sample size p1=10; p2=10; r1=8; r2=8; #Number of variables and noises in each dataset K=3; #Number of clusters mu=1; #Mean of the marginal distribution u1=0.5; #Range of enties in the coefficient matrix library(mvtnorm) epsilon <- matrix(rnorm(n*(p1-r1),0,1), n, (p1-r1)) # Generation of random error Sigma1 <- matrix(rep(0.8,(p1-r1)^2),(p1-r1),(p1-r1)) # Generation of the covariance matrix diag(Sigma1) <- 1 # Generation of the original distribution of the three clusters T1 <- matrix(rmvnorm(n1,mean=rep(-mu,(p1-r1)),sigma=Sigma1),n1,(p1-r1)) T2 <- matrix(rmvnorm(n2,mean=rep(0,(p1-r1)),sigma=Sigma1),n2,(p1-r1)) T3 <- matrix(rmvnorm(n3,mean=rep(mu,(p1-r1)),sigma=Sigma1),n3,(p1-r1)) X1 <- sign(T1)*(exp(abs(T1))) #Generation of signals in X X2 <- sign(T2)*(exp(abs(T2))) X3 <- sign(T3)*(exp(abs(T3))) ep1 <- (matrix(rnorm(n*r1,0,1),n,r1)) #Generation of noises in X X <- rbind(X1,X2,X3) beta1 <- matrix(runif((p1-r1)*(p2-r2),-u1,u1),(p1-r1),(p2-r2)) #Generation of the coefficient matrix Z <- X%*%beta1+epsilon #Generation of signals in Z ep2 <- (matrix(rnorm(n*r2,0.5,1),n,r2)) #Generation of noises in Z X <- cbind(X,ep1) Z <- cbind(Z,ep2) #our method Tune1 <- awncut.selection(X, Z, K, lambda, Tau, B = 20, L = 1000) awncut.result <- awncut(X, Z, 3, Tune1$lam, Tune1$tau, B = 20, L = 1000) ErrorRate(awncut.result[[1]]$Cs, n1, n2)
set.seed(123456) #This sets up the initial parameters for the simulation. lambda <- seq(2,6,1) #Tuning parameter lambda Tau <- seq(0.2,0.8,0.2) #Tuning parameter tau n=30; n1=10; n2=10; n3=n-n1-n2 #Sample size p1=10; p2=10; r1=8; r2=8; #Number of variables and noises in each dataset K=3; #Number of clusters mu=1; #Mean of the marginal distribution u1=0.5; #Range of enties in the coefficient matrix library(mvtnorm) epsilon <- matrix(rnorm(n*(p1-r1),0,1), n, (p1-r1)) # Generation of random error Sigma1 <- matrix(rep(0.8,(p1-r1)^2),(p1-r1),(p1-r1)) # Generation of the covariance matrix diag(Sigma1) <- 1 # Generation of the original distribution of the three clusters T1 <- matrix(rmvnorm(n1,mean=rep(-mu,(p1-r1)),sigma=Sigma1),n1,(p1-r1)) T2 <- matrix(rmvnorm(n2,mean=rep(0,(p1-r1)),sigma=Sigma1),n2,(p1-r1)) T3 <- matrix(rmvnorm(n3,mean=rep(mu,(p1-r1)),sigma=Sigma1),n3,(p1-r1)) X1 <- sign(T1)*(exp(abs(T1))) #Generation of signals in X X2 <- sign(T2)*(exp(abs(T2))) X3 <- sign(T3)*(exp(abs(T3))) ep1 <- (matrix(rnorm(n*r1,0,1),n,r1)) #Generation of noises in X X <- rbind(X1,X2,X3) beta1 <- matrix(runif((p1-r1)*(p2-r2),-u1,u1),(p1-r1),(p2-r2)) #Generation of the coefficient matrix Z <- X%*%beta1+epsilon #Generation of signals in Z ep2 <- (matrix(rnorm(n*r2,0.5,1),n,r2)) #Generation of noises in Z X <- cbind(X,ep1) Z <- cbind(Z,ep2) #our method Tune1 <- awncut.selection(X, Z, K, lambda, Tau, B = 20, L = 1000) awncut.result <- awncut(X, Z, 3, Tune1$lam, Tune1$tau, B = 20, L = 1000) ErrorRate(awncut.result[[1]]$Cs, n1, n2)
A dataset containing copy number aberrations measurements from TCGA.
brca.data.cna
brca.data.cna
A data frame with 873 patients and 515 gene names.
A dataset containing gene expression measurements from TCGA.
brca.data.ge
brca.data.ge
A data frame with 873 patients and 334 gene names.
A dataset containing protein measurements from TCGA.
brca.data.rppa
brca.data.rppa
A data frame with 873 patients and 164 gene names.
A dataset containing copy number aberrations from TCGA.
cesc.data.cna
cesc.data.cna
A data frame with 164 patients and 488 gene names.
A dataset containing gene expression measurements from TCGA.
cesc.data.ge
cesc.data.ge
A data frame with 164 patients and 325 gene names.
A dataset containing protein measurements from TCGA.
cesc.data.rppa
cesc.data.rppa
A data frame with 164 patients and 144 gene names.
This Function Calculates the True Error Rate of a Clustering Result, Assuming that There are Three Clusters.
ErrorRate(X, n1, n2)
ErrorRate(X, n1, n2)
X |
is a clustering result in matrix format. |
n1 |
is the size of the first cluster. |
n2 |
is the size of the second cluster. |
err is the true error rate of a clustering result.
Li, Yang; Bie, Ruofan; Teran Hidalgo, Sebastian; Qin, Yinchen; Wu, Mengyun; Ma, Shuangge. Assisted gene expression-based clustering with AWNCut. (Submitted.)
set.seed(123456) #This sets up the initial parameters for the simulation. lambda <- seq(2,6,1) #Tuning parameter lambda Tau <- seq(0.2,0.8,0.2) #Tuning parameter tau n=30; n1=10; n2=10; n3=n-n1-n2 #Sample size p1=10; p2=10; r1=8; r2=8; #Number of variables and noises in each dataset K=3; #Number of clusters mu=1; #Mean of the marginal distribution u1=0.5; #Range of enties in the coefficient matrix library(mvtnorm) epsilon <- matrix(rnorm(n*(p1-r1),0,1), n, (p1-r1)) # Generation of random error Sigma1 <- matrix(rep(0.8,(p1-r1)^2),(p1-r1),(p1-r1)) # Generation of the covariance matrix diag(Sigma1) <- 1 # Generation of the original distribution of the three clusters T1 <- matrix(rmvnorm(n1,mean=rep(-mu,(p1-r1)),sigma=Sigma1),n1,(p1-r1)) T2 <- matrix(rmvnorm(n2,mean=rep(0,(p1-r1)),sigma=Sigma1),n2,(p1-r1)) T3 <- matrix(rmvnorm(n3,mean=rep(mu,(p1-r1)),sigma=Sigma1),n3,(p1-r1)) X1 <- sign(T1)*(exp(abs(T1))) #Generation of signals in X X2 <- sign(T2)*(exp(abs(T2))) X3 <- sign(T3)*(exp(abs(T3))) ep1 <- (matrix(rnorm(n*r1,0,1),n,r1)) #Generation of noises in X X <- rbind(X1,X2,X3) beta1 <- matrix(runif((p1-r1)*(p2-r2),-u1,u1),(p1-r1),(p2-r2)) #Generation of the coefficient matrix Z <- X%*%beta1+epsilon #Generation of signals in Z ep2 <- (matrix(rnorm(n*r2,0.5,1),n,r2)) #Generation of noises in Z X <- cbind(X,ep1) Z <- cbind(Z,ep2) #our method Tune1 <- awncut.selection(X, Z, K, lambda, Tau, B = 20, L = 1000) awncut.result <- awncut(X, Z, 3, Tune1$lam, Tune1$tau, B = 20, L = 1000) ErrorRate(awncut.result[[1]]$Cs, n1, n2)
set.seed(123456) #This sets up the initial parameters for the simulation. lambda <- seq(2,6,1) #Tuning parameter lambda Tau <- seq(0.2,0.8,0.2) #Tuning parameter tau n=30; n1=10; n2=10; n3=n-n1-n2 #Sample size p1=10; p2=10; r1=8; r2=8; #Number of variables and noises in each dataset K=3; #Number of clusters mu=1; #Mean of the marginal distribution u1=0.5; #Range of enties in the coefficient matrix library(mvtnorm) epsilon <- matrix(rnorm(n*(p1-r1),0,1), n, (p1-r1)) # Generation of random error Sigma1 <- matrix(rep(0.8,(p1-r1)^2),(p1-r1),(p1-r1)) # Generation of the covariance matrix diag(Sigma1) <- 1 # Generation of the original distribution of the three clusters T1 <- matrix(rmvnorm(n1,mean=rep(-mu,(p1-r1)),sigma=Sigma1),n1,(p1-r1)) T2 <- matrix(rmvnorm(n2,mean=rep(0,(p1-r1)),sigma=Sigma1),n2,(p1-r1)) T3 <- matrix(rmvnorm(n3,mean=rep(mu,(p1-r1)),sigma=Sigma1),n3,(p1-r1)) X1 <- sign(T1)*(exp(abs(T1))) #Generation of signals in X X2 <- sign(T2)*(exp(abs(T2))) X3 <- sign(T3)*(exp(abs(T3))) ep1 <- (matrix(rnorm(n*r1,0,1),n,r1)) #Generation of noises in X X <- rbind(X1,X2,X3) beta1 <- matrix(runif((p1-r1)*(p2-r2),-u1,u1),(p1-r1),(p2-r2)) #Generation of the coefficient matrix Z <- X%*%beta1+epsilon #Generation of signals in Z ep2 <- (matrix(rnorm(n*r2,0.5,1),n,r2)) #Generation of noises in Z X <- cbind(X,ep1) Z <- cbind(Z,ep2) #our method Tune1 <- awncut.selection(X, Z, K, lambda, Tau, B = 20, L = 1000) awncut.result <- awncut(X, Z, 3, Tune1$lam, Tune1$tau, B = 20, L = 1000) ErrorRate(awncut.result[[1]]$Cs, n1, n2)
It clusters the columns of Z,Y and X into K clusters and the samples into R clusters by representing each data type as one network layer. It represents the Z layer depending on Y, and the Y layer depending on X.
mlbncut(Z, Y, X, K = 2, R = 2, B = 30, N = 500, q0 = 0.25, scale = TRUE, dist = "gaussian", sigmas = 1, sigmac = 1)
mlbncut(Z, Y, X, K = 2, R = 2, B = 30, N = 500, q0 = 0.25, scale = TRUE, dist = "gaussian", sigmas = 1, sigmac = 1)
Z |
is a n x q matrix of q variables and n observations. |
Y |
is a n x p matrix of p variables and n observations. |
X |
is a n x r matrix of r variables and n observations. |
K |
is the number of column clusters. |
R |
is the number of row clusters. |
B |
is the number of iterations. |
N |
is the number of samples per iterations. |
q0 |
is the quantiles in the cross entropy method. |
scale |
equals TRUE if data Y is to be scaled with mean 0 and variance 1. |
dist |
is the type of distance measure use in the similarity matrix. Options are 'gaussian' and 'correlation', with 'gaussian' being the default. |
sigmas |
is the tuning parameter of the Gaussian kernel of the samples. |
sigmac |
is the tuning parameter of the Gaussian kernel of the variables. |
This function will output K clusters of columns of Z, Y and X and R clusters of the samples.
The algorithm minimizes the NCut through the cross entropy method. The clusters correspond to partitions that minimize this objective function.
A list with the final value of the objective function and the clusters.
Sebastian J. Teran Hidalgo and Shuangge Ma. Multilayer Biclustering of Omics Data using MLBNCut. (Work in progress.)
#This sets up the initial parameters for the simulation. library(NCutYX) library(MASS) library(fields) n <- 50 p <- 50 h <- 0.15 rho <- 0.15 mu <- 1 W0 <- matrix(1,p,p) W0[1:(p/5),1:(p/5)] <- 0 W0[(p/5+1):(3*p/5),(p/5+1):(3*p/5)] <- 0 W0[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- 0 W0[(4*p/5+1):p,(4*p/5+1):p]=0 W0=cbind(W0,W0,W0) W0=rbind(W0,W0,W0) W1 <- matrix(1,n,n) W1[1:(n/2),1:(n/2)] <- 0 W1[(n/2+1):n,(n/2+1):n] <- 0 X <- matrix(0,n,p) Y <- matrix(0,n,p) Z <- matrix(0,n,p) Sigma=matrix(0,p,p) Sigma[1:(p/5),1:(p/5)] <- rho Sigma[(p/5+1):(3*p/5),(p/5+1):(3*p/5)] <- rho Sigma[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- rho Sigma[(4*p/5+1):p,(4*p/5+1):p] <- rho Sigma <- Sigma - diag(diag(Sigma)) Sigma <- Sigma + diag(p) X[1:(n/2),] <- mvrnorm(n/2,rep(mu,p),Sigma) X[(n/2+1):n,] <- mvrnorm(n/2,rep(-mu,p),Sigma) B11 <- matrix(0,p,p) B12 <- matrix(0,p,p) B21 <- matrix(0,p,p) B22 <- matrix(0,p,p) B11[1:(p/5),1:(p/5)] <- runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.5) B11[(p/5+1):(3*p/5),(p/5+1):(3*p/5)] <- runif((2*p/5)^2,h/2,h)*rbinom((2*p/5)^2,1,0.5) B11[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.5) B11[(4*p/5+1):p,(4*p/5+1):p] <- runif((1*p/5)^2,h/2,h)*rbinom((1*p/5)^2,1,0.5) B12[1:(p/5),1:(p/5)] <- runif((p/5)^2,-h,-h/2)*rbinom((p/5)^2,1,0.5) B12[(p/5+1):(3*p/5),(p/5+1):(3*p/5)] <- runif((2*p/5)^2,-h,-h/2)*rbinom((2*p/5)^2,1,0.5) B12[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- runif((p/5)^2,-h,-h/2)*rbinom((p/5)^2,1,0.5) B12[(4*p/5+1):p,(4*p/5+1):p] <- runif((1*p/5)^2,-h,-h/2)*rbinom((1*p/5)^2,1,0.5) B21[1:(p/5),1:(p/5)] <- runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.5) B21[(p/5+1):(3*p/5),(p/5+1):(3*p/5)] <- runif((2*p/5)^2,h/2,h)*rbinom((2*p/5)^2,1,0.5) B21[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.5) B21[(4*p/5+1):p,(4*p/5+1):p] <- runif((1*p/5)^2,h/2,h)*rbinom((1*p/5)^2,1,0.5) B22[1:(p/5),1:(p/5)] <- runif((p/5)^2,-h,-h/2)*rbinom((p/5)^2,1,0.5) B22[(p/5+1):(3*p/5),(p/5+1):(3*p/5)] <- runif((2*p/5)^2,-h,-h/2)*rbinom((2*p/5)^2,1,0.5) B22[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- runif((p/5)^2,-h,-h/2)*rbinom((p/5)^2,1,0.5) B22[(4*p/5+1):p,(4*p/5+1):p] <- runif((1*p/5)^2,-h,-h/2)*rbinom((1*p/5)^2,1,0.5) Y[1:(n/2),] <- X[1:(n/2),]%*%B11+matrix(rnorm((n/2)*p,0,0.25),n/2,p) Y[(n/2+1):n,] <- X[(n/2+1):n,]%*%B12+matrix(rnorm((n/2)*p,0,0.25),n/2,p) Z[1:(n/2),] <- Y[1:(n/2),]%*%B21+matrix(rnorm((n/2)*p,0,0.25),n/2,p) Z[(n/2+1):n,] <- Y[(n/2+1):n,]%*%B22+matrix(rnorm((n/2)*p,0,0.25),n/2,p) trial <- mlbncut(Z, Y, X, K=4, R=2, B=10, N=50, dist='correlation', q0=0.15, scale=TRUE, sigmas=0.05, sigmac=1) plot(trial[[1]],type='l') image.plot(trial[[2]]) image.plot(trial[[3]]) errorK <- sum((trial[[3]][,1]%*%t(trial[[3]][,1]) + trial[[3]][,2]%*%t(trial[[3]][,2]) + trial[[3]][,3]%*%t(trial[[3]][,3]) + trial[[3]][,4]%*%t(trial[[3]][,4]))*W0)/(3*p)^2 + sum((trial[[2]][,1]%*%t(trial[[2]][,1]) + trial[[2]][,2]%*%t(trial[[2]][,2]))*W1)/(n)^2 errorK
#This sets up the initial parameters for the simulation. library(NCutYX) library(MASS) library(fields) n <- 50 p <- 50 h <- 0.15 rho <- 0.15 mu <- 1 W0 <- matrix(1,p,p) W0[1:(p/5),1:(p/5)] <- 0 W0[(p/5+1):(3*p/5),(p/5+1):(3*p/5)] <- 0 W0[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- 0 W0[(4*p/5+1):p,(4*p/5+1):p]=0 W0=cbind(W0,W0,W0) W0=rbind(W0,W0,W0) W1 <- matrix(1,n,n) W1[1:(n/2),1:(n/2)] <- 0 W1[(n/2+1):n,(n/2+1):n] <- 0 X <- matrix(0,n,p) Y <- matrix(0,n,p) Z <- matrix(0,n,p) Sigma=matrix(0,p,p) Sigma[1:(p/5),1:(p/5)] <- rho Sigma[(p/5+1):(3*p/5),(p/5+1):(3*p/5)] <- rho Sigma[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- rho Sigma[(4*p/5+1):p,(4*p/5+1):p] <- rho Sigma <- Sigma - diag(diag(Sigma)) Sigma <- Sigma + diag(p) X[1:(n/2),] <- mvrnorm(n/2,rep(mu,p),Sigma) X[(n/2+1):n,] <- mvrnorm(n/2,rep(-mu,p),Sigma) B11 <- matrix(0,p,p) B12 <- matrix(0,p,p) B21 <- matrix(0,p,p) B22 <- matrix(0,p,p) B11[1:(p/5),1:(p/5)] <- runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.5) B11[(p/5+1):(3*p/5),(p/5+1):(3*p/5)] <- runif((2*p/5)^2,h/2,h)*rbinom((2*p/5)^2,1,0.5) B11[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.5) B11[(4*p/5+1):p,(4*p/5+1):p] <- runif((1*p/5)^2,h/2,h)*rbinom((1*p/5)^2,1,0.5) B12[1:(p/5),1:(p/5)] <- runif((p/5)^2,-h,-h/2)*rbinom((p/5)^2,1,0.5) B12[(p/5+1):(3*p/5),(p/5+1):(3*p/5)] <- runif((2*p/5)^2,-h,-h/2)*rbinom((2*p/5)^2,1,0.5) B12[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- runif((p/5)^2,-h,-h/2)*rbinom((p/5)^2,1,0.5) B12[(4*p/5+1):p,(4*p/5+1):p] <- runif((1*p/5)^2,-h,-h/2)*rbinom((1*p/5)^2,1,0.5) B21[1:(p/5),1:(p/5)] <- runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.5) B21[(p/5+1):(3*p/5),(p/5+1):(3*p/5)] <- runif((2*p/5)^2,h/2,h)*rbinom((2*p/5)^2,1,0.5) B21[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.5) B21[(4*p/5+1):p,(4*p/5+1):p] <- runif((1*p/5)^2,h/2,h)*rbinom((1*p/5)^2,1,0.5) B22[1:(p/5),1:(p/5)] <- runif((p/5)^2,-h,-h/2)*rbinom((p/5)^2,1,0.5) B22[(p/5+1):(3*p/5),(p/5+1):(3*p/5)] <- runif((2*p/5)^2,-h,-h/2)*rbinom((2*p/5)^2,1,0.5) B22[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)] <- runif((p/5)^2,-h,-h/2)*rbinom((p/5)^2,1,0.5) B22[(4*p/5+1):p,(4*p/5+1):p] <- runif((1*p/5)^2,-h,-h/2)*rbinom((1*p/5)^2,1,0.5) Y[1:(n/2),] <- X[1:(n/2),]%*%B11+matrix(rnorm((n/2)*p,0,0.25),n/2,p) Y[(n/2+1):n,] <- X[(n/2+1):n,]%*%B12+matrix(rnorm((n/2)*p,0,0.25),n/2,p) Z[1:(n/2),] <- Y[1:(n/2),]%*%B21+matrix(rnorm((n/2)*p,0,0.25),n/2,p) Z[(n/2+1):n,] <- Y[(n/2+1):n,]%*%B22+matrix(rnorm((n/2)*p,0,0.25),n/2,p) trial <- mlbncut(Z, Y, X, K=4, R=2, B=10, N=50, dist='correlation', q0=0.15, scale=TRUE, sigmas=0.05, sigmac=1) plot(trial[[1]],type='l') image.plot(trial[[2]]) image.plot(trial[[3]]) errorK <- sum((trial[[3]][,1]%*%t(trial[[3]][,1]) + trial[[3]][,2]%*%t(trial[[3]][,2]) + trial[[3]][,3]%*%t(trial[[3]][,3]) + trial[[3]][,4]%*%t(trial[[3]][,4]))*W0)/(3*p)^2 + sum((trial[[2]][,1]%*%t(trial[[2]][,1]) + trial[[2]][,2]%*%t(trial[[2]][,2]))*W1)/(n)^2 errorK
It clusters the columns of Z,Y and X into K clusters by representing each data type as one network layer. It represents the Z layer depending on Y, and the Y layer depending on X. Elastic net can be used before the clustering procedure by using the predictions of Z and Y instead of the actual values to improve the cluster results. This function will output K clusters of columns of Z, Y and X.
muncut(Z, Y, X, K = 2, B = 3000, L = 1000, alpha = 0.5, ncv = 3, nlambdas = 100, scale = FALSE, model = FALSE, gamma = 0.5, sampling = "equal", dist = "gaussian", sigma = 0.1)
muncut(Z, Y, X, K = 2, B = 3000, L = 1000, alpha = 0.5, ncv = 3, nlambdas = 100, scale = FALSE, model = FALSE, gamma = 0.5, sampling = "equal", dist = "gaussian", sigma = 0.1)
Z |
is a n x q matrix of q variables and n observations. |
Y |
is a n x p matrix of p variables and n observations. |
X |
is a n x r matrix of r variables and n observations. |
K |
is the number of column clusters. |
B |
is the number of iterations in the simulated annealing algorithm. |
L |
is the temperature coefficient in the simulated annealing algorithm. |
alpha |
is the tuning parameter in the elastic net penalty, only used when model=T. |
ncv |
is the number of cross-validations used to choose the tuning parameter lambda in the elastic net penalty, only used when model=T. |
nlambdas |
number of tuning parameters lambda used during cross-validation, only when model=T. |
scale |
when TRUE the Z, Y and X are scaled with mean 0 and standard deviation equal 1. |
model |
when TRUE the the relationship between Z and Y, and between Y and X are modeled with the elastic net. The predictions of Z, and Y from the models are used in the clustering algorithm. |
gamma |
is the tuning parameter of the clustering penalty. Larger values give more importance to within layer effects and less to across layer effects. |
sampling |
if 'equal' then the sampling distribution is discrete uniform over the number of clusters, if 'size' the probabilities are inversely proportional to the size of each cluster. |
dist |
is the type of distance measure use in the similarity matrix. Options are 'gaussian' and 'correlation', with 'gaussian' being the default. |
sigma |
is the bandwidth parameter when the dist metric chosen is gaussian. |
The algorithm minimizes a modified version of NCut through simulated annealing. The clusters correspond to partitions that minimize this objective function. The external information of X is incorporated by using ridge regression to predict Y.
Sebastian J. Teran Hidalgo and Shuangge Ma. Clustering Multilayer Omics Data using MuNCut. (Revise and resubmit.)
library(NCutYX) library(MASS) library(fields) #for image.plot #parameters# set.seed(777) n=50 p=50 h=0.5 rho=0.5 W0=matrix(1,p,p) W0[1:(p/5),1:(p/5)]=0 W0[(p/5+1):(3*p/5),(p/5+1):(3*p/5)]=0 W0[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)]=0 W0[(4*p/5+1):p,(4*p/5+1):p]=0 W0=cbind(W0,W0,W0) W0=rbind(W0,W0,W0) Y=matrix(0,n,p) Z=matrix(0,n,p) Sigma=matrix(rho,p,p) Sigma[1:(p/5),1:(p/5)]=2*rho Sigma[(p/5+1):(3*p/5),(p/5+1):(3*p/5)]=2*rho Sigma[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)]=2*rho Sigma=Sigma-diag(diag(Sigma)) Sigma=Sigma+diag(p) X=mvrnorm(n,rep(0,p),Sigma) B1=matrix(0,p,p) B2=matrix(0,p,p) B1[1:(p/5),1:(p/5)]=runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.2) B1[(p/5+1):(3*p/5),(p/5+1):(3*p/5)]=runif((2*p/5)^2,h/2,h)*rbinom((2*p/5)^2,1,0.2) B1[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)]=runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.2) B2[1:(p/5),1:(p/5)]=runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.2) B2[(p/5+1):(3*p/5),(p/5+1):(3*p/5)]=runif((2*p/5)^2,h/2,h)*rbinom((2*p/5)^2,1,0.2) B2[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)]=runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.2) Y=X%*%B1+matrix(rnorm(n*p,0,0.5),n,p) Y2=X%*%B1 Z=Y%*%B2+matrix(rnorm(n*p,0,0.5),n,p) Z2=Y%*%B2 #Computing our method clust <- muncut(Z, Y, X, K = 4, B = 10000, L = 500, sampling = 'size', alpha = 0.5, ncv = 3, nlambdas = 20, sigma = 10, scale = TRUE, model = FALSE, gamma = 0.1) A <- clust[[2]][,1]%*%t(clust[[2]][,1])+ clust[[2]][,2]%*%t(clust[[2]][,2])+ clust[[2]][,3]%*%t(clust[[2]][,3])+ clust[[2]][,4]%*%t(clust[[2]][,4]) errorK=sum(A*W0)/(3*p)^2 errorK plot(clust[[1]],type='l') image.plot(A)
library(NCutYX) library(MASS) library(fields) #for image.plot #parameters# set.seed(777) n=50 p=50 h=0.5 rho=0.5 W0=matrix(1,p,p) W0[1:(p/5),1:(p/5)]=0 W0[(p/5+1):(3*p/5),(p/5+1):(3*p/5)]=0 W0[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)]=0 W0[(4*p/5+1):p,(4*p/5+1):p]=0 W0=cbind(W0,W0,W0) W0=rbind(W0,W0,W0) Y=matrix(0,n,p) Z=matrix(0,n,p) Sigma=matrix(rho,p,p) Sigma[1:(p/5),1:(p/5)]=2*rho Sigma[(p/5+1):(3*p/5),(p/5+1):(3*p/5)]=2*rho Sigma[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)]=2*rho Sigma=Sigma-diag(diag(Sigma)) Sigma=Sigma+diag(p) X=mvrnorm(n,rep(0,p),Sigma) B1=matrix(0,p,p) B2=matrix(0,p,p) B1[1:(p/5),1:(p/5)]=runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.2) B1[(p/5+1):(3*p/5),(p/5+1):(3*p/5)]=runif((2*p/5)^2,h/2,h)*rbinom((2*p/5)^2,1,0.2) B1[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)]=runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.2) B2[1:(p/5),1:(p/5)]=runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.2) B2[(p/5+1):(3*p/5),(p/5+1):(3*p/5)]=runif((2*p/5)^2,h/2,h)*rbinom((2*p/5)^2,1,0.2) B2[(3*p/5+1):(4*p/5),(3*p/5+1):(4*p/5)]=runif((p/5)^2,h/2,h)*rbinom((p/5)^2,1,0.2) Y=X%*%B1+matrix(rnorm(n*p,0,0.5),n,p) Y2=X%*%B1 Z=Y%*%B2+matrix(rnorm(n*p,0,0.5),n,p) Z2=Y%*%B2 #Computing our method clust <- muncut(Z, Y, X, K = 4, B = 10000, L = 500, sampling = 'size', alpha = 0.5, ncv = 3, nlambdas = 20, sigma = 10, scale = TRUE, model = FALSE, gamma = 0.1) A <- clust[[2]][,1]%*%t(clust[[2]][,1])+ clust[[2]][,2]%*%t(clust[[2]][,2])+ clust[[2]][,3]%*%t(clust[[2]][,3])+ clust[[2]][,4]%*%t(clust[[2]][,4]) errorK=sum(A*W0)/(3*p)^2 errorK plot(clust[[1]],type='l') image.plot(A)
Builds a similarity matrix for the columns of Y and clusters them into K groups based on the NCut graph measure. Correlation, Euclidean and Gaussian distances can be used to construct the similarity matrix.
ncut(Y, K = 2, B = 30, N = 500, dist = "correlation", scale = TRUE, q = 0.1, sigma = 1)
ncut(Y, K = 2, B = 30, N = 500, dist = "correlation", scale = TRUE, q = 0.1, sigma = 1)
Y |
is a n x p matrix of p variables and n observations. The p columns of Y will be clustered into K groups using NCut. |
K |
is the number of clusters. |
B |
is the number of iterations. |
N |
is the number of samples per iterations. |
dist |
is the type of distance metric for the construction of the similarity matrix. Options are 'gaussian', 'euclidean' and 'correlation', the latter being the default. |
scale |
equals TRUE if data Y is to be scaled with mean 0 and variance 1. |
q |
is the quantile used for the top results at each iterations. |
sigma |
is the bandwidth parameter when the dist metric chosen is 'gaussian' (default=0.1). |
The algorithm minimizes the NCut through the cross entropy method. The edges of the graph correspond to the entries of a similarity matrix constructed based on a correlation, euclidean or gaussian distance metric. The clusters correspond to partitions that minimize this NCut objective function.
A list with the following components:
a vector of length N
which contains the quantiles
q
at each iteration of the optimization algorithm.
a matrix representing the clustering result of dimension p
times
K
, where p
is the number of columns of Y
.
the NCut measure for the cluster result.
Sebastian Jose Teran Hidalgo. Maintainer: Sebastian Jose Teran Hidalgo [email protected].
Von Luxburg, Ulrike. "A tutorial on spectral clustering." Statistics and computing 17.4 (2007): 395-416.
Kroese, D. P., Rubinstein, R. Y., Cohen, I., Porotsky, S., & Taimre, T. (2013). "Cross-entropy method." In Encyclopedia of Operations Research and Management Science (pp. 326-333). Springer US.
# This sets up the initial parameters for the simulation. library(MASS) n=100 # Sample size B=30 # Number of iterations in the simulated annealing algorithm. p=50 # Number of columns of Y. S=matrix(0.2,p,p) S[1:(p/2),(p/2+1):p]=0 S[(p/2+1):p,1:(p/2)]=0 S=S-diag(diag(S))+diag(p) mu=rep(0,p) W0=matrix(1,p,p) W0[1:(p/2),1:(p/2)]=0 W0[(p/2+1):p,(p/2+1):p]=0 Denum=sum(W0) Y=mvrnorm(n, mu, S) # NCut Res=ncut(Y, K=2, B=30, N=1000, dist='correlation', scale=TRUE, q=0.2, sigma=0.1) Cx=Res[[2]] f11=matrix(Cx[,1],p,1) f12=matrix(Cx[,2],p,1) errorL=sum((f11%*%t(f11))*W0)/Denum+sum((f12%*%t(f12))*W0)/Denum # This is the true error of the clustering solution. errorL
# This sets up the initial parameters for the simulation. library(MASS) n=100 # Sample size B=30 # Number of iterations in the simulated annealing algorithm. p=50 # Number of columns of Y. S=matrix(0.2,p,p) S[1:(p/2),(p/2+1):p]=0 S[(p/2+1):p,1:(p/2)]=0 S=S-diag(diag(S))+diag(p) mu=rep(0,p) W0=matrix(1,p,p) W0[1:(p/2),1:(p/2)]=0 W0[(p/2+1):p,(p/2+1):p]=0 Denum=sum(W0) Y=mvrnorm(n, mu, S) # NCut Res=ncut(Y, K=2, B=30, N=1000, dist='correlation', scale=TRUE, q=0.2, sigma=0.1) Cx=Res[[2]] f11=matrix(Cx[,1],p,1) f12=matrix(Cx[,2],p,1) errorL=sum((f11%*%t(f11))*W0)/Denum+sum((f12%*%t(f12))*W0)/Denum # This is the true error of the clustering solution. errorL
This function will output K channels of variables.
pwncut(X, K = 2, B = 3000, L = 1000, scale = TRUE, lambda = 1, epsilon = 0, nstarts = 3, start = "default", dist = "gaussian", sigma = 0.1, beta = 1)
pwncut(X, K = 2, B = 3000, L = 1000, scale = TRUE, lambda = 1, epsilon = 0, nstarts = 3, start = "default", dist = "gaussian", sigma = 0.1, beta = 1)
X |
is a n x p matrix of p variables and n observations. |
K |
is the number of clusters. |
B |
is the number of iterations in the simulated annealing algorithm. |
L |
is the temperature coefficient in the simulated annealing algorithm. |
scale |
equals TRUE if data X is to be scaled with mean 0 and variance 1. |
lambda |
the tuning parameter of the penalty. Larger values shrink the weighted cluster membership closer together (default = 1). |
epsilon |
values in the similarity matrix less than epsilon are set to 0 (default = 0). |
nstarts |
the number of starting values also corresponding how many times simulated annealing is run. Larger values provide better results but takes longer. |
start |
if it equals 'default' then the starting value for all weights is 1/K. If 'random' then weights are sampled from a uniform distribution and then scaled to sum 1 per variable. |
dist |
specifies the distance metric used for constructing the similarity matrix. Options are 'gaussian', 'correlation' and 'euclidean' (default = 'gaussian'). |
sigma |
is the bandwidth parameter when the dist metric chosen is 'gaussian' (default = 0.1). |
beta |
when dist='correlation', beta is the exponent applied to each entry of the similarity matrix. |
The algorithm minimizes a modified version of NCut through simulated annealing. The clusters correspond to partitions that minimize this objective function.
Sebastian J. Teran Hidalgo, Mengyun Wu and Shuangge Ma. Penalized and weighted clustering of gene expression data using PWNCut. (Submitted.)
# This sets up the initial parameters for the simulation. n <- 100 # Sample size p <- 100 # Number of columns of Y. K <- 3 C0 <- matrix(0,p,K) C0[1:25,1] <- matrix(1,25,1) C0[26:75,1:3] <- matrix(1/3,50,3) C0[76:100,3] <- matrix(1,25,1) A0 <- C0[ ,1]%*%t(C0[ ,1]) + C0[ ,2]%*%t(C0[ ,2]) + C0[ ,3]%*%t(C0[ ,3]) A0 <- A0 - diag(diag(A0)) + diag(p) Z1 <- rnorm(n,0,2) Z2 <- rnorm(n,0,2) Z3 <- rnorm(n,0,2) Y <- matrix(0,n,p) Y[ ,1:25] <- matrix(rnorm(n*25, 0, 2), n, 25) + matrix(Z1, n, 25, byrow=FALSE) Y[ ,26:75] <- matrix(rnorm(n*50, 0, 2), n, 50) + matrix(Z1, n, 50, byrow=FALSE) + matrix(Z2, n, 50, byrow=FALSE) + matrix(Z3, n, 50, byrow=FALSE) Y[ ,76:100] <- matrix(rnorm(n*25, 0, 2), n, 25) + matrix(Z3, n, 25, byrow=FALSE) trial <- pwncut(Y, K = 3, B = 10000, L = 1000, lambda = 1.5, start = 'default', scale = TRUE, nstarts = 1, epsilon = 0, dist = 'correlation', sigma = 10) A1 <- trial[[2]][ ,1]%*%t(trial[[2]][ ,1]) + trial[[2]][ ,2]%*%t(trial[[2]][ ,2]) + trial[[2]][ ,3]%*%t(trial[[2]][ ,3]) A1 <- A1 - diag(diag(A1)) + diag(p) plot(trial[[1]], type='l') errorL <- sum(abs(A0-A1))/p^2 errorL
# This sets up the initial parameters for the simulation. n <- 100 # Sample size p <- 100 # Number of columns of Y. K <- 3 C0 <- matrix(0,p,K) C0[1:25,1] <- matrix(1,25,1) C0[26:75,1:3] <- matrix(1/3,50,3) C0[76:100,3] <- matrix(1,25,1) A0 <- C0[ ,1]%*%t(C0[ ,1]) + C0[ ,2]%*%t(C0[ ,2]) + C0[ ,3]%*%t(C0[ ,3]) A0 <- A0 - diag(diag(A0)) + diag(p) Z1 <- rnorm(n,0,2) Z2 <- rnorm(n,0,2) Z3 <- rnorm(n,0,2) Y <- matrix(0,n,p) Y[ ,1:25] <- matrix(rnorm(n*25, 0, 2), n, 25) + matrix(Z1, n, 25, byrow=FALSE) Y[ ,26:75] <- matrix(rnorm(n*50, 0, 2), n, 50) + matrix(Z1, n, 50, byrow=FALSE) + matrix(Z2, n, 50, byrow=FALSE) + matrix(Z3, n, 50, byrow=FALSE) Y[ ,76:100] <- matrix(rnorm(n*25, 0, 2), n, 25) + matrix(Z3, n, 25, byrow=FALSE) trial <- pwncut(Y, K = 3, B = 10000, L = 1000, lambda = 1.5, start = 'default', scale = TRUE, nstarts = 1, epsilon = 0, dist = 'correlation', sigma = 10) A1 <- trial[[2]][ ,1]%*%t(trial[[2]][ ,1]) + trial[[2]][ ,2]%*%t(trial[[2]][ ,2]) + trial[[2]][ ,3]%*%t(trial[[2]][ ,3]) A1 <- A1 - diag(diag(A1)) + diag(p) plot(trial[[1]], type='l') errorL <- sum(abs(A0-A1))/p^2 errorL