MBG wiki: Cluster analysis of trajectories

Read the page Calculation of frame-to-frame rmsds and then come back.

So, now you have your crossDCD.matrix. Lets make a dendrogram based on these rmsds (which will be treated as distances).

We will use the statistical package R [1]. For the example shown below I will assume that your matrix is 460x460 (do a wc crossDCD.matrix to determine the actual dimension (if your matrix is square)) :

# R

R : Copyright 2004, The R Foundation for Statistical Computing
Version 2.0.1  (2004-11-15), ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for a HTML browser interface to help.
Type 'q()' to quit R.

> A <- matrix(scan("crossDCD.matrix", n = 460*460), 460, 460, byrow = TRUE)
Read 211600 items
> hc <- hclust( dist(A), method="complete" )
> ls ()
[1] "A"  "hc"
> hc

Call:
hclust(d = dist(A), method = "complete")

Cluster method   : complete 
Distance         : euclidean 
Number of objects: 460 

> plclust(hc)
> q()
Save workspace image? [y/n/c]: n

giving something like :

If you want to save, say a postscript, of your image, while in R and before you type any of the commands type:

> postscript()

This way, you open the device to print or save postscripts. When you are done with your work, type:

> dev.off()

to close the device and finish the postscript. Of cource, there are numerous options you can play with. If you insist on changing your postscript's properties ask N.M.G for R's manual.

From this dendrogram you could possibly make an educated guess as to how many distinctive clusters you have and then you can use a more advanced clustering method to define them (note however that there are proper statistical methods to help you decide the number of clusters). Using fuzzy clustering and assuming 3 clusters it will be something like :

# R

R : Copyright 2004, The R Foundation for Statistical Computing
Version 2.0.1  (2004-11-15), ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for a HTML browser interface to help.
Type 'q()' to quit R.

> A <- matrix(scan("crossDCD.matrix", n = 460*460), 460, 460, byrow = TRUE)
Read 211600 items
> library(cluster)
> f <- fanny(A,3)
> summary(f)
iterations  objective 
    33.000   1035.766 
Membership coefficients:
             [,1]      [,2]      [,3]
  [1,] 0.53150274 0.2677229 0.2007743
  [2,] 0.53087205 0.2679982 0.2011298
  [3,] 0.53662002 0.2649609 0.1984191

.....................................

[459,] 0.11053637 0.3214276 0.5680360
[460,] 0.11896391 0.3286576 0.5523785
Coefficients:
dunn_coeff normalized 
 0.4334923  0.1502384 
Closest hard clustering:
  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[186] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3
[297] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[334] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[371] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[408] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 2 3 3 3 3 3
[445] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Silhouette plot information:
    cluster neighbor     sil_width
76        1        2  5.967160e-01
122       1        2  5.956911e-01
121       1        2  5.956660e-01
79        1        2  5.918693e-01
119       1        2  5.910336e-01
65        1        2  5.907392e-01
113       1        2  5.887318e-01
120       1        2  5.881414e-01

..................................

435       3        2  7.393940e-02
440       3        2  5.069895e-02
288       3        2 -1.797243e-05
Average silhouette width per cluster:
[1] 0.4781967 0.4395254 0.4025648
Average silhouette width of total data set:
[1] 0.4414223

105570 dissimilarities, summarized :
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2123  8.5210 14.6000 14.5900 20.5500 30.2600 
Metric :  euclidean 
Number of objects : 460

Available components:
[1] "membership" "coeff"      "clustering" "objective"  "diss"      
[6] "call"       "silinfo"    "data"      
> plot(f)
Hit <Return> to see next plot: 
Hit <Return> to see next plot: 
> q()
Save workspace image? [y/n/c]: n

giving :