MBG wiki: Editing revision 16 of Eigenvectors and eigenvalues, sufficient sampling issues

Editing old revision 16. Saving this page will replace the latest revision with this text.

Doing a principal component analysis (PCA) of a trajectory is easy. And is even easier to prepare nice figures showing collective motions of the solute (see [[Calculation of eigenvalues and eigenvectors]] and [[More on eigenvectors and eigenvalues]]). And if you are not careful, it is just as easy for all these nice figures to be fantasy pure and simple.

Being careful is this : you must be able to demonstrate that you have sufficiently sampled the molecule's configurational space during your simulation. The truth is that you never are (recall the time scales of the various types of macromolecular motion). Still, if (in a hypothetical case) your molecule has two major and relatively stable conformers, you may have sufficient sampling of the given conformers (but you will never know whether there are other conformations that you've never seen in your dynamics). In this case you may be able to demonstrate sufficient sampling for these conformers (clusters).

Demonstrating convergence of sampling is still an active research area (Google for "converge of sampling in protein simulations" and you'll find an interesting paper by B. Hess). Here I will only show two types of graphs that you can prepare with the output from /carma/ (file carma.fluctuations.dat) and how these should look like in you have sufficient sampling (note that these indicators are necessary but not sufficient conditions).

The first type of graphs you would want to have a look at is a histogram of the fluctuations along the first few eigenvectors. This you do with /xmgr/ (read block data with columns 1-2, or 1-3, or 1-4, ..., and then produce a histogram from the 'Transformations' pull-down menu). These histograms should look like Gaussians centered on zero. Compare the following :

[[image: Sufficient sampling, 1, 2, 3 eigenvector fluctuations histogram]]

[[image: Insufficient sampling, 1, 2, 3 eigenvector fluctuations histogram]]

The second type of graph is produced by ploting the fluctuations of pairs of eigenvectors. The example that follows shows graphs for the combinations 1-2, 1-3 and 2-3 of eigenvectors. In the case of sufficient sampling, these should look as 2D Gaussians centered on (0,0). Compare these :

[[image: Sufficient sampling, 1,2 1,3 2,3 eigenvector pairs]]

[[image: Insufficient sampling, 1,2 1,3 2,3 eigenvector pairs]]

A better way to display these graphs would be to draw the distribution of density of points. This footnote tells you how{{Use an editor to create a file containing the fluctuations along the two eigenvectors you are interested in. Then compile this program :
<source>
#include <stdio.h>
#include <math.h>
#include <stdlib.h>

#define            nearint(a)      ((a) < 0.0 ? (int)(a-0.50) : (int)(a+0.50))

int main(argc,argv)
int     argc;
char    *argv[];
{
        FILE    *in;
        FILE    *out;
        float   x, y;
        float   max;
        float   points;
        int     grid;
        int     data[1000][1000];
        int     i, k;

if ( argc != 5 )
                {
                        printf("Usage: mkdensity <input file> <output file> <edge length> <number of grid points>\n");
                        exit(1);
                                        }

grid = atoi( argv[4] );
        max = atof( argv[3] );
        if ( grid > 1000 )
                {
                        printf("Too large grid. Max is 1000.\n");
                        exit(1);
                                        }
        for ( i=0 ; i < 1000 ; i++ )
        for ( k=0 ; k < 1000 ; k++ )
                data[i][k] = 0;

in  = fopen(argv[1], "r");
        out = fopen(argv[2], "w");

if ( in == NULL )
                {
                        printf("Can not open input file\n");
                        exit(1);
                                        }

if ( out == NULL )
                {
                        printf("Can not open output file\n");
                        exit(1);
                                        }

rewind( in );

while( fscanf( in, "%f %f", &x, &y) == 2 )
                data[ nearint(grid*x/max + 0.50*grid)][ nearint(grid*y/max + 0.50*grid)]++;

for ( i=0 ; i < grid ; i++ )
                {
                        for ( k=0 ; k < grid ; k++ )
                                fprintf(out, " % 5d", data[k][i] );

fprintf(out, "\n");
                }

exit(0);
}
</source>
The program will read a file containing pairs of coordinates of points and will output an ASCII file containing density distribution matrix. Then use a /gnuplot/ script to do the plotting :
<source>
# gcc mkdensity.c -lm
# ./a.out mydata test.dat 20 100
# cat > script
set pm3d
set view 0,0
set size square 3.0,3.0
unset key
unset xtics
unset ytics
unset zeroaxis
set output 'coloured.png'
set terminal png giant crop
splot "test.dat" matrix with pm3d palette
<CTRL-D>
# gnuplot < script
# display coloured.png
</source>

giving something like

[[image: sufficient sampling density distribution graph]]

}}.

You could even try to examine a 3D distribution (you'll need a trick{{If you have a file with three columns containing the fluctuations along the eigenvectors you want to plot, this tiny C program should prepare a PDB file for you (use input and output redirection to use it) :
<source>
#include <stdio.h>
#include <math.h>

main()
{
        float   x, y, z;        
        int     i;

i = 1;
        while (scanf("%f %f %f", &x, &y, &z ) == 3 )
        {
        printf("ATOM% 7d  CA  GLY% 6d    % 8.3f% 8.3f% 8.3f  1.00% 6.2f\n", i, i, x, y, z, 20.0);
        i++;
        }
}
</source>}} to convert fluctuations along three eigenvectors to x, y and z coordinates in a pseudo-pdb file). Compare these:

[[image: Sufficient sampling, 1,2,3 eigenvectors 3D]]

[[image: Insufficient sampling, 1,2,3 eigenvectors 3D]]

Summary:

This change is a minor edit.

Username:

Replace this text with a file.