MBG wiki: Eigenvectors and eigenvalues, sufficient sampling issues

Doing a principal component analysis (PCA) of a trajectory is easy. And is even easier to prepare nice figures showing collective motions of the solute (see Calculation of eigenvalues and eigenvectors and More on eigenvectors and eigenvalues). And if you are not careful, it is just as easy for all these nice figures to be fantasy pure and simple.

Being careful is this : you must be able to demonstrate that you have sufficiently sampled the molecule's configurational space during your simulation. The truth is that you never are (recall the time scales of the various types of macromolecular motion). Still, if (in a hypothetical case) your molecule has two major and relatively stable conformers, you may have sufficient sampling of the given conformers (but you will never know whether there are other conformations that you've never seen in your dynamics). In this case you may be able to demonstrate sufficient sampling for these conformers (clusters).

Demonstrating convergence of sampling is still an active research area (Google for "converge of sampling in protein simulations" and you'll find an interesting paper by B. Hess). Here I will only show two types of graphs that you can prepare with the output from carma (file carma.fluctuations.dat) and how these should look like in you have sufficient sampling (note that these indicators are necessary but not sufficient conditions).

The first type of graphs you would want to have a look at is a histogram of the fluctuations along the first few eigenvectors. This you do with xmgr (read block data with columns 1-2, or 1-3, or 1-4, …, and then produce a histogram from the 'Transformations' pull-down menu). These histograms should look like Gaussians centered on zero. Compare the following :

The second type of graph is produced by ploting the fluctuations of pairs of eigenvectors. The example that follows shows graphs for the combinations 1-2, 1-3 and 2-3 of eigenvectors. In the case of sufficient sampling, these should look as 2D Gaussians centered on (0,0). Compare these :

A better way to display these graphs would be to draw the distribution of density of points. This footnote tells you how1.

You could even try to examine a 3D distribution (you'll need a trick2 to convert fluctuations along three eigenvectors to x, y and z coordinates in a pseudo-pdb file). Compare these:

Footnotes:

1. Use an editor to create a file containing the fluctuations along the two eigenvectors you are interested in. Then compile this program :

#include <stdio.h>
#include <math.h>
#include <stdlib.h>

#define            nearint(a)      ((a) < 0.0 ? (int)(a-0.50) : (int)(a+0.50))

int main(argc,argv)
int     argc;
char    *argv[];
{
        FILE    *in;
        FILE    *out;
        float   x, y;
        float   max;
        float   points;
        int     grid;
        int     data[1000][1000];
        int     i, k;


        if ( argc != 5 )
                {
                        printf("Usage: mkdensity <input file> <output file> <edge length> <number of grid points>\n");
                        exit(1);
                                        }

        grid = atoi( argv[4] );
        max = atof( argv[3] );
        if ( grid > 1000 )
                {
                        printf("Too large grid. Max is 1000.\n");
                        exit(1);
                                        }
        for ( i=0 ; i < 1000 ; i++ )
        for ( k=0 ; k < 1000 ; k++ )
                data[i][k] = 0;

        in  = fopen(argv[1], "r");
        out = fopen(argv[2], "w");

        if ( in == NULL )
                {
                        printf("Can not open input file\n");
                        exit(1);
                                        }

        if ( out == NULL )
                {
                        printf("Can not open output file\n");
                        exit(1);
                                        }

        rewind( in );

        while( fscanf( in, "%f %f", &x, &y) == 2 )
                data[ nearint(grid*x/max + 0.50*grid)][ nearint(grid*y/max + 0.50*grid)]++;

        for ( i=0 ; i < grid ; i++ )
                {
                        for ( k=0 ; k < grid ; k++ )
                                fprintf(out, " % 5d", data[k][i] );

                        fprintf(out, "\n");
                }

        exit(0);
}

The program will read a file containing pairs of coordinates of points and will output an ASCII file containing density distribution matrix. Then use a gnuplot script to do the plotting :

giving something like

2. If you have a file with three columns containing the fluctuations along the eigenvectors you want to plot, this tiny C program should prepare a PDB file for you (use input and output redirection to use it) :