Actions

NWChem

From Modelado Foundation

Revision as of 17:45, February 5, 2013 by imported>Adam
NWChem
{{{image}}}
{{{imagecaption}}}
Developer(s) Pacific Northwestern National Laboratory
Stable Release 6.1/Latest Release Date here
Operating Systems Linux, Unix, etc.
Type Computational Chemistry
License Educational Community License (ECL) 2.0
Website http://www.nwchem-sw.org

NWChem provides many methods for computing the properties of molecular and periodic systems using standard quantum mechanical descriptions of the electronic wavefunction or density. Its classical molecular dynamics capabilities provide for the simulation of macromolecules and solutions, including the computation of free energies using a variety of force fields. These approaches may be combined to perform mixed quantum-mechanics and molecular-mechanics simulations.

Applications

There are many modules of NWChem. In the DynAX XStack program, PNNL will extract several of these modules and produce simple, readable, serial kernels. The DynAX team will produce parallel and optimized versions of these kernels and report on those optimizations.

The first of these kernels is the Self Consistent Field (SCF).

SCF Kernel

Click here to view a Powerpoint Presentation on the SCF kernel

Description

Download

Code will be available here shortly. It is licensed under the BSD license.

Optimizations

Additional Work Skipping

The reference implementation has two tests to skip unnecessary work, one at the i,j level and one at the i,j,k,l level (innermost loop). An additional test can be added to the i,j,k level which will allow skipping entire iterations of the inner most loop.

Reference twoel function:

 double twoel(double schwmax) {
   int i, j, k, l;
   
   for (i = 0; i < nbfn; i++) {
   for (j = 0; j < nbfn; j++) {
       if ((g_schwarz[i][j] * schwmax        ) < tol2e) {icut1 += nbfn * nbfn; continue;}
       double  KLTest = tol2e / g_schwarz[i][j];
   
   for (k = 0; k < nbfn; k++) {
   for (l = 0; l < nbfn; l++) {
   
       if (g_schwarz[k][l] < KLTest) {icut2 ++; continue;}
   
       icut3 ++;
       double gg = g(i, j, k, l);
       g_fock[i][j] += (       gg * g_dens[k][l]);
       g_fock[i][k] -= (0.50 * gg * g_dens[j][l]);
   
     } } } }
   
   return (0.50 * contract_matrices(g_fock, g_dens));
 }

The inner-most test, (g_schwarz[k][l] < KLTest), can be modified to work in the k loop under certain conditions. If the maximum value of all entries for a given row of g_schwarz is known, that can be compared to KLTest in the k loop and we can skip the l loop entirely in cases where it would not perform any updates.

This requires one additional line to be inserted between the k and l loops:

   for (k = 0; k < nbfn; k++) {
     if (g_schwarz_max_row_value[k] < KLTest) {icut4 ++; continue;}
   for (l = 0; l < nbfn; l++) {

To enable this additional test, the array g_schwarz_max_row_value should be populated with the maximum value for each row in makesz while it is initializing the g_schwarz matrix.

The additional memory requirements for this work skipping test are minimal as the array only has one dimension of size nbfn.

Symmetry of g()

The g function has several symmetries which can be exploited to reduce the number of times it must be run.

g(i, j, k, l) returns the same value in the following cases:

  • The i and j parameters are swapped.
  • The k and l parameters are swapped.
  • The i,j and k,l pairs are swapped.

Combining all possibly symmetries leads to an overall 8-way symmetry where the following calls all return the same value:

  • g(i, j, k, l)
  • g(i, j, l, k)
  • g(j, i, k, l)
  • g(j, i, l, k)
  • g(k, l, i, j)
  • g(k, l, j, i)
  • g(l, k, i, j)
  • g(l, k, j, i)

By calculating gg = g(i, j, k, l) once, we can then simply perform multiple updates to the fock matrix.

  • update_fock(gg, i, j, k, l)
  • update_fock(gg, i, j, l, k)
  • update_fock(gg, j, i, k, l)
  • update_fock(gg, j, i, l, k)
  • update_fock(gg, k, l, i, j)
  • update_fock(gg, k, l, j, i)
  • update_fock(gg, l, k, i, j)
  • update_fock(gg, l, k, j, i)

With update() being a macro or inlined function similar to:

update(gg, a, b, c, d) {
  g_fock[a][b] += (gg * g_dens[c][d]);
  g_fock[a][c] -= (0.50 * gg * g_dens[b][d]);
}


When taking advantage of the symmetry, there are special cases which need to be considered. Most obvious is that the iteration range of the loops must be modified so that g() is only called once for each equivalent ordering of its arguments then all possible updates to the fock matrix are made with the result. Secondly, when arguments which can be swapped are equal, there are fewer updates to be made to the fock matrix.

As an example, consider the case where i, j, k, and l are all equal to 0. The reference implementation can only process this once since the loop indices can only all be zero once. If all symmetric updates are performed in this case, the result would be that the exact same update is performed eight times since the arguments to update(0, 0, 0, 0) are unchanged by the swap operations.

To ensure proper operation, there are six separate variants of the twoel loops, one for each possible symmetry restriction:

  • No restrictions
  • i = j
  • k = l
  • i,j pair is identical to k,l pair
  • i = j and k = l
  • i, j, k, and l are all equal

With the special cases taken in account, the loop structure for the fully symmetric case is:

for (i = 0; i < nbfn; i++) {
  for (j = i + 1; j < nbfn; j++) {
    for (k = i; k < nbfn; k++) {
      
      if (k == i) l_start = 1 + j;
      else l_start = 1 + k;
      
      for (l = l_start; l < nbfn; l++) {

        gg = g(i,j,k,l);

        //perform all eight updates
        update_fock(gg, i, j, k, l);
        update_fock(gg, i, j, l, k);
        update_fock(gg, j, i, k, l);
        update_fock(gg, j, i, l, k);
        update_fock(gg, k, l, i, j);
        update_fock(gg, k, l, j, i);
        update_fock(gg, l, k, i, j);
        update_fock(gg, l, k, j, i);
      }
    }
  }
}

The loop structure for the other five cases can be trivially generated by removing the loops which are no longer independent. The variables should then simply be replaced with the independent index. Finally any duplicate identical updates should be removed.


As an example, the i = j case is:

for (i = 0; i < nbfn; i++) {
  // removed j loop
  for (k = i; k < nbfn; k++) {
    for (l = 1 + k; l < nbfn; l++) {
      
      gg = g(i,i,k,l); // Note that first two arguments are both i
      
      //perform four updates
      update_fock(gg, i, i, k, l);
      update_fock(gg, i, i, l, k);
      // duplicate: update_fock(gg, i, i, k, l);
      // duplicate: update_fock(gg, i, i, l, k);
      update_fock(gg, k, l, i, i);
      // duplicate: update_fock(gg, k, l, i, i);
      update_fock(gg, l, k, i, i);
      // duplicate: update_fock(gg, l, k, i, i);
      }
    }
  }
}

Note that the special calculation of the starting index for the l loop is only required in the fully symmetric case. In the i = j case, the initial value is:

if (k == i) l_start = 1 + i;
else l_start = 1 + k;

However, we can substitute i with k when k == i:

if (k == i) l_start = 1 + k;
else l_start = 1 + k;

Which removes the need for the conditional assignment since the initial value is 1 + k in both cases.

For all other symmetry restrictions, the l loop is removed entirely.

Using a precomputed lookup table in g()

Many of the individual calculations in g() only rely on the i,j or k,l argument pairs. Such values can be precomputed during initialization and stored in a matrix for lookup in g(). Because of the symmetry between the i,j and k,l pairs, one lookup table can be used for both. The additional memory requirements for using a lookup table are on the order of N^2, which is the same order as required for the fock, schwarz and other matricies.

Additionally, by using lookup tables, an e^x calculation can be removed from g() and replaced with a multiplication in the following line:

double exijkl = exprjh(-facij * rab2 - fackl * rcd2);

This is possible because that code has the following properties:

  • exprjh(x) is simply a shortcut around exp(x) which returns 0 for x values < -37
  • facij and rab2 only depend on i and j arguments
  • fackl and rcd2 only depend on the k and l arguments
  • the result can be re-expressed as exp(-facij * rab2) * exp(-fackl * rcd2);
  • Now, both calls to exp() only depend on one of the argument pairs, which allows all possible values to be stored in a lookup table.

In total, there are six values which can be precomputed and stored in a lookup table. To improve data locality and cache utilization, the values can be stored in an array of structs. The struct definition and description of each value from the original g() is:

struct {
  double ex;      // exp(-facij * rab2)
  double expnt;   // expnt[i] + expnt[j]
  double x;       // x[i] * expnt[i] + x[j] * expnt[j]) / expntIJ
  double y;       // y[i] * expnt[i] + y[j] * expnt[j]) / expntIJ
  double z;       // z[i] * expnt[i] + z[j] * expnt[j]) / expntIJ
  double rnorm;   // rnorm[i] * rnorm[j]
} g_lookup_table;

With a lookup table, the g function becomes:

static inline double g(int i, int j, int k, int l) {
  double f0val;
  g_lookup_table* lookup_ij = &g_lookup[i * nbfn + j];
  g_lookup_table* lookup_kl = &g_lookup[j * nbfn + k];
  
  double expntIJ = lookup_ij->expnt;
  double expntKL = lookup_kl->expnt;
  
  double exijkl = lookup_ij->ex * lookup_kl->ex;
  double denom  = expntIJ * expntKL * sqrt(expntIJ + expntKL);
  double fac    = (expntIJ) * (expntKL) / (expntIJ + expntKL);
  
  double xp   = lookup_ij->x;
  double yp   = lookup_ij->y;
  double zp   = lookup_ij->z;
  double xq   = lookup_kl->x;
  double yq   = lookup_kl->y;
  double zq   = lookup_kl->z;
  double rpq2 = (xp - xq) * (xp - xq) + (yp - yq) * (yp - yq) + (zp - zq) * (zp - zq); 
  
  f0val = f0(fac * rpq2);
  return (two_pi_pow_2_5 / denom) * exijkl * f0val * lookup_ij->rnorm * lookup_kl->rnorm;
}

In addition to reducing the complexity of calculations in g(), using a lookup table also reduces the number of memory operation from 20 (x, y, z, expnt and rnorm for each i,j,k,l index) in the reference code to 12 (2 * 6 struct entries).