Actions

NWChem

From Modelado Foundation

Revision as of 02:58, March 7, 2013 by imported>Rishi (→‎SCF Module)
NWChem
{{{image}}}
{{{imagecaption}}}
Developer(s) Pacific Northwestern National Laboratory
Stable Release 6.1/Latest Release Date here
Operating Systems Linux, Unix, etc.
Type Computational Chemistry
License Educational Community License (ECL) 2.0
Website http://www.nwchem-sw.org

NWChem provides many methods for computing the properties of molecular and periodic systems using standard quantum mechanical descriptions of the electronic wavefunction or density. Its classical molecular dynamics capabilities provide for the simulation of macromolecules and solutions, including the computation of free energies using a variety of force fields. These approaches may be combined to perform mixed quantum-mechanics and molecular-mechanics simulations.

Applications

There are many modules of NWChem. In the DynAX XStack program, PNNL will extract two of these modules and produce simple, readable, serial modules. The DynAX team will produce parallel and optimized versions of these modeules and report on those optimizations.

The first of these modules is the Self Consistent Field (SCF).

SCF Module

Click here to view a Powerpoint Presentation on the SCF Module

Click here to view a Word document describing the SCF module

Description

Download

You can download the code here. It is licensed under the BSD license.

Optimizations

Presentation on the SCF module optimizations (PDF format)

Additional Work Skipping

The reference implementation has two tests to skip unnecessary work, one at the i,j level and one at the i,j,k,l level (innermost loop). An additional test can be added to the i,j,k level which will allow skipping entire iterations of the inner most loop.

Reference twoel function:

 double twoel(double schwmax) {
   int i, j, k, l;
   
   for (i = 0; i < nbfn; i++) {
   for (j = 0; j < nbfn; j++) {
       if ((g_schwarz[i][j] * schwmax        ) < tol2e) {icut1 += nbfn * nbfn; continue;}
       double  KLTest = tol2e / g_schwarz[i][j];
   
   for (k = 0; k < nbfn; k++) {
   for (l = 0; l < nbfn; l++) {
   
       if (g_schwarz[k][l] < KLTest) {icut2 ++; continue;}
   
       icut3 ++;
       double gg = g(i, j, k, l);
       g_fock[i][j] += (       gg * g_dens[k][l]);
       g_fock[i][k] -= (0.50 * gg * g_dens[j][l]);
   
     } } } }
   
   return (0.50 * contract_matrices(g_fock, g_dens));
 }

The inner-most test, (g_schwarz[k][l] < KLTest), can be modified to work in the k loop under certain conditions. If the maximum value of all entries for a given row of g_schwarz is known, that can be compared to KLTest in the k loop and we can skip the l loop entirely in cases where it would not perform any updates.

This requires one additional line to be inserted between the k and l loops:

   for (k = 0; k < nbfn; k++) {
     if (g_schwarz_max_row_value[k] < KLTest) {icut4 ++; continue;}
   for (l = 0; l < nbfn; l++) {

To enable this additional test, the array g_schwarz_max_row_value should be populated with the maximum value for each row in makesz while it is initializing the g_schwarz matrix.

The additional memory requirements for this work skipping test are minimal as the array only has one dimension of size nbfn.

Symmetry of g()

The g function has several symmetries which can be exploited to reduce the number of times it must be run.

g(i, j, k, l) returns the same value in the following cases:

  • The i and j parameters are swapped.
  • The k and l parameters are swapped.
  • The i,j and k,l pairs are swapped.

Combining all possibly symmetries leads to an overall 8-way symmetry where the following calls all return the same value:

  • g(i, j, k, l)
  • g(i, j, l, k)
  • g(j, i, k, l)
  • g(j, i, l, k)
  • g(k, l, i, j)
  • g(k, l, j, i)
  • g(l, k, i, j)
  • g(l, k, j, i)

By calculating gg = g(i, j, k, l) once, we can then simply perform multiple updates to the fock matrix.

  • update_fock(gg, i, j, k, l)
  • update_fock(gg, i, j, l, k)
  • update_fock(gg, j, i, k, l)
  • update_fock(gg, j, i, l, k)
  • update_fock(gg, k, l, i, j)
  • update_fock(gg, k, l, j, i)
  • update_fock(gg, l, k, i, j)
  • update_fock(gg, l, k, j, i)

With update() being a macro or inlined function similar to:

update(gg, a, b, c, d) {
  g_fock[a][b] += (gg * g_dens[c][d]);
  g_fock[a][c] -= (0.50 * gg * g_dens[b][d]);
}


When taking advantage of the symmetry, there are special cases which need to be considered. Most obvious is that the iteration range of the loops must be modified so that g() is only called once for each equivalent ordering of its arguments then all possible updates to the fock matrix are made with the result. Secondly, when arguments which can be swapped are equal, there are fewer updates to be made to the fock matrix.

As an example, consider the case where i, j, k, and l are all equal to 0. The reference implementation can only process this once since the loop indices can only all be zero once. If all symmetric updates are performed in this case, the result would be that the exact same update is performed eight times since the arguments to update(0, 0, 0, 0) are unchanged by the swap operations.

To ensure proper operation, there are six separate variants of the twoel loops, one for each possible symmetry restriction:

  • No restrictions
  • i = j
  • k = l
  • i,j pair is identical to k,l pair
  • i = j and k = l
  • i, j, k, and l are all equal

With the special cases taken in account, the loop structure for the fully symmetric case is:

for (i = 0; i < nbfn; i++) {
  for (j = i + 1; j < nbfn; j++) {
    for (k = i; k < nbfn; k++) {
      
      if (k == i) l_start = 1 + j;
      else l_start = 1 + k;
      
      for (l = l_start; l < nbfn; l++) {

        gg = g(i,j,k,l);

        //perform all eight updates
        update_fock(gg, i, j, k, l);
        update_fock(gg, i, j, l, k);
        update_fock(gg, j, i, k, l);
        update_fock(gg, j, i, l, k);
        update_fock(gg, k, l, i, j);
        update_fock(gg, k, l, j, i);
        update_fock(gg, l, k, i, j);
        update_fock(gg, l, k, j, i);
      }
    }
  }
}

The loop structure for the other five cases can be trivially generated by removing the loops which are no longer independent. The variables should then simply be replaced with the independent index. Finally any duplicate identical updates should be removed.


As an example, the i = j case is:

for (i = 0; i < nbfn; i++) {
  // removed j loop
  for (k = i; k < nbfn; k++) {
    for (l = 1 + k; l < nbfn; l++) {
      
      gg = g(i,i,k,l); // Note that first two arguments are both i
      
      //perform four updates
      update_fock(gg, i, i, k, l);
      update_fock(gg, i, i, l, k);
      // duplicate: update_fock(gg, i, i, k, l);
      // duplicate: update_fock(gg, i, i, l, k);
      update_fock(gg, k, l, i, i);
      // duplicate: update_fock(gg, k, l, i, i);
      update_fock(gg, l, k, i, i);
      // duplicate: update_fock(gg, l, k, i, i);
      }
    }
  }
}

Note that the special calculation of the starting index for the l loop is only required in the fully symmetric case. In the i = j case, the initial value is:

if (k == i) l_start = 1 + i;
else l_start = 1 + k;

However, we can substitute i with k when k == i:

if (k == i) l_start = 1 + k;
else l_start = 1 + k;

Which removes the need for the conditional assignment since the initial value is 1 + k in both cases.

For all other symmetry restrictions, the l loop is removed entirely.

Using a precomputed lookup table in g()

Many of the individual calculations in g() only rely on the i,j or k,l argument pairs. Such values can be precomputed during initialization and stored in a matrix for lookup in g(). Because of the symmetry between the i,j and k,l pairs, one lookup table can be used for both. The additional memory requirements for using a lookup table are on the order of N^2, which is the same order as required for the fock, schwarz and other matricies.

Additionally, by using lookup tables, an e^x calculation in the following line can be removed from g() and replaced with a multiplication:

double exijkl = exprjh(-facij * rab2 - fackl * rcd2);

This is possible because that code has the following properties:

  • exprjh(x) is simply a shortcut around exp(x) which returns 0 for x values < -37
  • facij and rab2 only depend on i and j arguments
  • fackl and rcd2 only depend on the k and l arguments
  • the result can be re-expressed as exp(-facij * rab2) * exp(-fackl * rcd2);
  • Now, both calls to exp() only depend on one of the argument pairs, which allows all possible values to be stored in a lookup table.

In total, there are six values which can be precomputed and stored in a lookup table. To improve data locality and cache utilization, the values can be stored in an array of structs. The struct definition and description of each value from the original g() is:

struct {
  double ex;      // exp(-facij * rab2)
  double expnt;   // expnt[i] + expnt[j]
  double x;       // x[i] * expnt[i] + x[j] * expnt[j]) / expnt
  double y;       // y[i] * expnt[i] + y[j] * expnt[j]) / expnt
  double z;       // z[i] * expnt[i] + z[j] * expnt[j]) / expnt
  double rnorm;   // rnorm[i] * rnorm[j]
} g_lookup_table;

With a lookup table, the g function becomes:

static inline double g(int i, int j, int k, int l) {
  double f0val;
  g_lookup_table* lookup_ij = &g_lookup[i * nbfn + j];
  g_lookup_table* lookup_kl = &g_lookup[j * nbfn + k];
  
  double expntIJ = lookup_ij->expnt;
  double expntKL = lookup_kl->expnt;
  
  double exijkl = lookup_ij->ex * lookup_kl->ex;
  double denom  = expntIJ * expntKL * sqrt(expntIJ + expntKL);
  double fac    = (expntIJ) * (expntKL) / (expntIJ + expntKL);
  
  double xp   = lookup_ij->x;
  double yp   = lookup_ij->y;
  double zp   = lookup_ij->z;
  double xq   = lookup_kl->x;
  double yq   = lookup_kl->y;
  double zq   = lookup_kl->z;
  double rpq2 = (xp - xq) * (xp - xq) + (yp - yq) * (yp - yq) + (zp - zq) * (zp - zq); 
  
  f0val = f0(fac * rpq2);
  return (two_pi_pow_2_5 / denom) * exijkl * f0val * lookup_ij->rnorm * lookup_kl->rnorm;
}

In addition to reducing the complexity of calculations in g(), using a lookup table also reduces the number of memory operation from 20 (x, y, z, expnt and rnorm for each i,j,k,l index) in the reference code to 12 (2 * 6 struct entries).

Optimizations for large problem sizes

Matrix routines

The matrix multiplication and eigenvalue solver used in diagon() operates fast for small problem sizes, but as the problem size increases, diagon() can consume a larger percentage of the overall time. In fact, if the input problem is sparse enough, diagon() can scale worse than twoel() and consume the majority of the computation time.

This situation can be easily improved by replacing the Dgemm() in integ.c with a call to a BLAS dgemm and by replacing rs() and rsg() in diagonalize.c with dsyge() and dsygv() from LAPACK.

All of the replacement functions are direct equivalents which only require minor changes to use (mostly additional arguments).

Vector operations

If the compiler is not able to automatically vectorize operations well, there may also be a benefit to using level 1 BLAS functions.

contract_matricies() can be replaced with ddot()

damp() can be implemented by two BLAS routines:

void damp(double fac) {
  // g_dens = fac * g_dens
  cblas_dscal(nbfn*nbfn, fac, g_dens, 1);
  
  // g_dens = ((1.0 – fac) * g_work) + g_dens
  cblas_daxpy(nbfn*nbfn, 1.0 - fac, g_work, 1, g_dens, 1);
}

Which produce the same calculation as the original damp():

g_dens = (fac * g_dens) + ((1.0 – fac) * g_work)

Using BLAS routines in damp() has additional overhead as the vectors must be iterated over twice, so any benefit is limited to the case where the BLAS library is better optimized than what the compiler can generate from the original code.

oneel()

The calculation of h() in oneel() is unnecessary as the result never changes from one iteration to the next. Instead of recalculating h() to initialize the fock matrix each iteration, it can be calculated and stored once during initialization, then copied back into the fock matrix at the beginning of each iteration. The performance increase of this is small relative to the total execution time, but may still result noticeable differences for large problems.