From Modelado Foundation

Developer(s) Pacific Northwestern National Laboratory
Stable Release 6.1/Latest Release Date here
Operating Systems Linux, Unix, etc.
Type Computational Chemistry
License Educational Community License (ECL) 2.0

NWChem provides many methods for computing the properties of molecular and periodic systems using standard quantum mechanical descriptions of the electronic wavefunction or density. Its classical molecular dynamics capabilities provide for the simulation of macromolecules and solutions, including the computation of free energies using a variety of force fields. These approaches may be combined to perform mixed quantum-mechanics and molecular-mechanics simulations.


There are many modules of NWChem. In the DynAX XStack program, PNNL will extract two of these modules and produce simple, readable, serial modules. The DynAX team will produce parallel and optimized versions of these modeules and report on those optimizations.

The first of these modules is the Self Consistent Field (SCF).

SCF Module

Click here to view a Powerpoint Presentation on the SCF Module

Click here to view a Word document describing the SCF module



You can download the code here. It is licensed under the BSD license.


Presentation on the SCF module optimizations (PDF format)

Additional Work Skipping

The reference implementation has two tests to skip unnecessary work, one at the i,j level and one at the i,j,k,l level (innermost loop). An additional test can be added to the i,j,k level which will allow skipping entire iterations of the inner most loop.

Reference twoel function:

 double twoel(double schwmax) {
   int i, j, k, l;
   for (i = 0; i < nbfn; i++) {
   for (j = 0; j < nbfn; j++) {
       if ((g_schwarz[i][j] * schwmax        ) < tol2e) {icut1 += nbfn * nbfn; continue;}
       double  KLTest = tol2e / g_schwarz[i][j];
   for (k = 0; k < nbfn; k++) {
   for (l = 0; l < nbfn; l++) {
       if (g_schwarz[k][l] < KLTest) {icut2 ++; continue;}
       icut3 ++;
       double gg = g(i, j, k, l);
       g_fock[i][j] += (       gg * g_dens[k][l]);
       g_fock[i][k] -= (0.50 * gg * g_dens[j][l]);
     } } } }
   return (0.50 * contract_matrices(g_fock, g_dens));

The inner-most test, (g_schwarz[k][l] < KLTest), can be modified to work in the k loop under certain conditions. If the maximum value of all entries for a given row of g_schwarz is known, that can be compared to KLTest in the k loop and we can skip the l loop entirely in cases where it would not perform any updates.

This requires one additional line to be inserted between the k and l loops:

   for (k = 0; k < nbfn; k++) {
     if (g_schwarz_max_row_value[k] < KLTest) {icut4 ++; continue;}
   for (l = 0; l < nbfn; l++) {

To enable this additional test, the array g_schwarz_max_row_value should be populated with the maximum value for each row in makesz while it is initializing the g_schwarz matrix.

The additional memory requirements for this work skipping test are minimal as the array only has one dimension of size nbfn.

Symmetry of g()

The g function has several symmetries which can be exploited to reduce the number of times it must be run.

g(i, j, k, l) returns the same value in the following cases:

  • The i and j parameters are swapped.
  • The k and l parameters are swapped.
  • The i,j and k,l pairs are swapped.

Combining all possibly symmetries leads to an overall 8-way symmetry where the following calls all return the same value:

  • g(i, j, k, l)
  • g(i, j, l, k)
  • g(j, i, k, l)
  • g(j, i, l, k)
  • g(k, l, i, j)
  • g(k, l, j, i)
  • g(l, k, i, j)
  • g(l, k, j, i)

By calculating gg = g(i, j, k, l) once, we can then simply perform multiple updates to the fock matrix.

  • update_fock(gg, i, j, k, l)
  • update_fock(gg, i, j, l, k)
  • update_fock(gg, j, i, k, l)
  • update_fock(gg, j, i, l, k)
  • update_fock(gg, k, l, i, j)
  • update_fock(gg, k, l, j, i)
  • update_fock(gg, l, k, i, j)
  • update_fock(gg, l, k, j, i)

With update() being a macro or inlined function similar to:

update(gg, a, b, c, d) {
  g_fock[a][b] += (gg * g_dens[c][d]);
  g_fock[a][c] -= (0.50 * gg * g_dens[b][d]);

When taking advantage of the symmetry, there are special cases which need to be considered. Most obvious is that the iteration range of the loops must be modified so that g() is only called once for each equivalent ordering of its arguments then all possible updates to the fock matrix are made with the result. Secondly, when arguments which can be swapped are equal, there are fewer updates to be made to the fock matrix.

As an example, consider the case where i, j, k, and l are all equal to 0. The reference implementation can only process this once since the loop indices can only all be zero once. If all symmetric updates are performed in this case, the result would be that the exact same update is performed eight times since the arguments to update(0, 0, 0, 0) are unchanged by the swap operations.

To ensure proper operation, there are six separate variants of the twoel loops, one for each possible symmetry restriction:

  • No restrictions
  • i = j
  • k = l
  • i,j pair is identical to k,l pair
  • i = j and k = l
  • i, j, k, and l are all equal

With the special cases taken in account, the loop structure for the fully symmetric case is:

for (i = 0; i < nbfn; i++) {
  for (j = i + 1; j < nbfn; j++) {
    for (k = i; k < nbfn; k++) {
      if (k == i) l_start = 1 + j;
      else l_start = 1 + k;
      for (l = l_start; l < nbfn; l++) {

        gg = g(i,j,k,l);

        //perform all eight updates
        update_fock(gg, i, j, k, l);
        update_fock(gg, i, j, l, k);
        update_fock(gg, j, i, k, l);
        update_fock(gg, j, i, l, k);
        update_fock(gg, k, l, i, j);
        update_fock(gg, k, l, j, i);
        update_fock(gg, l, k, i, j);
        update_fock(gg, l, k, j, i);

The loop structure for the other five cases can be trivially generated by removing the loops which are no longer independent. The variables should then simply be replaced with the independent index. Finally any duplicate identical updates should be removed.

As an example, the i = j case is:

for (i = 0; i < nbfn; i++) {
  // removed j loop
  for (k = i; k < nbfn; k++) {
    for (l = 1 + k; l < nbfn; l++) {
      gg = g(i,i,k,l); // Note that first two arguments are both i
      //perform four updates
      update_fock(gg, i, i, k, l);
      update_fock(gg, i, i, l, k);
      // duplicate: update_fock(gg, i, i, k, l);
      // duplicate: update_fock(gg, i, i, l, k);
      update_fock(gg, k, l, i, i);
      // duplicate: update_fock(gg, k, l, i, i);
      update_fock(gg, l, k, i, i);
      // duplicate: update_fock(gg, l, k, i, i);

Note that the special calculation of the starting index for the l loop is only required in the fully symmetric case. In the i = j case, the initial value is:

if (k == i) l_start = 1 + i;
else l_start = 1 + k;

However, we can substitute i with k when k == i:

if (k == i) l_start = 1 + k;
else l_start = 1 + k;

Which removes the need for the conditional assignment since the initial value is 1 + k in both cases.

For all other symmetry restrictions, the l loop is removed entirely.

Using a precomputed lookup table in g()

Many of the individual calculations in g() only rely on the i,j or k,l argument pairs. Such values can be precomputed during initialization and stored in a matrix for lookup in g(). Because of the symmetry between the i,j and k,l pairs, one lookup table can be used for both. The additional memory requirements for using a lookup table are on the order of N^2, which is the same order as required for the fock, schwarz and other matricies.

Additionally, by using lookup tables, an e^x calculation in the following line can be removed from g() and replaced with a multiplication:

double exijkl = exprjh(-facij * rab2 - fackl * rcd2);

This is possible because that code has the following properties:

  • exprjh(x) is simply a shortcut around exp(x) which returns 0 for x values < -37
  • facij and rab2 only depend on i and j arguments
  • fackl and rcd2 only depend on the k and l arguments
  • the result can be re-expressed as exp(-facij * rab2) * exp(-fackl * rcd2);
  • Now, both calls to exp() only depend on one of the argument pairs, which allows all possible values to be stored in a lookup table.

In total, there are six values which can be precomputed and stored in a lookup table. To improve data locality and cache utilization, the values can be stored in an array of structs. The struct definition and description of each value from the original g() is:

struct {
  double ex;      // exp(-facij * rab2)
  double expnt;   // expnt[i] + expnt[j]
  double x;       // x[i] * expnt[i] + x[j] * expnt[j]) / expnt
  double y;       // y[i] * expnt[i] + y[j] * expnt[j]) / expnt
  double z;       // z[i] * expnt[i] + z[j] * expnt[j]) / expnt
  double rnorm;   // rnorm[i] * rnorm[j]
} g_lookup_table;

With a lookup table, the g function becomes:

static inline double g(int i, int j, int k, int l) {
  double f0val;
  g_lookup_table* lookup_ij = &g_lookup[i * nbfn + j];
  g_lookup_table* lookup_kl = &g_lookup[j * nbfn + k];
  double expntIJ = lookup_ij->expnt;
  double expntKL = lookup_kl->expnt;
  double exijkl = lookup_ij->ex * lookup_kl->ex;
  double denom  = expntIJ * expntKL * sqrt(expntIJ + expntKL);
  double fac    = (expntIJ) * (expntKL) / (expntIJ + expntKL);
  double xp   = lookup_ij->x;
  double yp   = lookup_ij->y;
  double zp   = lookup_ij->z;
  double xq   = lookup_kl->x;
  double yq   = lookup_kl->y;
  double zq   = lookup_kl->z;
  double rpq2 = (xp - xq) * (xp - xq) + (yp - yq) * (yp - yq) + (zp - zq) * (zp - zq); 
  f0val = f0(fac * rpq2);
  return (two_pi_pow_2_5 / denom) * exijkl * f0val * lookup_ij->rnorm * lookup_kl->rnorm;

In addition to reducing the complexity of calculations in g(), using a lookup table also reduces the number of memory operation from 20 (x, y, z, expnt and rnorm for each i,j,k,l index) in the reference code to 12 (2 * 6 struct entries).

Symmetric matrices

Both the input, g_dens, and output, g_fock, matrices used in twoel() are symmetric matrices. This means that half of the memory accesses or updates to each are redundant and can be eliminated from the inner loops of twoel(). The required changes to the updates vary somewhat between the Coulomb and exchange contribution calculations. The Coulomb update contribution is performed by the following function:

void update_coulomb(double g_value, int a, int b, int c, int d) {
  g_fock[a][b] += g_value * g_dens[c][d];

The symmetry in the Coulomb contributions is a subset of the symmetry in the g() function, and is therefore trivial to implement in the loops that have already been decomposed. Consider the updates in the full 8-way g() symmetry loop which has the restrictions i != j, k != l and i,j != k,l:

update_coulomb(g_value, i, j, k, l);
update_coulomb(g_value, i, j, l, k);
update_coulomb(g_value, j, i, k, l);
update_coulomb(g_value, j, i, l, k);
update_coulomb(g_value, k, l, i, j);
update_coulomb(g_value, k, l, j, i);
update_coulomb(g_value, l, k, i, j);
update_coulomb(g_value, l, k, j, i);

The "a" and "b" variables are used as indexes into the g_fock output matrix, when they are swapped, the same value is accumulated into the corresponding symmetric entry. All updates which swap "a" and "b" can be removed as the loop has already been decomposed to insure that j > i and k > l, therefore the swapped values only update the lower triangle which is unnecessary. This leaves us with the following updates:

update_coulomb(g_value, i, j, k, l);
update_coulomb(g_value, i, j, l, k);
update_coulomb(g_value, k, l, i, j);
update_coulomb(g_value, k, l, j, i);

Next we can use the symmetry of g_dens to remove the updates which swap the "c" and "d" arguments. Because this is using symmetry from the input matrix, we can't simply delete these updates. Instead, we can easily combine the symmetric updates into a single update. Recall that the Coulomb update is:

g_fock[a][b] += g_value * g_dens[c][d];

Instead of calling update_coulomb twice with swapped c and d arguments, the swap can be internalized to the function:

g_fock[a][b] += g_value * g_dens[c][d] + g_value * g_dens[d][c];

Because g_dens is symmetric, this can be substituted with:

g_fock[a][b] += g_value * g_dens[c][d] + g_value * g_dens[c][d];

Which is reduced to:

g_fock[a][b] += 2.0 * g_value * g_dens[c][d];

We can therefore create a new update function:

void update_coulomb2(double g_value, int a, int b, int c, int d) {
  g_fock[a][b] += 2.0 * g_value * g_dens[c][d];

Using the new update_coulomb2() function leads to the final coulomb updates:

update_coulomb2(g_value, i, j, k, l);
update_coulomb2(g_value, k, l, i, j);

The update_exchange() calls are more complicated to modify as their symmetry does not match the symmetry in g() as well. The exchange contributions are symmetric when "a" and "c" are swapped, or when "b" and "d" are swapped. This can only be easily used to get a two-fold reduction in updates which corresponds to the i,j != k,l loop restriction already in place. We did not fully take advantage of this symmetry to limit all updates to one triangle of the matrix. Instead, we eliminated all redundant updates but still allowed writing to both triangles. This required using a additional temporary matrix which was then post-processed to regenerate the correct symmetric values. The updates to the temporary matrix are performed in the following function:

void update_exchange(double g_value, int a, int b, int c, int d) {
  t_fock[a][c] -= 0.50 * g_value * g_dens[b][d];

When not taking advantage of any symmetry, the following updates must be performed:

update_exchange(g_value, i, j, k, l);
update_exchange(g_value, i, j, l, k);
update_exchange(g_value, j, i, k, l);
update_exchange(g_value, j, i, l, k);
update_exchange(g_value, k, l, i, j);
update_exchange(g_value, k, l, j, i);
update_exchange(g_value, l, k, i, j);
update_exchange(g_value, l, k, j, i);

Using the i,j = k,l symmetry of g(), we can eliminate half of the updates:

update_exchange(g_value, i, j, k, l);
update_exchange(g_value, i, j, l, k);
update_exchange(g_value, j, i, k, l);
update_exchange(g_value, j, i, l, k);

After completing the N^4 loop, the temporary matrix is accumulated into the fock matrix as follows:

for (i = 0; i < nfbn; i++)
  for (j = 0; j < nfbn; j++)
    g_fock[i][j] += t_fock[i][j] + t_fock[j][i];

Although we did not implement it, it should be possible to further decompose the loops to ensure only one triangle of the matrix is updated.

Optimizations for large problem sizes

Matrix routines

The matrix multiplication and eigenvalue solver used in diagon() operates fast for small problem sizes, but as the problem size increases, diagon() can consume a larger percentage of the overall time. In fact, if the input problem is sparse enough, diagon() can scale worse than twoel() and consume the majority of the computation time.

This situation can be easily improved by replacing the Dgemm() in integ.c with a call to a BLAS dgemm and by replacing rs() and rsg() in diagonalize.c with dsyge() and dsygv() from LAPACK.

All of the replacement functions are direct equivalents which only require minor changes to use (mostly additional arguments).

Vector operations

If the compiler is not able to automatically vectorize operations well, there may also be a benefit to using level 1 BLAS functions.

contract_matricies() can be replaced with ddot()

damp() can be implemented by two BLAS routines:

void damp(double fac) {
  // g_dens = fac * g_dens
  cblas_dscal(nbfn*nbfn, fac, g_dens, 1);
  // g_dens = ((1.0 – fac) * g_work) + g_dens
  cblas_daxpy(nbfn*nbfn, 1.0 - fac, g_work, 1, g_dens, 1);

Which produce the same calculation as the original damp():

g_dens = (fac * g_dens) + ((1.0 – fac) * g_work)

Using BLAS routines in damp() has additional overhead as the vectors must be iterated over twice, so any benefit is limited to the case where the BLAS library is better optimized than what the compiler can generate from the original code.


The calculation of h() in oneel() is unnecessary as the result never changes from one iteration to the next. Instead of recalculating h() to initialize the fock matrix each iteration, it can be calculated and stored once during initialization, then copied back into the fock matrix at the beginning of each iteration. The performance increase of this is small relative to the total execution time, but may still result noticeable differences for large problems.


This section is not complete yet. There is a good explanation in the DynAX Q2 report here, but it is not yet incorporated into the wiki.

Shared Memory


SWARM Single Node

Distributed Memory

MPI + OpenMP

SWARM Multiple Nodes