Skip to content


Personal tools
You are here: Home » Guides » XT3/Jaguar porting issues

XT3/Jaguar porting issues

Document Actions
This is a page to document porting issues, solutions, and resources in running on the Jaguar XT3 at ORNL.
TORIC code - jwright

(This document is in Structured Text. You can just type plain text, but see "Zope": for a primer)

These are issues Ed D'Azevedo, Vicki Lynch and myself (John Wright) encountered.

- The compiler was upgraded during the Jaguar upgrade from pgi4.1.4 to pgi4.1.5. This fixed a bug appearing in one routine in TORIC solved by using version included in Jaguar upgrade. One may consider using the earlier version if this compiler upgrade introduced problems.

- Buffer output was overloaded. The specific error message was 'aborting job:
internal ABORT - process 0: Other MPI error, error stack:
MPIDI_PortalsU_Request_PUPE(605): exhausted unexpected receive queue buffering increase via env. var. MPICH_UNEX_BUFFER_SIZE' This was solved by reducing the number of diagnostic messages and warnings from the code and increasing the environmental variable 'MPICH_UNEX_BUFFER_SIZE' from a default value of 60000000 to 200000000 (60M to 200M) [1]

- Just to be safe, you can set striping off with 'lfs setstripe <directory> 0 -1 1' [4]

- Not closing fortran units explicity before the end of the program can cause problems. Solution - close units explicitly. Add 'mpi_barrier' before 'mpi_finalize'

- The code crashed during a long sequence of allocate statements. Linking in the gnu malloc library ('-lgmalloc') solved the problem.

- Overloading communication when doing gathers ('ALLGATHERV'). May not scale well on XT3? Replace with 'ALLREDUCE', which was doable for our algorithm.

All of our problems resulted in catastophic failures that were relatively easy to locate. We did not experience any almost right problems with answers changing subtly.


.. [1] "XT3 Architecture and Software":

.. [2] "Optimization for the Cray XT3":

.. [3] "XT3 FY06 User's Meeting":

.. [4] "Jaguar FAQ":

NIMROD code -Vickie Lynch

The PGI compiler has a pointer bug that effects NIMROD (segmentation fault in deallocate when using nested fortran pointers). This bug was reported to Cray and fixed in fortran compiler version 6.2.5, but only for code compiled with optimization flags (-O2 and -fast). Recently we reported that this fix does not work for a debug run with no optimization.

module swap pgi/6.1.4 pgi/6.2.5

Use optimization when you compile.

M3D code (Linda Sugiyama's version) -Vickie Lynch

For this version of M3D you need to unload the iobuf module. It is not needed but may give performance gains when doing IO to stdout. The bug in IOBUF has been reported to Cray as a compiler bug. To unload:
module unload IOBUF/1.0.4

The included c test code works with IOBUF/1.0.2 on Jaguar. With IOBUF/1.0.3 and IOBUF/1.0.4 (default), only the first number on each row is read by fscanf. This causes the fusion code, M3D, not to read its input correctly.

#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <memory.h>
#define buffersize 192
int main(int argc, char *argv[])
char linebuffer[buffersize];
FILE *fptr;
int ii;
double x;

if ((fptr = fopen("omp.out", "r")) == NULL) {
fprintf(stderr, "%s: omp-style input file not found.\n", "omp.out");

fgets(linebuffer, buffersize, fptr);
for (ii=0; ii<10; ii++)
fscanf(fptr, "%le", &x);
printf("x = %lf\n", x);
head omp.out
x 24000 1 0.000
0.39710E+00 0.39717E+00 0.39724E+00 0.39731E+00 0.39737E+00 0.39730E+00
0.39723E+00 0.39716E+00 0.39708E+00 0.39701E+00
0.39693E+00 0.39684E+00 0.39673E+00 0.39650E+00 0.39626E+00 0.39603E+00
0.39579E+00 0.39555E+00 0.39531E+00 0.39506E+00
0.39477E+00 0.39438E+00 0.39398E+00 0.39358E+00 0.39316E+00 0.39274E+00
0.39231E+00 0.39189E+00 0.39135E+00 0.39077E+00
0.39019E+00 0.38961E+00 0.38901E+00 0.38841E+00 0.38781E+00 0.38710E+00
0.38635E+00 0.38558E+00 0.38480E+00 0.38398E+00
0.38316E+00 0.38232E+00 0.38136E+00 0.38036E+00 0.37935E+00 0.37834E+00
0.37731E+00 0.37627E+00 0.37517E+00 0.37395E+00
0.37270E+00 0.37143E+00 0.37015E+00 0.36884E+00 0.36750E+00 0.36603E+00
0.36452E+00 0.36299E+00 0.36142E+00 0.35983E+00
0.35819E+00 0.35644E+00 0.35465E+00 0.35280E+00 0.35090E+00 0.34897E+00
0.34695E+00 0.34486E+00 0.34273E+00 0.34057E+00
0.33836E+00 0.33607E+00 0.33364E+00 0.33116E+00 0.32863E+00 0.32606E+00
0.32338E+00 0.32056E+00 0.31768E+00 0.31474E+00
0.31175E+00 0.30870E+00 0.30553E+00 0.30224E+00 0.29885E+00 0.29539E+00
0.29184E+00 0.28808E+00 0.28424E+00 0.28033E+00

M3D code using FFTW -Vickie Lynch

I have a file, dphifftwc.c that solves the problem of M3D blowing up when using FFTW on Jaguar. (E-mail There is some problem with the c pointer, ptr, illustrated in following test code that calculate a derivate in fft space for the M3D code. The loop that uses the ptr pointer was the original loop that was causing fftw/3.1 to give incorrect answers in M3D. If "cc testfftw.c --target=catamount -fastsse ${FFTW3_LIB}" is used with the fftw module loaded "module load fftw/3.1" the loops have different answers. The answers using the prt pointer with -fastsse are two correct, two incorrect, two correct as seen in the following prints. This has been reported to a Jaguar system ticket as a possible compiler bug.
fftw with ptr=1.2 1.2
fftw with ptr=1.3 1.3
fftw with ptr=-0.12 0.12
fftw with ptr=-0.13 0.13
fftw with ptr=1.6 1.6
fftw with ptr=1.7 1.7
fftw with ptr=-0.16 0.16
fftw with ptr=-0.17 0.17
fftw with ptr=2 2
fftw with ptr=2.1 2.1
fftw with ptr=-0.2 0.2
fftw with ptr=-0.21 0.21
Without the -fastsse compiler option the answers with the ptr pointer are correct:
fftw with ptr=1.2 1.2
fftw with ptr=1.3 1.3
fftw with ptr=1.4 1.4
fftw with ptr=1.5 1.5
fftw with ptr=1.6 1.6
fftw with ptr=1.7 1.7
fftw with ptr=1.8 1.8
fftw with ptr=1.9 1.9
fftw with ptr=2 2
fftw with ptr=2.1 2.1
fftw with ptr=2.2 2.2
fftw with ptr=2.3 2.3

#include <string.h>
/* double precision, 1D fftw */
#include <fftw3.h>

/* FFTW can replace both power-of-2 FFT and non-power-of-2 discrete DFT */
/* but dealias still uses regular FFT (yet) */
int main()
int totplanes = 10;
int lverts = 12;
int maxmode=totplanes/2 + 1;
int plane, vert;
double factor, re, im, *ptr;
static fftw_complex *cplxbuf;

/* Allocate main data arrays */
if ((cplxbuf = (fftw_complex *)
fftw_malloc(lverts * maxmode * sizeof(fftw_complex))) == NULL)

memset((void *)cplxbuf, 0, lverts*sizeof(fftw_complex));
for (plane=1; plane<maxmode; plane++) {
for (vert=plane*lverts; vert<(plane+1)*lverts; vert++) {
cplxbuf[vert][0] = vert;
cplxbuf[vert][1] = -vert;
for (plane=1; plane<maxmode; plane++) {
factor = (plane)/(double)totplanes;

for (vert=plane*lverts; vert<(plane+1)*lverts; vert++) {
re = cplxbuf[vert][0];
im = cplxbuf[vert][1];
cplxbuf[vert][0] = -factor*im;
cplxbuf[vert][1] = factor*re;
printf("fftw no ptr=%g\t%g\n", cplxbuf[vert][0],cplxbuf[vert][1]);
memset((void *)cplxbuf, 0, lverts*sizeof(fftw_complex));
for (plane=1; plane<maxmode; plane++) {
for (vert=plane*lverts; vert<(plane+1)*lverts; vert++) {
cplxbuf[vert][0] = vert;
cplxbuf[vert][1] = -vert;
ptr = (double *) cplxbuf[lverts];
for (plane=1; plane<maxmode; plane++) {
factor = (plane)/(double)totplanes;

for (vert=0; vert<lverts; vert++) {
re = *ptr;
*ptr = -factor*ptr[1];
ptr[1] = factor*re;
ptr += 2;
for (vert=plane*lverts; vert<(plane+1)*lverts; vert++) {
printf("fftw with ptr=%g\t%g\n", cplxbuf[vert][0],cplxbuf[vert][1]);
return 0;

Last modified 2010-10-28 13:39