benchmark.doc

*******************************************************************************
10/2004
vxpair benchmark on kiwi cluster
Run to t=.1   

              cpu time per timestep
cpu         1024x512        2048x1024   4096x2048

1          .02509             .10445     too big
2          .02276             .09262      .38686
4          .01744             .06993      .27939

 
*******************************************************************************

3/25/03
LAMPI (CVS 3-24-03), QSC
256^3  
                                    w/o Reliability           HP-R12
1x1x8       0.625                      0.628                 .604
1x1x16:     0.36340    
1x1x32:     0.18887 
1x1x64:     0.08524 


512^3     
1x1x64      0.86919 0.87976
1x1x128     0.43432 
1x1x256     0.207

                                                16            32     64
CNLS cluster                                  .759          .366   .190   


PINKISH:
             mpirun --gm --nper 2       lampi  2 per node
256^3
1x1x8         2.02639, 1.98
1x1x16        .766                        1.03
1x1x32        .460                        .475 
1x1x64        .206                        .199
1x1x128       .11476, .11304

512^3
1x1x128       1.24


**********************************************************************************
hi-res timings on QA, AFTER CPU upgrade:
using MPI_64bit_R5 unless otherwise noted

iso12 run, 512^3 runs with delt=.0003, t=1=1 eddy turnover time
           3333 timesteps per eddy turnover time
           assume 1024^3 needs 6666 timesteps per eddy turnover time
           = 5 days per eddy turnover time on 512 cpus.  

           1536^3 needs 9999 timesteps.  2min per timestep, 14days.
            

           2048^3 needs 2*6666 timesteps.  at 2.70m per timestep
           (2048 cpus, 2 rails), that's 25 days for 1 eddy turnover time.

           R_lambda=750  kmax*eta=1
           2*6666*4.1 = 38 days per eddy turnover time
           R_lambda=1100  kmax*eta=1
           4096^3     = 2432 days per eddy turnover time.

           Umax = max( |u| + |v| + |w| ) 
           CFL=1.5 = Umax delt/delx 


1024^3  144Gb            24Gb per variable                             1b gridpoints
1536^3  486Gb
2048^3  1152Gb           192Gb per variable    
3072^3  3240Gb           needs at least 1620 cpus.  6x1x512?  
                          try 3x1x512  


                  cpu time per timestep
                   (running 5 timesteps) 
512^3                                                       SHI_YI
    1x1x64              .813       (with forcing)             7.6s
    1x1x128             .393       (with forcing)
    1x1x256             .195       (with forcing)
    2x1x256             .158       (with focfing)
    4x1x256             .116       (with forcing)

    4x2x128   
    2x2x256   (doesn't work)


1024^3 
     1x1x128  3.76               (with forcing)
     1x1x256  1.89               (with forcing)
     1x1x512  1.04               (with forcing)
     2x1x512  0.684
     4x1x412  0.539

1536
     1x1x384    
     1x1x512
     1x1x768    2.05
     2x1x768    1.50
     3x1x768    1.46

2048^3     
      1x1x512     failed to load. try from scratch1? 
      1x1x1024    (with focing: 12.8min!  3x slower than old timing.  try again?  IN
      1x1x1024     4.18, 4.18, 3.63(2x rail)
      2x1x1024     3.04    with 2x rail: estmate: 2.70
      4x1x1024


on QB
2048^3
      1x1x1024     7.22  (1 rail)
                   3.45 (2 rails)  a little better than 3.63.  
                   3.61 
      2x1x1024


1024^3                   
      1x1x512      1.00  (1 rail)  as expected


**********************************************************************************

512^3 benchmarks
18Gb memory
134M gridpoints

                        cpus    cpu per timestep   
                                (averaging over 5 timesteps)
O3K                       64        1.40
                         128        .697 
                         256        .332     1419steps, diags & output:.5124
        2x1x256          512        .394
500^3                    125     .55
500^3   2x1x250          500     .209


old bluemountain numbers:
                  500^3 on 125 cpus:         .860
                  512^3 on  64 cpus:         1.93    
                  512^3 on  32 cpus:        5.0    

QA:
                   64    .831
                  128    .422
                  256    .221
 (2x1x256)        512    .169
 (4x1x128)        1024   .954    ouch!


1536 benchmarks
03K              will not load.


**********************************************************************************
hi-res timings on QSC and QA, before CPU upgrade:

iso12 run, 512^3 runs with delt=.0003, t=1=1 eddy turnover time
           3333 timesteps per eddy turnover time
           assume 1024^3 needs 6666 timesteps per eddy turnover time
           = 5 days per eddy turnover time on 512 cpus.  

           2048^3 needs 2*6666 timesteps = 40 days per eddy turnover time
           on 1024 cpus.  


2048^3  1152Gb           192Gb per variable    
3072^3  3240Gb           needs at least 1620 cpus.  6x1x512?  
                          try 3x1x512  


1024^3 timeings on ASCI Q
                   5 timesteps

dns:   1x1x512    6.20min 

dnsghost:
1x1x512           3.17min 
4x1x256           3.08


REDO:

2048^3     

dns:    1x1x512     45.84 (5 timesteps)  9.17m per timestep
        1x1x1024         5 timesteps     4.35m per timestep
        2x1x1024
        4x1x1024


dnsghost 8x8x8      6.50 (1 timestep)
dnsghost 2x2x128    insufficient virtual memory
dnsghost 8x4x16     same
dnsghost 4x4x32     same

         1x1x1024   size per cpu:  (2052*2052*6)*8*3 *6 = 3.4Gb per cpu.  
         2x1x512    wont load                             2.2Gb per cpu
dhsghost 4x1x256    5 timesteps:   2.09m per timestep     1.7Gb per cpu
dhsghost 2x2x256    5 timesteps:   2.28m per timestep


1536^3 

dnsghost 1x1x512   wont load
dnsghost 8x8x8     2.70m  1 timestep
dnsghost 2x1x256   1.73m
dnsghost 2 2 128   2.00m
dnsghost 4 2 64    2.16
dnsghost 4 4 32    2.27
dnsghost 8 4 16    2.32


(4 + 1536/n1)(4 + 1536/n2)(4 + 1536/n3)


**********************************************************************************
hi-res timings on QSC, before CPU upgrade:


t=-1 (one timestep)
2048^3        11.05m      32K timesteps: 245 days.    
t=-5 (5 timesteps)
2048^3        46.83m  = 9.37m per timestep.  


t=-43  (.05)                         Nirv    BlueMtn       10eddy
one box timings:  256 on 32 cpus             > 15min       
                  256 on 64 cpus     9.11m    8.32        
                  256 on 128 cpus    7.65m    4.11m       
                  250 on 125 cpus    7.65m    3.98m       

                  400 on 5x1x25              27.03m
                  400 on 1x100               32.09m


t=-5  (5 time step)                     Nirv Bmtn   Q       8K timesteps
                  512^3 on 32 cpus:                 12.43    13.8d
                  512^3 on 64 cpus:                  5.07    5.6
                  512^3 on 128:                      2.54    2.8
                  512^3 on 256:                      1.22    1.4
                  500^3 on 500:                      0.94    1.0
                                                    
                  500^3 on 125 cpus:         4.30             4.7d
                  512^3 on  64 cpus:         9.66             10.7
                  512^3 on  32 cpus:        25.0              27

                                                            16K timesteps
                 1000^3 on 500 cpus:                 5.64    13d
                 1024^3 on 256 cpus:                  ?


t=-20 (20 time steps)                   Nirv Bmtn  Q        4K timesteps
                  256^3 on 32 cpus:                5.72m    19h
                  256^3 on 16 cpus:               10.52m    
                  256^3 on  8 cpus:               18.97m  


                  256^3 on 128 cpus:     6.89  3.97(how many timesteps???)
                                            check bluemountain output files


**********************************************************************************
LOW RES benchmarks, 6/2002
cvs tag:  'bench602'

./bench.sh 96 0 20 
121M
20 timesteps

                      1cpu    2cpu    3cpu     4cpu     6      8     
milkyway              .16     .080    .054
sulaco/lam            .21     .17               .17  
sulaco/lampi          .22    .17                .60


shankara.old  (2x2)   .28     .19               .093  .064   .050
shankara 4ring        .27     .14               .059  .062   .047
shankara-LAM(ethernet).27                                    .200
sixhitch              .29

pinkish               .25                                    .050 
pinkish-lampi                                                .037

-xW vecorization. cant use on Athlon, doesn't help that much:
sandboxx with -xW      .202
sandboxx w/o  -xW      .207   

./bench.sh 160 0 20
544M

milkyway              1.22    .42
autrey                1.40
shankara.old          1.76                      .62          .29
   running 4 cpus, 1 cpu per node:              .51  


./bench.sh 192 2  20
980M


./bench.sh 240 4  20  
1.9G
                      1cpu    2cpu    3cpu     4cpu     6      8       16     24   
shankara (16 cpus = 2x1x8)                     4.53           1.27     .737  .347


./bench.sh 256 8  20  (can run on 16 and 24 cpus?)
2.3G                 1cpu    2cpu       8       16     24    32    64                                                                      
milkyway              6.91    3.33
shankara.old                           1.61
shankara 4ring                         1.53  
shankara-LAM                           3.90
shankara                                 ?    .750      ?

CNLS cluster                                  .759          .366   .190   

QSC-lampi:                            .625
pinkish                                2.03                 .471          .113


./bench.sh 288 8  20
3.2G

./bench.sh 512  64  40
18.4    cpus:         32    64  
CNLS cluster               2.13


**********************************************************************************


**********************************************************************************
LOW RES benchmarks.  around 11/2001?
Spectral Model 1 CPU timings.
1 CPU  96^3  120M  
*****************************************************************
./bench.sh 96 0 5
run 5 timesteps.  CFL=1.5/.25, no structure functions

darkstar (linux 1Ghz)
PGF90, FFT99, auto       2.90  2.90    
PGF90, FFT99, params     2.86  2.87


PGF90,FFT99, auto            1.16  1.15
IFC,FFT99, auto              1.07  1.07
IFC,FFT99, params            1.05  1.07
LF90,FFT99, auto             1.28  1.27

autrey (linux 1.? Ghz ) 
PGF90,STK2+PGCC, auto        1.41  1.40

PGF90,STK2+GCC, auto         1.53  1.53
IFC,STK2+GCC, auto           1.45  1.45
LF90,STK2+GCC, auto          2.01  2.01


Nirvana SGIFFT                2.92  2.92  2.93  2.94
Nirvana FFT99                 2.97  2.97  2.97
Nirvana STK                

Truchas (Compaq ES40)
FFT99,params                 1.03   1.03
CPQ,params                   1.00   1.00

STK2,auto                    1.07   1.07
FFT99,auto                   1.03   1.03
CPQ,auto                     0.99   0.99


*****************************************************************