Performance Options

There are a number options for the Metal C compiler that can affect generated code performance. Some of these options are:

UnRoll - Controls loop unrolling
HOT - Performs high-order loop analysis and transformations (HOT)
Optimize - Specifies whether to optimize code during compilation
Architecture - Specifies the machine architecture for which the executable program instructions are to be generated.
Tune - Tunes instruction selection, scheduling, and other implementation-dependent performance enhancements for a specific implementation of a hardware architecture.
Inline - Attempts to inline functions instead of generating calls to those functions, for improved performance.

These are all documented in z/OS XL C/C++ User's Guide. The effect that each option will have on performance can generally be determined only empirically.

As an example of empirical analysis of selected options, I used the initialization routine from an implementation of the encryption algorithm RC4 (named after Ron Rivest of RSA Security ). The RC4 state buffer initialization code in C is shown below.

 
 /*
 rc4.c

 Copyright (c) 1996-2000 Whistle Communications, Inc.
 All rights reserved. 
 $FreeBSD: src/sys/crypto/rc4/rc4.c,v 1.2.2.1 2000/04/18 04:48:31 archie Exp $
 */                                                           
 struct rc4_state                                            
 {                                                           
     unsigned char  perm[256];                               
     unsigned char  index1;                                  
     unsigned char  index2;                                  
 } ;                                                         
                                                             
                                                             
 static void swap_bytes(unsigned char *a, unsigned char *b)  
 {                                                           
     unsigned char  temp;                                    
     temp = *a;                                              
     *a = *b;                                                
     *b = temp;                                              
 }                                                           
                                                             
 /*                                                          
  Initialize an RC4 state buffer using the supplied key,     
  which can have arbitrary length.                           
  */ 
 void rc4_init(struct rc4_state *state, unsigned char *key, int keylen)
{                                                                     
    unsigned char j;                                                  
    int i;                                                            
                                                                      
    /* Initialize state with identity permutation */                  
    for (i = 0; i < 256; i++)                                         
    {                                                                 
        state->perm[i] = (unsigned char) i;                           
    }                                                                 
    state->index1 = 0;                                                
    state->index2 = 0;                                                
                                                                      
    keylen = 24;                                                      
    /* Randomize the permutation using key data */                    
    for (j = i = 0; i < 256; i++)                                     
    {                                                                 
        j += state->perm[i] + key[i % keylen];                        
        swap_bytes(&state->perm[i], &state->perm[j]);                 
    }                                                                 
}

I ran the initialization routine, in a loop 10 million times, to obtain an average CPU time per iteration. The results are shown in the table below. Interestingly, the HOT option produced a 2/3 reduction in CPU time. Combining HOT with Unroll(Yes) yielded a 73% reduction in CPU time. The reductions in CPU time aren't always as dramatic as these, but they are worth determining by experimentation.

Optimization Options and Their Performance Effect
Test No.	Unroll	Hot	Key Fixed / Variable	Total CPU s.	CPU s. / Iteration	μ s. / Iteration	% Decrease
1	No	No	V	528.41	0.000052841	52.84
2	No	Yes	V	443.7	0.000017597	17.60	66.70%
3	Yes	No	V	443.7	0.000044370	44.37	16.03%
4	Yes	Yes	V	142.14	0.000014214	14.21	73.10%
5	Yes	Yes	F	136.18	0.000013618	13.62	74.23%
6	No	No	F	222.7	0.000022270	22.27	57.85%

References

z/OS XL C/C++ User's Guide
z/OS Metal C Programming Guide and Reference
z/OS XL C/C++ Language Reference