Performance Options
There are a number options for the Metal C compiler that can affect generated code performance. Some of these options are:
- UnRoll - Controls loop unrolling
- HOT - Performs high-order loop analysis and transformations (HOT)
- Optimize - Specifies whether to optimize code during compilation
- Architecture - Specifies the machine architecture for which the executable program instructions are to be generated.
- Tune - Tunes instruction selection, scheduling, and other implementation-dependent performance enhancements for a specific implementation of a hardware architecture.
- Inline - Attempts to inline functions instead of generating calls to those functions, for improved performance.
These are all documented in z/OS XL C/C++ User's Guide. The effect that each option will have on performance can generally be determined only empirically.
As an example of empirical analysis of selected options, I used the initialization routine from an implementation of the encryption algorithm RC4 (named after Ron Rivest of RSA Security ). The RC4 state buffer initialization code in C is shown below.
/*
rc4.c
Copyright (c) 1996-2000 Whistle Communications, Inc.
All rights reserved.
$FreeBSD: src/sys/crypto/rc4/rc4.c,v 1.2.2.1 2000/04/18 04:48:31 archie Exp $
*/
struct rc4_state
{
unsigned char perm[256];
unsigned char index1;
unsigned char index2;
} ;
static void swap_bytes(unsigned char *a, unsigned char *b)
{
unsigned char temp;
temp = *a;
*a = *b;
*b = temp;
}
/*
Initialize an RC4 state buffer using the supplied key,
which can have arbitrary length.
*/
void rc4_init(struct rc4_state *state, unsigned char *key, int keylen)
{
unsigned char j;
int i;
/* Initialize state with identity permutation */
for (i = 0; i < 256; i++)
{
state->perm[i] = (unsigned char) i;
}
state->index1 = 0;
state->index2 = 0;
keylen = 24;
/* Randomize the permutation using key data */
for (j = i = 0; i < 256; i++)
{
j += state->perm[i] + key[i % keylen];
swap_bytes(&state->perm[i], &state->perm[j]);
}
}
I ran the initialization routine, in a loop 10 million times, to obtain an average CPU time per iteration. The results are shown in the table below. Interestingly, the HOT option produced a 2/3 reduction in CPU time. Combining HOT with Unroll(Yes) yielded a 73% reduction in CPU time. The reductions in CPU time aren't always as dramatic as these, but they are worth determining by experimentation.
| Test No. | Unroll | Hot | Key Fixed / Variable | Total CPU s. | CPU s. / Iteration | μ s. / Iteration | % Decrease |
|---|---|---|---|---|---|---|---|
| 1 | No | No | V | 528.41 | 0.000052841 | 52.84 | |
| 2 | No | Yes | V | 443.7 | 0.000017597 | 17.60 | 66.70% |
| 3 | Yes | No | V | 443.7 | 0.000044370 | 44.37 | 16.03% |
| 4 | Yes | Yes | V | 142.14 | 0.000014214 | 14.21 | 73.10% |
| 5 | Yes | Yes | F | 136.18 | 0.000013618 | 13.62 | 74.23% |
| 6 | No | No | F | 222.7 | 0.000022270 | 22.27 | 57.85% |
References
- z/OS XL C/C++ User's Guide
- z/OS Metal C Programming Guide and Reference
- z/OS XL C/C++ Language Reference
All references copyright© IBM Corporation.