Capabilities and Limitations of Optimizing Compilers

limitation due to security

Invoking gcc with option-O1 or higher(e.g.,-O2 or-O3) will cause it to apply more extensive optimizations. These can further improve program performance, but they may expand the program size and they may make the program more difficult to debug using standard debugging tools.
Compilers must make safe optimization for the program, which ignored possible optimziations.

memory alising

example:

1
2
3
4
5
6
7
8
int add1(long *a, long *b) {
*a += *b;
*a += *b;
}

int add2(long *a, long *b) {
*a += 2 * *b;
}

if *a and *b points to the same location in the memory, the result of the 2 functions are different, so the compiler cannot optimize add1 to add2.

function calls

example:

1
2
3
4
5
6
7
long f(); 
long func1() {
return f() + f() + f() + f();
}
long func2 () {
return 4*f();
}

It is normal that the func1() should be optimized to func2, because of fewer jumps to the function. However, If f() changed a part of the global varibles, it would result in difference in the return value, for example,

1
2
3
4
long ans = 0;
log f() {
return ans++;
}

gcc compilers always perform unprogressive compilations because of the reasons like this.

Expressing Program Performance

we introduce the metric cycles per element, abbreviated CPE, to measure it
example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/* Compute prefix sum of vector a */ 
void psumi(float a[], float p[], long n)
{
long i;
p[0] = a[0];
for (i = 1; i <n; i++)
p[i] = p[i - 1] + a[i];
}
void psum2(float a[], float p[], long n)
{
long i;
p[0] = a[0];
for (i = 1; i < n-1; i+=2) {
float mid_val = p[i-1] + a[i];
p[i] = mid_val;
p[i+1] = mid_val + a[i+1];
}
/* For even n, finish remaining element */
if (i < n)
pli] = pl[i-1] + ali];
}

Using a least squares fit, we find that the run times (in clock cycles) for psum1 and psum2 can be approximated by the equations 368+9.0n and 368 +6.0n.

some ways to optimize a program

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#define IDENT 0
#define OP *
/* Implementation with maximum use of data abstraction */
void combinei(vec_ptr v, data_t *dest)
{
long i;

*dest = IDENT;
for (i = 0; i < vec_length(v); i++) {
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}

Eliminating Loop Inefficiencies

the test condition must be evaluated on every iteration of the loop. On the other hand, the length of the vector does not change as the loop proceeds. We could therefore compute the vector length only once and use this value in our test condition.
optimization of the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
/* Move call to vec_length out of loop */ 
void combine2(vec_ptr v, data_t *dest)
{

long i;

long length = vec_length(v);

*dest = IDENT;

for (i = 0; i < length; i++) {
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}

This optimization is an instance of a general class of optimizations known as code motion. They involve identifying a computation that is performed multiple times and remove it outside of the loop.

Reducing Procedure Calls

As we have seen, procedure calls can incur overhead and also block most forms of program optimization.

Eliminating Unneeded Memory References

The program can use register instead of the heap memory to store the data.
example:

1
2
3
4
5
6
7
8
9
10
11
12
/* Accumulate result in local variable */ 
void combine4(vec_ptr v, data_t *dest)
{
long i;
long length = vec_length(v);
data_t *data = get_vec_start(v);
data_t acc = IDENT;
for (i = 0; i < length; i++) {
acc = acc OP data[i];
}
*dest = acc;
}

Understanding Modern Processors

As we seek to push the optimization further, we need to optimize the microarchitecture of the processor.
TODO