Wednesday, May 7, 2008

Counter Intuitive Performance Tuning

Big O – Theory and Practice

As part of my consulting career, I keep coming across one problem over and over again with systems suffering from unacceptable performance. Over the years and with a lot of maturity I have learned that people who design these systems are not necessarily bad programmers. I have seen some really sharp people make some inexplicable mistakes especially when they believe they are doing the right thing. In some cases this confidence is a result of vendor sales pitches along the lines of buy our widget and it does all the work for you while in others insufficient theoretical and practical understanding is the main driver. In some interesting cases a good theoretical underpinning coupled by lack of exposure to large scale systems can also be an interesting driver.

Enough chitchat, let’s look at a simple problem. We are given two lists of 32-bit integers and asked to create a third list which contains an intersection of the two lists i.e. integers that exist in both lists. It appears to be a fairly straight forward problem, you loop over the first array looking for each element in the second array in a nested for loop and populate a third list if a match for outer element is found in the inner loop. Even though this approach is valid and correct, it takes n1 * n2 iterations to complete. This type of run-time characteristic is referred to as O(n2) and is usually not acceptable in production systems unless there is no other solution. But is there another solution to our problem? What if our input arrays were handed to us pre-sorted?

L1

1

2

3

4

5

6

7

8

9

10

L2

2

4

6

8

10

12

14

16

18

20

Does this give us any new opportunities for improvement? We could traverse the first list, perform binary search on the second, looking for a match and add successfully matched elements to the third list. Clearly this gives us significant performance improvement over the naïve implementation and has a complexity of O(n log(n)) since binary search is a log (n) operation. So far so good, are there ways to improve this approach? Let’s consider the case where our arrays are unequal in length:

L1

1

2

3

4

5

6

7

8

9

10

L2

2

4

6

8

10

We could select L1 to traverse over in the outer loop and perform binary search on L2 simply because of the order in which we received our inputs. Alternatively, we can traverse L2 and perform the binary search over L1. Theoretically there is no difference, however in real life 5*log(10) is always better than10*log(5). The idea here is that we should sample our inputs and evaluate their statistical and numerical distributions as they are fed to the algorithms we wish to tune and optimize. In the above case, a simple flip of a pointer can save us millions of iterations on each call depending on the difference in sizes for our input lists.

Performance Tuning Gone Wild

One of the most important mantras of successful performance tuning professionals (and Gods of computer science including Don Knuth) is that correctness comes before performance. We can live for so long with a system that performs correctly even if a bit slow but what good is a fast system if it does not function correctly? In this section we will look at some tuning strategies and consider their impact on the overall solutions remotely and dispassionately. Let’s say we are not satisfied with O(n log(n)) solution for our intersection problem? Can we rely on the on the sorted nature of our inputs to improve our performance and bring down our solution complexity to O(n)?. Let’s take a look:

L1

1

2

3

4

5

6

7

8

9

10

i=0

#include

L2

2

4

6

8

10

12

14

16

18

20

j=0

L3

int main()

L1

1

2

3

4

5

6

7

8

9

10

i=1

{

L2

2

4

6

8

10

12

14

16

18

20

j=0

int i, j, k, n = 10, m = 10;

L3

2

int L1[]={1,2,3,4,5,6,7,8,9,10};

L1

1

2

3

4

5

6

7

8

9

10

i=2

int L2[]={2,4,6,8,10,12,14,16,18,20};

L2

2

4

6

8

10

12

14

16

18

20

j=1

int L3[]={0,0,0,0,0,0,0,0,0,0};

L3

2

L1

1

2

3

4

5

6

7

8

9

10

i=3

for(j=0, i=0, k=0; i

L2

2

4

6

8

10

12

14

16

18

20

j=1

{

L3

2

4

if ( L1[i] == L2[j] )

L1

1

2

3

4

5

6

7

8

9

10

i=4

{

L2

2

4

6

8

10

12

14

16

18

20

j=2

L3[k++] = L1[i];

L3

2

4

++j;

L1

1

2

3

4

5

6

7

8

9

10

i=5

++i;

L2

2

4

6

8

10

12

14

16

18

20

j=2

continue;

L3

2

4

6

}

L1

1

2

3

4

5

6

7

8

9

10

i=6

L2

2

4

6

8

10

12

14

16

18

20

j=3

while(jL1[i])

L3

2

++i;

L1

1

2

3

4

5

6

7

8

9

10

i=7

while(j

L2

2

4

6

8

10

12

14

16

18

20

j=3

++j;

L3

2

4

6

8

}

L1

1

2

3

4

5

6

7

8

9

10

i=8

L2

2

4

6

8

10

12

14

16

18

20

j=4

for ( i = 0; i <>

L3

2

std::cout << style="color: rgb(42, 0, 255);">" ";

L1

1

2

3

4

5

6

7

8

9

10

i=9

L2

2

4

6

8

10

12

14

16

18

20

j=5

return 0;

L3

2

4

6

8

10

}

Notice in the graph, how we avoided unnecessary iterations by using a sliding window approach. The point of this exercise unfortunately wasn’t performance tuning (if this wasn’t already clear from the heading). The goal was to show that the theory and practice of programming are highly correlated in ensuring code quality but intuition, common sense and logic come a very close second. The question you should ask yourself is whether O(n) is always better than O(n log(n))? Are there ever circumstances where O(n log (n)) is better than O(n)? Theory says no but what does your instinct say? What if the lists are significantly different in terms of size? What if L1 is still 10 elements but L2 has 1,000,000 elements? For instant enlightenment consider the value of 10 * log (1000000) and compare it with 1,000,000 iterations from our highly tuned O(n) solution? If there is one thing you should take from this discussion it is that there is no single silver bullet solution to all your performance problems and that your users will always be able to come up with unexpected inputs to crush every single one of your performance assumptions in short order.

To Sort or Not to Sort – That is not the question

To keep things simple in the previous section, we assumed our inputs were pre-sorted. This assumption led us to some obvious opportunities for optimizations but what about the case where we can’t count on our inputs being pre-sorted? Should we spend time sorting them or simply fall back to naive implementation? In theory the naive approach is always O(n2). Sorting on the other hand is O(n log(n)) depending on the sorting algorithm chosen. In theory then, sorting the inputs, almost always appears to be a worthwhile pursuit but are there any further insights we can exploit and potentially improve our odds? Funny you should ask. If we think about it, do we really need to sort both lists? Since we always traverse the smaller array linearly, we only need to sort the one that we perform binary search on trimming down the time needed to sort smaller of the input arrays, which is good but let us recall why we switched to the binary search approach in the first place. The binary search approach was empirically better for sorted inputs with significantly different sizes. What happens if the lists are roughly the same size? Would we be justified in sorting both inputs to take advantage of our O(n) solution? The answer is clearly “no”, not if we have to spend O(n log(n)) in prep. Still is there a better way to calculate the intersection of two unsorted input arrays of roughly the same size? To answer this question, we need to get a feel for our input data and establish its numerical characteristics. The key questions are:

What are the lower and upper bounds for these inputs? [32-bit integers = 0 – 4294967296]

How small or large our input lists can possibly be? [up to 4294967296 unique integers]

What kind of machine are we running this task on? [32-bit machine – enough free memory]

Looking at our questions and their answers, we recall that our input vectors are comprised exclusively of 32-bit integers. Here is a sample input with roughly 4 billion elements in both lists:

L1

43

9757

671378

.

.

.

.

978740

11111112

4294967296

L2

4294967296

6963295

2254967194

.

.

.

.

346

43

41

To be fair, on most modern computers performing this operation is hardly a gargantuan undertaking. Consider the sort and binary approach for creating the intersection vector. The totals are: Sort [n * log(n)] + Search [n * log(n)] = 95,265,423,098 + 95,265,423,098 or approximately 195 billion iterations on average. This sort of thing takes no more than a few seconds on most modern computers. Clearly, as part of long running jobs or systems that need this calculation occasionally, such performance is acceptable but what if the task of calculating this intersection is part of a critical application in the fields of scientific research, financial transaction processing or similar performance oriented domains and forms part of an inner loop for a critical piece of processing? Suddenly the few seconds response time is no longer acceptable and becomes a potential liability.

Not This! Not Me!

Depending on system requirements, input constraints and characteristics of data distribution for certain problems, we are sometimes able to improve performance at the cost of space or vice-versa. This type of reciprocal relationship is commonly referred to as time and space tradeoff and comes in handy as a balancing act in times of need. In the following sections we will take a closer look at our intersection problem with intent of devising a better solution and evaluating its tradeoffs.

Before we delve further to improve the intersection problem, let’s take a moment to explore the relationship between time and space using embedded systems where both time and space are afforded a premium. In real-time embedded systems, performance has always been extremely critical but due to strict requirements on power and significant cost constraints, storage and processing capabilities available to embedded processors have usually been fairly limited (things have changed considerably in recent years, still some of the same limitations apply). For our example, we will use a hypothetical 8-bit processor with SUB and NOT op-codes that require 12 CPU cycles to complete vs. a 2 cycle memory lookup op-code. We are asked to code a C routine that performs the 8-bit NOT operation without using the built-in NOT operator (since the compiler blindly translates C language NOT operator into inefficient NOT op-code). Here is our approach:

unsigned char not_here(unsigned char not_me)

{

static unsigned char [256] not_this = { 255, 254, 253, ..., 0 };

return not_this[not_me];

}

Using Data as Index

Let’s apply the principles from our embedded systems example to address the performance problems in our intersection algorithm and use some of the input constraints to our advantage. Regardless of the size of input arrays or duplication of integers within them, our target array is never going to be more than 4 billion integers in size since we are only required to create a list of 32-bit integers that exist in both input arrays. With this knowledge, we start by allocating an empty array of 4 billion integers initialized to 0. We traverse the first array and use each of the elements we encounter as an index to store “1” in the empty array. We repeat the same process with second input array and populate a third empty array with the result of an AND operation between the two generated arrays. We can either use this resultant array directly as the output of the intersection algorithm or we can go one step further and populate L3 with actual values that intersect L1 and L2. Here is a sample implementation for assuming there are only 40 integers in the world ranging from 0-39:

L1

L1b

L2

L2b

L3b

L3

#include

19

1

10

1

1

0

4

1

38

0

0

2

void main()

7

1

12

1

1

4

{

3

1

24

0

0

6

int L1[40] = {19,4,7,3,18,13,8,...,14,17,16};

18

1

18

1

1

8

int L2[40] = {10,38,12,24,18,4,8,...,30,36,34,0};

13

1

4

0

0

10

static int L1b[40], L2b[40], L3[40];

8

1

8

1

1

12

int i, j;

2

1

32

0

0

14

0

1

16

1

1

16

for (i=0;i<40;++i)

1

1

20

0

0

18

{

5

1

6

1

1

L1b[L1[i]] = 1;

11

1

2

0

0

L2b[L2[i]] = 1;

15

1

26

1

1

}

6

1

14

0

0

12

1

22

1

1

for (i=0,j=0;i<40;++i)

9

1

28

0

0

if ( L1b[i] == 1 && L2b[i] == 1 )

10

1

30

1

1

L3[j++] = i;

14

1

36

0

0

for (i=0; i

17

1

34

1

1

std::cout << style="color: rgb(42, 0, 255);">" ";

16

1

0

0

0

}

Itty Bitty Bits and Bytes

On the surface it looks like we won our battle with the intersection problem by calculating it with merely three passes and achieving a straight O(n) solution. Of the three passes, our first and second passes have “n” equal to size of L1 and L2 respectively while third pass is a fixed cost of 4 billion iterations for 32-bit integers. In worst case scenario we are looking at 12 billion iterations which is very favorable compared to the 195 billion average iterations for the previous O(n log(n)) solution.

There is however a minor problem with our current solution. Remember when we mentioned the time and space tradeoffs? This is a classic example of one. We have reduced our time considerably but in the process thrown our space efficiency out the window. In its current form, the algorithm would require over 32 billion bytes of memory or storage to calculate the intersection efficiently. To keen observer it is probably clear where this is going, we don’t need an entire 32-bit integer to store the flag values in L1b or L2b, in fact we can get away with using just 1 bit for this information saving us 31 bits per entry and reducing the amount of storage needed to only 1 billion bytes, which is not as tall as order as 32 gigs. Please note that this scratch memory does not necessarily need to be main memory and can easily be located on secondary storage like flash memory, hard disks and so on to keep the pain of memory costs down while still giving significantly better performance.

Out of Time and Out of Cache

If we look closely, the intersection solution we came up with is essentially a specialization of using lookup to replace the NOT op-code with a finite cap on the input size and range and a limited amount of memory to go with it. In concluding this section we will leave the reader to ponder some of the implications our intersection solution has on cache locality for modern CPUs and if there are potential workarounds.

Next Episode – Data Structures and the Shocking Truth about Linked Lists

Just as our brains manipulate our bodies while using it for structural support and motor functions, the algorithms rely and manipulate data structures to achieve their objectives. Think of algorithms as the soul of your solutions and data structures as their skeleton and body. In the next lecture, we will discuss the time and space trade-offs associated with different data structures and the fundamental role they play in the overall performance picture with special focus on the orphan child otherwise known as linked list.