AVL Array

Get the latest
version via bzr

Overview

Page contents

Introduction

The complexity gap

Filling the complexity gap

Introduction

The simplest way to organize a collection of data in memory is storing the different elements in contiguous memory. If they all have the same size it's easy to reach the n'th element: just multiply n by the size of an element, and add this to the memory address where the sequence begins. The resulting address is where the n'th element resides. This is called random access. Any desired element can be reached in a short constant amount of time.
The drawback of this layout is that inserting or removing an element from any place of the sequence requires moving all elements that lay after the given position in the sequence. On average, the number of elements that need to be moved every time will be N/2, where N is the total number of elements in the sequence. Thus, the time required for an insert/erase operation will grow in direct proportion with the size of the sequence. This is called linear time.
What matters here is the growth of time (or memory) required with respect to the growth of the problem's size. The classification of this growth is called algorithmic complexity. Note that what matters here is not some CPU cycles more or less. Let's suppose that moving an element is 10 times faster than calculating the address of the n'th element. In this case, insert/erase operations would be faster than random access in sequences of less than 20 elements. From this size on, the time required for insert/erase grows and grows while the time required for random access remains the same.
No matter what constant or proportional advantage has the linear operation, as the size grows, a moment will come when it will start to be slower than the constant time operation. And from there on, it will become slower, and slower... For this reason, linear complexity is characertized as O(N) and is said to belong to a higher complexity order than constant complexity (O(1)). For more details, see Big O notation in the Wikipedia.
Another usual way to organize data in memory solves the aforementioned problem: it consists in letting the elements spread, but keeping them linked by storing, together with each one, the address of the next and previous elements in the sequence. Inserting and erasing are then O(1), requiring only to change some links. The drawback here is that, once lost the contiguous layout, accessing the n'th element requires jumping n times to the next element, starting with the first one. Thus, random access is now O(N).

The complexity gap

The C++ Standard Templates Library provides generic sequence containers for the two exposed ways of storing data:

The vector sequence container provides O(1) random access, amortized O(1) insert/erase at the end of the sequence, but O(N) insert/erase anywhere else.

The deque sequence container provides O(1) random access, amortized O(1) insert/erase at both ends of the sequence, but O(N) insert/erase anywhere else.

The list sequence container provides O(1) insert/erase, O(1) access to both ends of the sequence, but O(N) random access.

In most real life cases, one of these fits the requirements (either fast random access or fast insertion/removal). Though, in some rare cases, both kinds of operation are used intensely. The complexity of an algorithm depends on the operations in which it will spend more time as the size of the problem grows. Therefore, an algorithm that performs N insert/erase operations and N random accesses in a sequence of size N, will have O(N·N) complexity. If vector is chosen, the insert/erase operations will hold responsible. If list is chosen, the bottleneck will be in the random access operations.
For those cases, the problem of both vector (ordeque) and list is that they are specialized for one kind of operation, performing poorly in the other one.

Filling the complexity gap

A non-specialized sequence container that performed better than O(N) in both kinds of operation would be a great advance, even if it performed worse than vector on random access, and worse thanliston insert/erase. The sequence containeravl_array, proposed here, has precisely these properties. It provides O(log N) random access and O(log N) insertion/removal. By using it, the aforementioned algorithm would have its time complexity lowered down to O(N·log N).
As in list in avl_array every element is linked to its two sequence neighbors.
next---> ················· next--->
 A        B     ······      F        G
    <---prev ················· <---prev
Additionally, every element is linked to three elements more. These links are employed to build a tree structure:
      D
    /   \
  B       F
 / \     / \
A   C   E   G
The topmost element is called the root of the tree. Trees are self-similar structures. That is, every node is the root of a subtree that contains the nodes placed under it. In every node, a variable stores the rank of the node in its subtree (starting with 0). These kind of trees are called rank trees or order statistic trees.
         D,3
      /       \
   B,1         F,1
  /   \       /   \
A,0   C,0   E,0   G,0
This rank variable is employed for reaching the n'th element by descending from the root and taking every time the left branch (if n<rank) or the right branch (if n>rank). Before taking a right branch, n must be adjusted by subtracting the current rank from it. The time complexity of this random access operation is O(log N), regarded that the tree is balanced. Maintaining balance and ranks make insert/erase operations O(log N) too.
That said, it must be clarified that the current implementation of avl_array doesn't store the rank of every node, but the count of nodes in its subtree (including both branches and the node itself):
         D,7
      /       \
   B,3         F,3
  /   \       /   \
A,1   C,1   E,1   G,1
The rank of a node is found in the count field of the root of its left subtree (if there's no left subtree, then the rank is 0). The total size of the tree is the root's count. On insertion/removal the count fields in the path to the root must be updated, which requires adding (unconditionally) left and right counts of every node (only the nodes in the way up to the root need to be updated).
The underlying tree structure is never exposed to the user. It is only employed for providing logarithmic time random access, and it doesn't represent any meaningful hierarchy relationship within elements.

Revised 19 May, 2010

AVL Array

Get the latestversion via bzr

Overview

Page contents

Introduction

The complexity gap

Filling the complexity gap

Get the latest
version via bzr