SourceForge.net Logo AVL Array
(and Shiftable Files!)

Get the latest
version via bzr


Presentation


Page contents
Why don't ordinary files and arrays support insertion/removal operations?
Shiftable Files
AVL Array
Keeping track of Shiftable Files' contents
More information


Why don't ordinary files and arrays support insertion/removal operations?
  1. Because these operations would take too much time?

    No. Using augmented trees, these operations can be performed in logarithmic time. This is very similar to the performance of databases, which are all around.

  2. Because application programmers don't really need them?

    Well... no. Many applications manage sequences that require insertion/removal operations: text files, sound, video... These applications tackle the problem with ad-hoc solutions, reimplementing complex data structures and algorithms every time.

  3. Because there's no easy way to keep track of the moving contents?

    Correction: there wasn't. Now there is one easy way: AVL Array.


Shiftable Files

AVL Array was written first, but we'll better start with the simplified, permanent-storage version: Shiftable Files.

Shiftable Files is a library that offers the typical file primitives (read, write, and random access) plus fast insert/remove primitives. The next program is a usage example:

#include <shiftable_files.hpp>

int main ()
{
  shiftable_files::shiftable_file shf;

  shf.open ("hello.txt");   // Default: om_create_or_wipe_contents
  shf.write ("world!", 6);  // Overwrite (in this case, append)
  shf.seek_set (0);         // Jump to the beginning
  shf.insert ("Hello ", 6); // Insert
  shf.close ();             // Default: compact data and leave the
                            //          file as an ordinary file
  return 0;
}           // Output saved in new file "hello.txt":  Hello world!

Shiftable Files is based on memory-mapped files. The internal structure of a shiftable file is a tree of payload data blocks. When some bytes are deleted, the space left is simply marked as free. When new bytes are inserted, they fill the free space, or they are stored in newly allocated blocks at the end of the file. When there is too much free, unused space, some blocks from the end are moved, and the file is truncated.

On close operations, the payload data are recompacted and the metadata erased. Thus, the resulting file is left like an ordinary file, with no tree structure, no gaps, and no sequence jumps. Any program can open the resulting file.

Optionally, the close method can leave the shiftable file "as is", and simply close it with all the tree structure etc. This is faster, but then the resulting file should only be opened by programs using Shiftable Files.


AVL Array

AVL Array is a tool for programmers. It is a generic sequence container with logarithmic time complexity for both random access and insertion/removal (not only in the ends, but anywhere in the sequence). It is implemented as a C++ STL-like template.

Syntactically, AVL Array can be used as any STL sequence container. The program below populates an AVL Array with some numbers, and then inserts another number in the middle:

#include <iostream>
#include <avl_array.hpp>

using namespace std;
using namespace mkr;

int main ()
{
    avl_array<int> nums;              // Empty AVL_Array
    int i;

    for (i=0; i<10; i++)              // N times...
        nums.push_back (i);           // push_back: O(log N)

    nums.insert (nums.begin()+5, 99); // Iterator sum: O(log N)
                                      // Insertion: O(log N)

    for (i=0; i<11; i++)              // N times...
        cout << "  " << nums[i];      // Random access: O(log N)

    cout << endl;      // Program output:
                       //   0  1  2  3  4  99  5  6  7  8  9
    return 0;
}                // IMPORTANT: Each loop takes O(N log N) time
                 //            (this is bad)

Though perfectly legal, the code above is suboptimal. It can run faster if special care is taken in doing things The Right WayTM. General random access takes O(log N) time, but advancing to the next element takes only O(1) time. In addition, an AVL Array with N elements can be constructed from scratch in O(N) time. The program below does the same thing as the previous program, but faster:

#include <iostream>
#include <avl_array.hpp>

using namespace std;
using namespace mkr;

int main ()
{
    avl_array<int> nums(10);          // AVL_Array with N=10 elements: O(N)
    avl_array<int>::iterator pos;
    int i;

    for (  pos=nums.begin(), i=0;     // begin: O(1)
           pos!=nums.end();           // end and equality comparison: O(1)
           ++pos, i++             )   // Iterator increment: O(1)

        *pos = i;

    nums.insert (nums.begin()+5, 99); // Iterator sum: O(log N)
                                      // Insertion: O(log N)
    for (  pos=nums.begin();
           pos!=nums.end();
           ++pos             )        // (same as previous loop)

        cout << "  " << *pos;

    cout << endl;      // Program output:
                       //   0  1  2  3  4  99  5  6  7  8  9
    return 0;
}                // Each loop takes O(N) time
                 // (this is good)

For more details, see the full documentation of AVL Array.


Keeping track of Shiftable Files' contents

The million dollar question is:

How can I efficiently keep track of the different fragments of a file, if their position changes every time a byte is inserted or removed?

This issue is, most probably, the reason why today's files don't support insertion/removal operations. Though, the answer is quite simple:

Using an AVL Array!

AVL Array has a feature called Non-Proportional Sequence View (NPSV). This featue allows the user/programmer to define an alternative view of the sequence. In this view, every element has its own width. The position of an element is defined as the sum of the previous elements' widths. Here is an example:

Non-proportional Sequence View

The ordinary sequence view is the one we use in normal random access. In this view, every element occupies exactly one unit of space. That is, the ordinary width is always 1. It just can't be changed.

The NPSV width, in contrast, can be changed. In the example, element 'a' has an NPSV width of 3 units, and therefore element 'b' stands at NPSV position 3.

Note that element 'c' has an NPSV width of 0. As a result, it shares the NPSV position 4 with element 'd'.

Of course, the next operations take O(log N) time:

The solution should be clear at this point: Just store in an AVL Array the sequence of fragments of your file, and set each element's NPSV width to the size in bytes of the corresponding file fragment. Every time you insert or remove some bytes in the file, remember to update the NPSV width of the affected fragment.

For example, let's define some types and variables:

typedef avl_array
        <
            string,            // Element type
            allocator<string>, // (stuff required for STL-compatibility)
            true,              // Do use NPSV
            unsigned           // NPSV type
        >
        index;                       // Index of fragments of a shiftable file

typedef index::iterator bookmark;    // Reference to an element of the index

index I;
bookmark b, c;
unsigned pos, relpos;

If we append the fragment "Appendix B" (of 1000 bytes) to the shiftable file, we must reflect it in the index:

b = I.insert (I.end(), "Appendix B");
I.npsv_set_width (b, 1000);

Later, if we insert 50 bytes in this fragment of the file, we must update the index this way:

I.npsv_set_width (b, I.npsv_width(b)+50);

At any time, we can retrieve the absolute position of the fragment in the file just looking at the index:

pos = I.npsv_pos_of (b);

And we can also find the fragment that contains a given position:

c = I.npsv_at_pos (350);

Note that the fragment corresponding to c might not start exactly at position 350. It might start, for instance at position 300, and end at position 400. If we want to know the relative position of the byte 350 inside the fragment:

relpos = 350 - I.npsv_pos_of (c);


More information



Revised 21 May, 2010

Copyright © Universidad de Alcalá 2006-2010. See accompanying license.