Algorithms 101: Basic Data Structures Pt. 2

12 Aug 2016

This article will cover another bunch of must-know data structures, such as heaps, priority queues, associative arrays and hashes. Source code of examples used in this article is available in git repo.

Heaps

Heap is a tree-like data structure satisfying the following property: the key of parent node is ordered with respect of the key of child node. According to the order, heaps can be divided into max heaps (parent key is greater than the chilld's one) and min heaps (parent key is lesser than the child's one). According to the structure of an underlying tree heaps can be binary, d-ary, binomial, fibonacci and many others. The following interface is used to represent heaps:

package com.dancingrobot84.algorithms.datastructures;

public interface Heap<T extends Comparable<? super T>> {
    /** Get top item without removing it from heap **/
    public T peek();

    /** Get top item and remove it from heap **/
    public T pop();

    /** Return number of items in heap **/
    public int size();

    /** Insert new item into heap **/
    public void insert(T item);
}

Heaps are usually used to implement priority queues, which are used in some graph algorithms (Prim, Dijkstra). Heapsort sorting algorithm is another example of heap's usage.

AFAIK there is no particular implementation of heap as data structure in Java and C++ languages. However, there is std::make_heap function in C++ standard library which rearranges elements in given container in a way that they form a heap. As for Java, there is PriorityQueue class which uses heap under the hood.

Binary min heap implementation

Idea: this implementation does not have explicit tree structure. Instead, array is used to store nodes. Index of given node is used to identify its parent and children (getParent, getLeftChild and getRightChild methods). The image below gives an example of how to store binary heap of 7 items in an array.

Binary tree stored in array

The most important part of heap implementation is methods for maintaining heap property. In case of binary heap there are two of such methods: bubbleUp and bubbleDown. bubbleUp compares given element and its parent and if they violate heap property, exhanges them and moves up the tree. bubbleDown does the same thing, only for the children of the given element and moves down the tree. The first method is used to implement insert: we put new item at the end of an array and run bubbleUp on it in order to put it in its place. The second method is used to implement popMinimum: we exchange the top item with the last one and run bubbleDown on the top item in order to put it in its place.

Now the last thing to do is how to heapify given array efficiently. Naive approach is to create an empty heap and perform n insertions, which gives us O(nlog(n)) time and O(n) space. However, it is possible to do it in O(n) time and O(1) space using bubbleDown method: just iterate over the array in reverse order and run bubbleDown on each item.

package com.dancingrobot84.algorithms.datastructures.impl;

import com.dancingrobot84.algorithms.datastructures.Heap;
import com.dancingrobot84.algorithms.util.ArraysExt;

public class BinaryHeap<T extends Comparable<? super T>> implements Heap<T> {

    public static <T extends Comparable<? super T>> Heap<T> emptyWithCapacity(int capacity) {
        return new BinaryHeap<T>(capacity);
    }

    public static <T extends Comparable<? super T>> Heap<T> fromArray(T[] array) {
        return new BinaryHeap<T>(array);
    }

    @Override
    public T peek() {
        assert size() > 0 : "Heap is empty";
        return array[0];
    }

    @Override
    public T pop() {
        T min = peek();
        ArraysExt.swap(array, 0, lastPos);
        lastPos -= 1;
        bubbleDown(0);
        return min;
    }

    @Override
    public int size() {
        return lastPos + 1;
    }

    @Override
    public void insert(T item) {
        assert size() < capacity() : "Heap is full";
        lastPos += 1;
        array[lastPos] = item;
        bubbleUp(lastPos);
    }

    public int capacity() {
        return array.length;
    }

    private final T[] array;
    private int lastPos;

    @SuppressWarnings("unchecked")
    private BinaryHeap(int capacity) {
        array = (T[])new Comparable[Math.max(capacity, 0)];
        lastPos = -1;
    }

    private BinaryHeap(T[] originalArray) {
        array = originalArray;
        lastPos = array.length - 1;
        for (int i = lastPos; i >= 0; i--) {
            bubbleDown(i);
        }
    }

    private void bubbleDown(int pos) {
        if (pos < 0 || pos > lastPos) {
            return;
        }

        int minPos = pos;

        int leftPos = getLeftChild(pos);
        if (leftPos <= lastPos && array[minPos].compareTo(array[leftPos]) == 1) {
            minPos = leftPos;
        }

        int rightPos = getRightChild(pos);
        if (rightPos <= lastPos && array[minPos].compareTo(array[rightPos]) == 1) {
            minPos = rightPos;
        }

        if (minPos != pos) {
            ArraysExt.swap(array, pos, minPos);
            bubbleDown(minPos);
        }
    }

    private void bubbleUp(int pos) {
        int parentPos = getParent(pos);
        if (pos < 0 || pos > lastPos || parentPos < 0 || parentPos > lastPos) {
            return;
        }

        if (array[parentPos].compareTo(array[pos]) == 1) {
            ArraysExt.swap(array, parentPos, pos);
            bubbleUp(parentPos);
        }
    }

    private int getLeftChild(int parentPos) {
        return 2 * parentPos + 1;
    }

    private int getRightChild(int parentPos) {
        return 2 * parentPos + 2;
    }

    private int getParent(int childPos) {
        childPos = childPos % 2 == 0 ? childPos - 2 : childPos - 1;
        return childPos / 2;
    }
}

isEmpty and getMinimum have O(1) time complexity. Because both bubbleUp and bibbleDown take O(nlog(n)) time, insert and popMinimum methods have the same time complexity. As for Heap(int[]) ctor: though it is just n runs of bubbleDown which sholud give us O(nlog(n)) time, actually, it works faster than that, because most of heapification takes place in lower levels. Complete proof of its time complexity can be found on Wikipedia.

Priority Queues

Priority queue is abstract data structure similar to stack or queue, but the order of items retrieving in it depends on items' priorities. Usually item with the highest priority is served first.

Priority queues applicable in many popular algorithms: Dijkstra, Prim, Huffman encoding, etc. There are implementations of priority queues in C++ and Java: std::priority_queue and PriorityQueue classes respectively.

Implementation of priority queue using binary min heap

No additional comments as implementation is quite straightforward.

package com.dancingrobot84.algorithms.datastructures;

public class PriorityQueue<T extends Comparable<? super T>> {
    private final Heap<T> heap;

    public PriorityQueue(int capacity) {
        heap = BinaryHeap.<T>emptyWithCapacity(capacity);
    }

    public void insert(T item) {
        heap.insert(item);
    }

    public T pop() {
        return heap.pop();
    }
}

Associative arrays

Associative array or map is an abstract data structure. It stores key-value pairs in such a way, that each possible key appears only once. To describe associative array the following interface is used:

package com.dancingrobot84.algorithms.datastructures;

public interface Map<K extends Comparable<? super K>, V> {
    /** Return value associated with a key or null otherwise **/
    public V get(K key);

    /** Insert value into the map under specified key or rewrite if such key already exists **/
    public void put(K key, V value);
}

Associative arrays are applicable in great variaties of computer tasks: removing duplicates, counting items, etc. There are even specials databases called "key-value storage" which store key-value pairs either in memeory or on disk (Redis, Memcached, etc).

Hash Tables

Hash table is one of data structures implementing associative array. It uses array to store key-value pairs and special hash function to compute index of particular pair by its key. Usually, for implementation of map it performs better than binary search tree (both put and get work in O(1) amortized comparing to O(log(n)) in BST), but it is highly dependent on choosing the right hash function, load factor and collision resolution method.

Hash tables are used to implement maps, sets, database indexes and caches. They are available in almost any modern programming language. Some languages provide them as syntactic construct (Python, Ruby), others have specific classes in standard libraries (std::unordered_map in C++ and HashMap in Java).

Implementation of hash table

Idea: hash table is always implemented as underlying array containing entries. Depending on collision resolution method, entries could be either key-value pairs or their chains. There are several methods for collision resolution; the most popular ones are open adressing and chaining.

Open addressing method implements the following strategy: when collision occurs during insertion, try to find a different place in array for particular key by sequential probing of other places until the free one is found. When searching for a value, pairs are traversed in exactly the same way. There are several ways of choosing which places to look at when searching for the free one; the easiest method, called "linear probing", just looks at the place following the current one.

Chaining method solves the collision problem differently. It stores buckets of pairs in underlying array; when collision occurs, new pair is dropped in associated bucket. When searching for a value, bucket is searched. Usually, buckets are implemented as linked lists of pairs. The implementation below uses this method.

Another important part of hash table implementation is choosing the right hash function. The main requirement of hash function is uniform distribution of its values. The uniformity is crucial; otherwise the number of collisions rises causing bad performance of hash table. Though, for particular implementation this condition can be loosen to uniformity at specific sizes. The implementation below relies on hashCode function of Java's Object for simplicity and performs no additional hashing. It may affect performance as programmers writing classes are not obliged to guarantee uniformity of hashCode.

As the number of items in hash table grows, the chance of collision rises too; that's why it is necessary to resize underlying array and re-insert all key-value pairs sometimes. To determine the time for performing this action, load factor indicator is used. It is calculated as the number of key-value pairs inserted divided by the size of underlying array. As for the implementation below, load factor of 0.75 indicates that resizing is needed; the number was chosen manually, though in real world you may want to perform some experiments to choose it more precisely.

package com.dancingrobot84.algorithms.datastructures.impl;

import com.dancingrobot84.algorithms.datastructures.Map;

public class HashMap<K extends Comparable<? super K>, V> implements Map<K, V> {

    public static final int DEFAULT_CAPACITY = 2;

    public static <K extends Comparable<? super K>, V> Map<K, V> empty() {
        return new HashMap<K, V>(DEFAULT_CAPACITY);
    }

    public static <K extends Comparable<? super K>, V> Map<K, V> emptyWithCapacity(int capacity) {
        return new HashMap<K, V>(capacity);
    }

    @Override
    public V get(K key) {
        checkKey(key);
        Entry<K, V> entry = entries[findIndex(key)];
        for (; entry != null; entry = entry.next) {
            if (key.compareTo(entry.key) == 0) {
                return entry.value;
            }
        }
        return null;
    }

    @Override
    public void put(K key, V value) {
        checkKey(key);
        int index = findIndex(key);
        Entry<K, V> entry = entries[index];
        for (; entry != null; entry = entry.next) {
            if (key.compareTo(entry.key) == 0) {
                entry.value = value;
                return;
            }
            if (entry.next == null) {
                break;
            }
        }

        Entry<K, V> newEntry = new Entry<K, V>(key, value);
        if (entry == null) {
            entries[index] = newEntry;
        } else {
            entry.next = newEntry;
        }
        size++;

        if (isFull()) {
            resize(entries.length * 2);
        }
    }

    private int size;
    private Entry<K, V>[] entries;

    @SuppressWarnings("unchecked")
    private HashMap(int capacity) {
        size = 0;
        entries = (Entry<K, V>[])new Entry[Math.max(capacity, DEFAULT_CAPACITY)];
    }

    private static class Entry<K, V> {
        public K key;
        public V value;
        public Entry<K, V> next;

        public Entry(K key, V value) {
            this.key = key;
            this.value = value;
        }
    }

    private void checkKey(K key) {
        assert key != null : "Nullary keys are forbidden";
    }

    private int findIndex(K key) {
        return key.hashCode() % entries.length;
    }

    private boolean isFull() {
        return size >= 0.75 * entries.length;
    }

    @SuppressWarnings("unchecked")
    private void resize(int newCapacity) {
        Entry<K, V>[] oldEntries = entries;
        entries = (Entry<K, V>[])new Entry[newCapacity];
        size = 0;

        for (int i = 0; i < oldEntries.length; i++) {
            for (Entry<K, V> entry = oldEntries[i]; entry != null; entry = entry.next) {
                put(entry.key, entry.value);
            }
        }
    }
}

Cormen, Leiserson, Rivest, Stein. Introduction to Algorithms