Lecture 14: Abstraction functions and flexibility

Let’s talk about the requirements we laid out for Graph in HW1, focusing only on graphs and nodes.

A Graph is a collection of Nodes with the following functions (more or less). We include their complexity requirements.

template <typename V> class Graph {
   /** Return number of nodes. O(1) time. */
   size_type size() const;
   /** Return the node with index i. O(1) time.
       @pre 0 <= i < size() */
   Node node(size_type i);
   /** Add a node. O(1) amortized time.
       @param[in] position the node's position
       @param[in] value the node's value
       @return result (the new node)
       @post new size() == old size() + 1
       @post result.index() == old size() */
   Node add_node(Point position, node_value_type value);
   /** Remove a node. Time polynomial in size().
       Invalidates @a n, but not any other node.
       Decrements the indexes of nodes above @a n. */
   void remove_node(Node n);

   class Node {
      /** Return the node's position. O(1) time. */
      Point position();
      /** Return the node's value. O(1) time. */
      node_value_type& value();
      /** Return the node's index. O(1) time. */
      size_type index();
   };
};

Example representation

How to implement this specification? A natural way is to start from the complexity requirements and use data structures with that complexity. For instance, take node(i). This returns a node in O(1) time, so seems to imply a vector. (Vectors and hash tables are the basic data structures with O(1) access time.)

class Graph { ...
private:
   struct nodeinfo {
      Point position_;
      node_value_type value_;
   };
   std::vector<nodeinfo> nodes_; // index is node index
};

The Node object is a proxy for the position and value information stored in the graph under the node’s index.

class Graph { ...
   class Node { ...
      Point position() {
         return graph_->nodes_[index_].position_;
      }
      size_type index() {
         return index_;
      }
   private:
      graph_type *graph_;
      size_type index_;
      Node(graph_type *graph, size_type index)
         : graph_(graph), index_(index) {
      }
   };

   Node node(size_type i) {
      assert(0 <= i && i < nodes_.size());
      return Node(this, i);
   }
};

But this will cause a problem with removing nodes. Removing the node with index i must shift all nodes with greater indexes, to keep the indexes contiguous. Consider:

Graph<int> g;
auto n0 = g.add_node(Point(0,0,0), 0);   // n0.index_ == 0
auto n1 = g.add_node(Point(1,0,0), 1);   // n1.index_ == 1
auto n2 = g.add_node(Point(2,0,0), 2);   // n2.index_ == 2
// g.nodes_ == [<(0,0,0),0>, <(1,0,0),1>, <(2,0,0),2>]

g.remove_node(n0);    // Shifts values around in g.nodes_, but
                      // does not update n1.index_ and n2.index_!
// g.nodes_ == [<(1,0,0),1>, <(2,0,0),2>]

assert(n1.position() == Point(1,0,0));   // WILL FAIL!
            // n1.index_ == 1, but now g.nodes_[1] points to the
            // node with position (2,0,0)!
auto nx = g.node(0);  // Expect the node with position (1,0,0)
assert(nx == n1);   // WILL FAIL! They have different index_

We need to associate a more permanent identifier with each node—something that doesn’t change as nodes are removed. We called this second node index a “unique identifier” or “uid.” Here’s how we did it:

class Graph { ...
private:
   struct nodeinfo {
      Point position_;
      node_value_type value_;
      size_type index_;
   };
   std::vector<nodeinfo> nodes_;    // index is uid
   std::vector<node_id_type> i2u_;  // index is index, value is uid
};

The Node object is still a proxy, but by UID, not index. The primary change is i2u_, but nodeinfo changes as well: we need an O(1) map from UID to index to implement the Node::index() function; struct nodeinfo is a natural place to store that map.

class Graph { ...
   class Node { ...
      Point position() {
         return graph_->nodes_[uid_].position_;
      }
      size_type index() {
         return graph_->nodes_[uid_].index_;
      }
   private:
      graph_type *graph_;
      node_id_type uid_;
      Node(graph_type *graph, node_id_type uid)
         : graph_(graph), uid_(uid) {
      }
   };

   Node node(size_type i) {
      assert(0 <= i && i < nodes_.size());
     return Node(this, i2u_[i]);
   }
};

With suitable changes to add_node and remove_node to keep i2u_ up to date, this works great. A key change is that remove_node does not remove old nodes from the nodes_ array. If it did, then the uid-to-node mapping would change, invalidating nodes exactly as before! We spend space to get better complexity in a classic tradeoff.

Graph<int> g;
auto n0 = g.add_node(Point(0,0,0), 0);   // n0.uid_ == 0
auto n1 = g.add_node(Point(1,0,0), 1);   // n1.uid_ == 1
auto n2 = g.add_node(Point(2,0,0), 2);   // n2.uid_ == 2
// g.nodes_ == [<(0,0,0),0>, <(1,0,0),1>, <(2,0,0),2>]
// g.i2u_ == [0, 1, 2]

g.remove_node(n0);    // Shifts values around in g.i2u_!
// g.nodes_ == [<UNUSED>, <(1,0,0),1>, <(2,0,0),2>]
// g.i2u_ == [1, 2]

assert(n1.position() == Point(1,0,0));   // SUCCESS!
auto nx = g.node(0);  // Expect the node with position (1,0,0)
assert(nx == n1);   // SUCCESS!

Of course, now the nodes_ array can grow without bound. This is a huge bummer, but one we can fix. Before doing so, we’ll take a tour of specifications, abstraction functions, and representation invariants. These properties will help us as we analyze and improve our data structure.

Specifications and abstract data types

The specifications at the top of the post refer over and over to a couple concepts:

... the node's index ...
... the node's position ...
... the node's value ...
Invalidates ...

These together form an abstract concept of a graph. The user of the Graph class shouldn’t need to understand its implementation, but only its interface; and the interface is defined in abstract terms.

We win when interfaces are specific enough that it is possible to reason about their correctness. And for that, we need a specific graph abstraction.

Here’s one:

A graph G is a tuple ⟨N, E⟩.
N is a sequence of nodes [n₀, n₁, …, n_m–1] and E is a set of edges.
Each node is a pair of position and value ⟨_p_, _val_⟩ where p is a point in 3D space and val is an object of value type.
Each edge represents an unordered pair of nodes: If e ∈ E, then e = {_n__i, n_j} where n_i and n_j are elements of N.

If we wanted, we could now write out our specifications more precisely in terms of abstract objects. For example:

/** Add a node. O(1) time.
    @param[in] position the node's position
    @param[in] value the node's value
    @return result (the new node)
    @post new size() == old size() + 1
    @post result.index() == old size()

    In abstract terms, new G = <new N, new E>,
    where new N = old N ++ [<@a position, @a value>]
    and new E = old E. */
Node add_node(Point position, node_value_type value);

(Here, ++ on sequences concatenates the sequences together.) But the informal specifications are good enough in practice, as long as we can reliably extract a formal specification if and when we need one.

Abstraction functions

An abstraction function AF maps an internal representation of a class to the corresponding abstract concept. Abstraction functions let us bridge between the more abstract specifications provided by the comments and what actually happens in the code. Abstraction functions go from representation objects to abstract objects, because often many representation objects could stand for the same abstract object. For one example, we don’t generally care exactly where a Graph object is located in memory; it “means” the same thing regardless of its address.

An object’s representation consists of its data members. For Graph, this is the nodes_ and i2u_ arrays. The abstraction function, then, looks like this:

AF(*this) = G = ⟨N, E⟩, where:
N = [n0, n1, …, n(m–1)], m = i2u_.size(), and ni = ⟨nodes_[i2u_[i]].position_, nodes_[i2u_[i]].value_⟩ for all i in [0,m).

(We’re not considering edges, so forget about E for now.) The key thing to note is that the particular values of i2u_ do not occur in the abstract concept (the output of the abstraction function). Neither do the values of nodes_[x].index_. This is important, and common. Good data structures often include “helper members” that don’t match directly to parts of the corresponding abstract concept. We use those members to make the data structure better—either faster or, as here, less likely to cause problems for users. (It would be very difficult to use a Graph whose Node objects all got invalidated by every remove_node operation!) Thus, many graph representations with different node uids correspond to the same abstract graph.

Representation invariants

A representation invariant defines whether a class representation is valid. We use representation invariants to help prove that data structure operations are correct: every public data structure operation can assume that the data structure is valid on input, and must provide a postcondition that the data structure is valid on output. (There’s an exception for operations that destroy data structures, whose specifications say that they invalidate their input. Remove_node is an example.)

Representation invariants are functions that take representation objects and return Boolean values (true for valid, false for invalid).

For Graph, the representation invariant needs to check that the nodes_ and i2u_ arrays are synchronized. RI(*this) is true if and only if:

For every i in [0,i2u_.size()), nodes_[i2u_[i]].index_ == i.

The key thing to note here is that values not listed in the abstract concept appear in the representation invariant. This is again important, and common. We add helper members to improve the data structure; but they have to be correct to help! And here, the basic correctness requirement on nodes is that the index_ member is right.

Several other useful consistency requirements are actually already expressed by this invariant:

For each i with 0 ≤ i < i2u_.size(), 0 ≤ i2u_[i] < nodes_.size(). (This is implied since otherwise the element access nodes_[i2u_[i]] would fail.)
The uids in i2u_ are disjoint: if 0 ≤ i < j < i2u_.size(), then i2u_[i] ≠ i2u_[j]. (This is implied since nodes_[i].index_ can take only one value.)

It’s usually good to express the invariant as compactly as possible, since that makes it easier to understand and prove.

Our representation invariant doesn’t mention position_ or value_ because there are no internal consistency requirements on those fields. The abstraction function and representation invariant serve different purposes and can be quite independent.

Abstraction functions always work on valid representations, so if RI(x) is false it’s OK for AF(x) to break or return weird garbage.

Node abstraction function and representation invariant

The Node subobject has its own abstraction function and representation invariant. The abstract concept of a node is a subconcept of that of a graph.

AF(n) = ni, where i = n.graph_->nodes_[n.uid_].index_ and ni is the i’th node in AF(*n.graph_).
RI(n) is true if and only if 0 ≤ n.uid_ < n.graph_->nodes_.size().

Do you think this is complete, though? Think about it for a minute.

It’s not complete, because removed nodes are invalid, but their uids are still in range by design! We can improve the representation invariant to catch removed nodes this way:

RI(n) is true if and only if n.graph_->nodes_[n.uid_].index_ = i, where n.graph_->i2u_[i] = n.uid_.

If i2u_ and nodes_[].index_ don’t match, the node has been deleted. Again we can elide some implied requirements, such as that n.uid_ and i are in range for their respective arrays. This is very cool: we can add an O(1)-time valid() function to Node that verifies a node is valid, and then use that function in assertions!

class Node { ...
private:
   bool valid() {
      return uid_ >= 0 && uid_ < graph_->nodes_.size()
           && graph_->nodes_[uid_].index_ < graph_->i2u_.size()
           && graph_->i2u_[graph_->nodes_[uid_].index_] == uid_;
    }
public:
   Point position() {
      assert(valid());
      return graph_->nodes_[uid_].position_;
   }
   ...
};

Note how valid() actually contains the implied requirements from the representation invariant, not just the main requirement. This is important. Valid()’s purpose is to detect invalid nodes, so unlike most other operations, it doesn’t assume its input is totally valid. The carefully written out checks avoid crashing when a node is invalid and (say) has index_ that’s out of range for i2u_.

Saving space

Now let’s return to our space concern: if we call “n = add_node(); remove_node(n)” repeatedly, our graph data structure will grow more and more <UNUSED> elements. The total size of the graph is proportional to the total number of add_node calls, not the graph’s size or even its maximum size. To do better, we must reuse space from unused elements. And to do that, we must keep track of which elements are unused. We need a free list.

A lot of you had good ideas on how to represent the free list. Add a stack of free element indexes, or a vector, or even a double-ended queue (!). These work and are even good ideas (because they are simpler code). But you can do it by adding four bytes to the graph representation. How would you do this? Think about it.

What operations must the free list support? Not very many, if we think systematically.

remove_node() will add a node to the free list.
add_node() should check the free list for an element that could be reused. If there is one, it should reuse that element and advance the free list to the next free element.

Sounds like push_front() and pop_front(). Several container structures support these operations in O(1) time. We turn to singly linked lists. A singly linked list uses two types of data: (1) a head pointer to the first list element, and (2) per-element next pointers that link the list together. The end of the list is indicated by a distinguished sentinel value that can never equal a valid pointer (such as NULL).

Adding a head pointer to the first free element would take 4 extra bytes. But where can we find space for next pointers? Simple: reuse the nodes_[].index_ values! List links don’t need to be true C pointers; integers work just as well.

class Graph { ...
private:
   struct nodeinfo { ...
      node_id_type index_; // or next free nodeinfo
   };
   std::vector<nodeinfo> nodes_;
   std::vector<node_id_type> i2u_;
   node_id_type free_; // initialized to (node_id_type) -1

public:
   void remove_node(Node n) {
      ... free adjacencies, etc. ...
      // remove node from i2u_
      i2u_.erase(i2u_.begin() + n.index());
      // mark node as free
      nodes_[n.uid_].index_ = free_;
      free_ = n.uid_;
   }

   void add_node(Point position, node_value_type value) {
      node_id_type uid;
      if (free_ != (node_id_type) -1) { // we have a free slot
         uid = free_;
         free_ = nodes_[free_].index_;
      } else { // no free slot, add a new slot to the back
         uid = nodes_.size();
         nodes_.push_back(nodeinfo());
      }
      // rest is unchanged
      nodes_[uid].position_ = position;
      nodes_[uid].value_ = value;
      i2u_.push_back(uid);
      return Node(this, uid);
   }
};

But wait a minute—the representation invariant RI puts requirements on the index_ member; are we allowed to reuse it?!

Yes, and when you see why, you’ll understand a lot about abstraction functions and representations. The graph representation invariant is, again:

For every i in [0,i2u_.size()), nodes_[i2u_[i]].index_ == i.

But free nodes’ uids are not listed in i2u_. (They aren’t valid nodes, after all.) The representation invariant only discusses uids found in i2u_, so it does not constrain the values of free nodes. We can put anything we want in nodes_[i].index_, as long as i is a free uid.

It would be useful, however, to extend our representation invariant to check the free list. A correct graph will ensure that free items and used items are disjoint, and that free items and used items together cover all items.

(Old invariant) For every i in [0,i2u_.size()), nodes_[i2u_[i]].index_ == i.
Let F equal the set of uids listed on the free list, starting from free_; and let U equal the set of uids in the i2u_ array. Then F and U are disjoint, and F ∪ U equals the range [0,nodes_.size()).

Now, if we want, we can prove our code maintains this invariant for every operation. It’s easy for most operations—Node::position() doesn’t change i2u_ or index_, for example, so the postcondition “RI(*graph_)” follows directly from the precondition. For others (add_node()) it’s hard, but possible. The invariant doesn’t hold at every point during the operation, but assuming it holds at the beginning, we can prove it holds at the end.

Validity

Unfortunately, this space-saving change changes the meaning of our representation invariant on nodes.

A node becomes invalid as soon as it’s removed from the graph. This validity transition is instantaneous and doesn’t require any code—it just happens, at the semantic level. For instance:

auto n1 = g.add_node(...);
auto n2 = n1;
auto n3 = g.add_node(...);
n3 = n1;
g.remove_node(n3); // INSTANTLY n1, n2, and n3 become invalid

The previous node representation invariant allowed us to check node validity. After g.remove_node(n3), all of n1.valid(), n2.valid(), and n3.valid() would return false. And since node uids were never reused, the nodes would remain checkably invalid forever.

But now we reuse node uids, which can make an old uid appear valid again!

Graph<...> g;                // new graph
auto n1 = g.add_node(...);   // n1.uid_ == 0
g.remove_node(n1);           // free n1.uid_
assert(!n1.valid());         // checkably invalid
auto n2 = g.add_node(...);   // reuse uid 0!
assert(n1.valid());          // n1 appears valid again!

Now, is this bad? That depends.

We program C and C++ because we are interested in performance. We give up some safety for that performance: we can turn pointers into integers, write to random memory, access memory after freeing it, all sorts of awful stuff. This makes representation invariants inherently incomplete. Every C/C++ representation invariant assumes, as a precondition, that the representation in question wasn’t destroyed by random memory writes. Given that assumption, it’s not too far fetched to expect programmers to avoid other kinds of problems, such as touching invalid nodes. Also, in some cases, preconditions and representation invariants are unacceptably expensive to check. Imagine a full precondition checker for binary search: it would have to check that the input sequence was sorted—which takes O(n) time, violating the binary search’s complexity requirement!

Nevertheless, invariant checking is often cheap. An when it is, you should definitely load your program with relevant assertions. They might catch real bugs! You can turn them off, if you must, after you prove your code correct.

Is it possible, then, to change the Graph representation so that we can detect all invalid nodes, including node copies, in O(1) time? Think about it.

Yes, we can, as long as we spend some space. We need to reuse uids to save space, but to detect reuse of invalid nodes, we we can simply add another identifier that is never reused. This type of identifier is often called a generation number. Add an “unsigned gen_” to struct nodeinfo, and an “unsigned gen_” to Node. On every Node operation, check that the generations match. Done!

Or almost. Next time we’ll implement the generation version more carefully and write its invariants.

Posted on February 24, 2012