We now move onto a more complicated—but immensely powerful—data structure, the graph. This isn’t the same kind of graph that you may have learned in school, that of a function such as y = x^2 + 2, but instead is a kind of data structure that looks like this

Graph
Graph

What on Earth is this telling us? Firstly let’s start with some terminology; the circles with letters in are called nodes, or vertices, and the lines joining them are called arcs, or edges. The placement of these nodes and the shape of the edges connecting them is entirely irrelevant to the information in the graph, and we could move them around however we like provided we keep the edges intact. The presence of an arrow on an edge dictates its direction, and the absence of an arrow implies that it’s bidirectional. Finally, the numbers next to the edges are called weights, and can represent anything, but commonly represent how unfavourable crossing that edge is.

A lot of problems can be simplified down to graphs such as this. As a simple example, let’s pretend that each vertex is a city, say Aberdeen, Birmingham, Cambridge, Dundee and Exeter, with the number representing how difficult it is to get from one city to another. If we can only travel one way down directed edges, what is the easiest path to take to get from Exeter to Dundee? In other words, if we begin at node E and move along the edges to node D, summing their weights as we go, what is the smallest sum we can obtain?

Ignoring the directed edges that would be the path ECD, but we can’t travel from Cambridge to Dundee, we can only go from Dundee to Cambridge. We could however go along EBD, giving a total weight of 12, which seems quite good, but if instead we detoured to Aberdeen from Birmingham, that is the path EBAD, even though we travel through more cities we only have a total weight of 11, which is therefore a more efficient route! In fact it is the most efficient route for this graph. Whilst this was a very simple—and unrealistic—example, it shows that graphs can make some problems much easier to solve. This is an example of how the correct representation of data can make for far better algorithms. Imagine trying to represent this with an array or a linked list! You’d have to have linked lists of linked lists, or arrays of arrays…

How then would we go about implementing a graph? STL doesn’t have this functionality (annoyingly), so we’ll have to make our own. Let’s start by writing out our list of requirements. Ignoring efficiency, what do we want to be able to do?

  • Add vertices anywhere in the graph, which have data associated with them (the letters in the example)
  • Join two vertices together with a directed edge
  • Join two vertices together with an undirected edge
  • Remove vertices from anywhere in the graph
  • Remove edges from anywhere in the graph
  • Add weights to vertices
  • Traverse through every vertex in the graph
  • Find a short path from one vertex to another

OK, those are our requirements, and they seem pretty reasonable to me! Now it’s clear already that we can’t use a data structure that we’ve already covered to represent this, we need something new. Let’s consider efficiency first. Ideally we want O(1) efficiency on all of these points, but that isn’t really realistic, instead I think that it’s reasonable to ask for O(1) efficiency for the first six and perhaps O(n) efficiency for the last two, if we can manage that. Maybe we can’t, but we’ll try.

Let’s change notation a bit first, graphs are special enough to have their own notation. Let V be the set of all vertices in the graph, and let E be the set of all edges. A set is just an unordered collection of stuff, in this case the vertices and edges of the graph. (We can implement sets as data structures, but we’ll save that for later.) We’ll say that |V| is the number of vertices, and |E| is the number of edges. Now we can refine our efficiency requirements. If you think about it, there’s no way that we’ll be able to get vertex removal to be O(1). If the the vertex has five edges, then we’ll have to update the edge links for five other vertices. Perhaps it’s safer to say |E| then. Maybe we’ll exceed that, but that’s really an upper bound of time efficiency. A estimate on the second point is then probably O(|V|), and perhaps we could get O(|E| * |V|) for the last point. Enough of theoretical complexities though, how do we implement a graph?

One way immediately jumps to mind; store an array of vertices, and for each vertex store an array of pointers which point to the other vertices that the vertex is joined to. So for our example vertex D will store pointers to A, B, and C, but A will only store a pointer to D. This way we can get directed and undirected edges which we can easily add and remove. We can’t add new vertices if we use arrays though, so we could use a list or vector instead. Since we want random element access, we should use a vector. What about edge weights though? We have no way of storing those. Instead we could create a separate Edge class which stored a weight as well as two pointers, and then we could create a vector of edges and a vector of vertices. Then we would have two vectors of length |E| and |V| giving a total space complexity of O(|V| + |E|). That seems to work pretty well; a quite efficient use of space, and it allows for quick edge and vertex insertion and removal.

A bit of thinking reveals an alternative way. We could use a two-dimensional square grid, of side length |V|, where each position in the array has a number corresponding to the weight of the edge between the two vertices. So for our graph we would get a grid like so:

Adjacency Matrix
Adjacency Matrix

Referencing each number using a simple (i, j) coordinate system where i is the row, j is the column, and (1, 1) is the top left, we can say that a directed edge is drawn from i to j if a positive number is present at point (i, j). We can see that the grid—henceforth referred to as a matrix—is roughly symmetrical about the main diagonal (diagonal going from the top left to the bottom right), the exceptions being where the edge was directed and not undirected. This isn’t really that important for what we’re doing but it’s interesting to note.

At first glance this looks to be a much worse implementation. We can create the matrix using a vector of vectors of ints, that’s fine, but it isn’t very easy to add a new vertex. Each vector in the main vector has to increase in size by one, and then the main vector has to increase in size, adding a new vector of the increased size to itself. Sounds quite complicated! What’s more, this system cannot store vertex data. Then there’s the problem of memory usage; this matrix has O(|V|^2) space complexity. That’s the worse we’ve seen yet! But let’s take a step back for a moment. The space complexity of the previous method—henceforth called an edge list—is O(|V| + |E|).

But say our graph has 20 vertices in it. Then the matrix representation would take up 400k + 20c bytes, where k is the size of each edge weight and c is the size of each piece of data. To keep things simple we’ll store integers and have integer weights, and so the matrix uses up (ignoring the sizes of vectors etc.) 1680 bytes. This amount is constant, regardless of the number of edges in the graph. Using 64-bit pointers, single integer edge weights and integer data values, the edge list representation gives a space complexity of 80 + 12|E|. If we have more than 133 edges in the graph the matrix implementation has a smaller memory footprint! Sure, 133 edges is a lot for just 20 vertices… is it? Well how many edges would you need to join each up to 7 others? Since undirected edges are just two directed edges, that’s 7 edges per vertex, or 140 edges in total! Each vertex is connected to approximately 35% of the remaining vertices, yet the matrix form is more efficient! If we were to join every vertex to every other one, we would have almost 400 edges. For a graph with a lot of nodes and a high proportion of edges, it is clear that the matrix form is the way to go. Such a graph is called a dense graph, and one with few edges is called sparse.

Here’s some example code for each implementation, which I suggest you have a look as we’ll be using them for the rest of the tutorial (and other tutorials in this series too). Of course feel free to write your own instead, and please do, nothing beats writing code yourself! (If the example code doesn’t compile, read the comments; non-C++11 users need to change lines 72 and 73.)

Here’s some example usage code for my implementation that creates our example graph:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#include <iostream>
#define USE_MATRIX 0 // Set to 1 if using matrix implementation

#if USE_MATRIX == 1
#include "graph-matrix.hpp"
#else
#include "graph.hpp"
#endif

int main(void)
{
    // Create a new graph
    Graph<char> graph;

    // Fill the graph with nodes A-E
    for(char ch = 'A'; ch <= 'E'; ++ch)
    {
        graph.add_node(ch);
    }

    // Add the edges
    // A-D, weight 2
    graph.add_edge('A'-'A', 'D'-'A', false, 2.0);
    // B->A, weight 2
    graph.add_edge('B'-'A', 'A'-'A', true, 2.0);
    // B-D, weight 5
    graph.add_edge('B'-'A', 'D'-'A', false, 5.0);
    // B-E, weight 7
    graph.add_edge('B'-'A', 'E'-'A', false, 7.0);
    // C-E, weight 3
    graph.add_edge('C'-'A', 'E'-'A', false, 3.0);
    // D->C, weight 6
    graph.add_edge('D'-'A', 'C'-'A', true, 6.0);

    #if USE_MATRIX == 0
    // Iterate through the graph, printing the edges along with their weights
    for(int i = 0; i < graph.edges.size(); ++i)
    {
        std::cout << graph.edges[i].nodes[0]->data << "---"
            << graph.edges[i].weight << "---" << graph.edges[i].nodes[1]->data << std::endl;
    }
    #else
    // Use this loop instead if using graph-matrix.hpp
    for(int y = 0; y < graph.nodes.size(); ++y)
    {
        for(int x = 0; x < graph.nodes.size(); ++x)
        {
            if(graph.edges[x][y] != 0.0)
            {
                std::cout << graph.nodes[x] << "---" << graph.edges[x][y] << graph.nodes[y] << std::endl;
            }
        }
    }
    #endif

    return 0;
}

The output of the program should look like

A---2---D
D---2---A
B---2---A
B---5---D
D---5---B
B---7---E
E---7---B
C---3---E
E---3---C
D---6---C

Both of the implementations cover the first six of our requirements, those that were to do with manipulating and managing the data. But we’ve yet to consider the last two points, which are really algorithms more than features of the data structure. Since they’re essential to understanding why a graph is actually useful, instead of just an unnecassarily neat way to represent some data, they will be considered later on.