6.10

#### Lecture 30: Breadth-first search and Depth-first search on graphs

Finding paths from one node to another

Finding a route from one location to another is a fundamental question in everyday life, and it shows up in many forms, from explicit questions (“Can you give me directions to get to the library from here?”, or “What prerequisites must I take in order to qualify for this class?”), to more abstract problems (e.g., tracing the spread of a meme going viral). Often the goal is not merely to find a route, but to find the shortest route from one place to another.

##### 30.1Primer: graphs

We can model all of these problems as asking questions about graphs, which are simply abstract representations of connections between things. We say a graph is a collection of vertices (or nodes) and edges between vertices:
• A social graph tracks relationships (the edges) between people (the vertices)

• A road network tracks roads (the edges) between places (the vertices)

• A airline network tracks flights (the edges) between cities (the vertices)

• A curriculum dependency graph tracks prerequsites (the edges) between courses (the vertices)

Graphs can be directed or undirected: in directed graphs the edges point from one node to another, whereas in undirected graphs the edges connect in both directions. All the examples above are directed graphs:
• While friendships are usually mutual, one person might have a crush on another that isn’t reciprocated.

• One-way roads allow travel in one direction only, by definition!

• A direct flight from one city to another does not necessarily imply there is also a direct return flight.

• Actually, when a graph claims to be a dependency graph, but contains a cycle (or mutual dependency), problems typically arise. The algorithms we discuss today will work fine in the presence of cycles, but other algorithms might not.

Dependency graphs imply a “comes-before” relationship: you must take Fundies 2 before taking Algorithms, for example. It would be very problematic to claim that Algorithms depends on Fundies 2...and Fundies 2 likewise depends on Algorithms!

Do Now!

Which is more general: directed graphs or undirected graphs?

We can always represent an undirected graph as a directed graph, by replacing each edge in the undirected graph with a pair of directed edges pointing in opposite directions.

Graphs may also be weighted, meaning that each edge has a cost or value associated with it:
• Social graphs are unweighted: either a friendship exists or it does not.

• Road networks are weighted: recording the distance between places.

• Airline networks are weighted: they might record the price of a ticket, or the distance of the flight, or some other cost.

• A curriculum dependency graph is unweighted: each course is either a prerequisite or it is not.

Do Now!

Which is more general: weighted graphs or unweighted graphs?

We can always represent an unweighted graph as a weighted graph, by making all the edge weights the same (e.g., 1).

##### 30.1.1Representing graphs

How might we choose to represent graphs? One possibility is to record a graph as an ArrayList<ArrayList<weight>>, where weight is whatever kind of weights we need (for our purposes, usually Integers), and the entry graph.get(i).get(j) gives the weight of connecting vertex $$i$$ to vertex $$j$$. Of course, not all vertices are connected to all other vertices, so we would need an “invalid” value to mark those cases: we might choose to use null for this. This representation is called the adjacency matrix representation, and it’s pretty convenient when almost every node is connected to almost every other node. It does have the drawback that we have to manually check for null all the time, and it also has the drawback that we can’t store any additional information about vertices, since we’re representing vertices merely as an index in the ArrayLists.

An alternate representation might be to have explicit Vertex and Edge classes, as follows:
 class Vertex { ... any data about vertices, such as people's names, or place's GPS coordinates ... IList outEdges; // edges from this node } class Edge { Vertex from; Vertex to; int weight; } class Graph { IList allVertices; }
This representation is known as the adjacency list representation, and it avoids some of the problems of the adjacency matrix representation, while introducing some of its own.

Do Now!

Design a method for Graph to collect all the edges in the graph.

Do Now!

In the adjacency-matrix representation, we could find all the vertices reachable from a given vertex by looking in a row of the matrix, and we could find all the vertices that reach a given vertex by looking in a column of the matrix. (Why?) In the adjacency-list representation, we can easily find all the vertices reachable from a given vertex: it’s simply the outEdges field. Design a method inEdges on Graph that computes the list of vertices that reach the given vertex.

##### 30.1.2Aside: Describing the performance of graph algorithms

When we talk about graph algorithms, often their performance depends on the number of vertices and the number of edges in the graph. As a result, we often define their performance as functions of $$V$$, the number of vertices, and $$E$$, the number of edges.

Do Now!

Using big-$$O$$ notation, how many edges can there be in a graph, as a function of the number of vertices? Using big-$$\Omega$$, how few edges could there be in a graph? How few edges might there be if every vertex is connected?

We can also use these parameters to compare the two representations above, in terms of memory usage. For example, the memory requirements for the adjacency-matrix representation are $$O(V^2)$$, because there is an entry in the matrix for every pair of vertices, regardless of how many edges there are. In contrast, the adjacency-list representation uses $$O(V + E)$$ memory, because we allocate one Edge object for each edge, one ConsList object for each edge, and one Vertex object for each vertex. When $$E < V^2$$, this representation is more memory-efficient.

##### 30.2Diving deep: Finding any path between two vertices via depth-first search

Finding a path from one vertex to another requires, at a bare minimum, the ability to recognize the destination vertex and report success.

Do Now!

How can we tell when two vertices of the graph are “the same”? Which notion of equality is most appropriate here?

The first thing to notice about our adjacency-list representation above is that we do not include a notion of names for vertices. (In the adjacency-matrix, by contrast, every node had a unique index.) So the only reliable mechanism we have to distinguish two Vertex objects is via intensional equality. In particular, we are careful not to override equals with a customized equality check, so that we are guaranteed the default intensional equality. This works, and is convenient, but means we need to be careful never to allocate new Vertex objects if we don’t mean to!

But how are we to actually find a path from one Vertex to another? Let’s try a simpler question, of merely determining whether there is such a path or not. When we considered ancestry trees, we considered a simple problem boolean hasAncestorNamed(String name), which checked the parents, grandparents and further for any Person with the given name. Perhaps we could implement something similar here? Let’s assume that we’ve implemented iterators for our ILists, as we did in Lecture 25: Iterator and Iterable, so that we can conveniently write a for-each loop to iterate over edges:
 // In Vertex boolean hasPathTo(Vertex dest) { for (Edge e : this.outEdges) { if ( e.to == dest // can get there in just one step || e.to.hasPathTo(dest)) { // can get there on a path through e.to return true; } } return false; }

Do Now!

Why can we not just simplify the if statement to return (e.to == dest) || e.to.hasPathTo(dest)?

Do Now!

Will this code always work? Give an explanation of why, or an example graph that breaks it.

The nice thing about ancestry trees is that, well, they were trees: it was impossible for any Person to be their own ancestor. Actually, tree is a bit of a misnomer, since in some families (such as various historical European aristocracies) we find that distant relatives that share a common ancestor eventually have a child together. In these situations, the familiy tree is not tree-shaped, but is rather a slightly more general notion: a directed acyclic graph.

But we’re no longer dealing with ancestry trees — we have arbitrary graphs to deal with, and these graphs may include cycles. In the following directed graph, vertex A cannot reach vertex E, but if we run the code above starting at A, we’ll endlessly loop around the cycle A->B->C:

When we encountered this problem in Fundies 1, we introduced an accumulator parameter that stored all the nodes we’d already seen, so that we could abort the recursion before it looped indefinitely.

Do Now!

Implement this strategy for Vertex.

This approach will work, but let’s consider a slightly different way of organizing the problem. Let’s try to design a method on Graph instead, using loops instead of recursion, and let’s consider a different way of representing the accumulator. We’ll start with the following signature:
 // In Graph boolean hasPathBetween(Vertex from, Vertex to) { ... }
We’ll need to keep track of a list of alreadySeen vertices. And we’ll also need to keep track of a worklist: all the nodes that we have not yet finished processing. We’re going to need to add and remove items from these lists, in various places in the list. Fortunately, we have implemented a datatype that allows us to do this: a Deque!
 // In Graph boolean hasPathBetween(Vertex from, Vertex to) { Deque alreadySeen = new Deque(); Deque worklist = new Deque(); ... }
Look again at how the original code above tried to work: it iterated over all the outgoing edges from a given vertex and recursively called hasPathTo to see if they could reach the destination. By virtue of the nature of function calls, when we return from a recursive call, we still have all the remaining outgoing edges to process. Effectively, they make up an implicit worklist. Now that we are managing the worklist ourselves, we’ll have to make that become explicit.

To start the algorithm, we know we must process from, so we should add that vertex to our worklist. Then our algorithm is very simple: as long as we haven’t found our destination yet, and as long as there are more places to try, process the next place. This maps very neatly to the following code skeleton:
 // In Graph boolean hasPathBetween(Vertex from, Vertex to) { Deque alreadySeen = new Deque(); Deque worklist = new Deque(); // Initialize the worklist with the from vertex ...add from into worklist... // As long as the worklist isn't empty... while (!worklist.isEmpty()) { Vertex next = ...get (and remove) the next item off the worklist... if (next.equals(to)) { return true; // Success! } else if (alreadySeen.contains(next)) { // do nothing: we've already seen this one } else { ...next is a vertex we haven't seen yet: process it... } } // We haven't found the to vertex, and there are no more to try return false; }
There are some gaps to fill in this skeleton. The primary decision we have to make is where to add and remove items in the worklist. Think about the order that nodes were visited in the recursive approach above. As we visit each vertex adjacent to the vertex, we process that vertex completely before backing up and trying again with the next neighbor of this vertex. Let’s say we decide to always remove and process the first node of the worklist. To match the recursive approach’s behavior, we should therefore make certain to add new nodes for processing at the front of the Deque, so that we’ll process them completely (and remove them) before moving on to their neighbors. This leads to the following complete code:
Let’s trace the behavior of this algorithm, trying to find a path between B and E.
1. The worklist starts off as [], and alreadySeen is also [].

2. We add B to the worklist.

3. We start the while loop. Since the worklist isn’t empty, we remove the first item from it and assign it to next.

worklist: [], alreadySeen: [], next: B

4. Since next isn’t node E, and since alreadySeen does not contain it, we get to the else case, and add each of B’s neighbors (namely C and D) to the front of the worklist.

Do Now!

Why are C and D “backwards” in the worklist?

5. We return to the start of the while loop. Since the worklist isn’t empty, we remove the first item from it and assign it to next.

worklist: [C], alreadySeen: [B], next: D

6. Since next isn’t node E, and since alreadySeen does not contain it, we get to the else case, and add each of D’s neighbors (of which there are none) to the front of the worklist.

7. We return to the start of the while loop. Since the worklist isn’t empty, we remove the first item from it and assign it to next.

worklist: [], alreadySeen: [D, B], next: C

8. Since next isn’t node E, and since alreadySeen does not contain it, we get to the else case, and add each of C’s neighbors (namely A) to the front of the worklist.

worklist: [A], alreadySeen: [C, D, B]

9. We return to the start of the while loop. Since the worklist isn’t empty, we remove the first item from it and assign it to next.

worklist: [], alreadySeen: [C, D, B], next: A

10. Since next isn’t node E, and since alreadySeen does not contain it, we get to the else case, and add each of A’s neighbors (namely B) to the front of the worklist.

worklist: [B], alreadySeen: [A, C, D, B]

11. We return to the start of the while loop. Since the worklist isn’t empty, we remove the first item from it and assign it to next.

worklist: [], alreadySeen: [A, C, D, B], next: B

12. Since next isn’t node E, but since alreadySeen now does contain it, we do nothing, and silently discard node B again.

worklist: [], alreadySeen: [A, C, D, B]

13. We return to the start of the while loop. Now the worklist is empty, so we exit the loop, and return false.

Do Now!

Trace through the evaluation of this method, trying to find a path between A and C. Which nodes, if any, are left over in the worklist?

This algorithm descends deeply into the graph, going as far as it can following edges out from a single node before it reaches a cycle. Accordingly, it is known as depth-first search (or DFS).

Do Now!

What is its runtime?

The best-case scenario is that the starting and ending nodes are one and the same, and the algorithm finishes instantly, in $$\Omega(1)$$ time and space. This case isn’t very interesting; it’s exceedingly unlikely that this occurs. So what is the worst-case behavior? Following the idiom that “it’s always the last place you look”, the worst-case behavior means that we have to examine every single edge and node before finding the one we want. Accordingly, the algorithm seems to take $$O(V + E)$$ time to run. But our analysis was too glib: we didn’t quite count the cost of the loop body correctly. In particular, alreadySeen.contains(next) could itself take $$O(V)$$ time to run, since it’s a simple Deque whose cost of searching is linear in its length. So the actual runtime here is $$O(V^2 + E)$$, which is simply $$O(V^2)$$.

Exercise

This observation makes for a very simple improvement to this algorithm: instead of using a Deque, what other data structure could we use to improve this bottleneck? And how might we need to change our Vertex class to enable more efficient look-ups?

How much storage does the algorithm need? (Note that this is separate from how much storage the graph itself needs; we discussed that earlier and it depends on the graph representation.) The alreadySeen accumulator can contain at most every node in the graph, exactly once, so it uses $$O(V)$$ space. The worklist is a little trickier to analyze. In the worst case, it could actually contain $$O(V^2)$$ nodes, because some nodes appear multiple times.

Exercise

Work out a graph for which this might happen. Hint: it has lots of edges.

Accordingly, our algorithm needs a total of $$O(V^2)$$ space to work, in the worst case.

##### 30.3Broadening horizons: Finding any path between two vertices via breadth-first search

In the algorithm above, we chose to process a vertex completely before backtracking and moving to one of its neighbors, and we implemented that by always adding new items to the worklist at the front, which is where we also removed items. What if we changed that choice, and added items to the back? Practically speaking, this changes exactly two lines of code (both calls to addAtHead become calls to addAtTail). But what effect does this change have on the algorithm?

By adding new nodes to the back of the worklist, we effectively make them “wait their turn”, so that previously-added nodes get processed first. This means that we will process the starting node first, then all of its neighbors, then all of their neighbors (that haven’t already been processed), then all nodes 3 edges away from the start (that haven’t already been processed), then all nodes 4 edges away, etc. Our search broadens outward from the start, rather than diving deeply into a single path. Accordingly, this is known as a breadth-first search (or BFS).

Because our Deque allows us to add items to the front or the back of the list in constant time, this change does not affect the runtime or memory usage of our algorithm at all.

##### 30.4Removing repetition

The implementations of BFS and DFS are practically identical: they differ only in the strategy used to add new items to the worklist. How can we eliminate this duplication? What operations do these two algorithms need from worklists? Merely the ability to add an item, remove the first item, and check if the worklist is empty. We can define an interface that does just this and nothing more:
 // Represents a mutable collection of items interface ICollection { // Is this collection empty? boolean isEmpty(); // EFFECT: adds the item to the collection void add(T item); // Returns the first item of the collection // EFFECT: removes that first item T remove(); }
Now, we can define two implementations of this interface, that wrap a Deque<T> and implement these two strategies by delegating to that Deque.

##### 30.4.1Stacks: Last-in, First-out

One possible implementation is the one needed by depth-first search:
 class Stack implements ICollection { Deque contents; Stack() { this.contents = new Deque(); } public boolean isEmpty() { return this.contents.isEmpty(); } public T remove() { return this.contents.removeFromHead(); } public void add(T item) { this.contents.addAtHead(item); } }
This behavior is known as a stack: adding new items keeps piling them on top of existing ones, and the first item to be processed is the one most recently added. (This is occasionally described as “last-in, first-out” or LIFO behavior.)

##### 30.4.2Queues: First-in, First-out

The alternative implementation, needed by breadth-first search, adds items to the back:
 class Queue implements ICollection { Deque contents; Queue() { this.contents = new Deque(); } public boolean isEmpty() { return this.contents.isEmpty(); } public T remove() { return this.contents.removeFromHead(); } public void add(T item) { this.contents.addAtTail(item); // NOTE: Different from Stack! } }
This behavior is known as a queue: adding new items get in line behind existing ones, and the first item to enter the queue is the first item to be processed. (This is occasionally described as “first-in, first-out” or FIFO behavior.)

##### 30.4.3Rewriting BFS and DFS

We can now factor out the commonalities of breadth- and depth-first search, by supplying them with a worklist object, ready to use: