What do you do if you need to visualize a large graph with hundreds of thousands or millions of vertices? Force-directed graph layouts are a good, general-purpose way to see the structure of a graph, but the algorithms can take several hours to run on such large graphs.

I propose a new way to speed up force-directed graph layout algorithms called Random Vertex Sampling. This is a simple approximation algorithm that runs in linear time while still producing good graph layouts! It also has lower memory requirements than other approximation algorithms like Barnes-Hut or Fast Multipole. I’m presenting a paper about Random Vertex Sampling this year at EuroVis, and I released an open source D3.js plugin.

Random Vertex Sampling

Force-directed graph layout algorithms work by modeling the graph’s vertices as charged particles that repel each other and the graph’s edges as springs that try to maintain an ideal distance between connected vertices. The algorithms run an iterative physics simulation to find a good set of vertex positions that minimizes these forces.

Comparison of the brute-force repulsive force algorithm and the Random Vertex Sampling algorithm. LEFT: The brute-force algorithm computes repulsive forces between each pair of vertices. RIGHT: Random Vertex Sampling uses a sliding window to select a subset of vertices to update. For each vertex in the update set, it randomly samples a smaller set of vertices to use for computing repulsive forces. Each vertex also has a small, constant-sized number of “neighbor” vertices that they use to compute repulsive forces. The algorithm is iterative, so after one iteration, the window slides forward and computes forces on a different subset of vertices.

In force-directed graph layouts, repulsive force calculations between the vertices are the main performance bottleneck. The brute-force algorithm computes repulsive forces between each pair of vertices, and therefore runs in O(n^{2}) time at each iteration (n is the number of vertices in the graph).

Random Vertex Sampling reduces this runtime to O(n). It works by using a sliding window of length n^{¾} to select a subset of vertices to update their velocity. For each vertex in the update set, it randomly selects n^{¼} vertices to use for computing repulsive forces. Each vertex also has a constant-sized list of “neighbor” vertices used for computing repulsive forces on it. Because we choose the exponents carefully, and because the “neighbor” vertex lists are constant in size, the overall algorithm runs in linear time with O(n^{¾}) auxiliary space requirements. Before the next iteration, the window slides forward n^{¾} positions in the vertex list to update a different set of vertices during the next iteration. Then the whole process repeats itself.

In contrast, consider tree-based approximation algorithms like Barnes-Hut or Fast Multipole. They build a spatial tree to approximate forces by aggregating distant vertices. This enables a O(n log n) runtime, which is pretty good, but still not as fast as Random Vertex Sampling. But, the spatial tree also comes with the expense of requiring O(n log n) auxiliary memory to store the tree, which is much more than Random Vertex Sampling.

Faster and Just as Good!

In my experiments, I found that force-directed algorithms using Random Vertex Sampling run about 3 times faster on average than algorithms using Barnes-Hut. That’s a big improvement! Even better, it converges to a good layout faster, too.

I also found that Random Vertex Sampling produces layouts that are about the same quality as Barnes-Hut. Sometimes the layouts look a little more chaotic, though. We can improve this by first computing a layout using Random Vertex Sampling, and then running a few iterations of Barnes-Hut to smooth out the layout. That runs almost as fast as using Random Vertex Sampling by itself.

The detailed experiments use statistical analysis on five different graph layout quality metrics using a wide variety of graph types (social networks, transportation networks, geometric graphs, etc.). If you’re interested, see my research paper for all the details.

More Information

Try it out for yourself! It’s available in d3-force-sampled, a plugin for D3’s force-directed graph layout package v4 and above.

I am pleased to announce d3-force-reuse, a new D3.js plugin for computing faster force-directed graph layouts. This new library reduces graph layout computation time by 10% to 90% depending on the graph. The best part is that it is able to accomplish this without decreasing layout quality!

How It works

This library is based on my new research on speeding up the Barnes-Hut approximation. It does this by reusing approximations instead of computing new ones at each iteration of the layout algorithm. The standard Barnes-Hut approximation computes a quadtree at the beginning of each iteration. It then uses this quadtree to approximate repulsive forces for distant nodes by aggregating groups of nodes and treating them as one large node. This reduces the force calculation from quadratic running time down to O(n log n), which greatly improves runtime. If you’re interested in learning more about Barnes-Hut, Jeff Heer has a wonderful in-depth explanation.

My new research experiments with reusing the quadtree from one iteration to the next. This reduces the overall runtime by creating fewer trees, which saves several O(n log n) runtime operations. The d3-force-reuse library does this by only computing a new quadtree once every 13 iterations. In between, it reuses the most recently created quadtree. This is what I call the “uniform schedule”, which is in contrast to the “standard schedule” that Barnes-Hut normally uses. One alternative is a logarithmic schedule, which creates a new quadtree more often earlier in the simulation and less often later. I also tested a dynamic schedule that dynamically looks at node movement to decide when to create a new quadtree.

Fast, Good Quality Graph Layouts

My experiments show that creating a new quadtree once every 13-14 iterations using the uniform schedule decreases the runtime by 10% to 90%. This is a bigger improvement than the logarithmic schedule, but sometimes the dynamic schedule is a little faster.

Furthermore, the uniform schedule’s graph layout quality is at least as good as the standard Barnes-Hut algorithm, and sometimes a little better (on average)! We can measure the graph layout quality with Greadability.js by measuring the number of edge crossings, edge crossing angle, and the angle of incident edges. The uniform schedule always performs at least as well as the standard Barnes-Hut algorithm, and sometimes it even performs a little better! The logarithmic schedule sometimes performs a little worse, and the dynamic schedule performs the worst of them, but not by a lot.

Graph datasets are everywhere, and whether you call them graphs or networks, visualizing them is a challenging problem. In some sense, graph visualizations take an n-dimensional dataset and visualize it in two dimensions. (Or three dimensions, but that’s a debate for another day.) It might seem like we are losing a lot of information, but by carefully choosing the visualization we can actually enhance (or hinder!) our ability to find trends, patterns, clusters, or outliers in the data.

Let’s talk about six ways you might visualize graphs. We will use the Les Miserables dataset for each example so we can see how different visualizations help us answer different questions about the novel.

Force-Directed Layout

An example graph using a force-directed layout. Parts of the graph are hairballs, but we can see overall structure and trace which nodes are connected to the central characters. Interactive example.

Force-directed layouts are a type of node-link diagram, where the graph is visualized as nodes with links connecting them. The force-directed layout is one of the most common ways to visualize a graph. The basic algorithm works like a physics simulation where nodes push each other away, but links between nodes pull them together.

One advantage of force-directed layouts is that they produce similar layouts as humans^{1}. Studies show that node-link diagrams are better suited for sparse graphs with tens of nodes^{2,3,}^{4} and for the task of tracing paths between nodes^{1,2,3}. However, large or dense graphs produce “hairball” node-link diagrams that are difficult to understand.

Algorithms for force-directed layouts are actually a family of algorithms, and they have many cousins that produce similar visualizations, such as stress majorization and multilevel algorithms.

Group in a Box

Group-in-a-box using a treemap arrangement. This shows which communities are densely connected, and which are the main nodes connecting the communities. Interactive example.

What if you want to see connections within and between groups of nodes? The group-in-a-box layout has your back! This example uses the classic group-in-a-box layout. Each group of nodes is placed in a region of a treemap, and the nodes within a group are laid out using a force-directed layout. This allows us to see intra- and inter-group relationships.

Node grouping can be quite flexible. Groups can be communities or clusters of tightly connected nodes. Groups can also be formed by nodes that share the same attribute, such as country of origin.

Radial

A radial layout of character interactions, with characters grouped by the volume, chapter, and book they first appear in. Several characters primarily only interact with other characters from their chapter. Interactive example.

Radial layouts position nodes in a circle, typically grouping nodes so you can see connection patterns within and between groups. In this regard it is similar to group-in-a-box layouts. Nodes can also be laid out using a linear node order, which can be especially useful if there is a cyclic nature to the node order.

A radial layout can work well for smaller graphs, but use caution for large graphs. When there are hundreds or thousands of nodes there can be a lot of wasted space and it can be difficult to fit the entire visualization on the screen.

Tracing links can sometimes be difficult, especially if there are lots of links. This example uses edge bundling, which can reduce clutter and create a pleasing aesthetic. But it can also make it harder to trace individual links from one node to the other.

Arc

An arc diagram showing characters ordered by the first chapter they appear in. Jean Valjean is introduced early and interacts with characters through the book. Interactive example.

Arc diagrams show the graph’s nodes in a linear order, like the order that characters were first mentioned in a book. This makes them a useful tool for examining sequential patterns in graphs. Arc diagrams have been used to explore references within books and periodic structure in music.

Be mindful of some design considerations. The size of the visualization can quickly become too large if there are more than a few hundred nodes. It can also be difficult to identify clusters of tightly connected nodes because the layout emphasizes order and not clusters.

Semantic Substrate

A semantic substrate showing the relationship between the number of interactions and the number of appearances of each character. Jean Valjean is the highest ranked character in both regards. Interactive example.

If we take an arc diagram and add a second axis, we get a semantic substrate. It’s like a scatterplot, but with links connecting nodes.

Semantic substrates can simultaneously show node connections as well as trends, patterns, clusters, or outliers in the nodes’ attributes. As with arc diagrams, this comes at the expense of being able to easily see some types of node connection patterns like motifs or clusters of connected nodes.

Matrix Diagram

A matrix diagram of the Les Miserables dataset. Each community has its own color and is easily identifiable. Interactive example.

Matrix diagrams show the graph as a grid. Each node has a row and a column, and cells are filled in if there is a link connecting two nodes. It’s a good visualization choice for many analysis tasks, such as counting the number of nodes and finding common neighbors between nodes, especially for dense graphs with lots of links^{2,3,4}.

One benefit of matrix diagrams is that there is never any occlusion, whereas node-link diagrams can have nodes and links drawn on top of each other, making the visualization harder to understand. Matrix diagrams can work especially well if the rows and columns are ordered to reveal the structure of the graph. In this example, the visualization shows clusters of connected nodes grouped together in squares. But if there are many nodes, it can be hard to make the labels legible and still fit the entire matrix on the screen.

Caveats

These six visualizations are common, powerful ways to visualize graphs, but there are lots of other techniques designed to answer other types of questions. For example, other visualizations show changes in a graph over time, and some visualizations work better with large graphs.

Some of these visualization recommendations are based on perceptual research from the scientific community. One thing to keep in mind is that a lot of existing research focuses on visualizing small graphs (i.e. graphs with 100 nodes or fewer). There are a lot of open questions for large graphs, for example we do not have empirical evidence about what kinds of tasks are easier to perform on node-link diagrams versus matrix diagrams of large graphs. We need more research in this area to provide guidance to visualization practitioners as well as to illuminate the shortcomings of visualizations and guide future graph visualization research.

Further reading

Frank van Ham and Bernice Rogowitz. Perceptual Organization in User-Generated Graph Layouts. IEEE Transactions on Visualization and Computer Graphics, 14(6):1333–1339, 2008.

Rene Keller, Claudia M. Eckert, and P. John Clarkson. Matrices or node-link diagrams: which visual representation is better for visualising connectivity models? Information Visualization, 5(1):62-76, 2006.

Mohammad Ghoniem, Jean-Daniel Fekete, and Philippe Castagliola. A Comparison of the Readability of Graphs Using Node-Link and Matrix-Based Representations. IEEE Symposium on Information Visualization, pages 17-24, 2004.

Mohammad Ghoniem, Jean-Daniel Fekete, and Philippe Castagliola. On the readability of graphs using node-link and matrix-based representations: a controlled experiment and statistical analysis. Information Visualization, 4(2):114–135, 2005.