|By Corey L. Lanum
In this article, excerpted from Visualizing Graph Data, we’ll introduce the concept of a graph and its history and uses.
In December 2001, the Enron Corporation filed for what was at the time the largest-ever corporate bankruptcy. Its stock had fallen from a high of $90 per share the previous year to $0.61, decimating its employees’ pensions and shareholders’ investments in it. The FBI’s investigation into this collapse became the largest white-collar criminal investigation in history as they seized over 3000 boxes of documents and 4 terabytes of data. Among the information seized were about 600,000 e-mails between key executives at the organization. Although the FBI took pains to read every e-mail individually, the investigators recognized that they were unlikely to find a smoking gun – people committing complex financial fraud don’t often disclose their actions in a written form. And in 2001, e-mails were only starting to become the primary means of internal communications; lots of information was still exchanged via phone calls.
In addition to looking at the text of individual e-mails, the FBI also wanted to uncover patterns in the communications, perhaps in an attempt to better understand who the decision makers were within Enron or had access to a lot of the information internal to the company. To do this, they modeled the e-mails in Enron as a graph.
A graph is a model of data that consists of nodes, which are discrete data elements, and edges, which are relationships between nodes. The graph model brings to the forefront relationships that may be hidden in tabular views of the same data and illustrates what is most important. By making those relationships between the data elements a core part of the data structure, it helps you identify patterns in the data that wouldn’t otherwise be apparent.
In this article, we’ll introduce the concept of a graph and its history and uses.
What is a graph?
Figure 1: A simple property graph with two nodes and an edge. Stella (the first node) bought a 2008 Volkswagen Jetta (the second node) in September 2007 and sold it in October 2013. By modeling it as a graph, it highlights that Stella had a relationship with this car (the edge).
A graph is a set of interconnected data elements which is expressed in a series of nodes and edges.
In the common definition of a graph, edges have exactly two endpoints, not more. A link can take one of two forms:
- directed – the relationship has a direction. In the example above, Stella owns the car, but it doesn’t make sense to say the car owns Stella.
- undirected – the two items are linked without the concept of direction, the relationship inherently goes both ways. If Stella is linked to Roger because they committed a crime together, it means the same thing to say Stella was arrested with Roger as it does Roger was arrested with Stella.
Below, we see an example of a directed link with properties.
Figure 2: A property graph of a single e-mail between Enron executives. The two nodes are the sender and recipient of the e-mail, and the edge is the e-mail
Both nodes and edges can have properties, which are key-value pairs describing either the data element itself or the relationship. Below is a simple property graph showing that Stella bought a 2008 Volkswagen Jetta in September 2007 and sold it in October 2013. By modeling it as a graph, it highlights that Stella had a relationship with this car, albeit temporarily.
An e-mail is a relationship, too, between the sender and the recipient. The properties of the nodes are things like e-mail address, name, and tile, and the properties of the relationship are the date/time it was sent, its subject line, and the text of the e-mail.
To prove conspiracy, the FBI were interested in all the e-mails sent among the Enron executives, not just a single one, so let’s add some more nodes to represent a larger number of e-mails sent during a specified period of time.
Figure 3: A graph of some of the Enron executives’ e-mail communications. You can easily see that Timothy Belden is a hub of communication in this segment of Enron, sending and receiving email from many other executives
This is called a directed graph because it matters whether Kevin Presto sent an e-mail to Timothy Belden or received one. The arrowheads on the edges show that directionality: Kevin Presto sent an e-mail to Timothy Belden, but Timothy Belden did not reply, indicating they may not have been close associates, or they may have spoken offline. As we start to add more data to our graph, you can see the value of graphs – patterns become apparent. In the example above, you can easily see that Timothy Belden is a hub of communication in this segment of Enron, sending and receiving email from many other executives
Building graph data structures are only half the solution to pattern recognition. If you can visualize graphs using interactive node-link visualization diagrams, you’ll be able to create your own dynamic, interactive visualizations using a variety of tools available today.