From Graph Databases in Action by Dave Bechberger

In this article, we’ll review what makes a problem a good graph use case. We’ll start by examining a few general categories of problems and discussing why they might make for good graph use case.  Finally, we’ll analyze a general framework that we can use to help us decide if our problem is a good graph use case.

 


Take 37% off Graph Databases in Action. Just enter fccbechberger into the discount code box at checkout at manning.com.


Give a man a fish, and you will feed him for a day.  Teach a man to fish, and you feed him for a lifetime.

-Old Proverb

From social network analysis, recommendation engines, dependency analysis, fraud detection, master data management, to search problems or perform research on the Internet, and you’ll quickly encounter a listing of good use cases for graph databases.  The problem with such a list is that unless our problem is one of those specific problems, it’s hard to know if it’s a good fit for a graph database.  Instead of focusing on specific use cases, we decided to focus this section on the process of determining if our problem is a viable candidate to use a graph database. We believe that this approach offers you a better tool to decide if there’s a graph problem, rather than merely a rote listing of use cases.

Explore the Questions

While reading through the vast array of information on graph databases available on the Internet, you’ll likely to come across the statement that “Everything is a graph problem.”  The real world is easily described in graph terms, but saying that everything is solved by one type of database is a drastic oversimplification. That a problem can be represented as a graph doesn’t necessarily mean that a graph database is the best technology to choose to solve that problem.

What problems are we trying to solve?

This response is the most critical piece of information we need when deciding on any database technology. Knowing and understanding the problem we’re solving provides crucial details about what questions we’re going to ask, which types of data we need to store, and how we need to retrieve it. These details are critical factors in choosing the right database.

Let’s examine the types of problems we might be tasked with when building an application and discuss what makes them a good or bad candidate for using a graph database.

Selection/Search

  • Give me everyone who works at X?
  • Who in my system has a first name like “John”?
  • Locate all stores within X miles?

These types of problems are generally classified as Search or Selection problems.  Although you can answer these sorts of problems with a graph database, these types of problems aren’t utilizing the advantages of graph structure, such as rich relationships, recursive queries, or path algorithms.  On the other hand, it’s strongly-advised to use a relational database, such as PostgreSQL[1], or a search server such as Apache Lucene[2], Apache Solr[3], or Elasticsearch[4] with these Search or Selection type problems.  These databases are more mature, in the case of RDBMS, or more optimized, in the case of search servers, to answer these sorts of questions. These problems don’t use the relationships in our data: and in my experience, it’s unlikely that taking on the additional complexities of working with graph databases is worthwhile.

Verdict: Use an RDBMS or search server

 

Related or Recursive Data

  • What’s the easiest way for me to be introduced to an executive at X?
  • How do “John” and “Paula” know each other?
  • How’s company X related to company Y?

Questions that explore the relationships between entities provide a strong use case for a graph database.  A major strength of graph databases is that relationships add meaning and provide topological value to your data.  Graph query languages allow you to use the power of these relationships with straightforward query syntax.  Although not impossible in RDBMS type systems, these sorts of “friends of friends” queries require complex and difficult to maintain recursive Common Table Expressions (CTE) or complex joins across many different tables.  These problems are easier and quicker to solve using graph databases.

Verdict: Use a graph database!

 

Aggregation

  • How many companies are in my system?
  • What are my average sales for each day over the past month?
  • What’s the number of transactions processed by my system each day?

Questions that relate to aggregation of data constitute an excellent use case for an RDBMS database. RDBMS databases are optimized to perform complex aggregation queries quickly and with a minimal amount of overhead. These same sorts of queries could be performed in graph databases, but the nature of graph traversals requires that much more of the data is touched, which causes higher query latency and resource utilization.

Verdict: Use an RDBMS!

 

Pattern Matching

  • Who in my system has a similar profile to me?
  • Does this transaction look like other known fraudulent transactions?
  • Is the user “J. Smith” the same as “Johan S.?”

Problems based on matching patterns represent another strong use case for graph databases. Pattern matching based on how entities are related is a prime example of how we can use the power of graph databases. Typical use cases for this sort of query involve things like recommendation engines, fraud detection, or intrusion detection. Pattern matching use cases are such common graph use cases that graph query languages have specific, built in features to precisely handle these sorts of queries.

Verdict: Use a graph database!

 

Centrality, Clustering, and Influence

  • Who’s the most influential person I’m connected with on LinkedIn?
  • What equipment in my network will have the most substantial impact if it breaks?
  • What parts tend to fail at the same time?

Problems around Centrality, Clustering, and Influence are another typical use case for graph databases. Examples for these types of problems require finding the most influential person in a Twitter network, identifying critical pieces of infrastructure, or locating groups of entities within your data. Calculating the answers to these sorts of problems requires looking at entities, their relationships, and the incident relationships and adjacent entities. As with pattern matching use cases, these types of problems often have specific, built in graph query languages features.

Verdict: Use a graph database

 


I’m still confused… Is this a graph problem?

The types of problems discussed this far provide a significant first step in deciding if your problem is a good candidate for using a graph, but what if your problem doesn’t neatly fit into one of these pre-defined types. In this section, we’ll use the “friends of friends” problem to apply a decision framework to help us decide if we have a good problem for a graph.


Figure 1  Friends of Friends Problem Graph


First, take a moment and study the decision tree below. Then we’ll examine each of the questions and analyze why they matter.


Figure 2  Decision Tree for Deciding if We have a Graph Use Case


Do we care about the relationships between entities as much or more than the entities themselves?

This response to this question is a critical clue to decide if our problem is a good candidate for using a graph database. This question speaks to the heart of one of the most powerful features of graph databases: are relationships are as meaningful as the entities.  If our answer to this question is “Yes,” then our problem will likely use these meaningful relationships, which makes it an excellent candidate.  If our answer is “No,” then it’s less likely that we’ll use these meaningful relationships.  If there’s no need to use the relationships between data, then it’s unlikely, but not impossible, that our problem will benefit from using a graph database.

In the case of our “friends of friends” problem, the answer to this question is “yes,” which makes it likely a good candidate for using a graph database.  Why “yes”?  When we think about how we need to answer the question, “Of the people that Bob follows do any of them follow people who follow Bob?” we recognize that the relationships are the most interesting.  In this example, we could replace “Alice” with “Janet,” and the answer to the question remains the same.

If I were to model this in an RDBMS, would I be writing queries with multiple (5+) joins or recursive Common Table Expression (CTE) to retrieve my data?

If you’ve ever had the displeasure of writing a recursive CTE in SQL, then I feel your pain.  By reviewing our “friends of friends” example, let’s say that we wanted to answer the question, “What are the connections to get from Bob to Ted?”  While navigating these connections is trivial to answer in our toy graph above, if you think about connections in the context of a larger dataset such as Twitter or LinkedIn, you can quickly see that the number of potential steps is unknown. Attempting to perform this sort of query in a relational database either causes your database to consume all its resources or, more likely, time out. Luckily, one of the significant strengths of graph databases is their ability to recurse over this sort of unbounded hierarchical data.  If your question has this need, then it may be time to investigate applying a graph to help solve your problem.

Is the structure of my data continuously evolving?

Another merit of a graph database compared to a relational one is that rapidly evolving data schema tends to be much simpler in a graph. Graph databases don’t require default values or complex data migrations to change the schema associated with your graph.  If your problem requires taking in data with different data schemas, such as dependency management, then a graph database may be worth investigating. The data schema varies, but this isn’t a strong enough reason to choose a graph database; instead, pair it with one or more of the previous questions discussed, and you can make a strong case for solving these sorts of problems using a graph.

Is my domain a natural fit for a graph?

If you’re doing something such as routing, dependency management, social network analysis, or cluster analysis, then your problem revolves around highly interconnected data, and your domain may be well suited for using a graph. A word of caution: if your questions aren’t relying on the relationships in graph for answers, then you should consider other options, even if your domain is a graph.

In fact, our initial work with graph databases back in 2014[5] revealed how the client’s data fit naturally in a graph. We should know, because we tried it in three different graph databases.  The model had a built-in inheritance functionality, multi-hop traversals, and a natural requirement for dependency analysis.Their two primary constructs were even called components and relationships. The fact that it should’ve been built in a graph database instead of a relational database seemed obvious to us.

In the end, the right answer for that particular customer wasn’t to use one of the three graph engines we evaluated, but their relational engine was better (or rather, in a way congruent with their primary access patterns). We then spent the better part of the next eighteen months adding a read-optimized relational projection. As a result, we witnessed one hundred times performance improvement for some of their most demanding queries.

At first, we were all shocked that the graph databases didn’t provide the necessary performance improvement because the data modeling was naturally graph. Then we looked more closely at the five queries used to evaluate the performance of each engine. Aside from the inheritance modeling, none of the queries required a graph-style access pattern, and we were able to address the inheritance use cases through an aggressive denormalization approach. In fact, the access patterns were best suited for relational databases; hence, the outstanding performance improvement when the data was modeled to take advantage of the RDBMS query optimizers’ strengths.

In general, if you can answer yes to one or more of these questions, then it’s likely that you may have a graph problem.

If you want to learn more about the book, check it out on liveBook here and see this slide deck.

 

[1] https://www.postgresql.org

[2] http://lucene.apache.org

[3] http://lucene.apache.org/solr

[4] https://www.elastic.co

[5] The analysis was re-done with a public data set as the “Graph Database Shootout 2.0” talk presented at GraphDay Seattle in July, 2016 (https://www.experoinc.com/post/graph-db-shootout-20-slides-notes-from-dataday-seattle)