Do we need graphs?
The recent artificial intelligence (AI) revolution is the tip of the iceberg of a megatrend that has been impacting the computing industry for decades now. Over time, computing performance has increased exponentially against power consumed and cost; information storage costs have also decreased exponentially. To put this into perspective, while a terabyte of data can be stored in a disk costing around 100 US dollars in 2024, it would have taken more than a million dollars in the early 1990s!
Using computers and their derivative products, such as software, web applications, games, and multimedia content, has become deeply tied to our normal lifestyle. This dependence led to the need for understanding the behavior of all the interacting entities: humans, computer hardware, software such as web applications, and even organizations as a whole. The end goal was to find ways to make interactions more efficient, which could lead to unprecedented business opportunities.
Initially, given the constraints of the time, the information that was collected was less organized and the recorded truth provided a very high-level overview of systems, and only about a handful of variables within the system (for data scientists, think of data at aggregated levels, and with a small number of dimensions). At some point, someone realized computing power and data storage were cheap enough that you could record facts more granularly: not only could individual scenarios be recorded separately and more frequently, but other variables could also be recorded every time a snapshot was taken. The data revolution had begun, and stakeholders realized that by capturing and reviewing enough data about these interacting entities, a holistic picture of their behavior in the ecosystem could be drawn. The 2010s were spent commoditizing data and its products, to the point that even Series A-funded startups have adopted a data solution: be it warehousing, Elasticsearch, or recommendation engines.
Taking a step back, let’s understand what data means. A data point is essentially just a factual statement. Very little discussion exists on how the fact is represented in the data. The de facto representation of data is tabular, and this has generally worked well for the capabilities built around the current data ecosystem. The focus on data science research and engineering has revolved around existing database architectures, which is why the tabular form of representation is the most widespread. However, the tabular form need isn’t the only form of representation. The purpose of this chapter is to build a case around the graph form of representation and why graph representations can be the best option for several practical scenarios.
Graph data is represented using nodes (also called objects, vertices, or nouns) and edges (also called relationships, links, or verbs). Certain real-life scenarios necessitate emphasizing the relationships between the objects rather than just treating each object as an independent entity. Graph data structures provide us with a natural way to express these scenarios, as opposed to something such as the tabular format. Using the simple construct of treating entities as nodes and relationships as edges between two nodes, graph representations can effectively model information from a wide range of domains: from network topology to biological systems and supply chains to molecular structure.
A case study
To make this point clearer, let’s consider a common question that arises in social networks. For a user (say John), we want to ascertain whether another user (say Mary) is a second-degree connection (to John). A second-degree connection simply means Mary and John have a common connection, but Mary isn’t directly connected (is friends with) to John. The social media platform commonly tracks this piece of information between a pair of users and determines whether they should be recommended by the platform to connect. We’ll tackle this problem from two perspectives: first using the tabular representation, and then using the graph representation.
Tabular representation
First, we need to understand what the schema of the tables in the database would be. In a typical social media platform database, there would be several tables – one for users (capturing demographic information such as age, location, date of joining, and so on), one for posts (containing details about a post made, such as the user who made the post, the contents of the post, the date of making the post, the visibility level, and so on), and many more. The table of concern for us would be something called the connections table. It should capture information about which users are connected directly (that is, they have a first-degree connection). The schema should go somewhat like this:
connections( conn_id: UUID, user_id_1: UUID FOREIGN KEY, user_id_2: UUID FOREIGN KEY, date_of_conn: TIMESTAMP, status_of_conn: TEXT )
Table 1.1 shows a table that contains a few data points:
conn_id |
user_id_1 |
user_id_2 |
date_of_conn |
status_of_conn |
conn_uuid_0 |
john_uuid |
alex_uuid |
2022-10-30 |
active |
conn_uuid_1 |
alex_uuid |
greg_uuid |
2023-03-12 |
active |
conn_uuid_2 |
greg_uuid |
mary_uuid |
2023-04-11 |
active |
conn_uuid_3 |
mary_uuid |
alex_uuid |
2023-06-09 |
active |
Table 1.1 – Example data stored in tabular format
To determine whether John and Mary have a second-degree connection, a SQL query similar to the following can be executed:
Figure 1.1 – A SQL query being performed over tables that were introduced previously to retrieve second-degree connections
The crux of this query contains a recursive self-join operation, where each recursion level contains the connections of a certain degree. The initial filter of user_id_1 =
'john_uuid'
OR user_id_2 = 'john_uuid'
ensures that we only concern ourselves with users who are on some level and connected to John. Finally, by filtering by degree = 2
, we can get the list of all users who have a second-degree connection to John.
How efficient is this approach? The worst-case time complexity can be evaluated asymptotically and expressed in Big-O notation. Let be the count of users present on the social media platform and be the count of all first-degree connections (or the number of entries in the connections table). Join algorithms have evolved, and current join operations are very efficient, but if we look at the naive approach, where two tables have lengths and , a join operation would have an time complexity. Applying this logic to the preceding recursion, we’ll see that the time complexity of the entire query is or .
Graph representation
Now, let’s look at the graph representation for the same problem. Let each node of the graph represent a user and each edge connecting two nodes represent a first-degree connection between the users that the connected nodes represent. An illustration using Table 1.1 would look like this:
Figure 1.2 – Representing data from Table 1.1 in a graph
How do we find the answer to whether John and Mary have a second-degree connection here? We can employ an intuitive algorithm over this graph:
- Start from a source: Begin at a chosen starting point, often called the source or initial node of the graph. This is your current position for exploration. For our use case, the initial node would be that of John.
- Explore neighbors level by level: Visit all the neighbors of the current node before moving on to their neighbors. Imagine exploring the graph in layers, moving outward one level at a time. This ensures that you discover all nodes at a certain distance before moving farther away.
- Mark visited nodes: As you visit each node, mark it as visited to avoid revisiting the same node. Use a queue to keep track of the order in which you encounter nodes. While marking the nodes, you can also keep track of how many jumps from the initial node were made to reach this node. Continue this process until you’ve visited all reachable nodes from the starting point.
In simpler terms, this algorithm explores the graph by gradually moving away from the starting point, checking neighboring nodes level by level, and keeping track of visited nodes to avoid duplication. It’s like ripples spreading out from a pebble dropped in a pond, exploring nearby areas before moving to more distant ones. This algorithm is called breadth-first search (BFS), and it’s one of the most popular graph algorithms:
Figure 1.3 – Running BFS on the graph
By using this algorithm, if Mary’s node is marked as visited and has a jump count of 2, then we can safely say that John and Mary have a second-degree connection.
What’s the time complexity of BFS? As mentioned previously, the number of users is assumed to be , and the number of first-degree connections is . Effectively, BFS touches all vertices and edges of the graph at most once, so the time complexity is simply . In a practical scenario, the number of connections far outweighs the number of users on the platform, so the time complexity can be approximated to .
Is better than ? Definitely. We can see that just by changing the perspective of approaching the problem, we achieve a much more efficient solution. To further test your understanding, think of a scenario where the problem was kept the same, except you now have to check whether John and Mary were third-degree connections instead of second-degree connections. How would the time complexities of both approaches be affected?
Graphs are useful in practice. But before we talk about how certain properties of graphs and algorithms are used to solve graph problems, we need to define a common language that we can use to refer to graphs and their properties. The following section will cover how graphs are commonly defined in mathematics, and how a simple representation can cover all the different types of data that graphs can represent.