|
By Graph-Powered Machine Learning Alessandro Negro This article discusses creating a bigraph for a user-item dataset. |
Take 37% off Graph-Powered Machine Learning by entering fccnegro into the discount box at checkout at manning.com.
In a content-based approach to recommendation, a lot of information is available for both items and users which is useful to create profiles. We used a graph model to represent these profiles, connecting each item to its features and each user to features of interest. Even the nearest neighbor network was built using only this information. The collaborative filtering approach, on the other hand, relies on data related to the different kinds of interactions between users and items. Such information is generally referred to as a user–item dataset.
Figure 1: Bipartite graph creation in the recommendation process
Table 1 contains an example: the items here are books that different users have bought.
Table 1. Example of user–item dataset for the e-commerce scenario
|
Fluent Python |
Machine Learning: A Probabilistic Perspective |
Graph Analysis and Visualization |
Bayesian Reasoning |
Fraud Analytics |
Deep Learning |
User A |
1 |
1 |
1 |
1 |
1 |
1 |
User B |
0 |
1 |
0 |
0 |
0 |
1 |
User C |
1 |
0 |
0 |
0 |
0 |
0 |
User D |
0 |
1 |
0 |
1 |
0 |
1 |
This table contains data on only one type of interaction (purchases). In different scenarios, including the e-commerce site considered as our example, several types of interactions are available (views, clicks, or ratings) and can be used in the recommendation process. Multimodal recommender systems [da Costa and Manzato, 2014] combine multiple interaction types to provide recommendations. One of the best approaches is to create an isolated recommender system for each interaction type and then combine them in a hybrid recommender. Because the focus here is on data modeling we focus on a single type, but the extension to more is straightforward.
The user–item dataset (which in real life is much bigger than the sample shown here) represents the input of the recommendation process in the collaborative filter. This initial data is easy to obtain. To give you a few examples:
- Online merchants keep records of which customers bought which products, and sometimes whether they liked them.
- Supermarket chains usually keep purchase records for their regular customers using rewards cards.
People’s preferences for things, such as for certain products sold by a retailer, can be represented as graphs in recommendation networks [Newman, 2010] and used in recommender systems. The fundamental and most common representation of a recommendation network is a bipartite graph or bigraph.
Figure 2: A generic bipartite graph
In generic terms, a bipartite graph is a network whose nodes can be divided into two disjoint sets U and V such that each link connects a U-node (i.e. a node from the U set) to a V-node (i.e. a node from the V set). To make it simpler, if we color the U-nodes green and the V-nodes purple, then each link must connect nodes of different colors—Figure 2.
Let’s apply this simple concept to our scenario and more in general to recommendation systems that use a collaborative filtering approach. In recommendation networks, one vertex type represents products (or items in general) and the other type represents users. Edges connect users to the items they interact with (buy or like for instance). It’s also possible to represent strengths or weights on the edges to indicate, for example, how often a person has bought an item or how much they like it [Newman, 2010].
Using this model, it’s possible to convert the user-item dataset in a bipartite graph. For example, from our simple dataset for the e-commerce site of Table 1 it’s possible to create the following graph of Figure 3.
Figure 3: A bipartite graph representing Table 1
Although a bipartite graph can represent a recommendation network completely, it’s often convenient and useful to work with direct connections between vertices of only one type. From a bipartite graph, it’s possible to infer connections between nodes of the same type, creating a one-mode projection. Two projections can be generated for each bipartite graph. The first connects the U-nodes and the second connects V-nodes. Figure 4 shows the two possible projections computed from the bipartite graph in Figure 3.
Figure 4: The two possible projections of the graph in Figure 3
The projections of the bipartite graph show the relationships between users that shares the same purchases – even one – or relationships between books which have been purchased by the same users – even only one. This one-mode projection is often useful and widely employed, but its construction hides a lot of information from the original bipartite graph and hence it’s, in a sense, less powerful in terms of representation. For instance, in the case of items and users, it doesn’t show how many users purchase the two items connected. In order to solve this problem, it’s possible to add a property on the relationships whose value. We can capture this information making such projection weighted. The projection as it’s described is a co-occurrence network whose source is the bipartite graph. For instance, again in the case of a user–item matrix, two items are connected in the V-mode projection if they co-occur in at least one user’s preferences.
The projections—weighted and unweighted—of a bipartite graph represent data in a way that allows us to perform some analyses more easily than the original format. They’re therefore used quite a lot in recommendation systems [Grujić, 2008]. Often, graph clustering analysis is performed on these networks to reveal groups of items which are generally purchased together or, in the other projection, to identify customer segments based on their preferences for some specific set of items. Powerful visualization techniques can reveal patterns which are difficult to discover or identify in the original format.
To recap, the advantages of a graph model—and specifically a bipartite graph representation—of the user–item dataset are:
- It represents the data in a compact and intuitive way. The user–item dataset is sparse by nature. Generally, it has a lot of users and a lot of items, and users interact with a tiny portion of the items available. A matrix representing such interactions contains a huge number of zeros and few useful values. The graph contains only relationships that represent meaningful information.
- Projections derived from the bipartite graph are information rich and allow different types of analyses, both graphical (once visualized) and via algorithms (an example is graph clustering). Analyses such as customer segmentation and item clustering can represent a valuable support to the classical recommendation algorithms discussed in this article.
- The representation can be extended by modeling multiple bipartite graphs that share the same set of vertices but using different edges. This helps to represent data used as input for a multimodal recommendation engine. These engines use multiple types of interactions to provide better recommendations.
- As discussed in the next section, once the nearest neighbor network is computed it can be stored, sharing the user and item nodes of the bipartite graph created. This simplifies the recommendation phase by providing access to a single data structure that contains both user preferences and the nearest neighbors network.
Now that you understand the graph model describing the user–item dataset. let’s get practical and create a real database. The following listing, which is quite similar to the equivalent used for the content-based scenario, imports data from the Retail Rocket[1] (retailrocket.io) dataset, converting the user–item matrix into a bipartite graph. Retail Rocket helps web shoppers make better shopping decisions by providing personalized real-time recommendations through multiple channels with over one hundred million unique monthly users and one thousand retail partners over the world.
Listing 1. Code for importing user–item dataset
def import_user_item(self, file): with open(file, 'r+') as in_file: reader = csv.reader(in_file, delimiter=',') next(reader, None) with self._driver.session() as session: session.run("CREATE CONSTRAINT ON (u:User) ASSERT u.userId IS UNIQUE") #A session.run("CREATE CONSTRAINT ON (u:Item) ASSERT u.itemId IS UNIQUE") #A tx = session.begin_transaction() i = 0; j = 0; query = """ #B MERGE (item:Item {itemId: {itemId}}) MERGE (user:User {userId: {userId}}) CREATE (user)-[:PURCHASES{ timestamp: {timestamp}}]->(item) """ for row in reader: try: if row: timestamp = strip(row[0]) user_id = strip(row[1]) event_type = strip(row[2]) item_id = strip(row[3]) if event_type == "transaction": #C tx.run(query, {"itemId":item_id, "userId": user_id, "timestamp": timestamp}) i += 1 j += 1 if i == 1000: tx.commit() print(j, "lines processed") i = 0 tx = session.begin_transaction() except Exception as e: print(e, row, reader.line_num) tx.commit() print(j, "lines processed")
#A Create the constraints to avoid duplicates–each user and item must be unique
#B The query uses MERGE to create users or items when they don’t exist already; the CREATE for the BUYS relationship stores one relationship for each purchase
#C In the dataset there are multiple event types (view, add to cart, and so on); we consider here only completed transactions, or actual purchases
At the end of the import, which should take a few seconds, it’s possible to run the following query and visualize a portion of the graph.
Query 1. Simple query to visualize a portion of the graph created
MATCH p=()-[r:PURCHASES]->() RETURN p LIMIT 25
The graph created using the bipartite model represents the entry point of the next phase as described in the mental model of the process in Figure 1.
Exercises
With the newly created database, run queries to:
- Find the “best sellers” (the items with the highest number of purchases).
- Find the “best buyers” (the users with the highest number of purchases).
- Find recurrent buys (when an item is purchased more than one time by the same user).
That’s all for this article. If you want to learn more about the book, check it out on our browser-based liveBook reader here. You can also find the relevant source code here: https://github.com/alenegro81/gpml
[1] The data is available through Kaggle at the following address: https://www.kaggle.com/retailrocket/ecommerce-dataset.