Description: https://images.manning.com/360/480/resize/book/7/8028539-e341-4d17-97af-1d569805e6c5/Riscutia-Azure-MEAP-HI.png

From Data Engineering on Azure by Vlad Riscutia

This article talks about building identity keyrings.


Take 40% off Data Engineering on Azure by entering fccriscutia2 into the discount code box at checkout at manning.com.


The code examples in this article use Azure Data Explorer. If you want to run them, create an Azure Data Explorer cluster and database: https://azure.microsoft.com/en-us/services/data-explorer/. Once this is set up, you can use the web UI to connect to your cluster and database: https://dataexplorer.azure.com/. Azure Data Explorer uses KQL, the Kusto Query Language. If you are not familiar with it, you can check out the quick reference here: https://docs.microsoft.com/en-us/azure/data-explorer/kql-quick-reference.

Identity keyrings

In a large enough enterprise, getting a broad view of how the systems are used isn’t an easy task. What usually happens is that different parts of the business generate and master their own identities.

For example, the Website team masters cookie IDs to identify users who aren’t signed in, and profile IDs for signed in users. The Payments team uses customer IDs to identify customers and subscription IDs to keep track of which subscriptions customers are paying for. The Customer Success team has a support customer ID to identify customers in their support tools, and email addresses of customers. An identity keyring pulls together all these identities across the different systems and allows us to quickly find all connections. Figure 1 shows the various identities used by the different teams and how a keyring groups them together.


Figure 1. The Website, Payments, and Customer Success teams all master their own identities. A keyring groups these identities together.


The larger the business, the more identities we have and the harder it is to get the broader view of how users interact with it. Having a keyring allows us to correlate activity across the various systems: we can see, for example, how the time it takes to resolve a customer issue in the Customer Success team impacts user retention, tracked by the Payments team, or we can see how an A/B test run on the Website impacts the subscriptions a user signs up for.

Building an identity keyring

Various systems have some connections between identities. For example, the Website team might have a table which matches cookie IDs to profiles once users log in, and the user profile includes an email address. The Payments team keeps a mapping of customers to subscriptions, and also have the profile ID for a given customer ID. The Customer Success team connects each of their own IDs to an email address. Figure 2 shows these connections and how bringing them together allows us to group all identities in the system.


Figure 2. Various systems maintain various connections. Bringing them together allows us to group all identities in the system.


Let’s create the Azure Data Explorer tables containing these identities, as shown in listing 1.

Listing 1. Tables containing IDs and connections

 
 .set Profiles <|
 datatable (ProfileId: int, Email: string, CookieId: guid) [
     10002, 'emma@hotmail.com', '657d31b9-0614-4df7-8be6-d576738a9661',
     10003, 'oliver@hotmail.com', '0864c60d-cc36-4384-81a3-e4c1eee14fe7'
 ]
  
 .set Customers <|
 datatable (CustomerId: int, ProfileId: int) [
     1001, 10002,
     1005, 10003
 ]
  
 .set Subscriptions <|
 datatable (CustomerId: int, SubscriptionId: guid) [
     1001, 'fd10b613-8378-4d37-b8e7-bb665999d122',
     1005, '55979377-ed34-4911-badf-05e07755334c'
 ]
  
 .set SupportCustomers <|
 datatable (SupportCustomerId: int, Email: string) [
     21, 'emma@hotmail.com',
     22, 'oliver@hotmail.com'
 ]
 

These tables come from different systems but end up ingested into our data platform. Once we have this raw data, we can build a keyring by grouping all related IDs. The schema of our keyring table consists of a GroupId, uniquely identifying a group of related identities, a KeyType, which specifies which identity we capture in each row, and a KeyValue, which is the value of the identity.

We’ll start by ingesting the Profiles data. Listing 2 shows the first batch of ingestions.

Listing 2. Ingesting Profiles into the keyring

 
 .create table Keyring(GroupId: guid, KeyType: string, KeyValue: string)
  
 .append Keyring <| Profiles    #A
 | project GroupId=new_guid(), KeyType='ProfileId', KeyValue=tostring(ProfileId)
  
 .append Keyring <| Profiles
 | join (Keyring | where KeyType == 'ProfileId'
     | project GroupId, ProfileId=toint(KeyValue)) on ProfileId    #B
 | project GroupId, KeyType='Email', Email    #C
  
 .append Keyring <| Profiles
 | join (Keyring | where KeyType == 'ProfileId'
     | project GroupId, ProfileId=toint(KeyValue)) on ProfileId
 | project GroupId, KeyType='CookieId', tostring(CookieId)    #D
 

#A .append is similar to .set, but as .set creates a new table, .append expects an existing table to ingest into.

#B We join Profiles with the ProfileId rows in the Keyring to get the GroupId.

#C Add Email to Keyring.

#D Same as the previous command, but adding CookieId to keyring.

We first generate new GUIDs and we set the key type to be 'ProfileId' and the key value the ProfileId from the Profiles table. Next, we join the Profiles table with the Keyring table on ProfileId, which gives us the GroupId, and we add the emails. Third, we join the Profiles table with the Keyring table on ProfileId and add the CookieId values.

At this point, we “unrolled” the Profiles table into the keyring schema. Let’s add the Customers and Subscriptions IDs to it. Listing 3 shows this ingestion.

Listing 3. Ingesting Customers and Subscriptions IDs

 
 .append Keyring <| Customers
 | join (Keyring | where KeyType == 'ProfileId'
     | project GroupId, ProfileId=toint(KeyValue)) on ProfileId
 | project GroupId, KeyType='CustomerId', tostring(CustomerId)
  
 .append Keyring <| Subscriptions
 | join (Keyring | where KeyType == 'CustomerId'
     | project GroupId, CustomerId=toint(KeyValue)) on CustomerId
 | project GroupId, KeyType='SubscriptionId', tostring(SubscriptionId)
 

This is similar with what we did before, except when we bring in SubscriptionId we have to join on CustomerId instead of ProfileId. This is not a problem: we can join on any identity that we already have in the keyring to find the GroupId and extend the group with other identities.

Finally, let’s also add the customer support IDs, joining on email. Listing 4 shows this step.

Listing 4. Ingesting Support Customer IDs

 
 .append Keyring <| SupportCustomers
 | join (Keyring | where KeyType == 'Email'    #A
     | project GroupId, Email = KeyValue) on Email    #A
 | project GroupId, KeyType='SupportCustomerId', tostring(SupportCustomerId)
 

#A We can join with any key already available in the Keyring. In this case, we use Email instead of ProfileId.

Understanding keyrings

We now pulled together the IDs from all these different tables into one and grouped them together. If we query the keyring table, we see something like the following table:

GroupId

KeyType

KeyValue

f03c9e90-5d97-4a11-82aa-480f74325a2c

CookieId

657d31b9-0614-4df7-8be6-d576738a9661

62159798-2447-41e3-b0ef-f1a239d55978

CookieId

0864c60d-cc36-4384-81a3-e4c1eee14fe7

f03c9e90-5d97-4a11-82aa-480f74325a2c

ProfileId

10002

62159798-2447-41e3-b0ef-f1a239d55978

ProfileId

10003

f03c9e90-5d97-4a11-82aa-480f74325a2c

Email

emma@hotmail.com

62159798-2447-41e3-b0ef-f1a239d55978

Email

oliver@hotmail.com

f03c9e90-5d97-4a11-82aa-480f74325a2c

SupportCustomerId

21

62159798-2447-41e3-b0ef-f1a239d55978

SupportCustomerId

22

f03c9e90-5d97-4a11-82aa-480f74325a2c

CustomerId

1001

62159798-2447-41e3-b0ef-f1a239d55978

CustomerId

1005

f03c9e90-5d97-4a11-82aa-480f74325a2c

SubscriptionId

fd10b613-8378-4d37-b8e7-bb665999d122

62159798-2447-41e3-b0ef-f1a239d55978

SubscriptionId

55979377-ed34-4911-badf-05e07755334c

Now given any ID in the system, we can easily retrieve all connected IDs. For example, listing 5 shows how, given a SupportCustomerId (21), we can retrieve all related keys.

Listing 5. Retrieving all IDs related to a SupportCustomerId

 
Keyring
 | where KeyType == 'SupportCustomerId' and KeyValue == tostring(21)    #A
 | project GroupId    #A
 | join kind=inner Keyring on GroupId    #B
 

#A Get the GroupId of the SupportCustomerId with value “21”.

#B Join with Keyring on GroupId to retrieve all identities in that group.

A keyring enables us to correlate different datasets and get a holistic view of how our systems are used. We use a schema in which we can plug as many types of IDs as needed, with a KeyType column which gives us the type of ID and KeyValue column which stores the ID value.

The steps to build a keyring are:

  • Generate a group ID and start by ingesting one identity (ProfileId in our example).
  • For each new identity type, join with the keyring on a known connection to get the GroupId, then add the new identities to their respective groups.

See sidebar for a graph-oriented view on building a keyring.

Identity keyring as a graph

Another way to think about an identity keyring is as a graph problem. Each identity in the system represents a node in the graph, and each known connection represents an edge. For example, ProfileId and Email are nodes and because they’re connected (in the Profiles table), we have edges between pairs of these nodes.

Building a keyring means identifying all groups of connected identities. In graph terms, this means identifying all connected components of the graph and assigning a GroupId to each connected component. As a reminder, a connected component in a graph is a subgraph in which there’s a path between any pair of nodes, and there are no other connections to the super-graph.

An alternative way to build a keyring is using a graph database: we load all nodes and edges then we traverse to find connected components.

Keyrings give us a unified view over all identities in our systems. Another useful view is that of all events that occur across our systems.

If you want to learn more, check out the book on Manning’s liveBook platform here.