|
From Data Engineering on Azure by Vlad Riscutia This article talks about building identity keyrings. |
Take 40% off Data Engineering on Azure by entering fccriscutia2 into the discount code box at checkout at manning.com.
The code examples in this article use Azure Data Explorer. If you want to run them, create an Azure Data Explorer cluster and database: https://azure.microsoft.com/en-us/services/data-explorer/. Once this is set up, you can use the web UI to connect to your cluster and database: https://dataexplorer.azure.com/. Azure Data Explorer uses KQL, the Kusto Query Language. If you are not familiar with it, you can check out the quick reference here: https://docs.microsoft.com/en-us/azure/data-explorer/kql-quick-reference.
Identity keyrings
In a large enough enterprise, getting a broad view of how the systems are used isn’t an easy task. What usually happens is that different parts of the business generate and master their own identities.
For example, the Website team masters cookie IDs to identify users who aren’t signed in, and profile IDs for signed in users. The Payments team uses customer IDs to identify customers and subscription IDs to keep track of which subscriptions customers are paying for. The Customer Success team has a support customer ID to identify customers in their support tools, and email addresses of customers. An identity keyring pulls together all these identities across the different systems and allows us to quickly find all connections. Figure 1 shows the various identities used by the different teams and how a keyring groups them together.
Figure 1. The Website, Payments, and Customer Success teams all master their own identities. A keyring groups these identities together.
The larger the business, the more identities we have and the harder it is to get the broader view of how users interact with it. Having a keyring allows us to correlate activity across the various systems: we can see, for example, how the time it takes to resolve a customer issue in the Customer Success team impacts user retention, tracked by the Payments team, or we can see how an A/B test run on the Website impacts the subscriptions a user signs up for.
Building an identity keyring
Various systems have some connections between identities. For example, the Website team might have a table which matches cookie IDs to profiles once users log in, and the user profile includes an email address. The Payments team keeps a mapping of customers to subscriptions, and also have the profile ID for a given customer ID. The Customer Success team connects each of their own IDs to an email address. Figure 2 shows these connections and how bringing them together allows us to group all identities in the system.
Figure 2. Various systems maintain various connections. Bringing them together allows us to group all identities in the system.
Let’s create the Azure Data Explorer tables containing these identities, as shown in listing 1.
Listing 1. Tables containing IDs and connections
.set Profiles <| datatable (ProfileId: int, Email: string, CookieId: guid) [ 10002, 'emma@hotmail.com', '657d31b9-0614-4df7-8be6-d576738a9661', 10003, 'oliver@hotmail.com', '0864c60d-cc36-4384-81a3-e4c1eee14fe7' ] .set Customers <| datatable (CustomerId: int, ProfileId: int) [ 1001, 10002, 1005, 10003 ] .set Subscriptions <| datatable (CustomerId: int, SubscriptionId: guid) [ 1001, 'fd10b613-8378-4d37-b8e7-bb665999d122', 1005, '55979377-ed34-4911-badf-05e07755334c' ] .set SupportCustomers <| datatable (SupportCustomerId: int, Email: string) [ 21, 'emma@hotmail.com', 22, 'oliver@hotmail.com' ]
These tables come from different systems but end up ingested into our data platform. Once we have this raw data, we can build a keyring by grouping all related IDs. The schema of our keyring table consists of a GroupId
, uniquely identifying a group of related identities, a KeyType
, which specifies which identity we capture in each row, and a KeyValue
, which is the value of the identity.
We’ll start by ingesting the Profiles
data. Listing 2 shows the first batch of ingestions.
Listing 2. Ingesting Profiles into the keyring
.create table Keyring(GroupId: guid, KeyType: string, KeyValue: string) .append Keyring <| Profiles #A | project GroupId=new_guid(), KeyType='ProfileId', KeyValue=tostring(ProfileId) .append Keyring <| Profiles | join (Keyring | where KeyType == 'ProfileId' | project GroupId, ProfileId=toint(KeyValue)) on ProfileId #B | project GroupId, KeyType='Email', Email #C .append Keyring <| Profiles | join (Keyring | where KeyType == 'ProfileId' | project GroupId, ProfileId=toint(KeyValue)) on ProfileId | project GroupId, KeyType='CookieId', tostring(CookieId) #D
#A .append is similar to .set, but as .set creates a new table, .append expects an existing table to ingest into.
#B We join Profiles with the ProfileId rows in the Keyring to get the GroupId.
#C Add Email to Keyring.
#D Same as the previous command, but adding CookieId to keyring.
We first generate new GUIDs and we set the key type to be 'ProfileId'
and the key value the ProfileId
from the Profiles
table. Next, we join the Profiles
table with the Keyring
table on ProfileId
, which gives us the GroupId
, and we add the emails. Third, we join the Profiles
table with the Keyring table on ProfileId
and add the CookieId
values.
At this point, we “unrolled” the Profiles
table into the keyring schema. Let’s add the Customers
and Subscriptions
IDs to it. Listing 3 shows this ingestion.
Listing 3. Ingesting Customers and Subscriptions IDs
.append Keyring <| Customers | join (Keyring | where KeyType == 'ProfileId' | project GroupId, ProfileId=toint(KeyValue)) on ProfileId | project GroupId, KeyType='CustomerId', tostring(CustomerId) .append Keyring <| Subscriptions | join (Keyring | where KeyType == 'CustomerId' | project GroupId, CustomerId=toint(KeyValue)) on CustomerId | project GroupId, KeyType='SubscriptionId', tostring(SubscriptionId)
This is similar with what we did before, except when we bring in SubscriptionId
we have to join on CustomerId
instead of ProfileId
. This is not a problem: we can join on any identity that we already have in the keyring to find the GroupId
and extend the group with other identities.
Finally, let’s also add the customer support IDs, joining on email. Listing 4 shows this step.
Listing 4. Ingesting Support Customer IDs
.append Keyring <| SupportCustomers
| join (Keyring | where KeyType == 'Email' #A
| project GroupId, Email = KeyValue) on Email #A
| project GroupId, KeyType='SupportCustomerId', tostring(SupportCustomerId)
#A We can join with any key already available in the Keyring. In this case, we use Email instead of ProfileId.
Understanding keyrings
We now pulled together the IDs from all these different tables into one and grouped them together. If we query the keyring table, we see something like the following table:
GroupId |
KeyType |
KeyValue |
f03c9e90-5d97-4a11-82aa-480f74325a2c |
CookieId |
657d31b9-0614-4df7-8be6-d576738a9661 |
62159798-2447-41e3-b0ef-f1a239d55978 |
CookieId |
0864c60d-cc36-4384-81a3-e4c1eee14fe7 |
f03c9e90-5d97-4a11-82aa-480f74325a2c |
ProfileId |
10002 |
62159798-2447-41e3-b0ef-f1a239d55978 |
ProfileId |
10003 |
f03c9e90-5d97-4a11-82aa-480f74325a2c |
|
emma@hotmail.com |
62159798-2447-41e3-b0ef-f1a239d55978 |
|
oliver@hotmail.com |
f03c9e90-5d97-4a11-82aa-480f74325a2c |
SupportCustomerId |
21 |
62159798-2447-41e3-b0ef-f1a239d55978 |
SupportCustomerId |
22 |
f03c9e90-5d97-4a11-82aa-480f74325a2c |
CustomerId |
1001 |
62159798-2447-41e3-b0ef-f1a239d55978 |
CustomerId |
1005 |
f03c9e90-5d97-4a11-82aa-480f74325a2c |
SubscriptionId |
fd10b613-8378-4d37-b8e7-bb665999d122 |
62159798-2447-41e3-b0ef-f1a239d55978 |
SubscriptionId |
55979377-ed34-4911-badf-05e07755334c |
Now given any ID in the system, we can easily retrieve all connected IDs. For example, listing 5 shows how, given a SupportCustomerId
(21
), we can retrieve all related keys.
Listing 5. Retrieving all IDs related to a SupportCustomerId
Keyring | where KeyType == 'SupportCustomerId' and KeyValue == tostring(21) #A | project GroupId #A | join kind=inner Keyring on GroupId #B
#A Get the GroupId of the SupportCustomerId with value “21”.
#B Join with Keyring on GroupId to retrieve all identities in that group.
A keyring enables us to correlate different datasets and get a holistic view of how our systems are used. We use a schema in which we can plug as many types of IDs as needed, with a KeyType
column which gives us the type of ID and KeyValue
column which stores the ID value.
The steps to build a keyring are:
- Generate a group ID and start by ingesting one identity (
ProfileId
in our example). - For each new identity type, join with the keyring on a known connection to get the
GroupId
, then add the new identities to their respective groups.
See sidebar for a graph-oriented view on building a keyring.
Keyrings give us a unified view over all identities in our systems. Another useful view is that of all events that occur across our systems.
If you want to learn more, check out the book on Manning’s liveBook platform here.