|
From Data Privacy by Nishant Bhajaria The term “Data Inventory” is ill-defined and this article aims to create a definition which is intuitive and actionable.
|
Take 40% off Data Privacy by entering fccbhajaria into the discount code box at checkout at manning.com.
Data Inventory Definition
This process of adding tags derived from your Data Classification to your data systems is called “Data Inventory.” As you start building your Data Inventory, you’re indexing the contents of your data stores and making individual components expeditiously searchable. Data Inventory is like building the backend of a search engine for your data, much like a team of smart engineers built the backend of tools like Google.
Why do you need Data Inventory?
The definition covers an intuitive need for Data Inventory but it’s key that executives and aspiring executives understand specifically the risk mitigation and business enablement that Data Inventory makes possible.
In the lead up to the GDPR, the International Association of Privacy Professionals (IAPP) provided an enumerated plan for companies to get a head start on compliance. This was to be a checklist for companies to know where to start and what structures and processes to create as they prepare for a post-GDPR world, where privacy is to become front and center like never before. This list remains fairly applicable even as its individual components have become more complex to implement and with more variations based on a company’s use of data.
I listed the plan below with my insights added in.
- Conduct data inventory and mapping. This assumes that the starting point of a sound data protection program is the ability to classify, catalogue and discover data such that privacy risk is comprehensible at the time of data collection and access. This article provides a deep dive into Data Governance based on this time-tested guidance from industry experts.
- Establish a lawful basis for data processing and cross-border transfers. This is something your legal team would advise on, but the way you process data and where you transfer it to may take on additional complexities when it comes to geographic boundaries. Making that assessment requires exactly the sort of insight and discoverability that Data Classification and Data Inventory makes possible.
- Build and maintain a system to govern the data protection process, including establishing leadership (where appropriate, a data protection officer, setting policies and training personnel)
- Perform data protection impact assessments, along with data protection by design and by default. This typically refers to the privacy risk assessments and privacy reviews that your teams conduct on products and features.
- Prepare and implement data retention and record keeping policies and systems to meet information transparency and communications obligations. These obligations could form a part of your audits, for which prudent book-keeping is a prerequisite. Otherwise, your audit processes could become cumbersome and expensive.
- Configure systems and put in place processes to accommodate data subjects’ rights, including access, rectification, erasure, portability, objection to automated processing and revocation of consent. As mentioned before, data subjects’ rights (DSAR) are a key commitment for many companies thanks to laws like the GDPR and the CCPA. Having a Data Inventory is key to meeting these commitments at scale and with accuracy.
- Prepare for security breach response and notification. Your legal team and/or outside counsel should weigh in, but several jurisdictions in the United States and elsewhere have breach notification laws. These laws create expectations that companies that suffer from a data breach need to notify the impacted entities with specific pieces of information and within specific timeframes.
- Have a sound vendor management protocol. This step is critical because vendors who may get access to your systems and your data could make decisions with attendant privacy implications. Assessing the ability of your vendors to follow your data protection guidelines and their past record is critical. As we saw previously, companies may claim that data privacy issues occurred at third parties, but your stakeholders in the privacy community may hold you responsible nonetheless.
- Establish systems and channels for communicating with your data protection authority. It’s possible that you need to provide to regulatory authorities granular details around data, your decisions around handling it and time-stamped records. Data inventory enables and accelerates such a disclosure process, and that could help build a strong trust relationship as well.
To those executives who seek comfort from the fact that the only companies in the news for privacy are the big tech giants, I have this to say: These high-visibility companies faced a moment of truth AFTER rapid growth; at least they had the money to build privacy teams and lawyers to represent them in court. What if regulators or activist citizens come after a start-up pre-IPO and VCs fail to even get a basic return on their investment?
Additionally, the smaller your size and the more limited your resources, the harder it is to adapt to a sudden regulatory change. I know of several small companies that found their roadmaps severely impacted, and if you think privacy is expensive, the opportunity cost of not having privacy controls is almost certainly higher. As a somewhat imperfect analogue, consider this: Bill Gates recently said that the antitrust investigation around Microsoft in the late 1990s affected the company’s ability to effectively comprehend the threat posed by Google’s SaaS model and Apple’s mobile computing model, resulting in a lost decade for Microsoft. Why would you knowingly subject your company to such uncertainty, like when doing the right thing with privacy and trust helps your business build trust with your customers and helps with growth?
Data Inventory is a key part of your data protection program. Having established what Data Inventory is and the forcing functions that render it key, let’s look at the foundational building blocks of a Data Inventory. The next section looks at Data Inventory Tags.
Data Inventory: Machine-Readable Tags
“Tagging” or “Labelling” is something we all do routinely in our lives to help locate important materials like our tax returns or medical records. This concept and process are key when it comes to Data Governance.
What are Data Inventory tags?
Data Inventory is the process of applying the Data Classification onto your physical data stores. As we’ve already seen, the classification process is fairly cross-functional, and forces teams to come up with labels that describe the nature of the data and the privacy risk attached to it. Additional steps are required to ensure that your Data inventory is functional and serves its purpose i.e. indexing data, making it searchable and easier to protect.
The first steps in this process, and this is one that many companies tend to overlook to their eventual detriment, is to come up with tags or labels. These tags are the machine-readable incarnation of the Data Classification.
Data Inventory may well be the first time a company has a common definition around the data previously collected by several teams across the company. The task of finalizing these tags can often be confusing for many teams that may have gotten inured to their own naming conventions.
In order to simplify this process, I want to provide some criteria for useable data tags that helps your Data Inventory process and outcomes:
- These data labels and data value tags should be such that they are easily consumed by enforcement points like data loss prevention gateways or information rights management for actionable intelligence.
- The tags should be compatible with and supportive of external regulatory requirements (e.g. GDPR, CCPA). Sometimes you need to apply controls germane to specific legislation, and being able to tag your data appropriately is helpful. As an analogue, in GMail, you can tag a specific email with the labels “family vacation December 2019” and “Mom.” In this case, a search for either term surfaces that email.
- They should be applicable to all data in these states: data at rest, data in transit, and data in use. When it comes to data, you need to protect it regardless of its state, and the tags that enable you to locate it should yield similar outcomes regardless of whether the data is being transported between data centers or whether the data lives in a data warehouse.
- Tag definitions should be canonical, unambiguous, and machine-readable. They can be used either individually (e.g. for individual database column or API parameters), or as a group, represented as comma separated values, where applicable (e.g. for an entire dataset, or API).
The above enumeration isn’t exhaustive, but should offer you a great place to start. It’s vital that your team take seriously the exercise to come up with tag names. The process to apply these tags can be extremely expensive. This is one area where months and years of re-tagging save you weeks of planning.
Data Inventory tags – A specific example
Now that we’ve a conceptual understanding around Data Inventory tags, looking at specific patterns and examples help leaders form their own tagging strategies. For non-technical executive leaders, this exercise provides an educational view around the granularity and variety of data, as well as insight into why the tagging exercise is mission-critical. TABLE 1 below has an example of how you can create different kinds of tags for the same data.
TABLE 1: DATA INVENTORY TAGS
Tier |
Business/User |
Description |
Retention Category/ Retention Period/ Operational Access |
Tag Value (business|user):[a-z]+(-[a-z]+):L|F
Alternate Value (E.g. GCP Label) † (business|user)_[a-z]+(-[a-z]+):L/F |
Tier 1 |
User |
Government identifiers |
Active Account Data/ Lifetime/ N/A * |
user:government-id:L |
Tier 1 |
User |
Government identifiers |
Active Account Data/ Lifetime/ N/A * |
user:government-id-driverlicense:L |
Tier 1 |
User |
Government identifiers |
Active Account Data/ Lifetime/ N/A * |
user:government-id-passport:L |
Tier 1 |
User |
Protected health data |
Discretionary Retention/ Defined per collection/ N/A |
user:protected-health:L |
Tier 1 |
User |
Sensitive payment data |
** Active Account Data/ Lifetime/ N/A |
user:sensitive-payment:L |
Tier 1 |
User |
Sensitive demographic data |
Discretionary Retention/ Defined per collection/ N/A |
user:sensitive-demographic:L |
Tier 1 |
Business |
Intellectual Property and other Business Secrets |
Discretionary Retention/ Defined per collection/ N/A |
business:recipe:F |
It helps to understand how the data in TABLE 1 above operates.
First, let us look at the syntax for the data tags. Most engineers understand this but because this article aims for a broader audience, the following explanation may be helpful:
The format for a specific tag is along the lines of (business|user):[a-z]+(-[a-z]+):L|F. This format is known as a “regular expression,” which means it provides a template for what the end result is allowed to be. The format helps achieve three goals:
- A clearly identifiable signal to distinguish between business and user data
- A descriptive name that identifies for consumers of that data which is contained in that record
- A retention signal that helps drive decisions around how long the data can be retained in relation to privacy
Let’s step through some of the data and understand how the tagging process works to end up with a database that resembles TABLE 1. In order to do this, let’s break down the regular expression into smaller parts and understand how we’d use it to come up with a tag to store a customer’s payment data, for example, their credit card.
- (business|user) means that any data component being tagged could be either business data or user data. Because we need to create a tag for user data (i.e. a user’s payment data), we start our tag with “user”.
- The “:” is a delineator between different parts of the tag, and we’ll leave it as is. This far, we’ve “user:”
- Our regular expression now has [a-z]+(-[a-z]+): which means we can append a series of alphabets, followed by a hyphen and then more alphabets followed by another colon (“:”).
-
- Most companies categorize anything relating to payment data as sensitive, and we should append that word to our in-progress tag, leaving us with “user:sensitive”
- Several aspects of user data can be sensitive, ranging from their gender, race to payment information in an attempt to make our tagging more granular, and we can add more detail to our tag leaving us with “user:sensitive-payment:”
- We identified that any data that we attach with our in-progress tag’s sensitive payment data that belongs to a user
- The regular expression also provides us an “L” or an “F” to either retain the data for as long as the user is an active account holder or in the case of business data, it makes sense to hold the data. In this case, because we’re dealing with user data, our tag ends up being “user:sensitive-payment:L”
Now that we understand how the tags are created in line with the regular expression format, we can examine how these tags are used to map to data that you need to store and protect.
Let’s look at different tags for a business where a company that owns several restaurants wants to build its Data Inventory
Because our business owns restaurants, there’s a significant number of employees who work as cooks, delivery persons, and other staff. It’s likely that you’d support a vast number of different ways whereby people could prove their eligibility to work. Some of them might have a driver’s license, yet others may opt for a state ID.
Your use cases may involve:
- Wanting to update the database with employment verification records of new employees and support all forms of ID and circumstances
- Searching for employees based on a specific ID criteria. Example: all employees who are on a two-day probation after their first day because they haven’t provided a government ID yet
In TABLE 2, the first tag format (user:government-id:L) allows you to provide a binary value (True/False) to be able to discern whether the user has provided an ID.
TABLE 2: BASIC DATA INVENTORY TAGS
Tier |
Business/User |
Description |
Tag Value (business|user):[a-z]+(-[a-z]+):L|F
Alternate Value (E.g. GCP Label) † (business|user)_[a-z]+(-[a-z]+):L/F |
Tag example |
Tier 1 |
User |
Govt identifiers |
user:government-id:L |
John Smith:True:L |
Tier 1 |
User |
Govt identifiers |
user:government-id:L |
Jane Doe: True:L |
Tier 1 |
User |
Govt identifiers |
user:government-id:L |
Abe Linc: False:L |
Additionally, having “L” at the end of the tag indicates that the data in that record belongs to a user, and should be retained for the lifetime of the account. This means that you retain the user’s employment verification record until they work for your company. In reality these decisions are a lot more complex, because you may have to retain data for even longer due to tax purposes, but I’m simplifying the use case to explain Data Inventory.
In TABLE 2, you can specifically identify which employees have provided a valid government ID (John Smith and Jane Doe) and which ones have not (Abe Linc). After the first three days of employment, for example, you could run a query that searches for employees with “False” in their tags and identify the employees who have yet to furnish an ID.
The key takeaway: even if you don’t have data stored in a Structured Data format, you can use tagging to make the data searchable and identifiable.
You may want to identify employees who are on a work permit, and therefore need to submit their passports and an additional piece of documentation to prove their eligibility to work in the United States. TABLE 3 explains how Data Inventory can help.
In TABLE 3, the second tag format (user:government-id-driverlicense:L) and third (user:government-id-passport:L) tag format allows for different kinds of IDs. Instead of a binary value like we saw in TABLE 2, we use regular expressions to map the value of the tag.
TABLE 3: BASIC DATA INVENTORY TAGS
Tier |
Business/User |
Description |
Tag Value (business|user):[a-z]+(-[a-z]+):L|F
Alternate Value (E.g. GCP Label) † (business|user)_[a-z]+(-[a-z]+):L/F |
Tag example |
Tier 1 |
User |
Govt identifiers |
user:government-id-passport:L |
Jerry Seinfeld: ^d{10}:L |
Tier 1 |
User |
Govt identifiers |
user:government-id-driverlicense:L |
Jerry McGuire: ^d{9}:L |
Tier 1 |
User |
Govt identifiers |
user:government-id-passport:L |
Jerry Tom: ^d{10}:L |
In TABLE 3, the first and third user matches a request to identify employees with valid passports (on the assumption that passports have ten numbers) and Jerry McGuire matches a user who still needs to supply a passport (because he provided a driver’s license, which has nine numbers).
In this way, you can use Data Inventory to:
- Come up with tags that make your data searchable and map the data to privacy sensitivity
Extend the tags to meet diverse business use cases
The above example is a simplified exercise in data inventory and real-world scenarios are more complex and more diverse. The key takeaway is that you’re far better off having to search for, process and delete data using the above inventory rather than searching for sensitive data in JSON blobs or other data formats. In that scenario, you may miss sensitive data or end up spending significant chunks of resources in the discovery process.
Data Inventory ties in your privacy-centric understanding of your data i.e. your Data Classification to the data itself. This means that if you transfer your data from an on-premise environment to the Cloud, or from MongoDB to Cassandra, you ensure that the data carries with it the identities and risk values you attached. This significantly helps manage the risk in a decentralized and bottom-up data-driven company.
Now that we have the tags ready to apply to the data, we can take a look at the next section, which describes how you can create a baseline (or specifically, starting point) for your Data Inventory before using automation.
Data Inventory: Creating a Baseline
For any organization to get a handle on its data requires a mix of human and automation. The process of applying the tags involves a combination of both, partly because of the size of the data and in part because of its complexity.
Before you do that, you need a process to discover your data. This is critical because most companies start the Data Inventory process after a significant amount of data has already been collected. Although this represents a significant upfront expense, it also allows companies to build a baseline of their existing data. What we look at is some initial legwork to collect whichever information is readily available but scattered across different teams or in the minds of engineers without being documented. All of this information is euphemistically referred to as “tribal knowledge” Turning tribal knowledge into communal understanding is what we mean by “baseline”
By way of a baseline for Data Inventory, engineers, data scientists and others can come up with models and approximations of what data they collected and where it lives. Although these initial results may turn out to be incomplete or incorrect or both, this process can be useful in capturing known use cases and building ML models for additional discovery.
It helps to have a template/checklist to capture this information. I always recommended this pre-inventory done by inspecting your data storage from two dimensions:
- Data inventory by storage systems
- Data Inventory by data owner
In preparing your teams to inventory their data by storage system, you want to hand them a template that helps them record what they find in their first manual inventory of the systems they can account for.
For each storage system (e.g. Hive, Vertica, Kafka, SQL database, S3 bucket, etc.), data should be inventoried using following attributes:
- Total size (storage volume)
- Structured/unstructured data by %
- Data classification tier (as we discussed earlier) — NOTE: If your storage unit has data with multiple classifications, you should apply most high-risk tier
- Whether the unit contains personal data
It’s insufficient to inventory your data by storage system. Often, storage systems are owned by multiple stakeholders. You may also find that some storage systems aren’t owned by anyone, yet multiple engineers use them to store data.
In order to get an accurate view of your systems, you want to inventory your data by data owner as well. The attribute checklist for that could look like the following:
- Total Size (storage volume)
- Unit count (# of services/users/accounts or datasets)
- Structured vs. Unstructured
- Data classification tier (as we discussed earlier) — NOTE: If your storage unit has data with multiple classifications, you should apply most high-risk tier
- Whether the unit contains personal data
Once these initial baselines are complete, you get a sense of which business unit owns what percentage of privacy-sensitive data, and the systems said data lives in. This mapping is critical, and I’ve on occasion discovered data and systems that went undetected by automation; sometimes the one reclusive engineer knows of an S3 bucket where a table that maps home addresses to food deliveries lives.
If you want to learn more about the book, check it out on Manning’s liveBook platform here.