From MLOps Engineering at Scale by Carl Osipov

The goal of this article is to teach you the data quality criteria you should use across any machine learning project, regardless of the dataset. This means that this part deals primarily with concepts rather than code.


Take 40% off Cloud Native Machine Learning by entering fccosipov into the discount code box at checkout at manning.com.


Getting started with data quality

The goal of this article is to teach you the data quality criteria you should use across any machine learning project, regardless of the dataset. This means that this article deals primarily with concepts rather than code.

Let’s face it: data cleanup is an important but not the most exciting topic in machine learning. So to make the data quality principles more concrete, easier to remember, and hopefully more interesting, this part of the chapter relies heavily on real world case studies and data cleanup examples you can apply to your next machine learning project.

From “garbage-in garbage-out” to data quality

“Garbage in, garbage out” is a well-known cliché in the information technology industry. In the context of this book it means that if the input into your machine learning system is garbage, then the machine learning algorithms will train on garbage, and the outputs of machine learning will be garbage as well. The cliché points to the importance of data quality for a machine learning project but it does not prove that “garbage-in garbage-out” is critical or relevant to real-world data analysis and machine learning.

In 2010, as the global economy was still recovering from the financial meltdown that took place a few years prior, two Harvard economists, Carmen M. Reinhart and Kenneth S. Rogoff, published a research paper arguing for policies that could help countries to get their economies growing again. In the paper, the economists presented an analysis demonstrating that countries that incur debt of over 90% of their gross domestic product (GDP) face economic declines. In part based on the analysis from the economists, some European Union (EU) countries adopted harsh austerity measures, slashing salaries and eliminating thousands of jobs. As it turned out, the data used by the analysis was wrong.

The politicians who based their policies on the Reinhart-Rogoff results fell victim to the classic “garbage-in garbage-out” problem. The Reinhart-Rogoff fiasco is just one instance of many where an analysis of poor quality data led to billions of dollars’ worth of negative consequences. Some sources, such as the highly respected Harvard Business Review magazine, published a notable (also likely an exaggerated) claim that the total cost of bad data to the U.S. economy should be measured in thousands of billions of US dollars.

The issue of data quality is important, but to you as a machine learning practitioner, it may not be immediately obvious that you are working with a poor quality dataset. How do you know if your data is garbage or if it is of sufficient quality to perform machine learning?

Before starting with data quality

Before you can begin to work on data quality you need more than just a structured dataset. You need to know answers to the kinds of questions that you have already answered about DC taxi data:

  • Can the dataset be queried as one or more tables of rows and columns? In other words, are you querying data stored using a structured dataset format (e.g. CSV)? Since VACUUM is a set of data quality principles for structured datasets, it does not apply to unstructured formats used for natural language text, images, audio, and video.
  • What questions do you need to answer based on which columns? The DC taxi dataset-based machine learning example in this book is built around the question “What is the value for the fare amount column, given that you know the start time of a DC taxi trip, as well as the latitude and longitude coordinates for the pickup and the drop off locations of the trip?” Knowing the questions you wish to ask about the data also helps you understand the essential data in your dataset, in other words, the data in scope for training of your machine learning system to answer the questions. In addition to the essential data, your dataset may also contain reference data that is useful for ensuring the quality (specifically the accuracy) of your essential data but does not need to be cleaned up with the same degree of rigor as your essential data. For example, the values in the mileage column of the DC taxi dataset are not essential to answering the questions but are useful as a reference to compare against the values of the fareamount column and to ensure that the fare amount values have the right degree of data quality.
  • What is the schema for the essential data? Before you can begin cleaning up the dataset, you need to create a data schema in the catalog with changes that ensure that the data values are specified using appropriate data types and constraints. The data schema specifies the data type for every column of a data set. Yet, while the data type specifications are necessary for the schema to help ensure data quality, they are not sufficient. For every data type, you should also be able to specify whether it is “nullable.” Here, data type nullability is equivalent to the DDL (data definition language) nullability and indicates whether a value is allowed to be missing. You should also specify any constraints that further limit the range of possible values: in cases of string types, these can include regular expressions, while in the cases of integer types, these can be specified using interval ranges. The upcoming section on valid data illustrates the constraints using practical examples.

This schema in the catalog is similar to a data definition language (DDL) schema (part of the SQL standard) which describes data types such as integers, floats, timestamps, and more. Keep in mind that the discovered schema may or may not be the right schema to use.

So what does it mean to have the “right schema”? More precisely, what does it mean for a schema to consist of data types that are appropriate for the dataset’s values? Just as with DDL schemas, the choice of the appropriate data types is a tradeoff. On one hand, the schema should use data types that are sufficiently general to preserve the data values without loss of information. On the other hand, the data types should support (without type casting) expected operations on the data values while making efficient use of storage space. For example, the latitude and longitude coordinates from DC taxi dataset should be specified in the schema as floating point values (DOUBLE data type) instead of Unicode strings so the coordinate values can be used for distance calculations.

Normative principles for data quality

This section is about the principles behind the Valid, Accurate, Consistent, Uniform, and Unified Model (VACUUM) for structured data quality, along with cautionary tales to serve as case studies. The principles are normative, meaning that they define what quality data ought to be like instead of prescribing the specific steps or code for the implementation of data quality processing. The value of the principles is in a comprehensive and rigorous definition behind the claim that the data that complies with VACUUM is sufficiently “clean” and ready for machine learning.

Think of the VACUUM principles as a checklist of guidelines, criteria, or metrics of data quality that you should explore as part of your machine learning project. Keep in mind that doctors and pilots (as well as many other professionals) use checklists, but having a checklist will not make you a pilot or a doctor. If you are planning to develop professional expertise in data quality, you will need to develop your data cleaning skills. Once you have the right experience in place, a checklist can help jog your memory and ensure that you do not miss important data quality aspects.

VALID

On January 31st, 2020, United Kingdom left the EU. So, should an EU data warehouse store the string value United Kingdom as a valid value in a column with names of EU member countries?

You could argue that beginning February 1st, 2020 United Kingdom should stop being a valid data value in any column mentioning EU member states. However, this approach is counterproductive: excluding United Kingdom from a set of valid values means that any historical data associated with the column, in other words, any records dated prior to February 1st, 2020, are associated with a value that is not valid. So, if a value in a dataset was valid at any point of the dataset’s existence, it should remain valid.

This definition does not specify whether a combination of multiple valid values across multiple columns is valid: this issue will be addressed in the upcoming section on accuracy.

More precisely, a data value in a column is valid if it:

  • matches the column data type specified by the schema. For a data value to be valid, it must match the data type specified by the schema. SQL-based data type definitions in a schema may include:
    • INTEGER, for example in a column storing an elevator floor number
    • DOUBLE, for example, a percentage of users who click on a “Subscribe” button on a website
    • TIMESTAMP, for example, the time when an order was placed on a website
    • BOOLEAN, for example, whether a taxi trip ended at an airport
    • STRING, for example, the text of comments left in a comment box on a survey
  • matches one or more of the following constraints:
    • nullability, this constraint applies to any data type and specifies whether a value in a data column is allowed to have a NULL value. For example, a TIMESTAMP data value storing the date of birth in a driver’s license database must be non-nullable (in other words should not be permitted to have NULL value), while a user’s Twitter handle on a customer profile web page can be specified as nullable to handle the cases when the handle is unknown or not specified. Nullable data types can also include INTEGERs (for example, a rating of a taxi ride by a passenger on a scale of 1-5, with NULL value representing no rating), and other data types.
    • enumeration, this constraint applies to any data type and specifies a validation set, a dictionary, or an enumeration of valid values for a data type. In case of STRING values, the enumerations may include names of US states, or major airports in the New York City area, such as {LGA, JFK, EWR}. The enumeration constraint for a schema may specify an INTEGER data type for a country code column in a dataset of phone numbers using an enumeration of valid country phone codes. Recall from the example in the beginning of this section, that the enumeration must include all values that have ever been valid for the column. So, in any dataset older than February 1st, 2020, in a data column that stores European Union country names, United Kingdom is a valid value, regardless of the fact that the UK brexited from the EU on January 31st, 2020.
    • range This constraint is data type specific can be one of the following types:
      • interval constraint, used for numeric or date / time data types. As an example of valid integer data values, consider a dataset with an activity log for a single elevator in a skyscraper. One of the data columns in the dataset stores the floor number of where the elevator stopped. Since not all floors in this hypothetical skyscraper are accessible by the elevator and the numbering systems skips the 13th floor due to superstitions, the constraints on possible values include intervals from -3 to -1, for the parking garage, 1 to 12, and 14 to 42. Typical notation for this interval is [[-3, -1] or (0, 12] or [14,42]] where the square brackets [] indicate that the value is included in the interval while the parentheses () indicate that the interval does not include the value neighboring the parenthesis. The or keyword in this case represents the set union operation (in other words a logical or).
      • A similar approach is used when working with DOUBLE and other floating point data types. For example, a probability value can be specified with an interval range constraint from 0.0 to 1.0, [0.0, 1.0]
      • Intervals are also common with TIMESTAMP data types for date/time range where they are used to describe periods such as workdays, weekends, or holidays, for example, dates [2020-01-01 00:00:00, 2021-01-01 00:00:00]
      • regular expression constraint is used in the cases of the STRING data type, to specify the space of the valid values. For example, in a database that stores Twitter handles of social media influencers, a regular expression can specify that any value that matches /^@[a-zA-Z0-9_]{1,15}$/ is valid. Keep in mind that regular expressions also apply to many data columns that appear numeric, for instance, IP addresses consist primarily of numbers but are commonly stored as a string.
    • rule constraint applies to any data type and specifies the conditions to decide whether a value is valid. For example, if you have ever used a “Save my payment for later” feature on a website to permit a PCI-DSS compliant vendor to store your credit card number, you should know that a rule constraint for a credit card number is based on Luhn algorithm which computes parity check bits that ensure that a credit card number is valid.

At this point, you have seen the criteria specifying and examples illustrating what it means for a single value in a dataset to be valid or invalid. However, it is straightforward to come up with an example of a record consisting entirely of valid values but with an obvious data quality issue. Here’s a made up record from a hypothetical dataset listing countries on a continent.

Continent

Country

South America

United States

Both South America and United States are valid values for the respective columns, since the valid principle ensures that data quality problems are addressed on per column basis with data quality validation checks designed to catch data quality within a single value. To address the data quality issue in this example you will need to learn about the accuracy principle.

ACCURATE

When you learned about valid data, you looked at an example of a dataset of records about member states of the EU. As part of the example, you saw that United Kingdom is a valid value for the EU country column. Suppose you are working with a data record consisting of two columns: the first with a date/time of the membership and the second with the name of the country:

Date of Record

Member State

2020-01-31

United Kingdom

2020-02-01

United Kingdom

While all of the values in the example are valid, it is impossible to assert that the 2nd row in the dataset is garbage without using an external (to the data record) reference data source. The reference should be able to process the values from the entire record and indicate whether or not (or to what extent) the record is inaccurate.

More precisely, a data record is accurate if all the data values that are part of the record are valid and the combination of the values in the record are reconciled to a reference data source. By the way of an example, consider a database of college alumni, with the date of alumni college enrollment and the date of their college graduation. Checking the database for accuracy requires references to external sources of truth, for example an admissions database and a transcript database. In financial records, inaccuracy can be due to a mismatch between a credit card number and pin code. Sometimes accuracy problems arise when joining multiple tables incorrectly, for example, in a data record stating that the movie Titanic was produced by Guy Ritchie in 1920.

The accuracy assurance for values such as domain names are a particularly difficult category because the reference data source, the domain registration and DNS databases change over time. For example, if you try to create an email mailing list and you check the domain name part of an email using a regular expression, the data in the list may be valid but not accurate in the sense that some of the emails do not map to valid domain names. So you may attempt to send an email to the address from the mailing list to confirm that the domain name and the email resolve to an accurate address. Even prior to sending the email, you may attempt to perform a DNS lookup to verify the accuracy of the domain name. You could read more on DNS related error messages and how to fix them on the blog pages of WP Buffs.

In the UK leaving EU example, improving the quality of the data in the dataset means that there must exist a reference data source with the master record of the timestamps for the start and the end dates of an EU state membership. However, for many organizations the challenge with reference data sources isn’t that there are too few of them, but rather that there are too many. The next section on consistency will illustrate this problem with more examples.

CONSISTENT

In January of 2005, an owner of a small house in Valparaiso, a town of about 30 thousand residents in Porter County, Indiana, received a notice that the annual tax assessment value of his house was set at 400 million dollars. The notice, which also included a demand for a tax payment of $8 million, came as a surprise to the owner of the modest house since just the year prior the tax payment amounted to $1500. Although the issue with data accuracy was soon resolved, the story did not end there.

The data systems of Valparaiso did not follow the data quality consistency principle, so the original data accuracy problem propagated into the budgeting system for the town. The small town’s budget was drawn up with an assumption of an $8 million-dollar tax payment, so the town had to claw back $3.1 million from schools, libraries, and other budget funded units. That year, Porter County ended up with many unhappy students and parents as the schools had to cover a $200,000 budget shortfall.

Consistency issues arise when different and conflicting validation and accuracy implementations are used across different data silos: databases, data stores, or data systems. While each individual silo can be valid and accurate according to a silo-specific set of definitions, ensuring consistency means ensuring a common set of standards for valid and accurate data before integrating the data from systems across silos that span different technology and organizational boundaries.

For example, was UK a member of EU at 11:30pm on January 31st, 2020? If you are not mindful of data quality, the answer may depend on your dataset. In a UK dataset, you can expect a valid and accurate record showing that UK was not an EU member country at 11:30pm on January 31st, 2020. Yet, in an EU dataset, an identical combination of the date, time, and country name values, is an accurate record for an EU country member. As you have probably guessed the inconsistency is due to different assumptions about storing the date and time values across different datasets. The UK dataset in this example uses the Greenwich Mean Timezone while the EU dataset uses Central European Time.

Even when joining tables within a single dataset, or a single data silo, it is important to ensure consistency of validation and accuracy rules. Typical issues arise when using phone numbers and emails as unique identifiers for a user: since phone numbers and emails may change owners, joining tables based on this information might lead to problems. Another example might include different approaches for storing other identifiers such as phone numbers. Some might uniquely identify with country code, others may not. This might be as simple as using different primary keys with the same individual across different systems, maybe creating a new primary key to join together or it might be more subtle. For example some systems might use 5+4 ZIP code, other systems might use 5 digit zip code per individual.

UNIFORM

The Mars Climate Orbiter, a $125 million dollar robotic space probe, was launched by NASA to Mars in 1998. Less than 12 months later, during an orbit change maneuver, it skidded off the atmosphere of Mars and was gone. The designers of the orbiter integrated two systems where one was using U.S. customary (imperial) units of measurement and another was based on the SI (metric) units. Since the data measurement values used by the orbiter were not uniform across multiple data records, NASA wasted $125 million dollars of its budget.

Another example of a lack of uniformity per table could be due to survey issues. One taxi company is prompting passengers to rate their ride on a scale from 0-4 stars after each ride. While another taxi company uses 0 as no response with a scale of 1-5 stars. Average satisfaction scores per driver are recorded daily and inserted into an average score. The average satisfaction score is not uniform across rows in a dataset. Some drivers might be consistently penalized. It is critical that every data point inserted into a row is uniformly created. Problems with uniformity are caused by inconsistent data. Inconsistent tables have been joined together or data with inconsistent rows was inserted into datasets.

UNIFIED

In a 1912 book, an influential logician and philosopher Bertrand Russell described a fable to illustrate the problem with inductive reasoning, a fundamental principle behind machine learning. Paraphrasing Russell, here’s the story in a nutshell:

“On December 1st, a turkey was born in the United States. It was no ordinary turkey but rather the most intelligent turkey that ever lived. The genius turkey soon figured out the patterns of the night sky, the role of the sun in casting shadows, and realized that it was fed at 7am every day. Reasoning that the food was critical to the turkey’s wellbeing, it proceeded to ask itself whether it would be worthwhile to plot an escape, risking hunger and death, or if it would be safer to remain a well-fed captive. Having re-invented statistics, the rational turkey gathered data, developing increasing confidence that it will be fed every day at 7am regardless of the position of the sun, the moon, and the stars, temperature, precipitation and other factors. But the morning of the Thanksgiving Day its head was chopped off.”

The story (which is meant to be facetious rather than tragic) is here to help you remember that the machine learning models that you create, no matter how complex, are little more than digital versions of Russell’s turkey. Their success is based solely on their capacity to take advantage of the data available to them. In contrast, as a machine learning practitioner, you can make your machine learning project more successful through curiosity and deductive reasoning: by discovering facts and datasets that are novel but relevant to the project’s use case, unifying the discovered information with the dataset on hand, and expanding the scope of the training data available to train models. You can also help minimize the risks to the machine learning project’s success by discovering and addressing potential sources of non-obvious systemic bias in the training dataset, unifying, and bringing into alignment the cultural values of the project’s operating environment and the contents of the dataset.

While you can rely on machine learning models to perform effective inductive reasoning on an existing dataset, it is your responsibility to ensure that the dataset follows the unified principle, meaning that it:

  1. is a single place for the data relevant to your project’s machine learning use case(s).
  2. aligns the criteria used by your use case to achieve unbiased data-driven decision making with the content of the data you are using for machine learning model training.
  3. depends on a common data quality process for the data used for machine learning model training and for the data used with a trained machine learning model.

The unified principle is a part of VACUUM to remind you that data quality is a journey and not a destination.

That’s all for now.

To practice applying these data quality principles in a Jupyter notebook, you may continue on to the following example: https://github.com/osipov/smlbook/blob/master/ch3.ipynb

If you want to learn more about the book, check it out on our liveBook platform here.