By Tryggvi Björgvinsson
Everywhere we go we are surrounded by data, but it’s not always simple to interpret. Not only that, it’s also rather easy to mistakenly interpret data. This article, adapted from chapter 1, introduces the idea of “data usability” and goes into why it’s important in today’s data-rich world.
Should it be possible to convict you as a criminal because you don’t have enough knowledge? Are you a criminal for making a wrong prediction? Those questions arose in a regional court case in Italy, where science itself was put on trial following a big earthquake that destroyed the city of L’Aquila and killed 309 people in April, 2009. The tragedy of those deaths was made even worse in 2012 when six scientists and a government official, all members of the National Commission for the Forecast and Prevention of Major Risks, were accused of giving falsely reassuring statements prior to the quake.
It took the judge only four hours to reach a verdict. All seven individuals were found guilty of multiple manslaughter for inadequately warning residents by giving inaccurate, incomplete, and contradictory information. The convicted individuals were left confused. They didn’t know what they’d been convicted for, but they knew they’d been sentenced to six years in prison and barred from ever holding public office again. Scientists around the world were appalled and afraid this would give a bad precendent. Never again would knowledge relating to uncertainty be shared with the public. Scientific progress would grind to a halt in more fields than seismology. Thankfully, the conviction was overturned two years later. All of the convicted were cleared in an appeal that was confirmed by the Italian Supreme Court in November, 2015.
Being able to make correct predictions is a result of a seemingly endless amount of small insights and improvements, made over the course of many years, each step putting us in a better position to make better predictions. But, all of these incremental improvements allow us to make better predictions, not to speak in certainties.
Base future improvements on past insights
Like all of us, scientists can be wrong or make mistakes. Another interesting example of scientists being inaccurate dates from January 1973. What makes this example different is how the mistake was used as a basis for improvements instead of as grounds for a court case. This event took place in Iceland and, much like with L’Aquila, it began with earthquake sensors detecting increased seismic activity. Earthquake sensors are expensive and they were even more expensive back in 1973. You couldn’t put as many of them as you wanted wherever you wanted. This created a problem back in those days because the amount of sensors matters and the cost limited how many sensors were put into use. It turned out, in the 1973 event, that this limitation risked more than 5,000 human lives.
It started when increased seismic activity was detected by two earthquake sensors in south Iceland. The sensors record earthquake waves that travel through the Earth’s crust from the point of origin to the sensor. By analyzing the time it takes to reach different earthquake sensors, you can predict the point of origin (figure 1). The analysis of the data from the two earthquake sensors in 1973 indicated two possible points of origin: an active earthquake area called Veiðivötn, or a volcano in an archipelago off the southern coast of Iceland called Vestmannaeyjar.
Figure 1. Two possible points of origin with two sensors
Veiðivötn is a more active seismic area than Vestmannaeyjar, and the scientists predicted that Veiðivötn was the point of origin. Their focus was therefore on monitoring the Veiðivötn area closely, but as you can probably guess, they were monitoring the wrong area. The volcano on the inhabited island in the Vestmannaeyjar archipelago erupted.
Vestmannaeyjar threatened the more than 5200 inhabitants who lived on the small 5.2 square mile island where the volcano erupted. Even though the majority of inhabitants had gone to sleep when it started erupting around midnight, they were all able to leave their homes and reach safety on the mainland.
The rescue efforts were successful and the efforts marked the first time in history that lava flow was changed by human intervention. To save the prosperous harbor of Vestmannaeyjar, sea water was hosed onto the flowing lava and the lava current was redirected away from the harbor.
In contrast to the L’Aquila trial, the scientists weren’t put on trial. Instead they analyzed their predictions and realized that they could improve them. The problem wasn’t that the data was wrong, rather that it was incomplete. They needed more earthquake sensors (figure 2) to improve the quality of their predictions.
Figure 2. An extra sensor identifies the correct point of origin
Consider these two different approaches to improvement. The approach with the L’Aquila trial was to punish for lack of quality, an encouragement equivalent to that given by Homer Simpson when he told his kids, “You tried your best and you failed miserably. The lesson is, never try.”
To improve predictions, the approach in the Vestmannaeyjar eruption is more likely to be fruitful; analyzing the mistake is a much better approach to improving predictions. You look at what you’ve done and what data you have. From that, you try to figure out how you can improve what to do in the future (in this case, by adding more sensors). This approach increases the quality by encouraging improvements instead of pushing people away from them.
Why am I telling you this? What does this have to do with data usability? Well, it’s because the key ingredient of data usability is quality. The secret behind the art of data usability is to improve data quality; I’d even go as far as saying that these two terms can be used interchangeably. The difference is that the term data usability reminds us of what we’re trying to achieve (useful data) and data quality tells us how we’re going to achieve it (by improving quality). With that in mind, let’s give data quality (the underpinning of data usability) some further attention.
Creating the perfect world
I want you to hold two images in your mind. First, imagine a community; it can be any community you like: your online forum, a town, some squatters; pick your favorite community. Keep that community in your mind. Now, think about ice cream. As with the community, you can think about any ice cream: a particular brand, flavor, or texture you like.
Next, let’s think about how quality applies to these separate things community and ice cream. Ask yourself these two questions:
What is a quality community?
What is a quality ice cream?
You’re probably not going to answer those questions in the same way.
A quality community could be a community where everyone treats people with respect. It could also be a place where each community member contributes to the overall wellbeing or a group of people living off the land in a sustainable way.
Can these values be applied to ice cream? Would a quality ice cream treat people with respect? Is a quality ice cream all about contributing to the overall wellbeing of people? Would a quality ice cream live off the land in a sustainable way?
Probably not (I’d say that ice cream contributes to the overall wellbeing of people, but let’s not get sidetracked).
We probably regard quality ice cream as being made from natural ingredients, rich in flavor, or as ice cream that doesn’t melt instantly. Maybe all of these things, but in some prioritized order.
The point is that our definition of quality varies, depending on the product, situation, and who we are. That doesn’t mean we define quality differently. We have different expectations, and quality is all about expectations. The definition of a quality community depends on the expectations of a community member. Quality is therefore a highly personal or circumstantial metric.
What exactly is quality?
If you manage a dataset, you’re responsible for making sure that it’s of high quality and that it delivers what people expect it to deliver. You must make sure that you can provide quality data in various situations, for various projects and events, and to various people. You always strive for more quality in the data or the information you deliver. Datasets can be used, either directly or as information, by different groups of people.
A bookstore might collect sales figures for books into a single dataset used by different people. For example, consider the following:
To stock up, the inventory manager needs the dataset to see what books are selling.
The business manager needs the dataset to see what types of books are selling to better focus the business strategy.
Buyers might want to see a list of top-selling books because they don’t want to miss out on good books.
Advertisers may want to see a more condensed list of top-selling books to put into ads and draw people to the book store.
Each data consumer must look at the dataset in context. The inventory manager will have to mix sales figures data with inventory data. It’s of no use to stock up on books when you have plenty in your inventory. The business manager needs to analyze the market, competitors, prices and more alongside the sales figures. Buyers might want to see top-selling lists based on book categories; it doesn’t help someone interested in science fiction to see a list of top-selling books that contains only thrillers and recipe books. Advertisers may want to mix their shortlist with the international top-selling lists, book rewards, or book reviews. All those demands revolve around the same dataset.
It’s a complex world and your work as a data quality manager is to create the perfect world for your data consumers. You may now be asking yourself again: what exactly is data quality? Remember: you’re not only thinking about quality of data, you’re also improving quality of information and knowledge. Quality is always about the same thing, irrespective of whether you’re going for data, information, knowledge, community, or ice cream quality. It’s all about what the user wants.
It might be useful to think of quality as having two dimensions:
Required quality, how well something fulfills the user requirements.
Attractive quality, how well something exceeds the expectations of the users.
What I’ll be referring to as quality includes both those dimensions — how well the data, information, or knowledge fulfills or exceeds the requirements and expectations of the data users. You won’t be able to say what the perfect ice cream is (the best quality) or what would make your community great, but you’ll know it when you experience it and quality is about perfecting that experience. People have different expectations of different products and services and they won’t always tell you exactly what their expectations are; this opens up another question: what do people want?
Imitating Santa Claus
Mapping out requirements and expectations of people can be tricky. Only one person I know of can single-handedly figure out the all the requirements and expectations of the target group: Santa Claus. When you’re mapping out the different needs and wants, you must try your hardest to become Santa, but there’s an additional complexity. It’s not only about figuring out what people want, it’s also about figuring out what level of it they want.
If a child wants a toy car, Santa can be sure that the kid wants the whole car (not just a tire), but with quality, our expectations are on a spectrum. It’s good to see it in terms of expectations of timeliness. When you get married, you and your spouse must be registered at some government agency. Hopefully you won’t have to do it yourself because nothing brings you down from the happy cloud faster than waiting in a queue at a government agency. If you get married on a Saturday, the papers won’t be turned in until Monday and then the government agency probably won’t finish processing the papers until a few days later. In that time, you won’t be registered as married. But, does it matter if it takes five days to get registered? You want the government agency to register you as marrying the correct person but you still want your marriage to be registered in a relatively timely manner. You’d be dissatisfied if you weren’t registered as married until five years later. You want this particular data point — your marriage — to be registered in a relatively timely manner. A few days don’t matter, but a few years do.
Put that into perspective with a stock trader, who’s in a different league than government registries and wouldn’t be happy with a lag of a few days. Stock trading depends on shaving off a few milliseconds per transaction, not days. For both stock traders and newlyweds, timeliness is a requirement of the data process. It’s an attribute that describes quality. For stock traders, it is a crucial quality attribute, but it’s further down the priority list for the happy honeymooners.
Timeliness is one of many quality attributes. One scientific paper on data quality lists 180 different data quality attributes: ability to download, ability to upload, historical compatibility, novelty, up-to-date, well-documented, and many more. In software engineering, attributes or qualities like these are often referred to as “ilities” because many of the words end with “ility” (compressibility, reliability, volatility, and so on). This is the world you face; there are many different quality attributes, and data users in different contexts require various attributes and may put a different emphasis on each of those attributes. Working with quality is about identifying the applicable quality attributes and the satisfactory level for each attribute. In turn, the satisfactory level is the required quality aspect — the minimum level data users would be happy with in each context — and the attractive quality is how we exceed those expectations, but how do we do that?
We certainly don’t do it by memorizing a shopping cart full of quality attributes and picking them like food from a restaurant menu. It’s nearly impossible to make a list of what everyone wants (unless you’re Santa Claus). The trick is knowing your users; talking to them, listening to what they want, and understanding what they use the data for and in what context. Knowing that helps identify which attributes to focus on. Slowly but steadily you’ll gain experience and be able to identify the attributes people look for in your data. Remember that, even with experience, you never stop talking and listening to your data users. People and situations change and you constantly need to update your own understanding of who your users are, what they want, and for what situations.