Dealing with uncertainty

Olga Soloveva
7 min readDec 23, 2020

--

How to show uncertainty in graphs so that people can use it when making decisions.

What happened

During the lockdown, I have started to cook. I love that cooking at home is like solving a puzzle. It would help if you had a strategy on how to combine ingredients piled up in the fridge so that you can provide a variety of meals in days. When it comes to grocery shopping, I feel the expected anxiety of going to the store in person. I take some comfort in getting there in less busy hours, which can be googled. Around 8:00 PM, it is supposed to be less crowded in the store where I go. Usually, it works this way, and when I get there at 8:00 PM, there are one or two people at the checkout. But one day I went in at the usual time and was surprised to see four people. I felt intrigued and checked the online store’s info if there were changes in popular times. But it turned out that there were no changes. I just needed to read the schedule like this: on average, there are two people, and maybe two more or less. That is why I started to wonder if there are ways to show the data’s uncertainty visually.

Graph with popular times at the grocery store
Popular times at my store

Scientific uncertainty

The ambiguity in the display of data is widespread. For instance, a weather app forecasts snow, you put on rubber boots, but there is no snow. Or you wait for a bus, according to the schedule it should arrive, but the bus is late. The gap between weather forecasts or bus schedules and reality is incredibly frustrating if you understand bar charts or other graphs as an undeniable image of “what is.” In practice, you can rarely be 100% sure because data is constructed, not just given as a natural reflection of existing facts.

In science, the universal approach is to learn something about the world by looking at a small portion of it. First, researchers determine how things are now and model the future or create different scenarios from this starting point. Even though research is based on observations, it is a simplification of the real world. Because models simplify the world, they are forced to make assumptions, some of which may be inaccurate or incorrect. However, researchers usually calculate how likely it is that their findings represent a real effect, rather than happening by chance, research is not done to ‘prove’ things but to find the most likely explanation of what we see.

It is often unnecessary to reduce uncertainty. It doesn’t stop us from learning something useful and essential. The question is not whether we can know everything, but whether we know enough to make a decision. For example, researchers can calculate how frequently a flood is likely to occur in a specific location using historical data. An onсe in ten years calculation will help researchers decide what needs to be done to withstand this flooding frequency. Suppose a severe flood is a once in hundred years event. In that case, a community might only put resources into reinforcing significant buildings — a school or hospital.

The confidence interval is a standard way in statistics to tell us how reliable researchers think their observations are. Imagine, researchers want to find out if the avocados in the garden are big enough. To do this, they randomly select a sample of avocados and calculate the range the weight is likely to be. Then imagine that we sort all the avocados using a weighing machine, and it is confirmed that the actual average weight is within our calculated range. Though the calculated results were not exact, it is good they include the right answer. It is essential to express how likely the sample measurements will be correct to be useful to others. 95% confidence interval assumes that if we repeated the sampling process many times in 95% cases, the average sample size would be very close to the real average size.

All avocados from the garden are compared to a random sample of avocados
Mean values and confidence intervals for avocados’ weight

What to do

The most straightforward approach to displaying a confidence interval is to use a point to show the sample mean and lines to show the range around the mean [1]. Although this kind of graph is standard in the statistics, it has shortcomings. Sometimes the error bars cannot reflect the reality of data well. One might want to know whether the data elements are distributed evenly or clustered. If clustered, how many clusters and where? Are there any unusual data elements? Besides, approaches developed in scientific publications require basic knowledge of statistics. Many people don’t know what a confidence interval is.

The idea is to make graphs with uncertainty align with how people think about the data. There is a theory initially described by Pinker and expanded by Padilla about the cognitive fit between visualization and people’s mental schemas [2]. The theory suggests that different visual techniques naturally evoke specific associations. The trick is to find the right metaphor because it’s not clear exactly how each type of data and mental schemas fit together. We can make an educated guess with uncertainty in the data because Christoph Kinkeldy, Alan M. Makinen, and Jochen Shiva reviewed several empirical evaluations on this topic and noticed that blurring was rated as the most intuitive of all methods [3].

When I thought about implementing these ideas, I had a particular type of user in mind. At work, I am designing a tool for operators of the taxi fleet. Those are people who check on taxi drivers and help them get through the day. In the course of the research, it turned out that operators are well versed in numbers, but not in complex graphs and even more so in statistics. However, they need charts to monitor events because it is easier to notice deviations in a chart than in a data table. When the values are within the right limits, everything is fine; when not — you need to do something. Also, charts can help them explore correlations and decide on the next steps.

Imagine having to decide which driver works best. To do so, you look at the selected drivers’ ratings. Three of them have a rating of four. Without displaying the uncertainty of the data, you won’t be able to tell the difference. But if you see the average number and the measure of certainty around it, you can tell how precise the number is. You can notice that Amir and Volodya have the same rating. However, Amir has fewer votes; therefore, a less credible rating. At least some people will most likely see bare gradient bars as the minimum and maximum of the data. It is safer to combine confidence interval and data set to create a more expressive graph. Simultaneously, it will be noticeable that Katya has mostly neutral ratings, and Volodya has excellent and terrible ones at the same time.

Three driver ratings are compared, which have a mean value and confidence intervals
Mean driver ratings and associated confidence intervals

Then imagine that you need to check whether your taxi fleet is getting better overall. To find out, you need to open history with the ratings of all drivers at once. At first, it seems that things are getting better, but at second glance, it becomes visible that the difference is not so obvious. Besides, it’s strange that so many bad ratings appeared at the end of the period. In cases where there is a lot of data, the symbols can overlap, challenging to analyze. If you make the characters semi-transparent, you will distinguish between partially overlapping dots, but this will not help completely overlapping symbols. When there are many overlapping symbols on the chart, you can move some of them to the side to be visible.

History of ratings of all drivers in the park with confidence intervals
Rating history with confidence intervals

Conclusion

People use visualizations to make a variety of decisions, from everyday transit decisions to healthcare communications. Uncertainty is inherent in most data and can enter the collection, modeling, and analysis stage, although people usually do not assume this. Displaying uncertainty in graphs is a smart way to encourage your audience to consider uncertainty when making decisions. The difficulty is that there is still no established solution to showing it, although the topic is being discussed [4,5]. A clear understanding of what researchers mean by scientific uncertainty, and where it can be measured and where it can’t, would help everyone figure out how to respond to uncertainty in the data.

References

  1. A chapter about uncertainty in the book “Fundamentals of Data Visualization,” published by Claus O. Wilke
  2. An article “Decision making with visualizations: a cognitive framework across disciplines” by Lace M. Padilla
  3. A survey “How to Assess Visual Communication of Uncertainty? A Systematic Review of Geospatial Uncertainty Visualisation User Studies” by Christoph Kinkeldey, Alan M. MacEachren, and Jochen Schiewe
  4. A thesis for the Degree of Master of Fine Arts in Information Design and Visualization by Zheng Yan Yu
  5. A research lab MU Collective working on uncertainty visualizations

--

--