Data Thinking
Data thinking means viewing the world in terms of data. It’s an analytical technique for describing the past and present, and predicting the future. There are four stages in data thinking.
- What question(s) am I trying to answer?
- What data do I need to answer those questions?
- Where will I find this data?
- How do I model the problem using the data?
A data thinker thinks in these terms for every problem, no matter how small or large. Let’s take an example relating to retirement. The question is simple. When can I afford to retire?
What data do you need? You need to know the current value of your pension(s), the likely future value of your pension(s), the value of the state pension (and when you get it), other sources of income (once you retire), what savings and investments you have, and the value of any assets (or debts). You’ll have to contact a number of people (such as your pension provider) and research various things (such as the value of the state pension) before you can proceed.
Say you’re 60 years old with two pensions, one from your current job (where you’ve worked for 15 years) and one from your previous job (where you worked for 20 years). These combined pensions give you a private pension of £1,800 a month. Each extra year you work after age 60 will increase your pension by £50 (a month). You discover that the state pension is £1,000 a month, payable when you’re 66. You have no other sources of income but you’ve saved £30,000. Your only asset is your home, which is worth £250,000. You still have a mortgage (£600 a month), which will not be paid for another 10 years.
Now you have the data you need, you can build your model, which will allow you to play around with different variables and answer your question. You might decide to retire at 63 because the model tells you that your pension at this age will be enough if you sell your home, down-size to a small apartment and pay off your mortgage. Maybe you don’t want to sell your home and you’ll have to wait until you’re eligible for the state pension. The model will tell you.
The more data you give your model, the more accurate it will be. For example, the dataset could be expanded to include your monthly living costs. You could make a rough guess at the future value of your home or you could use additional data to make an accurate prediction based on historical trends in house prices where you live. In data-speak this is called “adding dimensions”. Too few dimensions (“under-fitting” in the jargon) and the model will produce unrealistic results; too many dimensions (“over-fitting”) and your model becomes complex and time-consuming to build. You won’t be surprised to learn that the simple task of getting rid of data is called “dimension reduction”.
Once the model is created, with as many dimensions as you think necessary, you can play around with variables such as the future value of your home — or when you’re likely to die. How much money will you need if you live to 70? Or 80? Or 100?
All models are wrong. Some models are useful.
Data modelling, no matter how simple or sophisticated, reduces the messy, complex world to clean, simple numbers. Someone once said “All models are wrong. Some models are useful.”. What he meant was that you can’t perfectly represent the real world using numbers. But it is possible to build models that help you make decisions. Any model of your retirement is better than no model. The simplest spreadsheet is better than blindly retiring into poverty or working until you drop.
What question am I trying to answer? What data do I need? Where do I find it? What does it tell me? A data thinker thinks in these terms. Want to buy a new home? What’s the biggest mortgage you can afford? Happy with your business? How could it make more money? Want to improve public health in your city? What’s the most cost effective intervention? The prospective house buyer doesn’t make a wild guess about how much she can afford to pay in a mortgage. The City Council doesn’t throw money at any public health initiative. These different problems require models of varying complexity but they all have one thing in common — data thinking.
Data thinking is analytical, not prescriptive. It doesn’t tell you what to do. Your retirement model will tell you a lot about what you can and can’t afford — but you still have to decide when to retire. As I write, the Scottish Government is considering women-only carriages on trains. Women don’t feel safe when travelling on trains in Scotland, especially at night and at weekends. My first reaction was: what does the data tell us? Are women more likely to be the victims of crime on trains? Do women make more complaints? Do women feel less safe than men? The answers to these questions should help us decide what to do. Suppose women do feel less safe on trains, do complain more and are more likely to be victims than men. That strengthens the argument for women-only carriages. But suppose men are more likely to be victims. Or suppose women aren’t more likely to be victims but feel more threatened. What then? It’s important that we have this data but the data shouldn’t decide for us. That’s a decision for politicians. Data thinking is not data dictatorship.
Lies, damn lies, and data
There’s an old saying that there are “lies, damn lies and statistics”. It’s true that you can lie with numbers. Your retirement model can deceive you if you exaggerate the value of your home or you conveniently forget about debts. In this case, you’re only deceiving yourself (and your future self will pay the price). But sometimes data tries to deceive other people. People choose data and build models to get the answers they want. They start with the answer and work back to the data. This deception is deliberate. Data only lies when you want it to. Properly selected and properly modelled data doesn’t lie — which is why it’s vital that datasets are made public and data models are shared. Be very, very suspicious when they’re not.
Data thinking isn’t just for geeks. Everyone needs it. Journalists should use data thinking every time they write a story. Is this murder unusual? What’s the trend in homicide in this city? Is there a pattern to crimes like this? Teachers need data thinking every time they start a new class. What are the pass rates in this subject? How likely is it that this class will pass? What should I be aiming for? Data thinking isn’t about rules and tools. It doesn’t matter if you use a small dataset and a spreadsheet or Big Data and Machine Learning. What matters is how you think. The journalist shouldn’t assume it’s “just another murder”. The teacher shouldn’t think his class “will be fine”. They should think data.