All posts by Adam

Wet Bulb Temperatures, Part 2

29th December 2023 Adam 1 Comment

I wrote recently about wet bulb temperatures (WBTs), and why we should be worried if they get too high. In that post, I mentioned that I had excluded data from before 1990, as I was concerned about the data quality. For the rest of that post, I assumed that the quality control of the HadISD dataset that the Met Office does would be adequate and that the episodes of extreme WBTs that I found in the dataset were real.

I’ve been thinking about that some more, and I think I probably need to be a bit more careful about data quality. Although the Met Office’s quality control procedures are very thorough, the observations that I’ve been looking at are by definition outliers. So even if 99.99% of the observations in the dataset are genuine (and I don’t know if that’s the right figure), the extreme observations I was interested in are far more likely to be in that remaining 0.01% than some randomly selected observation.

I’ve had some discussions about this with Dr Kate Willett from the Met Office, who has been supremely helpful and has given me some great ideas about how to look in more detail at the data quality. I’m very grateful to Dr Willett for her support, and much of the deep dive into data quality in the rest of this blog post uses her ideas. Any errors in my interpretation of the data below are mine.

So, for today I’d like to look at some of the extreme WBT episodes I found and look in more detail about whether they appear to be genuine or some artefact of faulty instruments or similar.

I gave a list of extreme WBT episodes in my last post, and I’m going to use a slightly different one here, though several of the episodes are common to both lists. Today I’m looking at all episodes where the WBT was recorded as being at least 35°C for at least 4 h, and I’m including all data at any time in the dataset (last time I excluded any observations before 1990). This gives us the following list of 26 episodes:

Weather station	Date of observation	Hours WBT ≥ 35	Max WBT
723870-03160, DESERT ROCK AIRPORT, Nye County, Nevada, United States (36.621, -116.028)	08 May 1979	6	38.8
	13 May 1980	4	39
	24 May 1980	5	36.7
	25 May 1980	12	38.6
	29 May 1980	5	39.4
	19 Apr 1981	6	39
	20 Apr 1981	11	39.8
	21 Apr 1981	4	38.8
	20 May 1981	9	39.8
	09 May 1982	4	39.1
944490-99999, LAVERTON AERO, Shire Of Laverton, Western Australia, 6440, Australia (-28.617, 122.417)	22 Jan 1995	6	36.2
952050-99999, DERBY AERO, Shire Of Derby-West Kimberley, Western Australia, Australia (-17.367, 123.667)	03 Feb 2001	4	36.8
404160-99999, KING ABDULAZIZ AB, Dammam Governorate, Eastern Province, Saudi Arabia (26.265, 50.152)	08 Jul 2003	5	36.5
417150-99999, SHAHBAZ AB, Jacobabad District, Larkana Division, Sindh, 79000, Pakistan (28.284, 68.45)	06 Jun 2005	6	37.4
954820-99999, BIRDSVILLE, Diamantina Shire, Queensland, Australia (-25.898, 139.348)	30 Dec 2006	4	36.6
952050-99999, DERBY AERO, Shire Of Derby-West Kimberley, Western Australia, Australia (-17.367, 123.667)	16 Apr 2009	4	36.5
941310-99999, TINDAL, Town of Katherine, Northern Territory, 0850, Australia (-14.521, 132.378)	04 Nov 2011	4	36.5
	06 Nov 2011	5	36.9
942170-99999, ARGYLE AERODROME, Shire Of Wyndham-East Kimberley, Western Australia, Australia (-16.633, 128.45)	13 Jan 2013	6	36.9
946590-99999, WOOMERA, Pastoral Unincorporated Area, South Australia, Australia (-31.144, 136.817)	11 Mar 2013	6	36.6
943120-99999, PORT HEDLAND INTL, Town Of Port Hedland, Western Australia, Australia (-20.378, 118.626)	24 Mar 2013	4	36.9
952050-99999, DERBY AERO, Shire Of Derby-West Kimberley, Western Australia, Australia (-17.367, 123.667)	29 Sep 2013	4	35.8
760400-99999, EJIDO NUEVO LEON BC., Municipio de Mexicali, Baja California, Mexico (32.4, -115.183)	23 Jul 2018	6	38.4
948170-99999, COONAWARRA, Wattle Range Council, South Australia, Australia (-37.3, 140.817)	10 Jan 2021	4	36.5
760400-99999, EJIDO NUEVO LEON BC., Municipio de Mexicali, Baja California, Mexico (32.4, -115.183)	21 Jul 2021	6	37.5
	16 Aug 2021	6	37.2

(The “Weather station” details are the ID number of the weather station and its name as recorded in the HadISD dataset, followed by the address of its region, and finally the latitude and longitude coordinates.)

For each of those episodes, I have drawn a panel of 6 graphs which I hope is going to give us a clue about whether the data look reliable. I have plotted the dry bulb temperature (what we would normally call just “temperature”) on the left and the humidity on the right. WBT is calculated from dry bulb temperature and humidity, so if either of those variables looks wrong, then the calculated WBT is likely to be wrong as well.

The top graph shows the evolution over a 4 year period centred on the extreme episode. This lets us see if we’re following something approximating to normal seasonal variation for the location or if there has been some kind of spike. It’s a bit much to plot individual hourly values over that timescale, so I have calculated summary statistics for each week: the median, 90th centile, and 98th centile. If there’s a sudden increase in the gap between the median and higher centiles, that suggests that we may have weird outliers.

The middle graph shows a 6 day period centred on the maximum of WBT, this time plotting individual values, and shows nearby stations for comparison. I have included up to 5 other stations within 200 miles of the index station. There may be fewer than 5 other stations if there weren’t 5 stations within 200 miles. This lets us see whether the values are following a reasonable diurnal variation and whether they are obvious outliers compared with nearby stations.

The bottom graph compares the values with other datasets. I have used Visual Crossing and ERA5. Visual Crossing is a commercial weather service, and ERA5 is a publicly available dataset from the European Union’s Copernicus programme (part of the EU space programme). Often the Visual Crossing data are identical to the index station (I’ve added a tiny amount of random noise to the graphs here just so that you can still see both data series without one sitting on top of the other and hiding it), as Visual Crossing and the HadISD data that I used as my primary source both come from the same underlying dataset (the NOAA Integrated Surface Database). But Visual Crossing use different QC procedures to the ones the Met Office use, so where they do differ, that does suggest that something is up. The Visual Crossing data extend for only 3 days rather than the whole 6 days, as I’m a cheapskate and only subscribe to their free service and downloading much more data than that would have exceeded my download limits. ERA5 is a reanalysis dataset using data from various sources, and so is a more independent source of data.

The x axis for the 2 lower graphs is titled “Local time”, but I should point out that this isn’t necessarily exactly local time. Rather than go to the trouble of trying to look up local time zones for each location and date, I simply assumed that the world was divided into 24 equal sized time zones and that daylight savings time didn’t exist, for ease of calculation, and then just calculated the local time from the UTC time and the station longitude. So this may be an hour or two out from the actual local time, but should be close enough that we can tell if the diurnal variation seems reasonable. So if you’re keen enough to look up specific observations and find the times don’t quite match, that’s why.

As in my last post on this subject, I do need to emphasise that I am a medical statistician and not a climate scientist, and maybe I’ve overlooked something important or erred in my interpretation of the data. So please don’t assume that everything I write below is absolutely bullet-proof.

So, bearing that caveat in mind, let’s take a look at some graphs.

Here is the graph for the first episode:

There are some really obvious problems here. The temperature data simply don’t look remotely plausible. I’d previously discussed this observation station with Dr Willett, and she thought maybe someone had mixed up Fahrenheit and Celsius in the temperature observations. The data do seem consistent with that: if you assume that some of those implausibly high values are actually figures in Fahrenheit, then they match up quite well with the other datasets.

Here are the graphs for the next 2 episodes, at the same station. They are very similar. I won’t bore you with the graphs for the remaining episodes at that station, but they are also similar. It is clear that the episodes from this station are not trustworthy.

The next episode, episode 11, comes from Western Australia. The temperature looks perfectly plausible, but something looks wrong with the humidity, with an obvious spike around the time of the episode and a large deviation from the ERA5 dataset as well as the 2 nearest stations. This also doesn’t seem to be a genuine high WBT episode.

Episode 12, also from Western Australia, is similar, in that the temperature looks plausible but the humidity looks unreasonably high and seems more likely to be some kind of instrument malfunction than real extreme humidity.

Episode 13, from Saudi Arabia, is probably genuine. The temperature looks perfectly reasonable as judged by the seasonal average, the diurnal variation, and some nearby weather stations, though it is a little higher than the ERA5 dataset. The humidity doesn’t have the obvious red flags of the episodes 11 and 12, though it does seem to go up a bit within 24 h of the episode peak compared with a couple of days on each side. However, this web page from NOAA mentions that the highest dew point ever recorded was on 8 July 2003 at Dharhan, Saudia Arabia, which is more or less exactly the location of the weather station, and does suggest that some extreme weather was happening about that time.

Episode 14, from Pakistan, looks like it could be genuine, though again, it’s hard to be sure. The temperature certainly seems plausible, but the humidity, while broadly in line with nearby stations and not showing any obvious signs of a spike, is higher than the the ERA5 dataset. I haven’t been able to find a media report specifically about this location and date, but this article says that more than 500 people died from heat in May and June in Pakistan, India, and Bangladesh, which does seem consistent with this episode being genuine.

Episodes 15 to 22 are all from Australia. I don’t think any of them is genuine. The temperature looks plausible in all cases, but the humidity looks wrong: out of whack with the nearby stations and very different from the ERA5 data.

Episode 23 is from Mexico. Again, the temperature looks plausible, but the humidity is very different to both the ERA5 and the Visual Crossing data, as well as being considerably higher than several nearby weather stations. I don’t think this is genuine.

Episode 24 is from Australia again, and the humidity looks wrong again. I don’t think this is genuine either.

Episodes 25 and 26 are from the same weather station in Mexico as episode 23, and the humidity again looks implausible.

So of our 26 episodes, it looks like almost all of them are not real and are probably the result of faulty instrumentation or faulty data entry into a database. Episode 13 from Saudi Arabia in 2003 looks very likely to be real, and episode 14 from Pakistan in 2005 may very well be real, though it’s hard to be sure.

So in fact prolonged spells of WBT above 35°C do seem to be very rare, for now at least. In my next blogpost I shall look more at data quality and see if I can come up with a statistical algorithm for distinguishing the real episodes from the others, as I don’t think it’s going to be feasible to look at these graphs one at a time for the less extreme, but still worrying episodes of high WBT, for example episodes of WBT above 32°C, as these are much more common. Once I have been able to distinguish more reliably between the episodes that are real and those that aren’t, then I’ll be able to do a better job of looking at whether the frequency of dangerously high WBT episodes is increasing.

Public health

Wet Bulb Temperatures, Part 1

21st September 2023 Adam 1 Comment

I’ve gotta be honest, I hadn’t heard of the concept of a “web bulb temperature” until earlier this year. I’ve heard about them quite a bit in recent months, and I’m sure we’ll all be hearing about them a lot more often from now on.

If the concept is also new to you, let me explain. It’s pretty much what it says. You take a traditional thermometer with a bulb of liquid at the bottom, and you wrap the bulb in a wet cloth. The water in the cloth evaporates, and the evaporative cooling means that the thermometer will read a lower temperature than it would have done if it were dry. The amount of cooling depends on the humidity of the air: in dry conditions, the evaporative cooling is efficient and the wet bulb temperature will be considerably lower than the dry bulb temperature, but in humid conditions, the difference is less marked. At the extreme, in 100% relative humidity, the water cannot evaporate at all and the wet bulb temperature (WBT) will be the same as the dry bulb temperature.

Why does this matter? And why am I writing about physics on a blog that’s mostly about medical statistics? Well, it has important implications for human health. People die in heat waves. And it turns out that WBT is important in understanding just how dangerous a heat wave is.

If you are exposed to 45°C heat for any length of time, that sounds pretty scary, but it need not necessarily be too dangerous as long as the air is not too humid, you are reasonably healthy, and you can stay well hydrated. We have evolved a cunning mechanism to stay cool in hot temperatures, namely sweating. This works in exactly the same way as the evaporative cooling in a wet bulb thermometer, and cools us down. Even if the air temperature is hotter than our body temperature, then in a dry atmosphere we can still cool down by sweating. But if we’re exposed to that kind of heat in humid conditions, then we’re in trouble.

So WBT is a way of measuring the combined effects of temperature and humidity. There is no magic cutoff for WBT below which everyone is fine and above which everyone dies. Moderately high WBTs that are mostly survivable for young healthy people may still kill large numbers of frail elderly people. However, the threshold of a WBT of 35°C is often mentioned as a limit of human survivability, and some research suggests that lower WBTs may be dangerous even to young healthy people. Certainly for people who aren’t young and healthy, high-ish WBTs below 35°C are likely to cause considerable excess mortality. The duration of high temperatures is also important. An extreme WBT may be survivable for 30 min, but not if you’re exposed to it for several hours.

The idea for writing this blogpost came from a conversation on social media. Fellow Mastodon user Kenneth Freeman asked the question of whether there is any way to know when WBTs hit 35°C somewhere on the planet. After a little bit of thought, I realised that this was a really important question to answer, so I thought I’d have a go.

If you are impatient to know what the answer is, feel free to skip to the results below. But if you are interested in all the geeky stuff about how I found out, then read on.

There are many weather-related websites that will let you search for the weather conditions at a given location at a given time. However, although they are presumably linked up to giant datasets of weather conditions at different locations and times, and so could in theory be easily queried to find the locations and times for specific weather conditions, I couldn’t find any website that would let you search that way round. But Kenneth tooted a link to an relevant scientific paper which had also examined the question of how often WBTs of 35°C are observed. They had used data from the HadISD dataset: a large dataset of observations from weather stations around the world, going back in some cases to the 1930s and updated monthly with the latest data, which is freely available online thanks to the UK Met Office. This seemed to be a good way to find out the answer to Kenneth’s question.

A good way, though not a trivially easy way. There are about 23GB of data in the dataset, which all needs to be sifted through. The dataset is stored in netCDF files, which apparently is a format used widely in climate science, but was new to me. However, there is of course a Python package to read netCDF files, so after a brief learning curve I was able to read the data and look for high values of WBT using Python code.

As an aside, this project turned out to be a brilliant example of the power of open data. One of the things that 20+ years’ professional experience as a statistician has taught me is that you should always make sure you understand exactly what you are looking at in your dataset before you try to do anything with the data. So as part of that process, I re-calculated the WBTs using the temperature and humidity data in the dataset to make sure I’d understood what the WBT variable in the dataset was and that it was what I was expecting it to be.

The WBTs that I calculated were not the same as the WBTs provided in the dataset.

I assumed at this stage that I must have got something wrong. I’m a medical statistician, not a climate scientist, and I figured that the climate scientists at the Met Office would know how to calculate WBT better than I do. So I emailed the Met Office to ask them what I was doing wrong, and it turned out that in fact my calculations were correct and they had an error in their code that they’d used to calculate WBT. I have to say that I am very impressed with the speed with which they corrected their data. The day after I emailed them, an updated version of the dataset appeared on their website, along with an explanation that they’d had to correct it.

It’s ever so easy to make mistakes when analysing data. We all do it. By making their data freely available online, the Met Office had vastly increased the chances that someone might come along and spot the mistake so it could be corrected. How long would that error have remained in the dataset if the data had not been freely available, one wonders? Top marks to the Met Office for making the data available and for promptly and transparently correcting the error once I’d alerted them to it.

Although the dataset goes back to the 1930s in some cases, most of the weather stations included in the dataset don’t go back that far. Many of the stations came on stream in the 1970s and 1980s. By 1990, 90% of the current number of weather stations were already contributing data. So I am starting my analysis at 1990, to avoid the problem of trying to make comparisons across time periods with vastly different amounts of data collection.

In addition to that, the data quality before 1990 looks like it may not always be that great. I did have a look back at earlier data to see what was there, and I found one weather station in the US with some very prolonged spells of WBT > 35°C around 1980. This seemed like such an outlier I contacted the Met Office about it (it has to be said that the team behind the HadISD dataset at the Met Office are wonderfully helpful people), and after taking a look at it they thought that the data really didn’t seem reliable, and that the most likely explanation was that the temperature data had been recorded in Fahrenheit and whoever originally uploaded the data had forgotten to convert it to Celsius.

Data quality is something that needs a lot of careful work, and the Met Office do have some extensive quality control procedures, which are described in a published paper. Of course the occasional spurious value can slip through even the most rigorous quality control, but having excluded the pre-1990 data, I am assuming that the Met Office’s quality control procedures are good enough and that the data are reliable. Certainly nothing else I’ve seen later than those values from the US station around 1980 leapt out at me as being obviously weird.

This analysis is somewhat quick and dirty at this stage. You will note that I’ve called this post “Wet Bulb Temperatures, Part 1”, as I plan to come back another day and tell you about some more sophisticated analyses. I don’t have any maps to show you today, and I’d love to do that in the reasonably near future. I have also used a very rough and ready method of calculating the duration of spells of high WBT: I have assumed that the spell lasted from the first time a weather station recorded a WBT above the threshold and ended at the last time the WBT was above the threshold at the same station, having been consistently above the threshold since. For long spells and stations that record data every hour, that’s probably not a bad approximation. However, consider a hypothetical example of a station reporting data only every 3 hours, with the WBT at 34.9 at 1200, 36 at 1500, and 34.9 at 1800. I would have counted this as a zero duration above 35, as there was only a single observation above 35. However, in practice, the WBT was probably above 35 for most of the time between 1200 and 1800. So my analysis here is rather conservative: the actual number of spells of long duration is probably higher than I’m reporting here. It wouldn’t be that hard to come up with more accurate estimates by interpolation between the observations, but I haven’t done that yet. I’ll do that another time and let you know what that looks like.

I have also not taken account of where stations are. If 2 stations that are close to each other both report episodes of high WBT at the same time, I have counted that as 2 separate episodes, although it might be more accurate to count it as 1 big episode. However, I guess it’s still a measure of how serious the extreme temperatures are. Also, what I’m focusing mostly on here is episodes where the WBT was at least 35°C for 6 h, as those as the episodes which are most likely to pose a serious threat to human survivability, and I have checked and none of those episodes were recorded at more than one station on a single day. I can’t promise that’s true of some of the less extreme episode durations. I hope to do some kind of analysis of the geographical extent of extreme WBT episodes some other time.

And my final caveat is that I am a medical statistician, not a climate scientist. I am aware that I’m going rather outside my field of expertise by analysing climate data, and I can’t exclude the possibility that I have overlooked something really important that would be obvious to any climate scientist. If you think I have erred in that way, do let me know via the comments below.

Anyway, with those caveats in mind, let’s look at the results.

Results

I’ll start with a list of extreme WBT episodes. These were any time the WBT was at least 35°C for 6 h or more (there were in fact no episodes longer than 6 h), or was at least 32°C for more than 12 h.

Weather station	Date of observation	Hours WBT ≥ 35	Hours WBT ≥ 32	Max WBT
944490-99999, LAVERTON AERO, Shire Of Laverton, Western Australia, 6440, Australia (-28.617, 122.417)	22 Jan 1995	6	6	36.2
412460-99999, SOHAR MAJIS, Al Batinah North Governorate, Oman (24.467, 56.65)	5 Jul 1998		15	32.9
412460-99999, SOHAR MAJIS, Al Batinah North Governorate, Oman (24.467, 56.65)	12 Jul 1998		18	34.2
412420-99999, DIBA, Musandam Governorate, Oman (25.617, 56.25)	8 Jul 2001		15	33.4
412420-99999, DIBA, Musandam Governorate, Oman (25.617, 56.25)	4 Jul 2002		18	33.2
412580-99999, MINA SULTAN QABOOS, Muscat, Muscat Governorate, Oman (23.633, 58.567)	28 Jun 2004		15	32.7
412420-99999, DIBA, Musandam Governorate, Oman (25.617, 56.25)	12 Jul 2004		15	33.5
417150-99999, SHAHBAZ AB, Shikārpur District, Sindh, Pakistan (28.284, 68.45)	6 Jun 2005	6	12	37.4
942170-99999, ARGYLE AERODROME, Shire Of Wyndham-East Kimberley, Western Australia, Australia (-16.633, 128.45)	13 Jan 2013	6	12	36.9
946590-99999, WOOMERA, Pastoral Unincorporated Area, South Australia, Australia (-31.144, 136.817)	11 Mar 2013	6	9	36.6
760400-99999, EJIDO NUEVO LEON BC., Municipio de Mexicali, Baja California, Mexico (32.4, -115.183)	23 Jul 2018	6	12	38.4
412170-99999, ABU DHABI INTL, Abu Dhabi, Abu Dhabi Emirate, United Arab Emirates (24.433, 54.651)	19 Aug 2020		15	33.1
760400-99999, EJIDO NUEVO LEON BC., Municipio de Mexicali, Baja California, Mexico (32.4, -115.183)	21 Jul 2021	6	6	37.5
760400-99999, EJIDO NUEVO LEON BC., Municipio de Mexicali, Baja California, Mexico (32.4, -115.183)	16 Aug 2021	6	6	37.2
412670-99999, QALHAT, Ash Sharqiyah South Governorate, Oman (22.667, 59.4)	9 Jul 2022	3	33	36.2
412400-99999, KHASAB PORT, Musandam Governorate, Oman (26.217, 56.25)	24 Aug 2023		30	33.1

What we can see is that there have been only a handful of episodes of a WBT of ≥ 35°C that lasted for 6h, and they are skewed towards the more recent years of the dataset. What is particularly striking is that the paper by Raymond et al I linked to previously, published in 2020 and using data up to 2017, which looked at the emergence of high WBTs, is already badly out of date. Of the 7 episodes of WBT ≥ 35°C that I have found, 3 of them have occurred after 2017. There are also 2 episodes of WBT ≥ 32°C that lasted for over 24 hours, which must be very hard to deal with, which have both occurred in the last couple of years.

Although I haven’t done any statistical analysis to see how likely it is that the clustering of extreme heat episodes in the later period could be due to chance, it certainly looks at first glance that the frequency of such events is increasing.

Let’s look at some graphs as well. Here are the number of episodes of WBTs above thresholds from 32 to 35°C of varying durations over time.

Again, I have not done any statistical analysis of this, but it does seem that there is an upward trend in many of those graphs with the frequency of high WBT events increasing over time.

If this trend of increasingly extreme WBT events continues, this could have serious implications for human health in the affected areas. It is no exaggeration to say that thousands, maybe millions, could die in such extreme weather events.

So this would be a really bad time to relax our efforts to reduce greenhouse gas emissions. But surely no-one would be stupid enough to do that, would they?

Update 29 December 2023: please see the next post, Wet Bulb Temperatures, Part 2, which explains why much of the data above may not be all it seems.

Public health

Coronavirus: when will we be back to normal?

9th January 2021 Adam 9 Comments

Well, 2020 was quite a year. I’m sure it’s one that most of us are glad is over.

Here in the UK, we have been badly hit by the covid-19 pandemic, indeed we have one of the worst death rates in the world. It didn’t have to be this way: as an island nation with a well developed health system, we could have handled the pandemic far better. Unfortunately, we have a government of incompetent idiots who have simply not been up to the job of dealing with it.

As I write this in early January 2021, covid-19 cases are at high levels and rising rapidly, following a reasonable approximation to exponential growth since the beginning of December with a doubling time of just over a fortnight. This is, frankly, terrifying, given that hospitals are already stretched to their limits.

But there is a ray of hope, in the shape of vaccines. We now have 3 vaccines approved for use in the UK, and over a million people have already been vaccinated. It has been an extraordinary achievement to get not one, but 3 vaccines invented, tested in large clinical trials, and approved in such a short space of time. The scientists, clinicians, clinical research professionals, statisticians, regulators, and last but by no means least clinical trial volunteers should be incredibly proud of what they have achieved.

It will take many months or possibly even years to vaccinate the whole UK population. But sensibly, vaccination is being prioritised for those most at risk, mainly starting with older age groups. The government have promised that they will have vaccinated the 15 million people at highest risk by the middle of February, including everyone over 70 as well as health and social care workers and those who are clinically extremely vulnerable.

They will break that promise of course, just like they break all their promises.

But hopefully at some time in the next few months, even if not as early as mid-February, all those high risk people will have been vaccinated. What does that mean for getting our lives and the economy back to normal?

Vaccinating that number of people will certainly not give us any meaningful herd immunity, but given that most deaths from covid-19 occur in the elderly, we would expect that vaccinating all the over 70s will dramatically cut the death rate.

At that stage, there may be a temptation on the part of politicians to open up the economy again, taking the view that perhaps it doesn’t matter if covid-19 is still circulating widely if few people are dying from it.

I think this would be a mistake. First, just because most deaths from covid-19 occur in the elderly, it does not mean that younger people don’t die from it at all. Very approximately 10% of covid-19 deaths are in people under the age of 60, and if the virus is spreading rampantly through the population and millions of people are infected, then the absolute numbers of younger people who die will not be negligible.

But there is a further reason to be cautious: long covid.

There is still much that we don’t know about long covid, but what we do know is that a small proportion of patients continue to have significant symptoms weeks or even months after the acute infection. It has been estimated that about 1 in 10 patients still have symptoms after 12 weeks.

If millions of people are being infected, then that suggests that hundreds of thousands of people may suffer from long covid.

What we don’t yet know is how long the symptoms of long covid last. Maybe most people will be back to normal within a year, or maybe the symptoms are generally permanent. We simply do not yet have enough long term data to know, given that the disease only first appeared just over a year ago.

Some of the symptoms of long covid are very worrying. Quite apart from potentially permanent lung and heart damage, one study found that cognitive performance could be reduced in a manner equivalent to 10 years of ageing.

If the symptoms of long covid do turn out to be permanent, then having hundreds of thousands of people affected by them would be nothing short of a public health catastrophe.

So while there will be a temptation to get back to normal life once deaths from covid are much reduced following vaccination of those at higher risk, I think that temptation needs to be resisted for a while longer until enough of the population have been vaccinated to give significant herd immunity.

At any rate, much as I miss my local pubs, I will not be going back to them until after I’ve had my vaccine.

Epidemiology, Healthcare, Politics, Public health

Covid-19 deaths

22nd March 2020 Adam 9 Comments

I wrote last week about how the number of cases of coronavirus were following a textbook exponential growth pattern. I didn’t look at the number of deaths from coronavirus at the time, as there were too few cases in the UK for a meaningful analysis. Sadly, that is no longer true, so I’m going to take a look at that today.

However, first, let’s have a little update on the number of cases. There is a glimmer of good news here, in that the number of cases has been rising more slowly than we might have predicted based on the figures I looked at last week. Here is the growth in cases with the predicted line based on last week’s numbers.

As you can see, cases in the last week have consistently been lower than predicted based on the trend up to last weekend. However, I’m afraid this is only a tiny glimmer of good news. It’s not clear whether this represents a real slowing in the number of cases or merely reflects the fact that not everyone showing symptoms is being tested any more. It may just be that fewer cases are being detected.

So what of the number of deaths? I’m afraid this does not look good. This is also showing a classic exponential growth pattern so far:

The last couple of days’ figures are below the fitted line, so there is a tiny shred of evidence that the rate may be slowing down here too, but I don’t think we can read too much into just 2 days’ figures. Hopefully it will become clearer over the coming days.

One thing which is noteworthy is that the rate of increase of deaths is faster than the rate of increase of total cases. While the number of cases is doubling, on average, every 2.8 days, the number of deaths is doubling, on average, every 1.9 days. Since it’s unlikely that the death rate from the disease is increasing over time, this does suggest that the number of cases is being recorded less completely as time goes by.

So what happens if the number of deaths continues growing at the current rate? I’m afraid it doesn’t look pretty:

(note that I’ve plotted this on a log scale).

At that rate of increase, we would reach 10,000 deaths by 1 April and 100,000 deaths by 7 April.

I really hope that the current restrictions being put in place take effect quickly so that the rate of increase slows down soon. If not, then this virus really is going to have horrific effects on the UK population (and of course on other countries, but I’ve only looked at UK figures here).

In the meantime, please keep away from other people as much as you can and keep washing those hands.

Epidemiology, Healthcare, Politics, Public health

Covid-19 and exponential growth

15th March 2020 Adam 9 Comments

One thing about the Covid-19 outbreak that has been particularly noticeable to me as a medical statistician is that the number of confirmed cases reported in the UK has been following a classic exponential growth pattern. For those who are not familiar with what exponential growth is, I’ll start with a short explanation before I move on to what this means for how the epidemic is likely to develop in the UK. If you already understand what exponential growth is, then feel free to skip to the section “Implications for the UK Covid-19 epidemic”.

A quick introduction to exponential growth

If we think of something, such as the number of cases of Covid-19 infection, as growing at a constant rate, then we might think that we would have a similar number of new cases each day. That would be a linear growth pattern. Let’s assume that we have 50 new cases each day, then after 60 days we’ll have 3000 cases. A graph of that would look like this:

That’s not what we’re seeing with Covid-19 cases. Rather than following a linear growth pattern, we’re seeing an exponential growth pattern. With exponential growth, rather than adding a constant number of new cases each day, the number of cases increases by a constant percentage amount each day. Equivalently, the number of cases multiplies by a constant factor in a constant time interval.

Let’s say that the number of cases doubles every 3 days. On day zero we have just one case, on day 3 we have 2 cases, and day 6 we have 4 cases, on day 9 we have 8 cases, and so on. This makes sense for an infectious disease epidemic. If you imagine that each person who is infected can infect (for example) 2 new people, then you would get a pattern very similar to this. When only one person is infected, that’s just 2 new people who get infected, but if 100 people have the disease, then 200 people will get infected in the same time.

On the face of it, the example above sounds like it’s growing much less quickly than my first example where we have 50 new cases each day. But if you are doubling the number of cases each time, then you start to get to scarily large numbers quite quickly. If we carry on for 60 days, then although the number of cases isn’t increasing much at first, it eventually starts to increase at an alarming rate, and by the end of 60 days we have over a million cases. This is what it looks like if you plot the graph:

It’s actually quite hard to see what’s happening at the beginning of that curve, so to make it easier to see, let’s use the trick of plotting the number of cases on a logarithmic scale. What that means is that a constant interval on the vertical axis (generally known as the y axis) represents not a constant difference, but a constant ratio. Here, the ticks on the y axis represent an increase in cases by a factor of 10.

Note that when you plot exponential growth on a logarithmic scale, you get a straight line. That’s because we’re increasing the number of cases by a constant ratio in each unit time, and a constant ratio corresponds to a constant distance on the y axis.

Implications for the UK Covid-19 epidemic

OK, so that’s what exponential growth looks like. What can we see about the number of confirmed Covid-19 cases in the UK? Public Health England makes the data available for download here. The data have not yet been updated with today’s count of cases as I write this, so I added in today’s number (1372) based on a tweet by the Department of Health and Social Care.

If you plot the number of cases by date, it looks like this:

That’s pretty reminiscent of our exponential growth curve above, isn’t it?

It’s worth noting that the numbers I’ve shown are almost certainly an underestimate of the true number of cases. First, it seems likely that some people who are infected will have only very mild (or even no) symptoms, and will not bother to contact the health services to get tested. You might say that it doesn’t matter if the numbers don’t include people who aren’t actually ill, and to some extent it doesn’t, but remember that they may still be able to infect others. Also, there is a delay from infection to appearing in the statistics. So the official number of confirmed cases includes people only after they have caught the disease, gone through the incubation period, developed symptoms that were bothersome enough to seek medical help, got tested, and have the test results come back. This represents people who were infected probably at least a week ago. Given that the number of cases are growing so rapidly, the number of people actually infected today will be considerably higher than today’s statistics for confirmed cases.

Now, before I get into analysis, I need to decide where to start the analysis. I’m going to start from 29 February, as that was when the first case of community transmission was reported, so by then the disease was circulating within the UK community. Before then it had mainly been driven by people arriving in the UK from places abroad where they caught the disease, so the pattern was probably a bit different then.

If we start the graph at 29 February, it looks like this:

Now, what happens if we fit an exponential growth curve to it? It looks like this:

(Technical note for stats geeks: the way we actually do that is with a linear regression analysis of the logarithm of the number of cases on time, calculate the predicted values of the logarithm from that regression analysis, and then back-transform to get the number of cases.)

As you can see, it’s a pretty good fit to an exponential curve. In fact it’s really very good indeed. The R-squared value from the regression analysis is 0.99. R-squared is a measure of how well the data fit the modelled relationship on a scale of 0 to 1, so 0.99 is a damn near perfect fit.

We can also plot it on a logarithmic scale, when it should look like a straight line:

And indeed it does.

There are some interesting statistics we can calculate from the above analysis. The average rate of growth is about a 30% increase in the number of cases each day. That means that the number of cases doubles about every 2.6 days, and increases tenfold in about 8.6 days.

So what happens if the number of cases keeps growing at the same rate? Let’s extrapolate that line for another 6 weeks:

This looks pretty scary. If it continues at the same rate of exponential growth, we’ll get to 10,000 cases by 23 March (which is only just over a week away), to 100,000 cases by the end of March, to a million cases by 9 April, and to 10 million cases by 18 April. By 24 April the entire population of the UK (about 66 million) will be infected.

Now, obviously it’s not going to continue growing at the same rate for all that time. If nothing else, it will stop growing when it runs out of people to infect. And even if the entire population have not been infected, the rate of new infections will surely slow down once enough people have been infected, as it becomes increasingly unlikely that anyone with the disease who might be able to pass it on will encounter someone who hasn’t yet had it (I’m assuming here that people who have already had the disease will be immune to further infections, which seems likely, although we don’t yet know that for sure).

However, that effect won’t kick in until at least several million people have been infected, a situation which we will reach by the middle of April if other factors don’t cause the rate to slow down first.

Several million people being infected is a pretty scary prospect. Even if the fatality rate is “only” about 1%, then 1% of several million is several tens of thousands of deaths.

So will the rate slow down before we get to that stage?

I genuinely don’t know. I’m not an expert in infectious disease epidemiology. I can see that the data are following a textbook exponential growth pattern so far, but I don’t know how long it will continue.

Governments in many countries are introducing drastic measures to attempt to reduce the spread of the disease.

The UK government is not.

It is not clear to me why the UK government is taking a more relaxed approach. They say that they are being guided by the science, but since they have not published the details of their scientific modelling and reasoning, it is not possible for the rest of us to judge whether their interpretation of the science is more reasonable than that of many other European countries.

Maybe the rate of infection will start to slow down now that there is so much awareness of the disease and of precautions such as hand-washing, and that even in the absence of government advice, many large gatherings are being cancelled.

Or maybe it won’t. We will know more over the coming weeks.

One final thought. The government’s latest advice is for people with mild forms of the disease not to seek medical help. This means that the rate of increase of the disease may well appear to slow down as measured by the official statistics, as many people with mild disease will no longer be tested and so not be counted. It will be hard to know whether the rate of infection is really slowing down.

Correlation does not equal causation, Dodgy statistics, Public health

More nonsense about vaping

18th March 2018 Adam 5 Comments

A paper was published in PLoS One a few days ago by Soneji et al that made the bold claim that “e-cigarette use currently represents more population-level harm than benefit”.

That claim, for reasons we’ll come to shortly, is not remotely supported by the evidence. But this leaves me with rather mixed feelings. On the one hand, I am disappointed that such a massively flawed paper can make it through peer review. It is a useful reminder that just because a paper is published in a peer reviewed journal does not mean that it is necessarily even approximately believable.

But on the other hand, the paper was largely ignored by the British media. I find that rather encouraging. We have seen flawed studies about e-cigarettes cheerfully picked up by the media before (here’s one example, but there are plenty of others), who don’t seem too bothered about whether the research is any good or not, just that it makes a good story. Perhaps the media are starting to learn that parroting press releases, when those press releases are a load of nonsense, is not such a great idea after all.

Sure, the paper made it into two of our most dreadful and unreliable newspapers, but as far as I can tell, the story was not picked up at all by the BBC or any of the broadsheet newspapers. And that’s a good thing.

So what was wrong with the paper then?

It’s important to understand that the paper did not collect any new data. There was no survey or clinical trial or review of health records or anything like that. It was purely a mathematical modelling study based on previously published data.

Soneji et al attempted to model the benefits and harms of e-cigarettes at the population level by considering what proportion of smokers are helped to quit by e-cigarettes, thus experiencing a health benefit, and what proportion of never-smokers are encouraged to start smoking by e-cigarettes, thus experiencing harm.

Of course a mathematical model is only as good as the assumptions that go into it. The big problem with this model is that there is no evidence that e-cigarettes encourage anyone to start smoking.

Now, there have been studies that show that young people who use e-cigarettes are more likely to start smoking that young people who don’t use e-cigarettes. Soneji et al used a meta-analysis of those studies to obtain the necessary estimates of just how much more likely that was.

But there is a big problem here. The assumption in Soneji et al’s modelling paper is that the observed association between e-cigarette use and subsequent smoking initiation is causal. In other words, they assume that those people who use e-cigarettes and then go on to start smoking have started smoking because they used e-cigarettes.

A moment’s thought shows that there are other perfectly plausible explanations rather than a causal relationship. Surely it is more likely that there is confounding by personality type here. The sort of person who uses e-cigarettes is probably the type of person who is more likely to start smoking. If e-cigarettes were not available, those people who first used e-cigarettes and then subsequently started smoking would probably have started smoking anyway.

But this is to some extent guesswork. While Soneji et al can most definitely not prove that the association between e-cigarette use and subsequent smoking is causal, no-one can prove it isn’t causal from those association studies, even if another explanation is more plausible.

We can, however, look at other data to help understand what is going on. Given that e-cigarettes are now far more available than they were a few years ago, if e-cigarettes were really causing people who wouldn’t otherwise have smoked to start smoking, then you would expect to see population-level rates of smoking start to increase.

In fact, according to data from the Office for National Statistics, the opposite is happening. According to the ONS data, “Since 2010, smoking has become less common across all age groups in the UK, with the most pronounced decrease observed among those aged 18 to 24 years”.

Now, of course we can’t say that that decrease in smoking prevalence is because of e-cigarettes, but it does seem to argue strongly against the hypothesis that e-cigarettes are encouraging young people to start smoking on a grand scale.

And if you believe Soneji et al’s claims, people would be starting smoking on a grand scale. Prof Peter Hajek, quoted by the Science Media Centre, has calculated what Soneji et al’s claims would mean if they were true in the UK:

“This new ‘finding’ is based on the bizarre assumption that for every one smoker who uses e-cigs to quit, 80 non-smokers will try e-cigs and take up smoking. It flies in the face of available evidence but it is also mathematically impossible. In the UK alone, 1.5 million smokers have quit smoking with the help of e-cigarettes. The ‘modelling’ in this paper assumes that we also have 120 million young people who became smokers.”

I think we can all see that having 120 million young people who are smokers among the UK population doesn’t make a whole lot of sense. Why could the peer-reviewers of the paper not see that?

Politics

Lessons must be learned. It must never happen again.

14th October 2017 Adam Leave a comment

Now that multiple accusations of rape and other serious sexual offences have been made against Harvey Weinstein, everyone agrees that what happened is terrible, that lessons must be learned, and that it must never happen again.

A few weeks ago, when Grenfell Tower burned down in London, with the loss of dozens of lives, everyone agreed that it was terrible, that lessons must be learned, and that it must never happen again.

When it turned out that British journalists had been hacking phones on a grand scale, including the phone of a dead schoolgirl, everyone agreed that it was terrible, that lessons must be learned, and that it must never happen again.

When it became clear that Jimmy Savile had been a prolific sexual abuser, everyone agreed that it was terrible, that lessons must be learned, and that it must never happen again.

When the banking system collapsed in 2008, causing immense damage to the wider economy, everyone agreed that it was terrible, that lessons must be learned, and that it must never happen again.

It seems to me that the lesson from all these things, and more, is clear. When people are in a position of power, sometimes they will abuse that power. And because they are in a position of power, they will probably get away with it.

This will happen again. People in a position of power are the ones who make the rules, and it doesn’t seem likely that they will change the rules to make it easier to hold powerful people to account.

I suppose it could happen, in a democracy such as the UK, if voters insist that their politicians prioritise holding the powerful to account. Sadly, I can’t see that happening. Most people prioritise other things when they go to the ballot box.

So unless that changes, all these things, and similar, will happen again.

Dodgy reporting, Dodgy statistics, Public health

Do 41% of middle aged adults really walk for less than 10 minutes each month?

24th August 2017 Adam 4 Comments

I was a little surprised when I heard the news on the radio this morning and heard that a new study had been published allegedly showing that millions of middle aged adults are so inactive that they don’t even walk for 10 minutes each month. The story has been widely covered in the media, for example here, here, and here.

The specific claim is that 41% of adults aged 40 to 60 in England, or about 6 million people, do not walk for 10 minutes in one go at a brisk pace at least once a month, based on a survey by Public Health England (PHE). I tracked down the source of this claim to this report on the PHE website.

I found that hard to believe. Walking for just 10 minutes a month is a pretty low bar. Can it really be true that 41% of middle aged adults don’t even manage that much?

Well, if it is, which I seriously doubt, then the statistic is at best highly misleading. The same survey tells us that less than 20% of the same sample of adults were physically inactive, where physical activity is defined as “participating in less than 30 minutes of moderate intensity physical activity per week”. Here is the table from the report about physical activity:

So we have about 6 million people doing less than 10 minutes of walking per month, but only 3 million people doing less than 30 minutes of moderate intensity physical activity per week. So somehow, there must be 3 million people who are doing at least 30 minutes of physical activity per week while simultaneously walking for less than 10 minutes per month.

I suppose that’s possible. Maybe those people cycle a lot, or perhaps drive to the gym and have a good old workout and then drive home again. But it seems unlikely.

And even if it’s true, the headline figure that 41% of middle aged adults are doing so little exercise that they don’t even manage 10 minutes of walking a month is grossly misleading. Because in fact over 80% of middle aged adults are exercising for at least 30 minutes per week.

I notice that the report on the PHE website doesn’t link to the precise questions asked in the survey. I am always sceptical of any survey results that aren’t accompanied by a detailed description of the survey methods, including specifying the precise questions asked, and this example only serves to remind me of the importance of maintaining that scepticism.

The news coverage focuses on the “41% walk for less than 10 minutes per month” figure and not on the far less alarming figure that less than 20% exercise for less than 30 minutes per week. The 41% figure is also presented first on the PHE website, and I’m guessing, given the similarity of stories in the media, that that was the figure they emphasised in their press release.

I find it disappointing that a body like PHE is prioritising newsworthiness over honest science.

Politics

Brexit voting and education

11th February 2017 Adam 47 Comments

This post was inspired by an article on the BBC website by Martin Rosenbaum, which presented data on a localised breakdown of EU referendum voting figures, and a subsequent discussion of those results in a Facebook group. In that discussion, I observed that the negative correlation between the percentage of graduates in an electoral ward and the leave vote in that ward was remarkable, and much higher than any correlation you normally see in the social sciences. My friend Barry observed that age was also correlated with voting leave, and that it was likely that age would be correlated with the percentage of graduates, and questioned whether the percentage of graduates was really an independent predictor, or whether a high percentage of graduates was more a marker for a young population.

The BBC article, fascinating though it is, didn’t really present its findings in enough detail to be able to answer that question. Happily, Rosenbaum made his raw data on voting results available, and data on age and education are readily downloadable from the Nomis website, so I was able to run the analysis myself to investigate.

To start with, I ran the same analyses as described in Rosenbaum’s article, and I’m happy to say I got the same results. Here is the correlation between voting leave and the percentage of graduates, together with a best-fit regression line:

For age, I found that adding a quadratic term improved the regression model, so the relationship between age and voting leave is curved, and increases with age at first, but tails off at older age groups:

Rosenbaum also looked at the relationship with ethnicity, so I did too. Here I plot the percent voting leave against the % of people in each ward identifying as white. Again, I found the model was improved by a quadratic term, showing that the relationship is non linear. This fits with what Rosenbaum said in his article, namely that although populations with more white people were mostly more likely to vote leave, that relationship breaks down in populations with particularly high numbers of ethnic minorities:

It’s interesting to note that the minimum for the % voting leave is a little over 40% white population. I suspect that the important thing here is not so much what the proportion of white people is, but how diverse a population is. Once the proportion of white people becomes very low, then maybe the population is just as lacking in diversity as populations where the proportion of white people is very high.

Anyway, the question I was interested in at the start was whether the percentage of graduates was an independent predictor of voting, even after taking account of age.

The short answer is yes, it is.

Let’s start by looking at it graphically. If we start with our regression model looking at the relationship between voting and age, we can calculate a residual for each data point, which is the difference between the data point in question and the line of best fit. We can then plot those residuals against the percentage of graduates. What we are now plotting is the voting patterns adjusted for age. So if we see a relationship with the percent of graduates, then we know that it’s still an independent predictor after adjusting for age.

This is what we get if we do that:

As you can see, it’s still a very strong relationship, so we can conclude that the percentage of graduates is a good predictor of voting, even after taking account of age.

What if we take account of both age and ethnicity? Here’s what we get if we do the same analysis but with the residuals from an analysis of both age and ethnicity:

Again, the relationship still seems very strong, so the percentage of graduates really does seem to be a robust independent predictor of voting.

For the more statistically minded, another way of looking at this is to look at the regression coefficient for the percentage of graduates alone, or after adjusting for age and ethnicity (in all cases with the % voting leave as the dependent variable). Here is what we get:

Model	Regression cofficient	t	P value
Education alone	-0.97	-45.9	< 0.001
Education and age	-0.90	-52.5	< 0.001
Education and ethnicity	-0.91	-55.0	< 0.001
Education, age, and ethnicity	-0.89	-53.9	< 0.001

So although the regression coefficient does get slightly smaller after adjusting for age and ethnicity, it doesn’t get much smaller, and remains highly statistically significant.

What if we turn this on its head and ask whether age is still an important predictor after adjusting for education?

Here is a graph of the residuals from the analysis of voting and education, plotted against age:

There is still a clear relationship, though perhaps not quite as strong as before. And what if we look at the residuals adjusted for both education and ethnicity, plotted against age?

The relationship seems to be flattening out, so maybe age isn’t such a strong independent predictor once we take account of education and ethnicity (it turns out that areas with a higher proportion of white people also tend to be older).

For the statistically minded, here are what the regression coefficients look like (for ease of interpretation, I’m not using a quadratic term for age here and only looking at the linear relationship with age).

Model	Regression cofficient	t	P value
Age alone	1.66	17.2	< 0.001
Age and education	1.28	25.3	< 0.001
Age and ethnicity	0.71	5.95	< 0.001
Age, education, and ethnicity	0.82	13.5	< 0.001

Here the adjusted regression coefficient is considerably smaller than the unadjusted one, showing that the initially strong looking relationship with age isn’t quite as strong as it seems once we take account of education and ethnicity.

So after all this I think it is safe to conclude that education is a remarkably strong predictor of voting outcome in the EU referendum, and that that relationship is not much affected by age or ethnicity. On the other hand, the relationship between age and voting outcome, while still certainly strong and statistically significant, is not quite as strong as it first appears before education and ethnicity are taken into account.

One important caveat with all these analyses of course is that they are based on aggregate data for electoral wards rather than individual data, so they may be subject to the ecological fallacy. We know that wards with a high percentage of graduates are more likely to have voted remain, but we don’t know whether individuals with degrees are more likely to have voted remain. It seems reasonably likely that that would also be true, but we can’t conclude it with certainty from the data here.

Another caveat is that data were not available from all electoral wards, and the analysis above is based on a subset of 1070 wards in England only (there are 8750 wards in England and Wales). However, the average percent voting leave in the sample analysed here was 52%, so it seems that it is probably broadly representative of the national picture.

All of this of course raises the question of why wards with a higher proportion of graduates were less likely to vote leave, but that’s probably a question for another day, unless you want to have a go at answering it in the comments.

Update 12 February 2017:

Since I posted this yesterday, I have done some further analysis, this time looking at the effect of socioeconomic classification. This classifies people according to the socioeconomic status (SES) of the job they do, ranging from 1 (higher managerial and professional occupations) to 8 (long term unemployed).

I thought it would be interesting to see the extent to which education was a marker for socioeconomic status. Perhaps it’s not really having a degree level education that predicts voting remain, but it’s being in a higher socioeconomic group?

To get a single number I could use for socioeconomic status, I calculated the percentage of people in each ward in categories 1 and 2 (the highest status categories). (I also repeated the analysis calculating the average status for each ward, and the conclusions were essentially the same, so I’m not presenting those results here.)

The relationship between socioeconomic status and voting leave looks like this:

This shouldn’t come as a surprise. Wards with more people in higher SES groups were less likely to vote leave. That fits with what you would expect from the education data: wards with more people with higher SES are probably also those with more graduates.

However, if we look at the multivariable analyses, this is where it starts to get interesting.

Let’s look at the residuals from the analysis of education plotted against SES. This shows the relationship between voting leave and SES after adjusting for education.

You’ll note that the slope of the best-fit regression line is now going the other way: it now slopes upwards instead of downwards. This tells us that, for wards with identical proportions of graduates, the ones with higher SES are now more likely to vote leave.

So what we are seeing here is most definitely a correlation between education and voting behaviour. Other things (ie education) being equal, wards with a higher proportion of people in high SES categories were more likely to vote leave.

For the statistically minded, here is what the regression coefficients look like. Here are the regression coefficients for the effect of socioeconomic status on voting leave:

Model	Regression cofficient	t	P value
SES alone	-0.58	-20.6	< 0.001
SES and education	0.81	26.5	< 0.001
SES, education, and ethnicity	0.49	12.4	< 0.001
SES, education, age, and ethnicity	0.31	6.5	< 0.001

Note how the sign of the regression coefficient reverses in the adjusted analyses, consistent with the slope in the graph changing from downward sloping to upward sloping.

And what happens to the regression coefficients for education once we adjust for SES?

Model	Regression cofficient	t	P value
Education alone	-0.97	-45.9	< 0.001
Education and SES	-1.75	-51.9	< 0.001
Education, SES, age, and ethnicity	-1.20	-23.4	< 0.001

Here the relationship between education and voting remain becomes even stronger after adjusting for SES. This shows us that it really is education that is correlated with voting behaviour, and it’s not simply a marker for higher SES. In fact once you adjust for education, higher SES predicts a greater likelihood of voting leave.

To be honest, I’m not sure these results are what I expected to see. I think it’s worth reiterating the caveat above about the ecological fallacy. We do not know whether individuals of higher socioeconomic status are more likely to vote leave after adjusting for education. All we can say is that electoral wards with a higher proportion of people of high SES are more likely to vote leave after adjusting for the proportion of people in that ward with degree level education.

But with those caveats in mind, it certainly seems as if it is a more educated population first and foremost which predicts a higher remain vote, and not a population of higher socioeconomic status.

Education

Do you believe in dinosaurs?

14th January 2017 Adam 3 Comments

On of my earliest memories is from when I was at primary school. I must have been about 5 years old at the time, and I had just heard about dinosaurs. I can’t remember how I heard about them. Perhaps my parents had given me a book about them. That’s probably the sort of thing that parents do for 5-year-olds, right?

Anyway, I was fascinated by the whole idea (as I expect most kids of that age are), and at school I asked my teacher “Do you believe in dinosaurs?”

The teacher was smart enough to spot that I was asking a question with some rather poor assumptions behind it, and helpfully and patiently explained to me why it’s not really a question of belief. Dinosaurs, she explained, were an established fact, as seen from abundant evidence from the fossil record. Belief didn’t come into it: dinosaurs existed.

I understood what my teacher explained to me, and learned an important lesson that day. Some things are not about belief: they are about facts. In fact looking back on this with the benefit of 40-odd years of hindsight, I think perhaps that lesson was the single most important thing I ever learned at school (yes, even more important than that thing about ox-bow lakes). It’s a shame I can’t remember the name of the teacher, because I’d really like to thank her.

But it’s even more of a shame that so many people who don’t believe in global warming or who do believe in homeopathy or similar didn’t have such a good primary school teacher as I had.

The Stats Guy