Table of Contents
Let’s clarify something at the outset. When thinking about test accuracy, it’s very easy to confuse two questions.
Place yourself in the position of someone having a screening swab test for Covid-19.
The following logic is based on asymtomatic screening so as to keep things simple.
(But let's assume you DON'T have any symptoms.)
The first question is:
What is the probability that I will get a positive or negative test result?
The vast majority of patients will hone in on this. This is perfectly reasonable.
If the test was completely accurate, that’s all anyone would need to think about.
If you get screened at random for a condition, then the probability of getting a positive result from a perfect test is the same as the proportion of the population who have the condition.
Let’s call that the prevalence rate. In practice, the prevalence rate is tough to accurately pin down, but more on this later.
However, in the majority of cases tests aren’t accurate, so we need to ask a second question:
What is the probability that the test result I do get, whether positive or negative, is correct?
It’s easy to get two these two questions mixed up. The point of this course is to clarify all the issues to do with this second question.
Researchers need to establish how good a test is at correctly labelling people as positive who do have the virus.
They call this the sensitivity rate. Expressed as a percentage, the higher the better.
Some more terminology is useful here. We’ll use some example numbers which help illustrate the issues going forward.
Let’s assume the test is 90% sensitive.
Researchers have established that when testing a group of people who they know do have the virus, 90% of those tested received a (correct) positive test result.
Of course, a completely sensitive test would identify 100% of this group as being positive.
But that leaves 10% of the results as incorrect. 10% of people received a false negative and were told they don’t have the virus when in fact they did.
So we also need to know something else. How good is the test at correctly labelling people as negative who do not have the virus?
Researchers call this specificity. Expressed as a percentage, the higher the number, the better.
Let’s say a new test is only 80% specific.
Researchers find that when they test people who they already know do not have the virus, 80% of those tested receive a (correct) negative test result.
But 20% of test subjects are told they are positive – they are false positives.
Before we move on, let’s restate: the accuracy of a screening test is determined by its sensitivity and its specificity.
Test manufacturers, regulators, researchers and clinicians all want to know how sensitive and specific a test is and how reliable these numbers are.
In principle, testing sensitivity should not be that difficult.
You run the test on people known to have the condition or known to have had it (e.g. swab tests versus serology antibody tests), and see how often you get a positive result.
You want to avoid the type of error where you incorrectly classify a healthy test subject as having the condition. Still, with due care about this, and done enough times, a reliable sensitivity number should be possible.
And over time, as the test is used and reassessed, the sensitivity so measured becomes accepted or challenged.
Sensitivity might improve over time. For example, the rise in swab test sensitivity likely to happen over the next year (2021).
The same process applies when determining specificity, though, in this case, the challenge is to source subjects who definitely don’t have the condition.
The risk here – again, using an example of Covid-19 – is that you incorrectly classify someone as negative who is pre-symptomatic. Hopefully, follow-up procedures will eliminate this particular error.
Of course, there are many other issues to consider.
But repeated enough times, a figure close to the truth should emerge.
When the two manufacturers of Sars-Cov-2 antibody tests released their products in late May 2020, there was a lot of criticism from clinicians and researchers about the ‘shaky’ nature of the claimed sensitivities and specificities.
At the time of writing this, this is still a hot subject!
It may not be intuitive, but for a do you have a virus type of test, sensitivity and specificity should be entirely separate (unless testers make mistakes like using the same swab on different subjects, or other basic errors).
It gets more confusing when, for example, doctors need to discover the level of a hormone level in the bloodstream and see if the result can predict the presence of a condition. In this situation, it’s likely that sensitivity and specificity get inter-linked: as you try to improve one, it might well worsen the other.
Even when repeating a test on the same person, errors will produce false negatives or false positives (depending on whether they do or don’t have the condition).
For example, a single patient – who does have Covid-19 – swabbed by ten different operators (an unpleasant prospect) might not get ten positive results. A single operator swabbing the patient ten times might miss virus particles. And not quite swabbing virus particles might just be an inherent weakness of the test, even when performed flawlessly by the operator.
So it’s a coincidence if tests have similar levels of sensitivity and specificity.
They are different because of the many different types of errors in the measuring process.
The ideal combination would be a test that had 100% sensitivity and 100% specificity, but there are few, if any, tests so accurate.
A 100% accurate test, in both these ways, would have perfect predictive power. But this never happens in the real world.
A test might be quite sensitive and have worse specificity, or vice versa.
Sadly, we can’t just use the information above to interpret test results. We need to know something else too.
We need to estimate what proportion of the overall population has the condition – the prevalence rate.
It’s always useful to look at limiting or boundary cases.
If we knew that 100% of the population have the condition, we don’t need the test in the first place. The test wouldn’t add anything. As an aside, when test manufacturers ‘test their test’ for sensitivity, they choose a sample group with a prevalence of 100%.
Of course, this begs the question about how we know that 100% of the population have the condition. But let’s just imagine this, as a kind of thought experiment.
And if we knew the rate was zero in the population, we also don’t need the test, for similar reasons. And when test manufacturers ‘test their test’ for specificity, they choose a sample group with a prevalence of 0%.
But anything in between we need to have a handle on, ideally an accurate handle.
Because tests hardly ever have perfect sensitivity and specificity, we must adjust the apparent or notional test accuracy by the prevalence of the condition.
Why is that?
Well, imagine you get a positive result from a swab test. You should be interested in one thing.
How likely is a true positive compared to a false positive?
This is what matters. You wouldn’t care about the accuracy of negative test results. Not at that moment, anyway.
So if I told you that in your case the probability of a false positive was much higher than the probability of a true positive, naturally, you would cheer up a bit.
But if I told you the opposite?
And if you get a negative result? You are interested in how likely is a true negative compared to a false negative.
Remember we concluded above that if the prevalence in the community is 100% or 0% we don’t need tests?
Let’s explain why the prevalence rate matters if it’s somewhere in between.
Assume that the proportion of the population who are asymptomatically carrying the virus is only 5%.
By the way, I’ve used 5% as an example to make the figures easier. Who knows the real number in the general population? 0.1%?
You know from the prevalence rate (we assumed 5% at the end of the previous lesson) that there are only 5 chances in a hundred you have the condition.
Our assumptions are sensitivity = 90%, specificity = 80%, prevalence = 5%.
From the above, we can work out that the likelihood of a true positive (where you do have the virus) is (happily) less than the probability of a false positive.
Let’s assume 1000 people do the swab test, 50 of whom actually have the condition (because the prevalence is 5%).
The test, if one has the virus, generates true positives 90% of the time.
So 45 people would be correctly labelled as positive.
But 950 people (the proportion of people who don’t have the virus) would, 20% of the time, be incorrectly labelled as positive.
Remember, the specificity was only 80%, and 20% = 100%-80%.
So 950 * 20% is 190.
So 190 people would be incorrectly labelled as positive.
Of the 1,000 the test is performed on, the total number who would get a positive result is thus 235.
But only 45 are real – 190 are false.
So 45 true positives divided by 235 total positives is 19.15%.
To recap, using a prevalence of 5%, a sensitivity of 90% and a specificity of 80%, if you get a positive swab result, it is only 19% likely to be correct – if you are asymptomatic and tested at random.
That’s way less than the headline sensitivity of 90% (in this made-up example), which is, understandably, what first catches the eye of those who test positive.
90% sensitivity makes asymptomatic lay-people think they have a 90% chance of having the virus if they get a positive test result.
Actually, it’s under 20%.To help visualise this, the chart below shows all the positive results, both true and false, and their relative proportions.
To recap – if you get a positive test result you are interested in one thing. How likely is that positive to be true?
This is the same as saying, ‘take the number of expected true positives and divide by the expected number of (true positives + false positives)’.
Let’s crunch the numbers if you get a negative result, using the same assumptions as before.
The number of true negatives is 950 x 80%. That’s 760. The number of false negatives is 50 x 10%. That’s 5. Total negatives is 760 + 5 = 765.
True negative results vastly outnumber false negative results.
760 over 765 means it is exceptionally likely that your negative result is accurate. 99.35% likely, to be precise.
To help visualise this, examine the chart below.
You should now see why if the prevalence rate just happens to be 50% we don’t need to factor into the test results the impact of the prevalence rate.
It’s because the ‘weights’ applied to true results are the same as the weights applied to false results. The prevalence effect thus washes out.
In this unlikely situation, we would only be interested in the difference between sensitivity and specificity.
And if sensitivity and specificity just happened to be the same, then the probability of a positive or a negative result being true would be the same as the sensitivity and the specificity.
But only in this highly unlikely situation.
In real world situations, the prevalence effect matters a lot.
It’s interesting to play around with these numbers and see how it changes things. Let’s do that in the next lesson.
But before that, if you are interested in reading more about the importance of prevalence rates when thinking about test results (the failure to take this into account being called the base rate fallacy) then you can read more detail here.
This chart shows the effect of varying prevalence.
For the first time, we’ve used real-world numbers for sensitivity and specificity which are, to the best of my knowledge at the time of writing, fairly accurate.
But there’s a lot of work going on with better and faster testing for current infection. It’s very likely that the sensitivity will rise from circa 70% at the moment to above 90% by mid 2021.
Note: you can click on Column A and D for more information.
Effect of 1% Prevalence
Showing a low level of current infection in the general community and how this allows the nearly perfect (but not quite perfect) specificity score to produce enough false positives such that they outnumber true positives by 6 to 4.
Effect of 35% Prevalence
This might be typical of a bad outbreak in a hospital or care home. Here the false negatives from the poor sensitivity contribute to a worsening of the true negative probability to approximately 86%. A repeat test might be required to provide reassurance to a concerned healthcare worker that they really are negative for Covid-19.
Sometimes doctors have a good handle on prevalence.
In a chronic condition like rheumatoid arthritis, estimates of the prevalence rate across the population can be reasonably accurate and are established over time.
Of course, patients aren’t tested at random for these kinds of conditions. Tests tend to be ordered when there is a concern or indication warranting further investigation.
But, assuming we had the resources, in principle it would be useful to know the prevalence in the population of people currently infected with a novel virus.
But if the accuracy of the test is poor, particularly its sensitivity, then random screening might be pointless at low levels of prevalence.
Somewhat paradoxically, it is only at much higher levels of prevalence that we would get meaningful information from a swab-test screening program for Covid-19.
But at much higher levels the presence of high infection levels would be a lot more obvious!
All of the preceding analysis assumes the testing of asymptomatic people.
For example, a breast cancer screening program. Or those people being allocated at random by PHE for a Covid-19 swab test.
Knowing the claimed sensitivity and specificity of the test, we adjust the positive and negative result’s probability of being correct by the ‘best-guess’ prevalence rate appropriate to the sampled population.
For example, suppose we are swabbing arriving airline passengers from an ‘outbreak’ country to screen for Covid-19. In that case, we might well use different prevalence levels than when swabbing arriving inhabitants of a remote Scottish island.
How should you assess the accuracy of an imperfect test result in this situation?
Here’s where Thomas Bayes comes in. Interpreting trial (RCT) or test results in the context of apriori information is a controversial issue in medical research, but it somehow seems reasonable.
Using the sensitivity, specificity and prevalence calculations we looked at earlier is a starting point, but the symptomatology now makes this seem insufficient.
We have more information that we need to take into account.
This change things. For any imperfect test of a symptomatic patient, any negative result becomes less likely to be true, and any positive result becomes more likely to be true.
Of course, these types of weighted judgements are ones that we all make every day.
For example, the term ‘false positive rate’ is sometimes incorrectly used to describe the proportion of total positive results (true + false) that are false.
This is the sense that Matt Hancock used the term in mid September 2020 (check out this BBC ‘More or Less’ podcast, about 10 minutes in). Both he and the interviewer were talking at cross purposes.
The objective of this short course was to improve your interpretation of test accuracy and significance. I hope you feel it has.
But here’s a final summary to consolidate the previous work.
A very sensitive test won’t produce many false negatives, expressed as a percentage, but the effect of less than perfect sensitivity is worse the higher the prevalence, unless specificity is 100%.
The reason? Higher prevalence produces more false negatives (unless sensitivity is perfect).
A very specific test won’t produce many false positives, expressed as a percentage, but the effect of less than perfect specificity is worse the lower the prevalence, unless sensitivity is 100%.
The reason? Lower prevalence produces more false positives (unless specificity is perfect).
And it all depends what matters.
If you are trying to ensure that healthcare workers don’t infect patients with the coronavirus, then test sensitivity is the key measure.
If the main concern is to make sure that, for example, a false-positive mammogram does not result in uncessary and worrying treatment escalation, then the screeners are more interested in the specificity of the test.