Prioritizing Risk: A Conversation on Vulnerability Scoring
Share with Your Network
In September of 2013, I wrote an article that examines the topic of vulnerability scoring in vulnerability management solutions. I argue that an unbounded vulnerability scoring system–that is, a scoring system without any delineated limits–can be valuable at the right level of a business’s process, and I conclude that rankings, categories, and more sophisticated metrics can benefit from this type of system.
It has been two years since the publication of my article, but the points I raised with regards to how organizations can prioritize vulnerability risk are still pertinent today. Just recently, Michael Roytman of Kenna, an IT security firm that specializes in a new category of security assessment solutions, engaged me in conversation on the topic of vulnerability scoring. Our discussion is presented below.
Michael Roytman: I completely agree with the spirit of your 2013 article. Yes, we often do prioritize too much, and part of the problem is in fact due to the way in which we define criticality and how historical scoring has obfuscated what is truly important here. However, I think that it is quite possible to address these issues with a bounded score.
Let me explain why bounded scores matter. In general, the bounding of metrics allows us to make claims about “how much worse” a particular state is than that of another. Without a limit, there’s no way to tell if a score of 12,000 is actually 2x worse than a score of 6,000. The reason for this is that we’re actually measuring the risk a vulnerability (or set of vulnerabilities) poses to an organization; in doing so, we inherently assume that we can measure the probability of the vulnerability being popped. As we both know, probability spaces are bounded by definition in order to be able to make claims about events that could occur outside of a set of knowable events. You can click here for further reading on that matter.
Tim Erlin: Michael, you present some solid reasoning there. A bounded score does provide one with the ability to distinguish between two vulnerabilities. In light of your analysis, I will adjust my language away from “unbounded” to something like “dynamic bounding.” We might be getting a little esoteric here, so let’s toss out some definitions as a set up:
- Bounded Score: a scoring system with a fixed lower and upper bound.
- Unbounded Score: a scoring system with either no fixed upper or no fixed lower bound.
- Dynamically Bounded Score: a scoring system with upper or lower bounds that change based on some criteria.
Now back to your remarks. While allowing for “claims about events that could occur outside of a set of knowable events,” the issue with a bounded score is that it doesn’t adjust for changes in probability–that is, assuming it’s static. A threat that has some probability x today may actually have probability y tomorrow.
If you make the assumption that you can know and adjust for a set of criteria that changes the probability of an exploit, then you should incorporate those into the scoring system, thereby allowing you to adjust the bounds on that system. The score is bounded at any given point in time, mind you, but those bounds will shift based on changes in probability.
MR: I do not see the greater value of the system you have just described. Bounded scores might have their drawbacks when it comes to probability, but fixed limits are nonetheless instrumental in their ability to communicate a rank ordering of vulnerabilities–that is, the order in which we should remediate them–because the goal of any of our discussions is really twofold:
- To make a list of actions for folks to take in order of importance.
- To be able to report on the current past and future states of the system.
Your article does an excellent job pointing out that these two goals are important. However, without a rank ordering, we suffer from the same problem: we know that a lot of vulnerabilities are important, but either we can’t decide which is most important (and hence which we should act on first) or we can’t report the state of the system because unbounded metrics in the aggregate obfuscate the end-state.
I will try to explain this line of reasoning further by critiquing line-by-line the only section of your article with which I truly disagree:
“The most common use case for vulnerability scoring is selecting which vulnerabilities to focus on remediating. In this case, we’ve already seen the limitations of a ranking system where you ultimately end up with the problem of clustering of high ranks, and an inability to act.”
The cluster of high ranks happens because the CVSS score distribution was designed without looking at the data we currently have available to us. That is, it is a crafted distribution made to fit normality to a 2002 dataset (CVSS 2), but in reality describing the data with more parameters (velocity of exploitation, type of exploits available, IDS alerts on the vulns, etc) can allow us to both create a scoring system with more granularity (more ranks, more feasible scores) and separate the artificial clustering at the high end that we see in CVSS. Keep in mind, CVSS has only something like 16 possible “ranks” or scores, and they all tend to the top because a lack of information in this schema keeps the score high. (No exploit confirmed? Keep the score the same. Exploit proof of concept? Lower the score exploit confirmed? Keep the score high.)
“A ranking, however, may be useful at small scale. If you have hundreds of devices, and no requirement for reporting on progress, then you can use vulnerability rankings. In most cases, however, an unbounded score provides the best mechanism for detailed prioritization.”
Again, this is a question of proper model selection. We can create schema that work at high scale (think FICO scores) and that are able to differentiate between or compare two system states (people) within bounds. We just need to incorporate more data into the models to get that granularity.
“Another benefit of the unbounded score is the ability to aggregate to the host level or any other arbitrary grouping. For example, you may not actually need to know which one vulnerability across your entire environment needs to be fixed today.”
Here I disagree. I think the goal is to find the one worst vulnerability across the environment, fix it, and move on to the next one. If a team isn’t doing that, they’re not remediating the vulnerabilities that put them most at risk. Losing this ability as a result of an unbounded score makes for an incomplete metric, and one can still find the worst vulnerability on a host if the model is crafted in such a way that only a specific, high amount of data criteria yield the top end of the score range.
TE: Now I get to disagree with you here. While this ‘prioritize the riskiest vulnerability, fix it, and move on to next one’ strategy might constitute an ideal process, it’s simply not reality. If we want effective risk reduction inside an actual organization, it’s more effective to adjust the process to the organization than to attempt to push the organization towards an ideal process. Some organizations work well with punch lists of vulnerabilities, but others do better with departmental goals, walls of shame, or remediation committees. That’s part of why flexibility in scoring models is important. Even within the same organization, there are use cases for multiple models. A vulnerability analyst might care about technical details on exploitability, while an executive might just want to know about progress against objectives. Here’s one take on how different strategies might be employed: https://www.tripwire.com/state-of-security/vulnerability-management/six-strategies-for-reducing-vulnerability-risk/
MR: This would appear to be a difference in organizational philosophy. Let me move on to the next section of your article with which I disagree:
“In fact, a more practical use case in a large environment is to ask which vulnerability your Windows admins should focus on this week. In this case, the aggregate scores of vulnerability instances across a filtered group is most useful (i.e. the highest total score for all instances of a Windows vulnerability in your environment).”
In the alternate schema I propose, we can aggregate based on technology, business process, or host–it’s just a question of finding that worst vulnerability according to a set of criteria or changing the filters to be more specific.
Now, let’s not let my comments obfuscate what’s important. We both are looking for ways to make scoring of vulnerabilities more useful, more granular, and better at predicting intrusions. I just want to make clear that the problem at the heart of CVSS is less about the type of score it is and more about which data is included and how good the model is. In fact, we might do well to use an unbounded score for instantaneous assessments and my model for over time measurements, but either one is better than the status quo.
TE: As you noted above, you’re essentially correct here. I don’t want to quibble with a magnanimous acknowledgement of shared purpose because I agree that we’re after the same thing, but I do want to focus not on the accuracy of the scoring but on the effectiveness of the scoring system for getting actual work done. This touches on the difference between precision and accuracy, as well as perhaps more importantly on the diminishing returns of increasing precision. Here’s an illustration of the difference between precision and accuracy:
You can use these terms to describe how vulnerability scanning results can go wrong. What I’d like to call out here is the bottom left target, which is labeled “Generally Right.” Depending on how an organization behaves, there may or may not be increased value in increased precision as long as accuracy remains. As an example, if my organization behaves more effectively around groups of assets rather than individual vulnerabilities, there’s a point at which the precision in scoring loses value because of the aggregates, i.e. I still need to push on my Windows admin more than my network infrastructure team.
A better example might be an organization that behaves irrationally, focusing on an industry standard measurement over a more accurate assessment of risk. In this case, a system like CVSS might have much more power within the organization than a highly accurate but unique vulnerability score.