23 January 2014

SMotW #89: number of infosec events

Security Metric of the Week #89: number of information security events, incidents and disasters


This week, for a change, we're borrowing an analytical technique from the field of quality assurance called "N why's" where N is roughly 5 or more.

Problem statement: for some uncertain reason, someone has proposed that ACME might count and report the number of information security events, incidents and disasters.
  1. Why would ACME want to count their information security events, incidents and disasters?
  2. 'To know how many there have been' is the facile answer, but why would anyone want to know that?
  3. Well, of course they represent failures of the information risk management process. Some are control failures, others arise from unanticipated risks materializing, implying failures in the risk assessment/risk analysis processes. Why did the controls or risk management process fail?
  4. Root cause analysis reveals many reasons, usually, even though a specific causative factor may be identified as the main culprit. Why didn't the related controls and processes compensate for the failure?
  5. We're starting to get somewhere interesting by this point. Some of the specific issues that led to a given situation will be unique, but often there are common factors, things that crop up repeatedly. Why do the same factors recur so often?
  6. The same things keep coming up because we are not solving or fixing them permanently. Why don't we fix them?
  7. Because they are too hard, or because we're not trying hard enough! In other words, counting infosec events, incidents and disasters would help ACME address its long-standing issues in that space.
There's nothing special about that particular sequence of why's nor the questions themselves (asking 'Who?', 'When?', 'How?' and 'What for?' can be just as illuminating), it's just the main track my mind followed on one occasion. For instance, at point 5, I might equally have asked myself "Why are some factors unique?". At point 3, I might have thought that counting infosec incidents would give us a gauge for the size or scale of ACME's infosec issues, begging the question "Why does the size of scale of the infosec issues matter?". N why's is a creative technique for exploring the problem space, digging beneath the superficial level.

The Toyota Production System uses techniques like this to get to the bottom of issues in the factory. The idea is to stabilize and control the process to such an extent that virtually nothing disturbs the smooth flow of the production line or the quality of the final products. It may be easy for someone to spot an issue with a car and correct it on the spot, but it's better if the causes of the issue are identified and corrected so it does not recur, or even better still if it never becomes an issue at all. Systematically applying this mode of thinking to information security goes way beyond what most organizations do at present. When a virus infection occurs, our first priority is to contain and eradicate the virus: how often do we even try figuring out how the virus got in, let alone truly exploring and addressing the seemingly never-ending raft of causative and related factors that led to the breach? Mostly, we don't have the luxury of time to dig deeper because we are already dealing with other incidents.

Looking objectively at the specific metric as originally proposed, ACME managers gave it a PRAGMATIC score of 49%, effectively rejecting it from their shortlist ... but this one definitely has potential. Can PRAGMATIC be used to improve the metric? Obviously, increasing the individual PRAGMATIC ratings will increase the overall PRAGMATIC score since it is simply the mean rating. So, let's look at those ratings (flick to page 223 in the book).

In this case, the zero rating for Actionability stands out a mile. Management evidently felt totally powerless, frustrated and unable to deal with the pure incident count. The number in isolation was almost meaningless to them, and even plotting the metric over time (as shown on the example graph above) would not help much. Can we improve the metric to make their job easier?

As indicated at item 7 above, this metric could help by pointing out how many information security events, incidents and disasters link back to systematic failures that need to be addressed. Admittedly, the bare incident count itself would not give management the information needed to get to that level of analysis, but it's not hard to adapt and extend the metric along those lines, for instance categorizing incidents by size/scale and nature/type, as well as by the primary and perhaps secondary causative factors, or the things that might have prevented them occurring.

A pragmatic approach would be to start assigning incidents to fairly crude or general categories, and in fact this is almost universally done by the Help Desk-type functions that normally receive and log incident reports - therefore the additional information is probably already available from the Help Desk ticketing system. Management noting a preponderance of, say, malware incidents, or an adverse trend in the rate of incidents stemming from user errors, would be the trigger to find out what's going wrong in those areas. Over time, the metric could become more sophisticated with more detailed categorization etc.

No comments:

Post a Comment

Have your say!