27 February 2013

SMotW #46: IT capacity and performance metric

Security Metric of the Week #46: measuring IT capacity and performance


The capacity and performance of IT services, functions, systems, networks, applications, processes, people etc. are generally measured using a raft of distinct metrics addressing separate pieces of the puzzle.  Collectively, these indicate how 'close to the red line' IT is running.  

Conceivably the individual metrics could be combined  mechanistically to generate a single summary metric or indicator giving an overall big-picture view of IT capacity and performance ... but more likely in practice is a dashboard-type display with multiple gauges showing important metrics in one view, allowing the viewer to identify which aspects of IT performance and capacity are or are not causing concern, and perhaps dig down for still more details on specific gauges. 

Glossing over the question of precisely what is shown on IT's capacity and performance dashboard, let's see how ACME Enterprises scored the metric using the PRAGMATIC approach:

P
R
A
G
M
A
T
I
C
Score
92
92
82
77
96
62
84
64
29
75%

ACME's managers have taken the view that the metric's Accuracy and Independence are both of some concern since (in their context) the IT department reports its own capacity and performance figures to the rest of the organization, and clearly has an interest in painting as rosy a view as possible.  [This situation is common where IT services are specified in a contract or Service Level Agreement, especially if the numbers affect IT's recharge fees and budgets.]  At the same time, everyone knows this, so IT's natural bias is counteracted to some extent by the cynicism of managers outside of IT, with the consequence that the metric is not as accurate, trustworthy and valuable as it might be if it were measured and reported dispassionately by some independent function.

The metric's Cost-effectiveness merits just 29% in the managers' opinion.  The cost of gathering the base data (across numerous IT capacity and performance parameters, remember), analysing it, massaging it (!), presenting it, viewing, considering, challenging and ultimate using it, amounts to a lot of time and effort for a complex metric that has no more than a ring of truth to it.  Overall, the managers evidently feel this metric generates far more heat than light.

Notice that the PRAGMATIC analysis has focused management's attention on various concerns with the design of the metric, and hints at a number of ways in which the design might be altered to improve its score, such as making the measurement process more independent and objective.  While of course it would have been possible to identify and address these concerns without explicitly using the PRAGMATIC approach, in practice people tend not to consider such things, at least not in sufficient depth to reach the breakthrough moment where genuine solutions emerge. 

One such breakthrough proposal on ACME's table is to discard the entire self-measurement-and-reporting thing, resorting instead to a new metric that involves IT's business customers rating IT's capacity and performance.  IT department is likely to feel threatened by this revolution, but think about it: if IT's customers identify issues and concerns from their perspective, IT has a clear mandate to address them, and can legitimately use the business requirements as a basis for its resourcing requests.  IT could still use the original capacity and performance dashboard for internal IT management purposes, without the need to massage or justify the figures to the rest of ACME.  This change of approach would substantially increase the PRAGMATIC score for the metric, and more importantly would enhance the relationship between IT and the business.  Result!

21 February 2013

SMotW #45: extent of security testing

Security Metric of the Week #45: extent to which information security is incorporated in software QA

Well-managed IT development projects incorporate information security at all applicable stages of the systems lifecycle, from initial outline specification and business case, through design, development, testing and release, on through operational use, management and maintenance of the system, right through to its retirement/replacement at the end of its life.  It would be possible to measure that in order to generate some sort of security index for all systems, using the index to drive-up security integration and quality, but doing so would be a tall order for most organizations.  Perhaps we should talk about that another time.

This week's example security metric is far simpler with a much tighter scope, measuring information security activities only during the "software Quality Assurance" (testing) phases of a development.  

The "extent to which information security is incorporated" is rather vague wording but we assume the metric would be specified more explicitly by the organization.  For instance, someone could examine all the development methods being used to identify all the QA stages (plural), then the steps or activities where information security could or should be involved.  The next step would be to survey all the currently-active development projects, checking available evidence to confirm whether information security is or is not involved as it could or should be.

The individual checks may involve crude decisions at each point ("Is information security involved, or not?", a binary option) or more sophisticated assessments of the nature of the involvement, for instance using a predefined Likert scale (e.g. "Not involved at all (score 1), slightly involved (2), involved (3), highly involved (4) or fully involved (5)".  Furthermore, the importance of information security's involvement at each step could be assessed in a similar way ("Is information security involvement necessary or optional at this stage?" or "How important is the involvement at this stage: (1) not important at all; (2) quite important; (3) important; (4) very important; (5) absolutely critical?").  Finally, the individual projects will vary in the relevance or necessity for information security involvement, depending on the associated information security risks relating to factors such as compliance obligations, technical complexity, business security demands, privacy aspects etc., and again these could be measured.

Notice that more sophisticated, information-rich versions of this metric come at a higher cost in terms of the need to specify the metric's Likert scales or other scoring parameters, to measure the projects against them, to analyze and report them, and of course to interpret and use them.  It takes time and effort to do all that.  Notice also that there is subjectivity throughout - things have to be interpreted in the specific context of each development project.  Even the definition of what constitutes a "development project" for this metric is subject to interpretation (e.g. does it only involve projects run in-house by IT, or does it include those run by IT outsourcers and cloud suppliers, and by business people on their desktops, tablets, smartphones and maybe BYOD equipment?).

Using the PRAGMATIC method, ACME Enterprises Inc scored this metric thus:

P
R
A
G
M
A
T
I
C
Score
85
80
67
62
70
50
35
35
50
59%


The assessors were evidently quite impressed with the metric's Relevance, Predictability and Meaninfulness for information security, but concerned about its subjectivity.  The Cost criterion is neutral on the basis that the metric could be quite simple, quick and cheap at least at first, but may evolve into something more sophisticated later if it turns out that the additional information would be valued (implying that ACME management are actively and consciously managing a suite of security metrics).

15 February 2013

One louder

Here's a little lesson on metrics, courtesy of rock-gods Spinal Tap: 

This is the top to a, you know, what we use on stage but it's very, very special because, if you can see ... Yeh ... the numbers all go to eleven.  Look, right across the board, eleven, eleven, eleven and ...  Oh, I see.  And most amps go up to ten?  Exactly.  Does that mean it's louder?  Is it any louder?  Well, it's one louder, isn't it?  It's not ten.  You see, most blokes, you know, will be playing at ten.  You're on ten here, all the way up, all the way up, all the way up, you're on ten on your guitar.  Where can you go from there?  Where?  I don't know. Nowhere.  Exactly.  What we do is, if we need that extra push over the cliff, you know what we do?  Put it up to eleven.  Eleven.  Exactly.  One louder.  Why don't you just make ten louder and make ten be the top number and make that a little louder? ... These go to eleven.
 
This is Spinal Tap - interview of a band member

13 February 2013

SMotW #44: system change correlation

Security Metric of the Week #44: Correlation between system/configuration logs and authorized change requests

In theory, changes to controlled IT systems (other than data changes made by legitimate, authorized users through their applications) should only be made under the authority of and in accordance with approved change requests.  In practice, other changes typically occur for various reasons such as ad hoc system administration (usually involving relatively "minor" changes that may not require separate authorization) and changes made for nefarious purposes (such as hacks and malware).  Furthermore, authorized changes aren't always made (e.g. they are delayed, overtaken by events, or neglected).  This metric involves someone somehow linking actual with authorized changes.  

The metric's PRAGMATIC ratings and overall score are quite good apart from the final three criteria: 

P
R
A
G
M
A
T
I
C
Score
87
80
90
80
80
80
60
50
47
73%

The person measuring this is probably going to be a system administrator who has a direct interest in the metric, affecting the Independence rating.  The metric is unlikely to identify a rogue sysadmin, unless they are so inept as to leave obvious traces and incriminate themselves!  The metric could be independently measured or cross-checked by someone else (such as an IT auditor) to confirm the values, especially if there is some reason to doubt the integrity of the measurer or the validity and Accuracy of the measurements.  However, cross-checking inevitably impacts the Cost-effectiveness rating and further increases the Time delay before the measure is available.

Aside from that issue, the metric is bound to be quite Costly, given the painstaking manual analysis that would be needed to correlate technical log entries with change requests.  A given change could generate a multitude of log entries, possibly on several systems.  Furthermore, log entries accumulate normally, hence the measurer would need to sift out those that are associated with authorized changes from those that aren't.  

To be of much use, the metric would also need to distinguish trivial from important changes, requiring still more analysis.

Oh by the way, mathematicians reading this may expect the metric to be represented as a correlation coefficient between -1 and +1, but that is not necessarily so.  While there may be numbers behind the scenes, a crude red/amber/green rating of a bunch of servers may be entirely sufficient for a management report and fit for purpose if, for instance, it enables management to spot obvious issues with particular sysadmins, departments or business units, or with certain categories/types of change.   In our jaundiced view, information security metrics are far more valuable as decision-support tools for practitioners and managers than as theoretical exercises in mathematical precision.  It's handy if the two objectives coincide, but not  always necessary!

11 February 2013

PRAGMATIC policy metrics

PRAGMATIC information security policy metrics

First, to set the context for this piece, let me be explicit about four important presumptions:
  1. "Policy" means a clear statement of management intent or direction or control - a written set of high-level requirements or constraints over what employees should and should not do under certain circumstances, considered and laid out by management, and formally mandated on everyone in the organization.
  2. Management is more than merely 'concerned' to achieve compliance with the corporate policies: they have implemented a suite of compliance-related processes and activities with the goal of achieving a high level of - though not necessarily total - compliance (e.g. there is a formalized way of identifying and handling policy exceptions and, where appropriate, granting exemptions).
  3. Employees are aware of their obligations under various policies.  They have ready access to the policies, and they are actively encouraged to read them and comply.  The policies are written straightforwardly enough to be readily understood, and there are suitable support mechanisms for anyone who needs help to understand and implement the policies.  Policies are generally considered sensible, appropriate and necessary - not draconian or frivolous.  There are change-management activities supporting their introduction, as well as subsequent reviews and updates.
  4. Enforcement is treated as a last resort, but management is willing to apply suitable penalties firmly and fairly if that is what it takes to achieve compliance.  More than that, policies are actually enforced, contravenors are penalized, and everyone understands that there probably will be adverse consequences for them (at least) if they break the rules.
If those presumptions hold true, I would argue that there is value in using metrics to support the policies.  Compliance-related metrics are obvious candidates but there will usually be others that relate to the subjects, purposes or objectives of the policies.  

For example, suppose there is a corporate policy about the use of cryptography concerning its use for encryption, authentication and integrity purposes.  Suitable metrics in each of these areas should support the need for the policy in the first place, such as by identifying what proportion of systems are still using deprecated algorithms or weak keys.  Once the policy is approved, tracking and reporting the same metrics ought to show how effectively the policy is proving in practice at dealing with the issues.

That in turn suggests the possibility of proactively stating and using metrics in policies, perhaps a discrete metrics section specifying compliance and subject-matter metrics in the same way as policies usually state 'responsibilities' and 'compliance'.   Metrics may also be used in the preamble or introduction or background (within the policies and/or in the accompanying emails, awareness and training materials to accompany their implementation), helping to explain and justify the need for the policies, putting meat on the bones rather than the usual vague assertions about risks.

Naturally, I'm talking about PRAGMATIC metrics, implying that thought should be applied to the specific choice of metrics, which in turn implies that someone is thinking quite deeply about the specific requirements and purposes of the policies - no bad thing in its own right.  

If those presumptions do not hold true, policy metrics are probably not at the very top of your to-do list, but bear them in mind.  Now may be a good time to seed the idea with your more enlightened colleagues.

08 February 2013

DOGMATIC metrics

DOGMATIC information security metrics

Whereas most of us in the profession see business advantages in having reliable, accurate, truthful data about information security, metrics are occasionally used for less beneficial and ethical purposes.  There are situations in which information is deliberately used to mislead the recipient, for example where the reporting party wishes to conceal or divert attention from information security issues in their remit.

We have seen this most obviously in the context of regular performance reporting by service providers to their customers against SLAs (Service Level Agreements) and contractual requirements.  IT outsourcers or IT departments typically report "uptime", a metric that sounds straightforward enough at face value but turns out to be something of a a minefield for the unwary.

Imagine, for instance, that I, an IT Operations manager for an IT outsourcer, report to you, the relationship manager for my client, that we have achieved our SLA targets of 98% uptime for the last month.  Sounds great, right?  Evidently you have set targets and we have met them.  Fantastic.  Imagine also that I don't just tick a box but provide a fancy performance management report complete with glossy cover, technicolor graphs and numerous appendices replete with lengthy tables showing reams of supporting data about the services.  Furthermore, I have been reporting like this for years, since the start of the contract in fact.  

Buried away in those graphs and tables spread throughout the report are some fascinating facts about the services.  If anyone has the patience and dedication to pore over the numbers, they might discover that the services were in fact unavailable to users several times last month:
  • 7 times for a few minutes each due to server hardware issues (total ½ hour);
  • Once for 1 hour to diagnose the above-noted issues, and once more for 2 hours to replace a faulty power supply (total 3 hours);
  • 31 times for between 1 and 4 hours each for backups (total 50 hours);
  • Once for nearly 2 days for a test of the disaster recovery arrangements (total 40 hours);
  • An unknown number of times due to performance and capacity constraints causing short-term temporary unavailability (total unknown).
The total downtime (more than 93½ hours) was far more than the 2% evidently allowed under the SLA (roughly 15 hours per month), so how come I reported that we achieved our targets?  Five possible reasons include:
  1. Backups and disaster recovery testing are classed as 'allowable downtime' and are not classed as part of the defined services covered by the SLA;
  2. The short-term performance and capacity issues were below the level of detection (not recorded) and therefore it is not possible to determine a meaningful downtime figure;
  3. The individual events resulting from hardware glitches were short enough not to qualify as downtime, which is defined in the SLA as something vaguely similar to "identified periods of non-provision of defined services to the customer, outwith than those permitted in this Agreement under sections 3 and 4, lasting for at least five (5) minutes on each and every occasion";
  4. Several of the downtime episodes occurred out-of-hours, specifically not within the "core hours" defined in the SLA;
  5. I lied!
Possibly as a result of complaints from your colleagues and management concerning the service, you may take me to task over my report and we will probably discuss reasons 1-4 in a fraught meeting (strangely enough, both of us know there is a fifth reason, but we never actually discuss that possibility!).  I will quote the SLA's legalese at you and produce reams of statistics, literally.  You will make it crystal clear that your colleagues are close to revolting over the repeated interruptions to their important business activities, and will do your level best to poke me into a corner where I concede that Something Will Be Done.  After thrashing around behind the bike sheds for a while, we will eventually reach a fragile truce, if not mutual understanding and agreement.

Such is life.  

We both know that "uptime" is a poor metric.  Neither of us honestly believes that the 98% target, as narrowly and explicitly specified by our  lawyers in the SLA, is reasonable, and we both know that the service is falling short of the customer's expectations, not least because they have almost certainly changed since the SLA was initially drawn up.  However, this is a commercial relationship with a sole supplier, in a situation that imposes an infeasibly high cost on you to find and transfer to an alternative supplier.  I have commitments to my stakeholders to turn a profit on the deal, and you vaguely remember that we were selected on the basis of the low cost of our proposal.  Uptime is not, in fact, the real issue here, but merely a symptom and, in this case, a convenient excuse for you and I to thrash out our differences every so often and report back to our respective bosses that we are On Top Of It.

Uptime has been used in this way for decades, pre-dating the upsurge in IT outsourcing.  It has never been a particularly PRAGMATIC metric.  It is almost universally despised and distrusted by those on both sides of the report.  And yet there it remains, laughing at us from the page.

06 February 2013

Think, decide, act

"Users must not make the mistake of thinking that this number-heavy approach is somehow going to make decisions for them – the method is just a heuristic tool to help people think about the issues, decide on solutions and act on their decisions."
Well said Dave!  That statement came at the end of a piece advising businesses to develop matrices showing the knowledge and skills of employees in order to identify single points of failure and gaps, for business continuity purposes.  

I'm not entirely convinced that Dave's suggested approach is materially better than management and/or HR simply scratching their heads and working out who the organization would miss the most if they fell under a proverbial bus.   On the other hand, 'completing a self-assessment questionnaire/skills matrix by the end of next month' might be a convenient lever to ensure that some analysis is in fact done rather than being continually back-burnered.  

There are two more general metrics points here:  

  1. Metrics are simply a type of information.  They achieve nothing unless the people who receive the information act appropriately on it, preferably doing things better than they would have done without the metrics.
  2. Measuring something inevitably focuses attention on it, and sometimes that alone may be sufficient for those doing the measuring to spot and deal with issues directly, and/or lead directly to changes in whatever is being measured as a result of being observed.  Therefore in some situations, the measurement process - the very act of measuring - may be valuable in its own right.  Measuring sometimes trumps metrics!

05 February 2013

SMotW #43: VaR Value at Risk

Security Metric of the Week #43: VaR Value at Risk

VaR is one of several metrics used to measure the financial aspects of information security.  

VaR is normally used in investment management, for insurance purposes, and to determine the appropriate levels of contingency cash reserves needed by banks etc., but it can be applied to measure other kinds of risk.

In the financial world, VaR is the calculated value of a portfolio of financial assets (e.g. stocks and shares) at which there is a stated probability of loss within a defined period, assuming normal trading.  For example, a 5% daily VaR of $1m means the value of the portfolio is predicted to fall by more than $1m on one day out of twenty, on average.

Management of ACME Enterprises Inc calculates the PRAGMATIC score for VaR at just 38%:

P
R
A
G
M
A
T
I
C
Score
70
652030354030302238%


Although VaR appears to be quite Predictive and Relevant to information security, the remaining AGMATIC criteria reflect management's misgivings about this metric:

  • Actionability is low since there is not much that information security people can do to influence the value of information assets, aside from making it more expensive for adversaries to compromise it;
  • Genuinness: the ambiguities and assumptions involved in calculating VaR;
  • Meaningfulness: the variety of definitions and interpretations of VaR implies confusion about its meaning unless we take the trouble to explain it properly in this context;
  • Accuracy may be acceptable for commonplace security incidents that occur with predictable frequencies and outcomes, but not for rarer and often more extreme events;
  • Timeliness is limited because of the practical difficulties of re-calculating and updating the metric as assets and risks change;
  • Independence: the people best placed to determine the metric are the information asset owners, in conjunction with information security/risk management professionals.  They all have a vested interest in assuring management that information assets are not unduly at risk, hence their VaR calculations may be biased;
  • Cost-effectiveness suffers because of the effort required to calculate and update VaR relative to the utility of the metric.

Your opinions on the criteria and scoring may well differ, and that's fine - a good sign in fact.  If you had been involved in the PRAGMATIC ratings discussion, you would have had the chance to influence the outcome.  We are simply reporting the discussions that took place, hinting at the thinking processes and rationale behind the assigned ratings.  The analysis and discussion is a vital part of the PRAGMATIC process, and if anything is more important than the final score.  To understand the scoring fully, you would need to appreciate contextual factors such as the business of ACME Enterprises Inc., the nature of its information assets and information security risks, and the backgrounds, motivations and current interests of the managers involved.