Favourite quotes

"All models are wrong, some are useful" George Box

"Prediction is very difficult, especially if it's about the future" Niels Bohr

Favourite formula

exp(pi*i)+1=0

This is our most favourite equation of all time. It is simple and elegant. It takes two numbers that are irrational 'pi' and 'e'. An irrational number cannot be expressed as a ratio of simple numbers, the digits required for the number are infinite. The other number 'i' is the square root of -1. The formula takes two numbers that we can never know exactly [see the first 100,000 digits of pi (here) or the first 10,000 digits of e (here)], they are very different in concept. They are combined with a number that cannot exist (the square root of -1) and produce a very real solution. Perhaps more importantly, this equation often comes up in real world mathematics. Maths can be fun.

Favourite abuse of statistics

Likert Scale

You will all have seen statistics based on Likert scale. These measures are commonly used to assess people's attitudes to a survey. I have seen presentations and analysis where scores have been averaged. This is wrong in a number of dimensions.

  1. Very satisfied;
  2. Somewhat satisfied;
  3. Neither satisfied nor dissatisfied;
  4. Somewhat dissatisfied; and
  5. Very dissatisfied,

We have never seen an Olympic medal summary as claiming that the average medal total was 1.5 (first=1, second=2, third=3). Yet we have seen similar claims made of a Likert Scale. The number associated with each scale is related to its order -the data is ordinal. They could equally as well be given letters ie a,b,c... instead of 1,2,3... . Adding ordinal data is meaningless. By example this would mean that the organisation would value 5 people giving a very dissatisfied (1) the same as one person giving a Very Satisfied (5). Both examples sum to 5. The data is ordinal, statistics related to adding the metrics are meaningless. A valid use of the data might be to report that in a previous period 20% of surveyed people were "Very or Somewhat Dissatisfied" and the number has increased to 30%. Susan Jamieson in MEDICAL EDUCATION 2004;38: 1212-1218 has provided an interesting paper on this subject "Likert scales: how to (ab)use them (here)

Statistical confidence

Many times, we have seen results presented such as the following fictitious scenario. The fraud level in the last quarter was $20 million ± $5 million stated at a 95% confidence interval. The previous quarter the estimated level was $22 million ± $5 million stated at the specified confidence interval. From a statistical view point the analyst has indicated that they were 95% confident that the fraud was in a range between $15 and $25 million in the first instance and $17 and $27 million in the second. This range is caused solely by statistical variations in the data.

Let's unpick this variation. The Analyst expects that if a sample were drawn again then with a 95% confidence, they expect the number to be in the indicated range. Both ranges are overlapping. The observed improvement might be genuine but it is also possible it is due to stochastic fluctuations in the data. The small print is that this change is too uncertain to break out the champagne, and that next period the picture may reverse - confidence limits are important.

Spreadsheets a tool with limitations

Spreadsheets make code review difficult or impossible. The code is spread away in hundreds of cells across many tabs. It may even be spread across multiple spreadsheets. If the fundamental platform makes it difficult for peer review, how can it be expected to be right. Public domain information on the wisdom of using spreadsheets follows:

Spreadsheets have their own difficulties; they have however brought programming to the masses and are not going to be replaced in the near future. It is important that management be aware of the inherent problems with spreadsheets especially large ones. For business-critical decisions perhaps a better answer is to use dedicated software.

Oracle a large software provider has noted some of the "boo boos" that have occured using spreadsheets here. Taking a balanced viewpoint, we note that software errors are not purely the domain of spreadsheets.

Use and abuse of regression

The following has been provided as a minimal check list when using or receiving regression information.

A regression model will associate dependent (output) and independent (input) variables. These factors should be assembled because there is a principle led reason to expect that the inputs will cause outputs.

There are many types of regression, they tend to have the following diagnostics:

Our recommendation is that r2 F test, t test results should accompany any regression along with a dialogue on why the dependent and independent variables were chosen. Accompanying the statistics should be a simple dialogue explaining the significance of the statistics.