Saturday 3 September 2016

Some thoughts on the Statcheck project



Yesterday, a piece in Retractionwatch covered a new study, in which results of automated statistics checks on 50,000 psychology papers are to be made public on the PubPeer website.
I had advance warning, because a study of mine had been included in what was presumably a dry run, and this led to me receiving an email on 26th August as follows:
Assuming someone had a critical comment on this paper, I duly clicked on the link, and had a moment of double-take when I read the comment.
Now, this seemed like overkill to me, and I posted a rather grumpy tweet about it. There was a bit of to and fro on Twitter with Chris Hartgerink, one of the researchers on the Statcheck project, and with the folks at Pubpeer, where I explained why I was grumpy and they defended their approach; as far as I was concerned it was not a big deal, and if nobody else found this odd, I was prepared to let it go.
But then a couple of journalists got interested, and I sent them a more detailed thoughts.
I was quoted in the Retraction Watch piece, but I thought it worth reporting my response in full here, because the quotes could be interpreted as indicating I disapprove of the Statcheck project and am defensive about errors in my work. Neither of those is true. I think the project is an interesting piece of work; my concern is solely with the way in which feedback to authors is being implemented. So here is the email I sent to journalists in full:
I am in general a strong supporter of the reproducibility movement and I agree it could be useful to document the extent to which the existing psychology literature contains statistical errors.
However, I think there are 2 problems with how this is being done in the PubPeer study.
1. The tone of the PubPeer comments will, I suspect alienate many people. As I argued on Twitter, I found it irritating to get an email saying a paper of mine had been discussed on PubPeer, only to find that this referred to a comment stating that zero errors had been found in the statistics of that paper.
I don't think we need to be told that - by all means report somewhere a list of the papers that were checked and found to be error-free, but you don't need to personally contact all the authors and clog up PubPeer with comments of this kind.
My main concern was that during an exceptionally busy period, this was just another distraction from other things. Chris Hartgerink replied that I was free to ignore the email, but that would be extremely rash because a comment on PubPeer usually means that someone has a criticism of your paper.
As someone who works on language, I also found the pragmatics of the communication non-optimal. If you write and tell someone that you've found zero errors in their paper, the implication is that this is surprising, because you don't go around stating the obvious*. And indeed, the final part of the comment basically said that your work may well have errors in it and even though they hadn't found them, we couldn't trust it.
Now at the same time as having that reaction, I appreciate this was a computer-generated message, written by non-native English speakers, that I should not take it personally, and no slur on my work was intended. And I would like to know if errors were found in my stats, and it is entirely possible that there are some, since none of us is perfect. So I don't want to over-react, but I think that if I, as someone basically sympathetic to this agenda, was irritated by the style of the communication, then the odds are this will stoke real hostility for those who are already dubious about what has been termed 'bullying' and so on by people interested in reproducibility.
2. I'll be interested to see how this pans out for people where errors are found.
My personal view is that the focus should be on errors that do change the conclusions of the paper.
I think at least a sample of these should be hand-checked so we have some idea of the error rate - I'm not sure if this has been done, but the PubPeer comment certainly gave no indication of that - it just basically said there's probably an error in your stats but we can't guarantee that there is, putting the onus on the author to then check it out.
If it's known that on 99% of occasions the automated check is accurate, then fine. If the accuracy is only 90% I'd be really unhappy about the current process as it would be leading to lots of people putting time into checking their papers on the basis of an insufficiently sensitive diagnostic. It would make the authors of the comments look frankly lazy in stirring up doubts about someone's work and then leaving them to check it out.
In epidemiology the terms sensitivity and specificity are used to refer to the accuracy of a diagnostic test. Minimally if the sensitivity and specificity of the automated stats check is known, then those figures should be provided with the automated message.

The above was written before Dalmeet drew my attention to the second paper, in which errors had been found. Here’s how I responded to that:

I hadn't seen the 2nd paper - presumably because I was not the corresponding author on that one. It's immediately apparent that the problem is that F ratios have been reported with one degree of freedom, when there should be two. In fact, it's not clear how the automated program could assign any p-value in this situation.

I'll communicate with the first author, Thalia Eley, about this, as it does need fixing for the scientific record, but, given the sample size (on which the second, missing, degree of freedom is based), the reported p-values would appear to be accurate.
  I have added a comment to this effect on the PubPeer site.


* I was thinking here of Gricean maxims, especially maxim of relation. 

Thursday 1 September 2016

Why I still use Excel



The Microsoft application, Excel, was in the news for all the wrong reasons last week.  A paper in Genome Biology documented how numerous scientific papers had errors in their data because they had used default settings in Excel, which had unhelpfully converted gene names to dates or floating point numbers. It was hard to spot as it didn't do it to all gene names, but, for instance, the gene Septin 2, with acronym SEPT2 would be turned into 2006/09/02.  This is not new: this paper in 2004 documented the problem, but it seems many people weren't aware of it, and it is now estimated that the literature on genetics is riddled with errors as a consequence. 
This isn't the only way Excel can mess up your data. If you want to enter a date, you need to be very careful to ensure you have the correct setting. If you are in the UK and you enter a date like 23/4/16, then it will be correctly entered as 23rd April, regardless of the setting. But if you enter 12/4/16, it will be treated as 4th December if you are on US settings and as 12th April if you are on UK settings.
Then there is the dreaded autocomplete function. This can really screw things up by assuming that if you start typing text into a cell, you want it the same as a previous entry in that column that begins with the same sequence of letters. Can be a boon and a time-saver in some circumstances, but a way to introduce major errors in others.
I've also experienced odd bugs in Excel's autofill function, which makes it easy to copy a formula across columns or rows. It's possible for a file to become corrupted so that the cells referenced in the formula are wrong. Such errors are also often introduced by users, but I've experienced corrupted files containing formulae, which is pretty scary.
The response to this by many people is to say serious scientists shouldn't use Excel.  It's just too risky having software that can actively introduce errors into your data entry or computations. But people, including me, persist in using it, and we have to consider why.
So what are the advantages of keeping going with Excel?
Well, first, it usually comes for free with Microsoft computers, so it is widely available free of charge*. This means most people will have some familiarity with it –though few both to learn how to use it properly.
Second, you can scan a whole dataset easily: it's very direct scrolling through rows or columns. You can use Freeze Panes to keep column and row headers static, and you can hide columns or rows that you don't want getting in the way.
Third, you can format a worksheet to facilitate data entry. A lot of people dismiss colour coding of columns as prettification, but it can help ensure you keep the right data in the right place. Data validation is easily added and can ensure that only valid values are entered.
Fourth, you can add textual comments – either as a row in their own right, or using the Comment function.
Fifth, you can very easily plot data. Better still, you can do so dynamically, as it is easy to create a plot and then change the data range it refers to.
Sixth, you can use lookup functions. In my line of work we need to convert raw scores to standard scores based on normative data. This is typically done using tables of numbers in a manual, which makes it very easy to introduce human error. I have found it is worth investing time to get the large table of numbers entered as a separate worksheet, so we can then automate the lookup functions.
Many of my datasets are slowly generated over a period of years: we gather large amounts of data on individuals, record responses on paper, and then enter the data as it comes in. The people doing the data entry are mostly research assistants who are relatively inexperienced. So having a very transparent method of data entry, which can include clear instructions on the worksheet, and data validation, is important. I'm not sure there are other options of software that would suit my needs.
But I'm concerned about errors and need strategies to avoid them. So here are the working rules I have developed so far.
1. Before you begin, turn off any fancy Excel defaults you don't need. And if entering gene names, ensure they are entered as text.
2. Double data entry is crucial: have the data re-entered from scratch when the whole dataset is in, and cross-check the data files. This costs money but is important for data quality. There are always errors.
3. Once you have the key data entered and checked, export it to a simple, robust format such as tab-separated text. It can then be read and re-used by people working with other packages.
4. The main analysis should be done using software that generates a script that means the whole analysis can be reproduced. Excel is therefore not suitable. I increasingly use R, though SPSS is another option, provided you keep a syntax file.
5. I still like to cross-check analyses using Excel – even if it is just to do a quick plot to ensure that the pattern of results is consistent with an analysis done in R.  
Now, I am not an expert data scientist – far from it. I'm just someone who has been analysing data for many years and learned a few things along the way. Like most people, I tend to stick with what I know, as there are costs in mastering new skills, but I will change if I can see benefits. I've become convinced that R is the way to go for data analysis, but I do think Excel still has its uses, as a complement to other methods for storing, checking and analysing data. But, given the recent crisis in genetics, I'd be interested to hear what others think about optimal, affordable approaches to data entry and data analysis – with or without Excel.

*P.S.  I have been corrected on Twitter by people who have told me it is NOT free; the price for Microsoft products may be bundled in with the cost of the machine, but someone somewhere is paying for it!

Update: 2nd September 2016
There was a surprising amount of support for this post on Twitter, mixed in with anticipated criticism from those who just told me Excel is rubbish. What's interesting is that very few of the latter group could suggest a useable alternative for data entry (and some had clearly not read my post and thought I was advocating using Excel for data analysis). And yes, I don't regard Access as a usable alternative: been there tried that, and it just induced a lot of swearing.
There was, however, one suggestion that looks very promising and which I will chase up
@stephenelane suggested I look at REDcap.
Website here: https://projectredcap.org/

Meanwhile, here's a very useful link on setting up Excel worksheets to avoid later problems that came in via @tjmahr on Twitter
http://kbroman.org/dataorg/

Update 4th October 2016
Just to say we have trialled REDCap, and I love it.  Very friendly interface. Extremely limited for any data manipulation/computation, but that doesn't matter, as you can readily import/export information into other applications for processing. It's free but institution needs to be signed up for it: Oxford is not yet fully functional with it, but we were able to use it via a colleague's server for a pilot.