Comments on: 50 shades of gray: A research story

50 shades of gray: A research story

MisantropicPainforest — Mon, 29 Jul 2013 10:56:03 -0800

Psychologists recount a valuable lesson about the fragility of statistical validity and the state of publishing. "Two of the present authors, Matt Motyl and Brian A. Nosek, share interests in political ideology. We were inspired by the fast growing literature on embodiment that demonstrates surprising links between body and mind to investigate embodiment of political extremism. Participants from the political left, right, and center (N = 1,979) completed a perceptual judgment task in which words were presented in different shades of gray. Participants had to click along a gradient representing grays from near black to near white to select a shade that matched the shade of the word. We calculated accuracy: How close to the actual shade did participants get? The results were stunning. Moderates perceived the shades of gray more accurately than extremists on the left and right (p = .01). Our conclusion: Political extremists perceive the world in black and white figuratively and literally. Our design and follow-up analyses ruled out obvious alternative explanations such as time spent on task and a tendency to select extreme responses. Enthused about the result, we identified Psychological Science as our fallback journal after we toured the Science, Nature, and PNAS rejection mills. The ultimate publication, Motyl and Nosek (2012), served as one of Motyl's signature publications as he finished graduate school and entered the job market. The story is all true, except for the last sentence; we did not publish the finding."

"Before writing and submitting, we paused. Two recent articles have highlighted the possibility that research practices spuriously inflate the presence of positive results in the published literature (John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011). Surely ours was not a case to worry about. We had hypothesized it; the effect was reliable. But we had been discussing reproducibility, and we had declared to our lab mates the importance of replication for increasing certainty of research results. We also had an unusual laboratory situation. For studies that could be run through a Web browser, data collection was very easy (Nosek et al., 2007). We could not justify skipping replication on the grounds of feasibility or resource constraints. Finally, the procedure had been created by someone else for another purpose, and we had not laid out our analysis strategy in advance. We could have made analysis decisions that increased the likelihood of obtaining results aligned with our hypothesis. These reasons made it difficult to avoid doing a replication. We conducted a direct replication while we prepared the manuscript. We ran 1,300 participants, giving us .995 power to detect an effect of the original effect size at α = .05. The effect vanished (p = .59)." via Andrew Gelman. See also Frederick Guy's commentary.

By: yoink

yoink — Mon, 29 Jul 2013 11:08:52 -0800

Very interesting. I wonder if it would be possible in these kinds of scientific realms to get an agreement to publish regardless of the results of the experiment? You would circulate a paper in a form that described only the structure of the experiment, the way the results would be analyzed, the questions the experiment would address, the reasons for addressing such a question etc. etc. etc. But no hint at all was made as to what the results actually were. I mean in a way the answer to that is "obviously not" and I can understand all the practical reasons why this wouldn't work, but it's interesting, at least, to think about what the world of publishing in experimental science might look like if that were the practice. Because in theory, at least, any experiment worth actually bothering conducting should be written up whether the results confirm or disprove the null hypothesis. If there were more emphasis on "is this a question worth asking and have you asked it well" and less on "did you find something superficially startling in your results" there'd be less of a bias towards finding and publishing these kinds of statistically marginal effects.

By: Potomac Avenue

Potomac Avenue — Mon, 29 Jul 2013 11:11:39 -0800

Yoink the article addressed that possibility, in a way: Crowd sourcing replication efforts Individual scientists and laboratories may be interested in conducting replications but not have sufficient resources available for them. It may be easier to conduct replications by crowd sourcing them with multiple contributors. For example, in 2011, the Open Science Collaboration began investigating the reproducibility of psychological science by identifying a target sample of studies from published articles from 2008 in three prominent journals: the Journal of Personality and Social Psychology, the Journal of Experimental Psychology: Learning, Memory, and Cognition, and Psychological Science (Carpenter, 2012; Yong, 2012). Individuals and teams selected a study from the eligible sample and followed a standardized protocol. In the aggregate, the results were intended to facilitate understanding of the reproducibility rate and factors that predict reproducibility. Further, as an open project, many collaborators could join and make small contributions that accumulate into a large-scale investigation. The same concept can be incorporated into replications of singular findings. Some important findings are difficult to replicate because of resource constraints. Feasibility could be enhanced by spreading the data collection effort across multiple laboratories.

By: thelonius

thelonius — Mon, 29 Jul 2013 11:14:15 -0800

I feel a lot better about having majored in philosophy now

By: eviemath

eviemath — Mon, 29 Jul 2013 11:18:15 -0800

I was thinking something similar: if high impact factor journals began including sections (or even better, just including, not in a segregated way) for negative results and results of replicated studies, this could help change the incentives for focusing on research with novel results. Likewise, there needs to be a shift in criteria for PhD granting, hiring, tenure, and promotion decisions to give some value to time spent working to replicate the results of others. For example, at my university, we need to have something to show for ourselves both in terms of teaching and research (and service, to a lesser degree) - we can't get tenure with only one strength. So within the research category, experimentalists should need to show that they have been productive both in original research and in the testing and replication work that is necessary for science to function well.

By: yoink

yoink — Mon, 29 Jul 2013 11:59:00 -0800

Potomac Avenue: that suggestion doesn't quite address the problem, though, which is that there is currently an active incentive for researchers to play the "statistical outlier on high buzz-factor research" lottery in order to get publication in a prestige journal. Sure, there are all kinds of ways to make replication experiments more affordable and more common, but if they're essentially anonymous "fact checking" operations they won't really feature meaningfully in promotion and tenure decisions. In fact, in a weird way they might almost reduce whatever scruples a research team might have about rushing into print with a small-sample high-wow-factor result ("Yeah, sure this might be BS, but the replication crowd will eventually check it out and by then I'll have already been hired/promoted/counted it on my resume.") There needs to be some way to rebalance incentives at the front end and not just try to catch mistakes at the back end.

By: Pyrogenesis

Pyrogenesis — Mon, 29 Jul 2013 12:24:59 -0800

This is pretty cool. The experiment's relation to actual stuff being done under the rubric of "embodiment" is tenuous at best, but that's beside the point anyway. But I do love papers that take as the object of their experiment the experimental situation itself. It was, in fact, one of the rather overlooked facets of the "science wars" of the 90s: one of the major claims from the "relativist" science studies people was that science in the making looks rather different from settled science. Ever since then I've been fascinated about studying science itself by scientific means.

By: Omnomnom

Omnomnom — Mon, 29 Jul 2013 13:06:50 -0800

So...why did the effect vanish?

By: yoink

yoink — Mon, 29 Jul 2013 13:08:12 -0800

So...why did the effect vanish? Because it was a statistical fluke.

By: srboisvert

srboisvert — Mon, 29 Jul 2013 13:37:38 -0800

The problem with the 'replication crowd' is that this can also be done in bad faith. It is very easy to fail to replicate research. You can do it just by being lazy and sloppy. A failed replication can also be a statistical fluke. Further, research materials are often times very time consuming and labor intensive to produce. Then some replicating lab demands them before your research program has finished using them. Do you share them out right away? That said there is a pretty active whisper network in social psychology about findings that only work in certain labs or research 'families'.

By: no regrets, coyote

no regrets, coyote — Mon, 29 Jul 2013 14:06:57 -0800

The problem with the 'replication crowd' is that this can also be done in bad faith. It is very easy to fail to replicate research. You can do it just by being lazy and sloppy. A failed replication can also be a statistical fluke. But that's kind of how science works. Any study can be lazy or sloppy or an outlier. Taking a failed replication of an experiment as the full story is as silly as taking the original study as the full story. Individual studies only have meaning in the context of the greater body of research on a topic. I wonder if it would be possible in these kinds of scientific realms to get an agreement to publish regardless of the results of the experiment? I totally agree with this. In my previous field (experimental particle physics) we were very good about this. Negative results of a well designed experiment were considered just as interesting as positive results. Now I work in a medical field and there is much more pressure to make a novel claim, and I see papers that are pretty clearly the results of statistical fishing expeditions, Bonferroni be damned.

By: Omnomnom

Omnomnom — Mon, 29 Jul 2013 14:52:25 -0800

I'm reminded of a couple of posts here about scientists and university professors faking their results. And I'm really glad these guys didn't. The temptation must be great.

By: Halogenhat

Halogenhat — Mon, 29 Jul 2013 18:38:27 -0800

A researchers wildest dream come true: results that are actually shocking.

By: chortly

chortly — Mon, 29 Jul 2013 19:40:49 -0800

It seems crazy to me that a p-value threshold -- 0.05 -- which was selected in the early days of statistical science when almost no one was doing it, is still the standard threshold for "significant" results. Sure, replication is important, but simply reducing the significance threshold to reflect the increased number of researchers -- say, by a factor of 10 or 100 -- would go a long way to insulate the social and medical sciences from false positives. The 5-sigma of physics might be a bit much, but even 3 would presumably rule out the vast majority of junk.

By: Canageek

Canageek — Mon, 29 Jul 2013 22:07:43 -0800

The one problem I see, is if they are using a web browser then people will be influenced by the brightness, contrast, calibration of the monitor and their surroundings. It sounds like they did the replication online, when they would have no control over these conditions, which could lead to a false negative. I mean, when was the last time YOU calibrated your monitors colour? I haven't, I don't have the (expensive) equipment for it. My brother has, as he is a professional photographer, but only once, when he could borrow the gear from a friend of his.

By: Blasdelb

Blasdelb — Tue, 30 Jul 2013 00:02:05 -0800

"I wonder if it would be possible in these kinds of scientific realms to get an agreement to publish regardless of the results of the experiment?" This could only really conceivably work for a very small fraction of scientific research. Where really, contrary to popular perception, the lions share of what we do is not coming up with solid answers to obvious questions exactly but instead coming up with clever questions that should produce useful answers if framed properly. Indeed, for the vast majority of us who have the potential of getting negative results, those negative results simply mean we've asked a stupid question with an answer that isn't at all interesting, which is what happened here. Negative results like "Does Vitamin NewFadThing cure X? Nope, not even close" should get published because that is still a useful answer even if it isn't an interesting one, but results like this have no business crowding out useful and interesting research from the literature. Also, so long as grant/hiring/tenure committees are beholden to people who are only capable of judging research based on numbers crunched from a CV, rewarding stupid questions that produce results that are both useless and uninteresting the same way as smart questions will continue to be a bad idea that could only be toxic to the scientific community. That said, these guys are awesome and, while the strength of character they clearly have at least should be a baseline expectation, their enthusiasm about it is inspiring.

By: Cannon Fodder

Cannon Fodder — Tue, 30 Jul 2013 01:33:28 -0800

Sure, replication is important, but simply reducing the significance threshold to reflect the increased number of researchers -- say, by a factor of 10 or 100 -- would go a long way to insulate the social and medical sciences from false positives. The 5-sigma of physics might be a bit much, but even 3 would presumably rule out the vast majority of junk. The problem with this is that this reduces the power of the experiment, of course (the ability to detect a difference if it is actually present). Most experiments are rather underpowered. This actually does matter a lot, because, in drug trials for instance, we're not just testing for efficacy but for toxicity as well. The best way to boost power is to increase the number of participants, of course. I do agree that p=0.05 is a problem, because it means lots of situations where there is no difference will be recorded as significant. Of coruse p values are a bit of a red herring. In practice the null hypothesis is almost never true: your drug might be marginally different from placebo. What we care about is the effect size.

By: metaBugs

metaBugs — Tue, 30 Jul 2013 06:06:06 -0800

Much as I like the idea of journals agreeing to publish before seeing the results, I really can't see for-profit journals going for it. They need to stay full of exciting data on sexy topics, otherwise no-one will want to pay the subscriptions. I could imagine the various titles having a sort of "overflow" journal, for pre-accepted papers which turned out to have results too dull for the flagship, but this would be costly to run (all the same admin, editing, proofing and hosting costs) and rarely cited. Part of the problem is addressed by things like the Journal of Negative Results, but I'd worry about trying to impress a hiring committee with that on my CV. I like the approach taken by a few of the free/open journals, who do claim to screen only on basis of scientific merit. PLOS One, for example, explicitly state that they welcome papers showing negative data. They also have a reproducibility initative, encouraging labs to submit their protocols etc for validation by (blinded?) collaborators, although I don't know how well this is going. But it's true that this doesn't solve the problem that we tend to reward researchers for being lucky, at least to a degree. In Iain Banks' Matter (very minor spoiler warning:), one of the characters tells the story of the "hundredth idiot":

One hundred idiots make idiotic plans and carry them out. All but one justly fail. The hundredth idiot, whose plan succeeded through pure luck, is immediately convinced he's a genius

I'm not asserting that scientists are necessarily idiots, but I think we can all acknowledge that there's an element of this in research. We're always striking out into the unknown, and choosing research paths based on incomplete information, or educated guesses, or the equipment and reagents available to us. Most of those explorations yield uninspiring results, but every now and again, one of those roughly chosen paths yields gold. And it's the person on the lucky path whose career gets made. It's not all luck, by any means: fortune favours the prepared mind, and all that. But it's a significant factor, and I agree that working out how to recognise and reward intelligent, industrious work that hasn't (yet) been lucky should be a big priority. I just don't know how to do it. Pyrogenesis - I do love papers that take as the object of their experiment the experimental situation itself. ... Ever since then I've been fascinated about studying science itself by scientific means. That sounds really interesting. Do you have any recommendations for a good place to start reading?

By: hydrobatidae

hydrobatidae — Tue, 30 Jul 2013 08:33:31 -0800

The other thing about adjusting the alpha value is that for a lot of sciences, increasing the power through increasing the sample size is next to impossible. I'm in ecology and our sample sizes are constrained by a lot of things that medicine and physics aren't constrained by - funding (strangely enough, no one is giving us a million dollars to run much of anything), time (field seasons are only so long and if you can only afford 4 people to help, you can only get so much data; then next year is another subset of data). Occasionally I'll argue that the alpha value in ecology should be lowered to 0.1 because that's much more realistic for a system over which you have very little control. An example from my own work - I looked at diet in an Arctic bird. You had to net them off their nests which required repelling down anchored in very unstable soil. We were pretty happy because we got 12 birds one summer (because that's all we could reach). We could have got the partners too for 24 birds but then you're affected by pseudo-replication. This project was repeated for a couple years and we actually got results at 0.08 (a 'trend'). Our publishing was limited by this 'non-significance' even though for the power level we were working with, we were getting some really good suggestions of what was going on. If we had had better p-values (randomly), we could have got into some really fancy journals. I now appreciate that my old supervisor inserts 'randomly selected an alpha value = 0.05' in his papers.

By: Mental Wimp

Mental Wimp — Tue, 30 Jul 2013 13:09:27 -0800

via Andrew Gelman. Who, by the by, rocks. And, note to y'all, 2013 is the International Year of Statistics. Statistics, bitches!