A Post-Publication Peer-Review (3PR) of *Time, Money, and Morality*

#### Gino, F., & Mogilner, C. (online, 2013). Time, Money, and Morality. *Psychological Science*. DOI: 10.1177/0956797613506438

File under:

**HIBAR**: **H**ad **I** **B**een **A** **R**eviewer…

**3PR**: **P**ost-**P**ublication **P**eer-**R**eview (or: **3**mpirical **P**lausability **R**esuscitation)

Performed by *Fred Hasselman*

Contact me if you have any questions

#### Introduction

The *Time, Money, and Morality* article has been HIBAR-ed on Twitter and the Blogosphere (e.g., by Rolf Zwaan and Greg Francis ) and the discussion seems to revolve around the validity of the inferences based p-values close to 0.05 (e.g., they raise suspicions of p-hacking).

In short, the article reports of 4 Experiments testing 2 core postulates:

- Postulate 1: Priming
`Money`

activates self-interest and increases unethical behaviour - Postulate 2: Priming
`Time`

activates self-reflection and decreases unethical behaviour

Unethical behaviour is operationalised as taking the opportunity to cheat on a task.

Priming methods vary across experiments, so do the tasks that allow for an opportunity to cheat.

In Experiment 1 the two postulates are tested, Experiments 2-4 concern an assessment of the role of self-reflection on cheating behaviour and is operationalised differently across experiments.

#### Hold on to your P-curves for a moment… Back to the basics!

In this **P**ost-**P**ublication **P**eer-**R**eview (3PR) I demonstrate that there is indeed some cause for concern about the way these results are presented and interpreted. Was it p-hacking? … I don’t know and maybe I don’t even care. To me this is an example of **sloppy science**, p-hacked or not, these results were *allowed* to be published by expert peers. It is more relevant to discuss the broken system of quality control that should have picked up on at least some of the following issues:

- Important information is missing:
- in general (e.g., number of subjects per condition, sample size determination)
- selectively across experiments (e.g., participants per cell, reporting of effect sizes)

- The analyses used on frequency data are inappropriate
- Invalid or biased inferences and oddities:
- No adjustments for multiple comparisons
- “Marginal significance” shifts ad hoc between
`0.1 > p > 0.05`

- Obvious intervening/mediator variable is omitted: Accuracy of performance
- No explanation of (conflicting) results across experiments (e.g., variation in amount of cheating)
- No explanation for failing of random assignment to design levels (
**none**of the experiments have equal N samples)

The article under scrutiny is by no means exceptional with respect to such issues, moreover, the way frequency / proportion data are analysed in psychological science is generally awkward and most of the time completely wrong.

I will 3PR the data based on the information in the article and comment on the results:

I. Analysis of Proportion / frequency data

II. Analysis of Extent of Cheating data

III. HAPPE-ing: **H**ypothesing **A**fter **P**ost **P**ublication **E**valuation

The R code used to generate the results (and this page) is available in this Markdown file, and this post explains how to post to a WordPress blog.

## I. Analysis of proportion / frequency data

Some concerns can be raised about the significant differences between various conditions in proportion `Cheating`

reported in the 4 experiments.

First and foremost, no corrections for multiple comparisons are conducted, should one do so, just 2 significant proportion differences remain:

`Money`

vs. `Time`

in experiment 1 & 4. In Experiment 3, the sample difference `No Mirror: Money - Time`

was marginally significant in the 2^nd significant digit (original: `p = 0.015`

, adjusted = 0.013, Bonferroni).

Second, no continuity correction is applied, these proportions are calculated from discrete numbers (participants). If a continuity correction is applied, 2-3 significant differences remain, depending on the -level chosen:

Exp. | Contrast | Published | Continuity corrected | Bonferroni adjusted |
---|---|---|---|---|

1 | Money-Time | <.001 |
4 × 10^{-4} |
< 0.0167 |

1 | Money-Ctrl | <.05 |
0.0894 | > 0.0167 |

1 | Time-Ctrl | <.05 |
0.0836 | > 0.0167 |

2 | Int: Money-Time | <.01 |
0.1493 | ~ 0.0125 |

2 | Per: Money-Time | >.05 | 1 | > 0.0125 |

2 | Money: Int-Per | <.03 |
0.0856 | > 0.0125 |

2 | Time: Int-Per | >.05 | 1 | > 0.0125 |

3 | Mir: Money-Time | >.05 | 0.7996 | > 0.0125 |

3 | NoM: Money-Time | <.003 |
0.0293 |
~ 0.0125 |

3 | Money: Mir-NoM | >.05 | 0.0537 | > 0.0125 |

3 | Time: Mir-NoM | >.05 | 1 | > 0.0125 |

4 | Money-Time | <.001 |
10^{-4} |
< 0.0167 |

4 | Money-Ctrl | <.05 |
0.0522 | > 0.0167 |

4 | Time-Ctrl | <.05 |
0.0752 | > 0.0167 |

Number sig. results |
9 |
3 |
Original: 4, Continuity: 2 |

This calls for a more appropriate analysis of frequency data:

- Log-linear analysis of observed cell frequencies
- Exact odds ratios of 2×2 sub-tables to test hypotheses using Effect Size CIs

(`Cheating`

can be considered a dichotomous response, so logistic regression could also be used, see III. HAPPE-ing)

Note:Experiment 2 & 3 do not list

nper condition, the most likely values forn(1. closest to an integer value; 2. as equal as possible; 3. Add to total N) are assumed:

Experiment 2

Prime Assessment Ncond * %Cheat = Ncheat (deviation) Money Personality 36 `*`

0.2778`=`

10.0008 (8 × 10^{-4})Time Personality 35 `*`

0.2857`=`

9.9995 (5 × 10^{-4})Money Intelligence 38 `*`

0.5`=`

19 (0)Time Intelligence 33 `*`

0.303`=`

9.999 (10 × 10^{-4})

Experiment 3

Prime Assessment Ncond * %Cheat = Ncheat (deviation) Money Mirror 31 `*`

0.387`=`

11.997 (0.003)Time Mirror 28 `*`

0.321`=`

8.988 (0.012)Money No Mirror 30 `*`

0.667`=`

20.01 (0.01)Time No Mirror 31 `*`

0.355`=`

11.005 (0.005)

### 1. log-linear analysis of observed cell frequencies

Log-linear analysis, or poisson regression using the generalised linear model, can be used to test whether relationships exist among the variables in a multi-way contingency table. Here I analyse the number of participants in each cell of the design: The observed frequencies take the role of the dependent variable and the levels of the design factors such as `Mediator`

, `Prime`

and `Cheating`

are considered the levels of independent variables (another option would have been a logistic / probit regression with `Cheating`

as the dependent binary / proportion variable).

Two types of result given for each experiment:

*First*, a table listing deviance tests for the full (saturated) model. The analysis starts with the NULL model (all frequencies are equal) in the first row. Each subsequent row lists what happens to the deviance (of the model in the previous row) when a factor is added. A significant drop in deviance means adding the factor to the model contributes to predicting the difference between expected and observed frequencies. For hints of corroboration of the hypotheses reported in the paper, significant interactions between a design factor and `Cheating`

are necessary.

*Second*, a mosaic plot is displayed, this is a graphical representation of the conditional cell frequencies. The mosaic plot also indicates which residual frequencies (observed – expected) are significantly below (red) or above (blue) the expected frequencies (residuals are interpretable as a Z-score). The coloured cells contribute most to a high and possibly significant value.

Note:

The significance of the change in deviance can depend on the order in which factors are added to the model and is not the same as a significant beta weight in a regression model.

`> [1] "Experiment 1"`

`> Analysis of Deviance Table > > Model: poisson, link: log > > Response: Count > > Terms added sequentially (first to last) > > > Df Deviance Resid. Df Resid. Dev Pr(>Chi) > NULL 5 24.8 > Cheating 1 9.33 4 15.4 0.00225 ** > Prime 2 0.02 2 15.4 0.98981 > Cheating:Prime 2 15.41 0 0.0 0.00045 *** > --- > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1`

`> [1] "Experiment 2"`

`> Analysis of Deviance Table > > Model: poisson, link: log > > Response: Count > > Terms added sequentially (first to last) > > > Df Deviance Resid. Df Resid. Dev Pr(>Chi) > NULL 7 19.64 > Cheating 1 13.86 6 5.78 0.0002 *** > Prime 1 0.25 5 5.52 0.6146 > Test 1 0.00 4 5.52 1.0000 > Cheating:Prime 1 1.51 3 4.02 0.2198 > Cheating:Test 1 2.53 2 1.48 0.1114 > Prime:Test 1 0.03 1 1.45 0.8609 > Cheating:Prime:Test 1 1.45 0 0.00 0.2284 > --- > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1`

`> [1] "Experiment 3"`

`> Analysis of Deviance Table > > Model: poisson, link: log > > Response: Count > > Terms added sequentially (first to last) > > > Df Deviance Resid. Df Resid. Dev Pr(>Chi) > NULL 7 11.50 > Cheating 1 2.14 6 9.36 0.144 > Prime 1 0.03 5 9.32 0.855 > Test 1 0.03 4 9.29 0.855 > Cheating:Prime 1 4.24 3 5.05 0.040 * > Cheating:Test 1 2.85 2 2.21 0.092 . > Prime:Test 1 0.50 1 1.71 0.481 > Cheating:Prime:Test 1 1.71 0 0.00 0.191 > --- > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1`

`> [1] "Experiment 4"`

`> Analysis of Deviance Table > > Model: poisson, link: log > > Response: Count > > Terms added sequentially (first to last) > > > Df Deviance Resid. Df Resid. Dev Pr(>Chi) > NULL 5 21.3 > Cheating 1 4.22 4 17.1 0.03996 * > Prime 2 0.29 2 16.8 0.86607 > Cheating:Prime 2 16.76 0 0.0 0.00023 *** > --- > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1`

**Conclusion log-linear analysis:**

This alternative, and in my opinion more appropriate analysis is in agreement with the results after correction for multiple comparisons and continuity:

- The mosaic plots show that there may be some unexpected factors driving the “effects” reported in the paper:
- In experiment 1 & 4 it is not so much the observed frequency of people that
*did*cheat, but the number of participants that*did not*cheat that deviate from the expected frequencies based on table margins. - The
`Money`

prime caused*less*people to**NOT**cheat, whereas the`Time`

prime caused*more*people to**NOT**cheat

- In experiment 1 & 4 it is not so much the observed frequency of people that
- If there is a difference in amount of
`Cheating`

between samples, it is likely a “main effect” between the`Time`

and`Money`

prime (`Cheating:Prime`

interaction), it is found to cause a significant drop in deviance in Experiments 1, 3 and 4. - Experiment 2 stands out, because observed differences in
`Cheating`

are unlikely due to chance, but none of the other factors contribute to explain differences between expected and observed frequencies.

The point about the mosaic plots is not just semantics or methodologists’ nit-picking. What it tells us is that, e.g. in the mosaic plot Table.1.1, among the observed frequencies of `CheatYES`

, the cell `Money`

does not stand out much from `Time`

and `Control`

from what may be expected by chance, for `CheatNO`

on the other hand, the cell `Money`

does stand out as different.

### 2. Exact odds ratios of 2×2 subtables to test hypotheses using Effect Size CIs

**Effect Size Confidence Intervals:**

To get a clearer idea about the significance between cell differences I calculate confidence intervals around the effect size associated with contingency tables. The CIs in Figure 1 below are based on the exact Odds Ratio (using the non-central hypergeomteric distribution) for a 2×2 sub-table of the full design obtained from `Fisher's Exact Test`

, testing against .

```
> [1] "Figure 1. Exact log Odds Ratio's of 2x2 tables comparing frequency of Cheating between independent samples in each experiment."
```

Note:Here, the Confidence Levels have been adjusted to account for the fact that 3 (EXP1&4) and 4 (EXP2&3) subtables of the full design were compared (

`1-(0.05 / #tests)`

). The exact p-value from Fisher’s exact test reported in the Figure was multiplied by the number of comparisons in each experiment.

### Conclusion Proportion data

- If there is an effect, it exists as a “main-effect” difference between the
`Money`

and`Time`

primed samples in Experiment 1 and 4. - Experiment 3
`No Mirror: Money - Time`

is a marginal case. - Experiment 2 did not yield any substantial effects.
- 4-5 out of 7 statistical inferences in the paper that are made based on proportion data should be considered invalid.

## II. Analysis of extent of cheating

The extent of `Cheating`

concerns the difference between actual accuracy (which is not provided as a result) and reported accuracy by a participant.

Experiment 1-3 report analyses of extent of `Cheating`

including means and SD’s. Sample size assumptions for Experiments 2 and 3 are the same as above.

#### Compare Cohen’s d CIs

I created CIs around the effect sizes based on the means and SD reported for Experiment 1-3 using the `R`

package `MBESS`

.

```
> [1] "Figure 2. Cohen's d with exact CIs comparing extent of Cheating between independent samples in experiment 1-3."
```

### Conclusion Extent of Cheating

The pattern is the same as the previous analyses:

- Experiment 1 shows a clear effect between
`Money`

and`Time`

samples - Experiment 3
`No Mirror: Money - Time`

is again a close call

## III. HAPPE-ing (Hypothesising After Post-Publication Evaluation)

Should reviewers have noticed these issues with data analysis?

**Yes, they should have!**

Even without re-analysing the published data as I have done here, the conclusions by the authors can be questioned based on a comparison of very elementary results:

Across four experiments, using different primes and a variety of measures and tasks, we consistently

found that shifting people’s attention to time

decreases dishonesty. Priming time makes people reflecton who they are, and this self-reflection

reduces their likelihood of behaving dishonestly.

The clue is to compare the results across the 4 experiments and evaluate whether it is valid to infer that the core postulates have been corroborated. The designs and materials are slightly different each time, but if variation in outcomes (e.g., proportion cheating behaviour) varies systematically with one or more of the experimental differences, there may be another variable at work here.

One result that begs explanation is the drop in proportion `Cheating`

in all the samples of Experiment 2 when compared to the other experiments. What is special about the procedure and methods? Regrettably more than 1 potential intervening factor changes with respect to Experiment 1.

A second odd omission in the interpretation of the results is the level of accuracy achieved by participants. In Experiments 1-3, the urge to cheat must have been *less* when a participant had achieved 90% accuracy. Experiment 4 is somewhat different in that the cheating opportunity concerns one “bottleneck” problem that is difficult to solve, but has to be correct in order to make other more easily solvable problems count in adding to the final reward. Here, accuracy could have an opposite effect in which less accurate participants cheat less. If 0 or only 1 extra item past the “bottleneck” item were solved, a participant might be less inclined to cheat than a participant who solved every problem except for the “bottleneck” item.

#### What is mediating what?

The figure below shows the interaction between the maximal financial incentive that could be awarded and the proportion cheating for each prime and experimental condition (indicating whether a mediator variable was manipulated in addition to being exposed to a prime). Note that the `Intelligence`

and the `No Mirror`

condition of Experiments 2 and 3 respectively are considered similar to Experiment 1 and 4, that is, they reflect a condition in which `Self-reflection`

was not induced by any other means than priming:

This relationship can be tested in a generalised linear model, of course being fully aware that this is *exploratory HAPPE-ing*. I assume the samples from each experiment are independent and use the number of cheaters vs. no cheaters as the dependent binomial variable. The model contains only those effects for which data are available (e.g., no interactions with both `Prime`

and `Mediator`

)

Note:A generalised linear mixed model (GLMM) with sample ID as a random effect gives similar results.

`> > Call: > glm(formula = cbind(CheatYES, CheatNO) ~ Reward + Prime + Mediator + > Reward * Prime + Reward * Mediator, family = binomial, data = reward) > > Deviance Residuals: > Min 1Q Median 3Q Max > -1.153 -0.695 -0.122 0.251 1.956 > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) -0.4495 0.2195 -2.05 0.0405 * > Reward 0.0111 0.0219 0.51 0.6125 > PrimeNone 0.5868 0.3973 1.48 0.1397 > PrimeMoney 0.6040 0.2824 2.14 0.0325 * > MediatorSelf-reflection -0.8128 0.3147 -2.58 0.0098 ** > Reward:PrimeNone 0.0167 0.0359 0.47 0.6416 > Reward:PrimeMoney 0.0698 0.0327 2.13 0.0329 * > Reward:MediatorSelf-reflection -0.0189 0.0434 -0.44 0.6626 > --- > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > (Dispersion parameter for binomial family taken to be 1) > > Null deviance: 76.292 on 13 degrees of freedom > Residual deviance: 11.035 on 6 degrees of freedom > AIC: 82.48 > > Number of Fisher Scoring iterations: 4`

`> [1] "Null-model deviance test: p < 1.33525644154704e-11"`

In the table above the model `Intercept`

corresponds to the odds of `Cheating`

compared to the Null-model when the predictors have the values: `Prime`

= `Time`

, `Mediator`

= `None`

and `Reward`

= 0. Compared to the overall probability of observing `Cheating`

behaviour, it thus seems that when the `Time`

prime is presented without an induction of `Self-reflection`

and a financial reward incentive, the odds of `Cheating`

drop.

This appears to be a corroboration of the second postulate, but note that in this analysis (just as in the previous analyses), there is no real difference between the `Time`

prime and prime = `None`

. The standard errors around these parameters are quite high. A clearer picture emerges when the Intercept is defined as `Prime`

= `None`

, `Mediator`

= `None`

and `Reward`

= 0 and the Odds Ratios are compared (exponentiation of the parameter estimates):

`> [1] "Odds Ratios compared to Prime = None, with profile likelihood CI.95"`

`> OR 2.5 % 97.5 % > (Intercept) 1.15 0.60 2.21 > Reward 1.03 0.97 1.09 > PrimeTime 0.56 0.25 1.21 > PrimeMoney 1.02 0.47 2.21 > MediatorSelf-reflection 0.44 0.24 0.81 > Reward:PrimeTime 0.98 0.92 1.05 > Reward:PrimeMoney 1.05 0.98 1.14 > Reward:MediatorSelf-reflection 0.98 0.90 1.07`

The odds ratios in the table above are multiplicative changes to the `Probability of Cheating`

= 1 when the predictor increases by 1 unit. So an OR < 1 will decrease the odds of observing `Cheating`

behaviour and an OR > 1 will increase it. The 95% CIs are based on the profile likelihood and show that in most cases the effect covers a range below and above 1. The range for the effect of `Self-Reflection`

is always below 1.

One can interpret the modelled relationship between these variables as follows:

- There is a weak positive association between the
`Maximal Financial Reward`

and the`Probability of Cheating`

- The association changes with the value of
`Prime`

, becoming stronger when`Money`

is primed, weaker when`Time`

is primed - The induction of
`Self-reflection`

does not cause the association to change, it changes the intercept, the base-line`Probability of Cheating`

at`Reward`

= 0

A graphical representation of the model predictions more clearly reveals this relationship:

### Conclusions, Discussion and further HAPPE-ing

- The significant results between
`Time`

and`Money`

in Experiments 1 and 4 probably arise due to the increase in`Probability of Cheating`

when there is a financial reward and`Money`

is primed.- It is unlikely there are any other “real” differences in these data except for the induction of
`Self-reflection`

: Model predictions show it decreases the`Probability of Cheating`

by the same amount for different primes - Note that there were no actual data points for
`None`

+`Self-reflection`

- It is unlikely there are any other “real” differences in these data except for the induction of
- The missing predictors in the
`Probability of Cheating`

analysis are the actual and reported*accuracy*of the performance (amount of correctly solved problems and money received respectively). These values cannot be inferred from the extent of cheating analyses. It seems reasonable to assume in most experiments there was less incentive to engage in`Cheating`

by participants who were more accurate.- This brings up the question of whether the effects are driven by some sort of Speed-Accuracy instruction: Naturally,
`Time = Money`

, but taking the time to solve the problems may lead to higher accuracy and less incentive to cheat, likewise a focus on getting as many answers as possible may introduce errors and promote cheating.

- This brings up the question of whether the effects are driven by some sort of Speed-Accuracy instruction: Naturally,

In science there is a moral obligation to do the best one can to be as accurate as possible and usually this means it is wise to be as modest as possible about ones’ scientific claims. I am not an expert in this field, but the sheer amount of questions that can be raised about the validity of the inferences made in this paper makes one wonder who the peers were that achieved consensus about the credibility of this research and what their area of expertise was.

I am not saying this is irrelevant, or poor research; the two effects that survive the scrutiny of **3PR** are certainly interesting. I am just a little worried this paper says more about the morality of contemporary scientific publishing than the scientific study of moral behaviour.

Some notes about this file:

- This file was created using Markdown in RStudio: Unless otherwise indicated in the code blocks (e.g., by
**require**), the basic R packages are used. - All the analyses are based on results reported in the publication.
- The one true gospel on statistical inference does not exist and more than one approach to analyse these data may be defensible.
- Therefore: Please be aware these comments and suggestions reflect my own preferences and standards in these matters. If you feel I should change some of my preferences and/or standards please let me know, because I review and adjust them on a regular basis.

Pingback: Help me make the HIBAR blog better | Had I Been A Reviewer