In the first of what will hopefully be many original content HIBAR posts we host here, Lucy Cragg and Matthew Inglis raise some questions about the statistical analyses used in a recent high profile paper by Starr et al (2013) which claims that approximate number sense measured in infancy predicts mathematical abilities 3 years later. This paper has received a lot of media attention, including this piece by Virginia Hughes which includes adorable example video footage of the baby testing
Number Sense in Infancy Predicts Mathematical Abilities in Childhood
Lucy Cragg and Matthew Inglis
Dealing with numbers and quantities is an essential skill in our everyday lives, enabling us to compare the price of two items, follow a recipe, or split a dinner bill between friends. Recently, psychologists have proposed the existence of an Approximate Number System (ANS or ‘number sense’) which enables adults, children, and even infants to compare, order or add nonsymbolic sets of items without needing to count or rely on numerical symbols. Moreover, it has been suggested that the ANS may play an important role in the learning of formal, symbolic mathematics.
In a recent paper, published this week in PNAS, Starr, Libertus and Brannon tested this hypothesis by seeing how well a child’s number sense at 6 months old predicts their mathematics ability at 3 and a half years. At 6 months the infants were shown two simultaneous streams of pictures, one with an alternating number of dots and one with the same number of dots in each picture. The amount of time spent looking at each stream was measured and their preference for looking at the changing numbers was calculated. Half of the infants also completed a similar non-numerical task where either the colour or the size of an image changed or remained constant. Three and a half years later children were invited back to the lab to complete a nonsymbolic comparison task, in which they were shown a series of two arrays of dots and asked to select the one with the greater number of dots. This is widely used as a measure of the acuity of the ANS in children and adults. They also completed the Test of Early Mathematics Ability (TEMA), a counting task, and an IQ measure (the Reynolds Intellectual Assessment Scales).
Starr et al. found a significant correlation between time spent looking at the changing number of dots at 6 months and both ANS acuity and mathematics ability (as measured by the TEMA) at 3 and a half, which wasn’t accounted for by general intelligence. This would provide strong evidence that early number sense is a building block for later mathematical ability. However, a closer look at the data suggests that we perhaps shouldn’t be so hasty. In the spirit of earlier posts on Had I Been A Reviewer, we ask four questions that occurred to us when reading the paper.
Question 1. The key question in the paper, whether there is a correlation between numerical preference scores and mathematics achievement was investigated using a Pearson correlation coefficient, with the finding that there was a significant correlation, “r = .28, p = .03”. However, with the appropriate degrees of freedom (46) an r value of .28 is not significant, p = .054. This would appear to be an invalid use of a one-tailed test, by Lombardi & Hurlbert’s (2009) criteria. In view of these criteria, what justifies the use of a one-tailed test here, and why was it not flagged as being one-tailed in the paper?
In general, which of the p values reported in the papers are from one-tailed tests, which are from two-tailed tests, and which of the ‘significant’ values reported would still be significant if standard statistical practice were followed?
Question 2. An inspection of Figure 2 suggests that the participants in the study did not show a preference for the changing dots at all: by eye, the mean numerical preference score seems to be around zero. To investigate, we extracted the data from Figure 2B using GraphClick (http://www.arizona-software.ch/graphclick). Calculating the mean numerical preference scores of all participants yielded +0.05, a value which was not significantly different from zero, t(47)=.573, p = .569. This is surprising, because in the Supplementary Materials to the paper the authors write
“The cross-sectional infant samples from which the participants were recruited demonstrated significant preference scores in all of the conditions tested [10 vs. 20: t(17) = 1.74; 8 vs. 16: t(15) = 3.11; 5 vs. 15: t(29) = 2.00; 6 vs. 18: t(41) = 1.71; 6 vs. 24: t(15) = 4.31; all values of P < 0.05]. Note that these studies had unequal sample sizes due to differences in the specific goals of each study.”
The degrees of freedom in these analyses suggest that the 48 participants in the study were sampled from a pool of 122. The original 122 apparently showed strong positive preference scores, but the 48 who ended up in this study did not. Why are these participants so unrepresentative of the larger group? In the Libertus & Brannon (2010) study, which reported some of this earlier data, groups which did not show a positive numerical change-detecting score (at the group level) were excluded from later stages in the longitudinal study (p. 903). What accounts for the different exclusion criteria adopted in this paper?
Question 3. A critical test of the link between early numerical preference and later maths is that the relationship should be specific to detect changes in number, not just a preference for things that change. Starr et al. investigated this by comparing the correlations between numerical preference and later ANS acuity and mathematics ability, and non-numerical preference and later ANS acuity and mathematics ability in the 24 children who completed both preference tasks. They found a stronger correlation for numerical preference than non-numerical preference for ANS acuity, but no difference in the size of correlations for mathematics ability.
It appears that the authors used a Fisher’s r to z test to compare their two correlation coefficients (again reporting a one-tailed p value), and not a Williams-Steigler test, which should be used when comparing dependent correlations (Howell, 1997, p. 264). Fisher’s r to z test is only valid when comparing two independent correlation coefficients (i.e. the null hypothesis is that the r in group 1 is equal to the r in group 2). In this paper there is only one group of participants, and the null hypothesis is r12 = r13 (i.e. the correlation between variables 1 and 2 is equal to the correlation between variables 1 and 3).
To calculate the p value associated with this null hypothesis, you need to know r23 (the correlation between variables 2 and 3, in this case the numerical and non-numerical preference scores). Unfortunately, the authors don’t report the correlation between the non-numerical preference scores and numerical preference scores in the paper, so it is impossible for readers to make this calculation. However, they do reference an earlier paper (Libertus & Brannon, 2010) which reports that in a reduced sample of 16 (out of 24) of these children, the r23 correlation was a non-significant -.15. If a Williams-Steigler Test is calculated using this value it seems that the two correlations are not significantly different, p = .08 (two-tailed). What was the correlation between non-numerical and numerical preference scores among the 24 participants in this study? Does an appropriate hypothesis test, such as the Williams-Steigler Test, show that the correlations between numerical preference scores and ANS acuity and between non-numerical preference scores and ANS acuity are significantly different?
Question 4. As noted above, on page 3 the authors tested whether numerical preference scores correlated with ANS acuity and mathematics achievement in the reduced sample for which non-numerical preference scores were available. They found a significant correlation between numerical preference scores and ANS acuity (the authors write “r = -.42, P < 0.02”, whereas the two-tailed p for r(22) = -.42 is .041) but no significant correlation with mathematics achievement (r = .16).
This pattern of results in the smaller sample was attributed to “lack of statistical power”. However, in the larger sample the two effect sizes were more or less identical (rs = .28, -.29), so this seems a curious explanation. Given that these two tests were testing the same hypothesis, and that the authors interpreted one significant result as supporting the hypothesis, should a Bonferroni correction have been applied? If it had, neither (two-tailed) test would have been significant. Is this a problem for the authors’ interpretation?
Howell, D. (1997). Statistical Methods for Psychology. Pacific Grove, CA: Duxbury.
Lombardi, C. M. & Hurlbert, S. H. (2009). Misprescription and misuse of one-tailed tests. Austral Ecology, 34, 447-468.
Libertus, M. E. & Brannon, E. M. (2010). Stable individual differences in number discrimination in infancy. Developmental Science, 13, 900-906.
- Infants’ maths skills predict their potential (nature.com)
- Baby’s innate number sense predicts future math skill (sciencedaily.com)