*In the first of what will hopefully be many original content HIBAR posts we host here, Lucy Cragg and Matthew Inglis raise some questions about the statistical analyses used in a recent high profile paper by Starr et al (2013) which claims that approximate number sense measured in infancy predicts mathematical abilities 3 years later. **This paper has received a lot of media attention, including this piece by Virginia Hughes which includes adorable example video footage of the baby testing :)*

**Number Sense in Infancy Predicts Mathematical Abilities in Childhood**

**Lucy Cragg and Matthew Inglis**

Dealing with numbers and quantities is an essential skill in our everyday lives, enabling us to compare the price of two items, follow a recipe, or split a dinner bill between friends. Recently, psychologists have proposed the existence of an *Approximate Number System *(ANS or ‘number sense’)* *which enables adults, children, and even infants to compare, order or add nonsymbolic sets of items without needing to count or rely on numerical symbols. Moreover, it has been suggested that the ANS may play an important role in the learning of formal, symbolic mathematics.

In a recent paper, published this week in PNAS, Starr, Libertus and Brannon tested this hypothesis by seeing how well a child’s number sense at 6 months old predicts their mathematics ability at 3 and a half years. At 6 months the infants were shown two simultaneous streams of pictures, one with an alternating number of dots and one with the same number of dots in each picture. The amount of time spent looking at each stream was measured and their preference for looking at the changing numbers was calculated. Half of the infants also completed a similar non-numerical task where either the colour or the size of an image changed or remained constant. Three and a half years later children were invited back to the lab to complete a nonsymbolic comparison task, in which they were shown a series of two arrays of dots and asked to select the one with the greater number of dots. This is widely used as a measure of the acuity of the ANS in children and adults. They also completed the Test of Early Mathematics Ability (TEMA), a counting task, and an IQ measure (the Reynolds Intellectual Assessment Scales).

Starr et al. found a significant correlation between time spent looking at the changing number of dots at 6 months and both ANS acuity and mathematics ability (as measured by the TEMA) at 3 and a half, which wasn’t accounted for by general intelligence. This would provide strong evidence that early number sense is a building block for later mathematical ability. However, a closer look at the data suggests that we perhaps shouldn’t be so hasty. In the spirit of earlier posts on *Had I Been A Reviewer*, we ask four questions that occurred to us when reading the paper.

*Question 1. *The key question in the paper, whether there is a correlation between numerical preference scores and mathematics achievement was investigated using a Pearson correlation coefficient, with the finding that there was a significant correlation, “*r* = .28, *p* = .03”. However, with the appropriate degrees of freedom (46) an *r* value of .28 is not significant, *p = *.054. This would appear to be an invalid use of a one-tailed test, by Lombardi & Hurlbert’s (2009) criteria. In view of these criteria, what justifies the use of a one-tailed test here, and why was it not flagged as being one-tailed in the paper?

In general, which of the *p* values reported in the papers are from one-tailed tests, which are from two-tailed tests, and which of the ‘significant’ values reported would still be significant if standard statistical practice were followed?* *

*Question 2. *An inspection of Figure 2 suggests that the participants in the study did not show a preference for the changing dots at all: by eye, the mean numerical preference score seems to be around zero. To investigate, we extracted the data from Figure 2B using GraphClick (http://www.arizona-software.ch/graphclick). Calculating the mean numerical preference scores of all participants yielded +0.05, a value which was not significantly different from zero, *t*(47)=.573, *p* = .569. This is surprising, because in the Supplementary Materials to the paper the authors write

“The cross-sectional infant samples from which the participants were recruited demonstrated significant preference scores in all of the conditions tested [10 vs. 20:

t(17) = 1.74; 8 vs. 16:t(15) = 3.11; 5 vs. 15:t(29) = 2.00; 6 vs. 18:t(41) = 1.71; 6 vs. 24:t(15) = 4.31; all values ofP< 0.05]. Note that these studies had unequal sample sizes due to differences in the specific goals of each study.”

The degrees of freedom in these analyses suggest that the 48 participants in the study were sampled from a pool of 122. The original 122 apparently showed strong positive preference scores, but the 48 who ended up in this study did not. Why are these participants so unrepresentative of the larger group? In the Libertus & Brannon (2010) study, which reported some of this earlier data, groups which did not show a positive numerical change-detecting score (at the group level) were excluded from later stages in the longitudinal study (p. 903). What accounts for the different exclusion criteria adopted in this paper?

*Question 3. *A critical test of the link between early numerical preference and later maths is that the relationship should be specific to detect changes in number, not just a preference for things that change. Starr et al. investigated this by comparing the correlations between numerical preference and later ANS acuity and mathematics ability, and non-numerical preference and later ANS acuity and mathematics ability in the 24 children who completed both preference tasks. They found a stronger correlation for numerical preference than non-numerical preference for ANS acuity, but no difference in the size of correlations for mathematics ability.

It appears that the authors used a Fisher’s *r* to *z* test to compare their two correlation coefficients (again reporting a one-tailed *p* value), and not a Williams-Steigler test, which should be used when comparing dependent correlations (Howell, 1997, p. 264). Fisher’s *r* to *z* test is only valid when comparing two *independent* correlation coefficients (i.e. the null hypothesis is that the *r* in group 1 is equal to the *r* in group 2). In this paper there is only one group of participants, and the null hypothesis is *r*_{12} = *r*_{13} (i.e. the correlation between variables 1 and 2 is equal to the correlation between variables 1 and 3).

To calculate the *p *value associated with this null hypothesis, you need to know *r*_{23} (the correlation between variables 2 and 3, in this case the numerical and non-numerical preference scores). Unfortunately, the authors don’t report the correlation between the non-numerical preference scores and numerical preference scores in the paper, so it is impossible for readers to make this calculation. However, they do reference an earlier paper (Libertus & Brannon, 2010) which reports that in a reduced sample of 16 (out of 24) of these children, the *r*_{23} correlation was a non-significant -.15. If a Williams-Steigler Test is calculated using this value it seems that the two correlations are not significantly different, *p* = .08 (two-tailed). What was the correlation between non-numerical and numerical preference scores among the 24 participants in this study? Does an appropriate hypothesis test, such as the Williams-Steigler Test, show that the correlations between numerical preference scores and ANS acuity and between non-numerical preference scores and ANS acuity are significantly different?

*Question 4. *As noted above, on page 3 the authors tested whether numerical preference scores correlated with ANS acuity and mathematics achievement in the reduced sample for which non-numerical preference scores were available. They found a significant correlation between numerical preference scores and ANS acuity (the authors write “*r* = -.42, *P* < 0.02”, whereas the two-tailed *p* for *r*(22) = -.42 is .041) but no significant correlation with mathematics achievement (*r* = .16).

This pattern of results in the smaller sample was attributed to “lack of statistical power”. However, in the larger sample the two effect sizes were more or less identical (*r*s = .28, -.29), so this seems a curious explanation. Given that these two tests were testing the same hypothesis, and that the authors interpreted one significant result as supporting the hypothesis, should a Bonferroni correction have been applied? If it had, neither (two-tailed) test would have been significant. Is this a problem for the authors’ interpretation?

** References**

Howell, D. (1997). *Statistical Methods for Psychology. *Pacific Grove, CA: Duxbury.

Lombardi, C. M. & Hurlbert, S. H. (2009). Misprescription and misuse of one-tailed tests. *Austral Ecology, 34, *447-468.

Libertus, M. E. & Brannon, E. M. (2010). Stable individual differences in number discrimination in infancy. *Developmental Science, 13, *900-906.

###### Related articles

- Infants’ maths skills predict their potential (nature.com)
- Baby’s innate number sense predicts future math skill (sciencedaily.com)

Excellent analysis. Here in Germany the article was described in the Süddeutsche Zeitung, one of Germany’s top newspapers, with the title “Endowment in Mathematics is Hereditary”. Needless to say the author really didn’t understand the paper itself or the analysis of the data.

That is why the peer review process needs to be improved tremendously. Once such results appear in a mainstream journal, they easily become great headlines for the mainstream press, who never spends tie in anaysing such complex problems in detail.

Excellent job. It would be nice to hear a reply from the authors with respect to the above comments.

Great post. To be honest, the stats issues escaped me. I had a more fundamental concern with the way the infant data was pooled from five different studies with each infant making one of five different numerical comparisons. From page 4:

“studies varied in the number of elements presented in the changing and constant streams and the number of participants drawn from each of these cross-sectional studies varied as well: 6 vs. 24 (n=2), 5 vs.15 (n=18), 6 vs.18 (n=13), 8 vs.16 (n= 2), or 10 vs. 20 (n = 13).”

Everything hangs on the assumption that, despite the different numbers and ratios of objects in the displays, the preference scores can all be lumped in together as if the infants were all doing the same task.

The authors attempt to address this issue by “normalizing” the data:

“To enable comparison of preference scores across numerical conditions, preference scores were normalized by dividing each score by the highest score in its respective condition.”

Each child was given a preference score ranging from -1 to +1 that indicated whether they preferred the changing (positive) or non-changing (negative) display. Unless I’ve completely misunderstood, normalization involved dividing each infant’s preference score by the preference score of the infant in their study with the greatest preference for the changing display.

This makes all kinds of assumptions about the distributions of preference scores (which are bounded by plus and minus 1) and means that each child’s normalized score is a function of two numbers – their own preference score and the most extreme positive preference score in the sample. Given that the most extreme values are also likely to be the least accurate (regression to the mean), this suggests that the normalized data are going to be incredible unreliable.

In fact, in Figure 2, two infants actually had “normalized” preference scores that were less than -1. This means that these infants showed a stronger preference for the non-changing display than any of the other infants showed for the changing display. In other words, by the standards of their group, they showed strong “number sense” but the “wrong” preference.

Good point Jon. In fact, even if you accept that the different numerical preference tasks are in principle equivalent, the way the authors apparently normalised the scores is, I think, rather problematic.

If you assume that the unstandardised scores come from a normal distribution, then when you have a large sample you’d expect a more extreme largest value than when you have a small sample, so everyone’s scores will be normalised in a more extreme way. In other words, individuals’ normalised scores are to some extent dependent on how many other participants were in the same condition as them.

To investigate the potential scale of this effect, I ran some quick simulations. According to the degrees of freedom reported in the Supplementary Materials, the sample sizes of the authors’ various conditions varied from 16 to 42. I generated 1000 datasets, assuming a population unstandardised mean of 0.4 and an SD of 0.2 (these numbers don’t matter very much, the basic point is the same regardless of what you choose here). When I assumed an N of 16 I got a grand normalised mean (the mean across all 1000 datasets) of 0.54, and when I assumed an N of 42 I got a grand normalised mean of 0.48. In other words, when you have unequal sample sizes this way of normalising the data may introduce artefactual effects (with potentially large effect sizes, d=0.3 in this case).

I am out of my league here in both education and field, and I do not have the ability to do frame stop on my current computer. With that, repeated viewings of the “adorable footage” above, leave me with the impression that the pictures of variable numbers of dots have more variation in the size of the dots in successive frames, than do the pictures of non-variable dots. By that, I mean that if the dot diameters ranged from one to five, the variable numbers had more steps between successive shots, than the non-variable. I wish I had access to frame freeze to check the accuracy of my impression, as a greater variation in sizes between shots could be as attractive to infants as a variation in number of dots.