During the years from 2000 to 2004, I helped to start an education technology company called the Grow Network. In the early days, we sat at makeshift desks in a converted cabbage warehouse, blowing on our fingers to keep them warm. Eventually we became a large and successful company, serving millions of students across the country. McGraw-Hill Education acquired us around the time I returned to academia, and my understanding is that they heat the building in winter now.
As the company's Vice President for Education and Product Development, I led psychometrics for Grow, which meant that I routinely worked with large-scale assessment data for hundreds of thousands of students in grades 3-8. From time to time, I wondered about the gender gap I had heard about between boys' and girls' mathematics achievement. But when I analyzed the raw data myself, I found no obvious effect. Naturally, I wasn't conducting anything like a careful study.
Now in a very significant paper, Hyde et al. have shown convincingly based on multi-state data for millions of students that - whatever used to be the case in decades past - there is no gender gap today. (Gender Similarities Characterize Math Performance, by Janet S. Hyde, Sara M. Lindberg, Marcia C. Linn, Amy B. Ellis, Caroline C. Williams, Science 25 July 2008: Vol. 321. no. 5888, pp. 494 - 495, DOI: 10.1126/science.1160364)
I hope that policy and advocacy groups will run with this message and work it into the culture of K-12 education - and American culture as a whole. I'll certainly look for more opportunities to spread the news.
Here begin some more detailed responses to the paper. Take these for what they are: the first responses of a reasonably intelligent non-expert.
The Hyde et al. paper is about two different things. The first concern of the paper is with math achievement in the mean. Here the findings are clear and straightforward: there is no math gap today, in the mean. This result forms the paper's headline: "Gender Similarities Characterize Math Performance."
The second concern of the paper is much more complex, having to do with gender disparities in the science and engineering pipeline that leads from college majors, to Ph.D. recipients, to tenured positions in academia. The gender gap tends to get more pronounced the further down the pipeline you look. Most notable is the scarcity of women in tenured positions in top-25 research universities.
The gender gap downstream is a very complicated problem, or set of problems, and frankly I do not think that children's math scores are the best place to begin thinking about them. Leaving aside my own instincts, however, it has certainly been suggested in the past that gender differences in math ability at the high end may go some way towards explaining the "tenure gap." The idea is that the means for males and females could be exactly the same, and yet if the variability were higher for males, then males would predominate at the high end of the achievement distribution (as well as at the low end, presumably!). Exactly how this disparity in achievement at the high end is supposed to contribute to the tenure gap is never spelled out, but it is at least an empirical question whether children's scores do or do not exhibit a gender discrepancy in variability (in either direction).
Hyde et al. examined the high end of the score distribution for their data set. They did indeed find greater variability in male students' scores (see graphic further below). This augurs for overrepresentation of males at high percentiles.
Actual gender ratios at high percentiles were not published for the full data set, though the researchers noted that male 11th-grade students in Minnesota did tend to be overrepresented at the 99th percentile and above. (Minnesota 11th-grade was the only state/grade combination for which actual gender ratios were quoted above the 99th percentile, although no reason was given to suspect that Minnesota grade 11 was atypical; the greater variability in male students' scores was significant across the full range of states and grades analyzed, as shown in the graphic further below).
I don't really understand the data on page 494 as they relate to the variance ratios quoted for Asian/Pacific Islanders in Table S2, but for the white students at least, I would call the tail effect substantial. In Minnesota at 11th grade, there was a 2-to-1 ratio of males to females at the 99th percentile and above. The authors note (p. 495) that,
If a particular specialty required mathematical skills at the 99th percentile, and the gender ratio is 2.0, we would expect 67% men in the occupation and 33% women. Yet today, for example, Ph.D. programs in engineering average only about 15% women.
In the conclusion of the paper, this observation is rendered as follows (p. 495):
There is evidence of slightly greater male variability in scores, although the causes remain unexplained. Gender differences in math performance, even among high scorers, are insufficient to explain lopsided gender patterns in participation in some STEM fields.
(One person's "substantial" is another person's "slight.") I think anybody would agree with the authors that a 2-to-1 ratio at the 99th percentile is insufficient to explain the lopsided gender patterns we observe today. 67-33 is simply not the same as 85-15. But the authors are clearly doing their rhetorical best to portray this glass as half-full; 67-33 is not the same as 50-50 either.
The primary result of the paper - the lack of a gender gap in math achievement - looks ironclad. The secondary finding, that of greater variability among males, is also obviously incontrovertible - whatever we may think about its relevance to the pipeline problem. The variability effect is probably not new, but this paper, based as it is on the scores of millions of students ranging in age from 7 through 18, provides a convenient point of reference and a handy estimate (variability ratio male:female = 1.16).
(Click to enlarge. I created this histogram in Excel by hand-keying the data in Table S1 in the Supplemental Online Material. Two outliers have been dropped to keep the scale compact, specifically 1.76 and 2.39. The green shading indicates ratios less than unity. N=64 observations are shown.)
Hyde et al. also looked at performance on difficult questions as another lens on the problem. They found (p. 495) that
At grade 12, effect sizes for [hard] items ranged between 0 and 0.15 (average d= 0.07). At grade 8, effect sizes for these items ranged between 0 and 0.08 (average d= 0.05). Thus, even for difficult items requiring substantial depth of knowledge, gender differences were still quite small.
This shows that gender differences were small in the mean even for hard problems - a significant finding. But now the tail question arises once more: What about gender disparities at the high end of performance on the "subtest" of hard questions? Is the variability ratio for the hard subtest similar to the 1.16 quoted elsewhere in the paper? What male:female ratio do we find among those achieving perfect scores on the difficult subtest? I think this is an important question - possibly much more important than the overall variance ratio given in the paper - because, having looked at many standardized tests during my time at Grow, I have long known that many of the questions on accountability exams are "gimmes" that are not exactly selecting our future star researchers. Less anecdotally, Ginsburg et al. (p. xiii) reported in 2005 that (emphasis added)
The questions on Singapore’s high-stakes grade 6 Primary School Leaving Examination (PSLE) are more challenging than the released items on the U.S. grade 8 National Assessment of Education Progress (NAEP) and the items on the grade 8 state assessments. ... Singapore’s most challenging questions are designed to help Singapore identify the best students. These are more difficult than the most challenging questions on the state grade 8 assessments as well as on NAEP.
For this reason I was surprised by the tone of surprise in the paper's final sentence (p. 495):
An unexpected finding was that state assessments designed to meet NCLB requirements fail to test complex problem-solving of the kind needed for success in STEM careers, a lacuna that should be fixed.
The net result of all this discussion might be that PISA or TIMSS could be a better way to investigate the STEM pipeline than U.S. exams (whether state or NAEP).
The facts reported by Hyde et al. are facts, of course, but even so, all of this research seems backwards to me. Instead of sifting through the tea leaves of seven-year-olds' math scores, shouldn't we be looking directly at the traits of successful scientists? Though we'd probably find fairly high math ability in this group, I think we might find even greater deviations in certain relevant dimensions of personality. Think for example about the way curiosity factors into a scientific career. Or creativity. Then there's ambition...the ability to concentrate for extended periods of time...relative freedom from family responsibility during the 20's and 30's...and a level of perseverance and attention to detail bordering on the obsessive-compulsive. Or who knows - maybe scientists would turn out to be just like the rest of us.