Sunday, July 27, 2008

The gender gap in math has disappeared.

A major paper published this week in Science is worth a look.

During the years from 2000 to 2004, I helped to start an education technology company called the Grow Network. In the early days, we sat at makeshift desks in a converted cabbage warehouse, blowing on our fingers to keep them warm. Eventually we became a large and successful company, serving millions of students across the country. McGraw-Hill Education acquired us around the time I returned to academia, and my understanding is that they heat the building in winter now.

As the company's Vice President for Education and Product Development, I led psychometrics for Grow, which meant that I routinely worked with large-scale assessment data for hundreds of thousands of students in grades 3-8. From time to time, I wondered about the gender gap I had heard about between boys' and girls' mathematics achievement. But when I analyzed the raw data myself, I found no obvious effect. Naturally, I wasn't conducting anything like a careful study.

Now in a very significant paper, Hyde et al. have shown convincingly based on multi-state data for millions of students that - whatever used to be the case in decades past - there is no gender gap today. (Gender Similarities Characterize Math Performance, by Janet S. Hyde, Sara M. Lindberg, Marcia C. Linn, Amy B. Ellis, Caroline C. Williams, Science 25 July 2008: Vol. 321. no. 5888, pp. 494 - 495, DOI: 10.1126/science.1160364)

I hope that policy and advocacy groups will run with this message and work it into the culture of K-12 education - and American culture as a whole. I'll certainly look for more opportunities to spread the news.


Here begin some more detailed responses to the paper. Take these for what they are: the first responses of a reasonably intelligent non-expert.

The Hyde et al. paper is about two different things. The first concern of the paper is with math achievement in the mean. Here the findings are clear and straightforward: there is no math gap today, in the mean. This result forms the paper's headline: "Gender Similarities Characterize Math Performance."

The second concern of the paper is much more complex, having to do with gender disparities in the science and engineering pipeline that leads from college majors, to Ph.D. recipients, to tenured positions in academia. The gender gap tends to get more pronounced the further down the pipeline you look. Most notable is the scarcity of women in tenured positions in top-25 research universities.

The gender gap downstream is a very complicated problem, or set of problems, and frankly I do not think that children's math scores are the best place to begin thinking about them. Leaving aside my own instincts, however, it has certainly been suggested in the past that gender differences in math ability at the high end may go some way towards explaining the "tenure gap." The idea is that the means for males and females could be exactly the same, and yet if the variability were higher for males, then males would predominate at the high end of the achievement distribution (as well as at the low end, presumably!). Exactly how this disparity in achievement at the high end is supposed to contribute to the tenure gap is never spelled out, but it is at least an empirical question whether children's scores do or do not exhibit a gender discrepancy in variability (in either direction).

Hyde et al. examined the high end of the score distribution for their data set. They did indeed find greater variability in male students' scores (see graphic further below). This augurs for overrepresentation of males at high percentiles.

Actual gender ratios at high percentiles were not published for the full data set, though the researchers noted that male 11th-grade students in Minnesota did tend to be overrepresented at the 99th percentile and above. (Minnesota 11th-grade was the only state/grade combination for which actual gender ratios were quoted above the 99th percentile, although no reason was given to suspect that Minnesota grade 11 was atypical; the greater variability in male students' scores was significant across the full range of states and grades analyzed, as shown in the graphic further below).

I don't really understand the data on page 494 as they relate to the variance ratios quoted for Asian/Pacific Islanders in Table S2, but for the white students at least, I would call the tail effect substantial. In Minnesota at 11th grade, there was a 2-to-1 ratio of males to females at the 99th percentile and above. The authors note (p. 495) that,
If a particular specialty required mathematical skills at the 99th percentile, and the gender ratio is 2.0, we would expect 67% men in the occupation and 33% women. Yet today, for example, Ph.D. programs in engineering average only about 15% women.

In the conclusion of the paper, this observation is rendered as follows (p. 495):
There is evidence of slightly greater male variability in scores, although the causes remain unexplained. Gender differences in math performance, even among high scorers, are insufficient to explain lopsided gender patterns in participation in some STEM fields.

(One person's "substantial" is another person's "slight.") I think anybody would agree with the authors that a 2-to-1 ratio at the 99th percentile is insufficient to explain the lopsided gender patterns we observe today. 67-33 is simply not the same as 85-15. But the authors are clearly doing their rhetorical best to portray this glass as half-full; 67-33 is not the same as 50-50 either.


The primary result of the paper - the lack of a gender gap in math achievement - looks ironclad. The secondary finding, that of greater variability among males, is also obviously incontrovertible - whatever we may think about its relevance to the pipeline problem. The variability effect is probably not new, but this paper, based as it is on the scores of millions of students ranging in age from 7 through 18, provides a convenient point of reference and a handy estimate (variability ratio male:female = 1.16).

(Click to enlarge. I created this histogram in Excel by hand-keying the data in Table S1 in the Supplemental Online Material. Two outliers have been dropped to keep the scale compact, specifically 1.76 and 2.39. The green shading indicates ratios less than unity. N=64 observations are shown.)


Hyde et al. also looked at performance on difficult questions as another lens on the problem. They found (p. 495) that
At grade 12, effect sizes for [hard] items ranged between 0 and 0.15 (average d= 0.07). At grade 8, effect sizes for these items ranged between 0 and 0.08 (average d= 0.05). Thus, even for difficult items requiring substantial depth of knowledge, gender differences were still quite small.

This shows that gender differences were small in the mean even for hard problems - a significant finding. But now the tail question arises once more: What about gender disparities at the high end of performance on the "subtest" of hard questions? Is the variability ratio for the hard subtest similar to the 1.16 quoted elsewhere in the paper? What male:female ratio do we find among those achieving perfect scores on the difficult subtest? I think this is an important question - possibly much more important than the overall variance ratio given in the paper - because, having looked at many standardized tests during my time at Grow, I have long known that many of the questions on accountability exams are "gimmes" that are not exactly selecting our future star researchers. Less anecdotally, Ginsburg et al. (p. xiii) reported in 2005 that (emphasis added)
The questions on Singapore’s high-stakes grade 6 Primary School Leaving Examination (PSLE) are more challenging than the released items on the U.S. grade 8 National Assessment of Education Progress (NAEP) and the items on the grade 8 state assessments. ... Singapore’s most challenging questions are designed to help Singapore identify the best students. These are more difficult than the most challenging questions on the state grade 8 assessments as well as on NAEP.

For this reason I was surprised by the tone of surprise in the paper's final sentence (p. 495):
An unexpected finding was that state assessments designed to meet NCLB requirements fail to test complex problem-solving of the kind needed for success in STEM careers, a lacuna that should be fixed.

The net result of all this discussion might be that PISA or TIMSS could be a better way to investigate the STEM pipeline than U.S. exams (whether state or NAEP).


The facts reported by Hyde et al. are facts, of course, but even so, all of this research seems backwards to me. Instead of sifting through the tea leaves of seven-year-olds' math scores, shouldn't we be looking directly at the traits of successful scientists? Though we'd probably find fairly high math ability in this group, I think we might find even greater deviations in certain relevant dimensions of personality. Think for example about the way curiosity factors into a scientific career. Or creativity. Then there's ambition...the ability to concentrate for extended periods of time...relative freedom from family responsibility during the 20's and 30's...and a level of perseverance and attention to detail bordering on the obsessive-compulsive. Or who knows - maybe scientists would turn out to be just like the rest of us.

Saturday, July 5, 2008

A Sound of Thunder

Most of the windows in our house look out across the little valley that separates New York's Taconic Range from the Green Mountains of Vermont. The elevation of the house, 1250 feet, gives us an eye-level view of approaching thunderstorms. On the hottest summer days, thunderheads gather on the far western horizon all afternoon, bunching higher and higher, until grey misty clouds spill over into the valley. The grey raft floats towards us, dropping a curtain of rain at its leading edge. Standing in the living room and looking out to westward, I can almost imagine I'm sailing into a squall.

Lightning occasionally strikes the marsh down below the house; the picture below shows a lightning bolt from a spectacular electrical storm just a few days ago.

This strike was about 300 yards from the house. We've had closer. A few years ago, our well pump motor was apparently fried by lightning that must have struck one of the trees on the south side of our yard. The current flowed into the ground and jumped to the wiring that leads into our mechanical room.


A couple of weeks ago, I was awakened during the night by a storm. The thunderclaps were loud, but it was actually the brightness of the lightning that disturbed my sleep. I lay for a time in the intermittent dark, calculating distances and velocities from the delay information, when it occurred to me that there was something peculiar about the sound of the thunder. Mixed in with the sounds of the storm, I heard a regular beat of deep booming sounds spaced a few seconds apart. The uniformity of the spacing in time marked the sounds as unnatural. It was then that I remembered an article I'd read a few days earlier in the local newspaper: Hail Cannon Stirs Complaints.

The regular booms I had heard were those of a "hail cannon." You can google around to learn more about these devices, but the point of them is to emit a loud sonic boom every few seconds, the goal being to suppress the formation of hailstones. Hail cannons were popular in Europe during the 1890's and 1900's, but they were gradually abandoned after they were perceived to be ineffective. Hail cannons have made a comeback in the last decade or two, and this year a large orchard in Bennington has purchased one and put it to use, causing dozens of complaints in Bennington and neighboring towns.

It is unlikely that hail cannons actually work. According to a 2006 paper by Dutch meteorologists Jon Wieringa and Iwan Holleman in Meteorologische Zeitschrift, the few publishable experiments that have been carried out yielded results ranging from mostly negative to inconclusive at best. And there appears to be no known meteorological mechanism that would lead one to believe in their effectiveness a priori. Wieringa and Holleman conclude that "the use of cannons or explosive rockets is waste of money and effort." The Commission for Atmospheric Sciences Management Group of the World Meteorological Association also declared in 2007 that "...hail cannons...have no physical basis and are not approved."

Nevertheless, hail cannons are growing in popularity. In one sense, this is understandable. Hail is a very destructive force, costing billions of dollars of damage in the U.S. each year. And premiums for hail insurance are apparently extremely expensive. Caught between a hailstone and a hard place, a farmer (especially a gullible one) might view a $30,000 hail cannon as a lifesaver.

Of course, even a correct calculation about the risk of hail, the cost of insurance, and the cannon's own cost and effectiveness ignores another relevant factor, the property rights of the neighbors who are being rattled every time the cannon fires. In Bennington, a noise ordinance prohibits the cannon from being fired at night, but the orchard has violated the ordinance twice. The Town may be considering litigation against the orchard, and the neighbors may be doing so as well. In any event, I think it will take a judge to balance the perceived and real interests of the orchard against those of the surrounding property owners.

A healthy agricultural sector is in everybody's best interest. For this reason, some residents of the Town support the hail cannon, which the farmer has said he believes in "100 percent." But how can it be good for agriculture when a farmer spends $30,000 on a machine that in all likelihood fails to protect his crops? Would we applaud the farmer if he had spent $30,000 to have the Rite of the White Tiger performed on his property? Tibetan farmers used to believe in this anti-hail ritual 100 percent too. But belief alone didn't make it effective at preventing hail.

It can't be pleasant for farmers and car dealerships to have no good options for dealing with hail. But a noisome gimmick isn't a good option either. Perhaps Vermont and other states could consider setting up a taxpayer-subsidized hail insurance program. After all, insurance is a proven method for dealing with Acts of God. And it's also very quiet.