July 16, 2003

Apples to apples evaluation of charter schools

The Manhattan Institute Center for Civic Innovation has just released the Working Paper Apples to Apples: An Evaluation of Charter Schools Serving General Student Populations, by Jay P. Greene, Greg Forster, and Marcus A. Winters (July, 2003). In the report the authors confuse the ability of schools to improve themselves with their ability to improve their students, as this Web contribution will explain. It is an elementary error that completely invalidates the report.

I remark that in February, 2003, the same authors produced a report Testing High Stakes Tests: Can We Believe the Results of Accountability Tests?. I wrote a Web review of that report in which I explained that the authors confused the predictive power of a high stakes test with its validity as a measure of student learning. That too was an elementary error that completely invalidated the report. (In both cases the report's conclusions are plausible, but that is besides the point.)

The present Apples to Apples report sets out to compare the performance of charter schools with that of public schools serving similar populations. (Given the wide range of educational policies in place in charter schools as well as in public schools I'm not sure that the question is all that interesting, but let's accept the question anyway.) In order to compare similar schools, the report focusses on charter schools that serve a general student population, and the control group of public schools is formed by taking for each charter school the nearest public school that also serves a general population.

The measure of performance is whatever standard statewide test is in place. Now I remind the reader of the concept of value-added assessment. See, for example:

Value-added assessment employs, ideally, performance data on individual pupils over multiple years, and looks at improvements over time. It is a way to factor out the effects of different student backgrounds, because these are, one assumes, reflected in their initial test performance. If one doesn't have data on individual pupils then one can use data on grades within a school. In that case the incremental performance that one cares for is that between a certain grade in one year and the next higher grade the next year, on the assumption that this involves approximately the same student population.

Greene et al. could certainly have used such grade-to-grade value added assessment in their work. However, they did something different. They look at the overall performance of each school in one year and compare it to the overall school performance the next year. The school performance is measured in whatever way the state measures it: typically some average scale score or a percentile rank within the state. They do this for each tested subject separately, but not separately for each grade. They then compare the year-to-year changes in performance of the charter schools to the year-to-year changes in performance of the nearest public schools. They find, finally, a small (in fact, very small) advantage for charter schools on this measure. In the executive summary they express their observations as follows:

Measuring test score improvements in eleven states over a one-year period, this study finds that charter schools serving the general student population outperformed nearby regular public schools on math tests by 0.08 standard deviations, equivalent to a benefit of 3 percentile points for a student starting at the 50th percentile. These charter schools also outperformed nearby regular public schools on reading tests by 0.04 standard deviations, equal to a benefit of 2 percentile points for a student starting at the 50th percentile.

And so, the authors completely confuse a measure of the improvement of schools with a measure of the improvement of student performance. Charter schools could be performing wonderfully or they could be performing dismally relative to public schools in improving student performance, and it would not be seen on the whole school year to year test score improvements that are the basis of this report. It would be seen, of course, in traditional value-added assessment at the pupil or grade level.

Posted by Bas Braams at July 16, 2003 04:26 PM


Wouldn't the schoolwide measure be pretty much an average of the progress at each grade level? i guess it would be off by one grade, since the oldest students the first year would be gone by the second year, but it would be pretty close and simpler than comparing each grade level. Or am I missing something? Statistics is not my strong suit.

Posted by Joanne Jacobs at July 16, 2003 07:47 PM

The issue is precisely that the oldest students are gone the next year. Let's say that we are dealing with grade school, K-5, and the annual statewide tests start in grade 3 (a common situation), so our schools are rated on the performance of the kids in grades 3-5. Suppose that the charter schools do spectacularly well. The kids that enter in K are on average at the 20th percentile; by the time they are tested in 3rd grade they are at the 50th percentile, and by the time they are tested in 5th grade they have made it all the way up to the 80th percentile. The charter schools repeat this spectacular success from year to year. On the Greene et al measure they would not be making any progress; they are flat at somewhere like the 65th percentile. If the charter schools are consistently a pathetic failure, again the Greene et al measure won't show it. All that it shows is if charter schools are improving over time relative to public schools. It is a measure of how well charter schools are improving their own performance; not of how well they are improving their students' performance.

Posted by Bas Braams at July 16, 2003 08:57 PM

I found the report somewhat confusing; I think it could be written more clearly. I think when he writes of standard deviations, for example, he is refering to test scale distributions, and not the standard deviations of his sample of score means. But, he should be publishing the calculations and results of the latter statistical tests.

I don't think it is so bad that he is comparing school mean scores in one year to school mean scores in the next year. He can't get any better data than that, I'm sure. He's making do with what is available. And, they must be grade level scores. I don't think there is such a thing as a mean test score for an entire school with several grades mushed together.

I have more trouble with the assumption that he has comparable groups. The fact that he is comparing a charter school to the nearest public school (if I understood that part correctly) almost assures that they are not comparable, since presumably any kid at either school must have made a conscious, deliberate choice for one or the other, ...lots of selection bias (and the kids in charter schools are probably the ones with more initiative, more adventurous). I think it would be better to compare charter schools, or cities with charter schools, to public schools in other, demographically similar cities that have no charter schools, or with those entire cities ...a sort of paired cluster design.

Posted by Richard Phelps at July 16, 2003 11:32 PM

Well, OK, Bas has implied that I am being much to wimpy about this. Maybe he's right. The authors could have done a value-added study--it would have taken more time, would have been more complicated, and would have given them a somewhat smaller sample size. But, in the end, they would have had a better study. There are enough states that test in more than one grade, some even in consecutive grades, that the authors could have used synthetic cohorts.

Bas is right, I believe, that at best with the current study they are looking at how well the schools improved what they were doing in a year's time, and not how much they improved their students. This, in a way, biases the study in favor of the newer schools (the charter schools) which are higher up on the learning curve. Newer schools are likely to make larger incremental gains than older schools, because they start from a higher point on the learning curve.

Maybe I was being wimpy about it in part because I am astonished by, but clueless about, what is happening with MI education reports. This is the third one I have read in the past year, which amounts to all that I have read in the past year, that has been thoroughly suspicious. I asked a colleague who tends toward the "establishment" side of the fence about this and he said "they have an agenda, you know." Maybe they do; I certainly wouldn't know. Maybe they are so tired of the research fraud of "the other side" that they have become attracted to research without restraint.

The worrying aspect is that dozens of education reform groups around the country accept what they get from MI as gospel, and spread it. They assume, as I used to, that an MI report was dependably valid and reliable.

Posted by Richard Phelps at July 18, 2003 07:41 PM