April 2, 2009

Duncan Hypocrisy Watch

The NYT reports that during a press phone call yesterday, Education Secretary Arne Duncan "unleashed a barrage of dismal statistics about the South Carolina schools" whose Governor, Mark Sanford "has told the Obama administration that he would not accept some $577 million in educational stimulus money for South Carolina unless he could use it to pay down state debt."

During the putative barrage of dismal statistics Duncan noted that "only 15 percent of the state’s black students are proficient in math and that the state has one of the nation’s worst high school graduation rates."

This is a pot kettle black moment.

Duncan, who up until he was tapped by the Obama administration, served as CEO of Chicago Public Schools. Why don't we take a look at how well the Chicago Public Schools has fared under Duncan's astute management?

In 2007, only 10% of 4th Grade black students in Chicago tested at the proficient level in Reading. (Table A-5) And, only 9% of 8th Grade black students tested at the proficient level. (Table A-6)

In 2007, only 8% of 4th Grade black students in Chicago tested at the proficient level in Math. (Table A-5) And, only 6% of 8th Grade black students tested at the proficient level. (Table A-6)

These dismal results were obtained with spending of between $13k - $14k per pupil -- far higher than what South Carolina spends. Apparently, how much a district spends has little to do with how well it educates.

Now there goes some performance information on our public schools that is "embarrassing."

28 comments:

Anonymous said...

So, what's the answer here, then? Should we spend the next 4-8 years ignoring states with horrible records because the Secretary of Education came from a failing district? I get that Duncan wasn't the best choice here (hell, I am available... :) ) but now that we are stuck with him, we shouldn't call out failure?

Dick Schutz said...

The failure that the Secretary of Education and the Obama Administration should be calling out is the policy failure of "standards and accountability by standardized tests." When "proficiency" is nothing more than a cut score on a test that is sensitive only to racial/SES differences, not only South Carolina, but all kids are in trouble.

Hypocricy + ignorance does not ordinarily equal "Change we can believe in."

KDeRosa said...

Anon, I'm merely noting the hypocrisy. We should call out failure, strating with the sec'y's own high-spendin' low-performin' district.

Dick, th eprioblem isn't so much that the tests aren't sensitive to instruction, but that much of the vocabulary/background knowledge continues not to be taught in school (or before school in the home for the low-SES crowd).

Tracy W said...

The failure that the Secretary of Education and the Obama Administration should be calling out is the policy failure of "standards and accountability by standardized tests." When "proficiency" is nothing more than a cut score on a test that is sensitive only to racial/SES differences, not only South Carolina, but all kids are in trouble.

Dick, I notice that you cite no evidence that the tests in question are only sensitive to racial/SES differences. Nor do you offer any evidence of a better option than standardised tests.
Do you have any evidence to support your opinions here?

CrypticLife said...

I looked for other possible interpretations, and while doing so noticed that 15% is exactly the nation's performance for black students, though "large central city" is only 13%.

Either way, sad, sorry performance, of course. Is there any chance it will be better if Duncan doesn't have direct responsibility for a single school district? Could he be better at being sec of ed than superintendent?

Dick Schutz said...

"the prioblem isn't so much that the tests aren't sensitive to instruction, but that much of the vocabulary/background knowledge continues not to be taught in school (or before school in the home for the low-SES crowd)."

The two factors go hand in hand. The fact that the tests in large part measure background info (which isn't taught) is what makes them insensitive to instruction.

Tracy, there are a baker's dozen papers, with references, at
http://ssrn.com/author=1199505
that respond to your question from various angles.

Malcolm Kirkpatrick said...

(Anonymous): "Should we spend the next 4-8 years ignoring states with horrible records because the Secretary of Education came from a failing district?"

We should ignore, or better, mock, insult, and humiliate, people who suggest that, beyond a rather low level, increases in school budgets will generate increases in school performance (as measured by standardized tests of Reading, Science, and Mathematics).

(Schultz): "When 'proficiency' is nothing more than a cut score on a test that is sensitive only to racial/SES differences, not only South Carolina, but all kids are in trouble."

Scores on standardized tests or Reading, Science, and Mathematics respond to institutional variables. See John Chubb and Terry Moe Politics, Markets, and America's Schools, also Herman Brutsaert's comparison of government and Catholic schools in Belgium. See also Lockheed and Jimenez comparison of government schools and independent schools in developing countries. See also Lassibile and Gomes, on country-level performance, related to the application of market mechanisms (vouchers, etc.).
More obviously, standardized tests of Reading, Science, and Math respond to the age at which students take the test; a 15 year-old and a 5 year-old from the same family (hence, same parent SES) will test differently.

RMD said...

dick,

I still don't see what you're advocating . . . it's hard to imagine a world without standardized tests.

Besides, I know extremely good charters schools that are somehow able to make those test scores move dramatically.

Libby Maxim said...

rmd

dont unleash Dick and standardized tests

i use a reading tool for instruction that is so transparent i need no STs to let me know how the child is doing, the child's movement through the instruction is the assessment, not only is this time saving but it is very instructional not only for the child but the teacher, both child and teacher understand the process and child can see where he has been and where he has to go, very motivating for child

at any moment, i can point to exactly what said child can and cannot do in terms of the alphabetic code, the child's fluency and comprehension merely by observing him using the reading instructional tool

quite nice, and when parents come and observe, they also know exactly where their child is and i do not have to waste time on the QRI or standardized tests

and parents do not leave thinking, what the heck can my child do

with most STs, no one including the teacher can actually point to where the instruction fell down and know how to fix it to help the child

they merely see a score and make guesses as to why the child did poorly

lib

Anonymous said...

I don't think it's hypocritical to say "You're doing pretty poorly, you need the money." If Duncan himself had refused federal funds despite his track record, that'd be hypocritical. As it is, I can easily imagine Duncan saying "Yeah, I need the help." That's what he's calling Sanford on.

Tracy W said...

The two factors go hand in hand. The fact that the tests in large part measure background info (which isn't taught) is what makes them insensitive to instruction.

Dick, your initial claim was, to quote you "When "proficiency" is nothing more than a cut score on a test that is sensitive only to racial/SES differences"

You have now changed your argument. If tests are sensitive to background information then if that background information is taught, then the test will be sensitive to instruction. So you now are effectively admitting you were wrong to say that the tests are only sensitive to racial/SES differences. But of course you're not going to acknowledge this.

Tracy, there are a baker's dozen papers, with references, at http://ssrn.com/author=1199505 that respond to your question from various angles.

Dick, I asked for evidence behind your assertion. Let's take the first paper on the link you provided, written by yourself. In this you just make unsupported assertions. For example you say "Next, we save the items that have the highest relationship with all the other items we’ve saved. That leaves us with the purest measure of the thingy we can get. And we call it an achievement test."

Now of course you might do this. But that is not evidence that any serious psychometrician does it this way, let alone that all standardised tests are designed this way. This is you making stuff up. You can't even make stuff up in an internally consistent way. For example you describe a method of constructing a test when, for some unknown reason, you chuck out the "easy" items and the "hard" items. Having done this you assert that "First, we’ve morphed from “latent trait” to “ability” and then to “achievement.” Second, we’ve sliced the latent trait into age/grade levels"
But nothing the procedure you describe has anything to do with latent traits, or age or grade levels.

You then start going on about constructing a bell curve. You make this assertion that "The IRT scores come out of the computer forming a normalized distribution with a mean of 0 and a
range of +/‐ 3." This is despite me pointing out to you in painstaking detail that IRT does not create or require a normalised distribution curve for tests, for example in the reference material you yourself point to it is clearly stated:
The shape of the test characteristic curve depends upon a number of factors, including the number
of items, the item characteristic curve model employed, and the values of the item parameters. Because of this, there is no explicit formula, other than equation 4-1, for the test characteristic curve as there was for the item characteristic curve." (see chapter 4 of http://echo.edres.org:8080/irt/baker/final.pdf, and yes I have quoted this to you before, I am disappointed but not surprised to see that it has made no impact on your memory).

Conclusion - you have no evidence to support your assertions about standarised tests.

As for the alternative method you describe to summarise:
- specify "the capability to be delivered"
- construct a set of 5-9 performance indicators
- aggregate the information.

Now if you standardise the assessment of the 5-9 performance indicators that we can be confident that your rating of a student on them is going to be roughly equal to that of any other trained assessor assuming that the student hasn't learnt anything new between the two assessments then we have a standardised achievement test. The purpose of standardising achievement tests is so we can do comparisons of achievement even if we can't have the same person testing all students (for example, we might want to test more students than Dick can assess single-handedly, or we might want a test that will still be useful even if Dick drops dead, or we might want to introduce some controls for observer bias that means that Dick can't carry out all the tests single-handedly).

So you spend all this time dissing standardised achievement tests, and yet you then describe something so close to a standardised achievement test. Yet you have the chutzpah to criticise the Secretary of Education for not dissing standardised achievement tests.

Libby Maxim - it sounds to me like you are using a curriculum so well laid out that it effectively doubles as a standardised achievement test. Very smart design. I will defend separate standardised achievement tests though because they allow us to compare different curriculae.

Dick Schutz said...

"it's hard to imagine a world without standardized tests."

Standardized tests were not used as measures of individual achievement before the mid-1960s. In fact test companies before then specifically warned that the tests were not fit for this use.

"I know extremely good charters schools that are somehow able to make those test scores move dramatically."

Certainly. Ken's graph shows such schools too. The point is that these effects are built around a non-replicable set of school personnel and selected students.

No other sector of life uses such convoluted and arcane measures. They'd be laughed out existence in a minute.

The main function of the tests has been to hold harmless the unaccountables who mandate their use.

Tracy, you need to do more homework. It's not possible to thrash out these technical matters in blog comments that quickly deteriorate to "tis-taint" and "so's your old man" level of dialog.

Parentalcation said...

Let's also note that South Carolina has some of the toughest standards in the country, so their stats are a lot more honest than other places.

It's fairly easy to get a direct comparison of numbers using the NAEP data...

Black Students - SC leads 265 - 248 (+17 margin)

Parents Graduated College - SC leads 291 - 267 (+24)

Eligible for School Lunch - SC 269 - 257 (+12)

South Carolina beats Chicago in every category, using the exact same test.

(All 8th grade Math scores in 2007)

Tracy W said...

Standardized tests were not used as measures of individual achievement before the mid-1960s.

This is false as I have pointed out to you before. For example, O-levels were introduced in Britain in the 1950s. See http://en.wikipedia.org/wiki/Ordinary_Level
And my own grandmother who went to high school and university in the 1930s reported sitting exams designed to measure her individual achievement, and in particular to see if she could get into university.

No other sector of life uses such convoluted and arcane measures. They'd be laughed out existence in a minute.

This is also false. Medicine uses standardised tests for diagnositic purposes - see http://eprints.soton.ac.uk/10479/

Tracy, you need to do more homework.

Dick, how would me doing more homework fix your errors? I quoted where the text you referred to contradicted your claim about IRT scores coming out of the computer forming a normalised test - that's all I can do. I don't see how me doing more homework will result in you starting to get things right, it never has in the past, I think that you are going to have to put some effort into fixing your mistakes yourself.

It's not possible to thrash out these technical matters in blog comments that quickly deteriorate to "tis-taint" and "so's your old man" level of dialog.

Your errors are far larger than merely technical matters. For example, as I pointed out above, you present a hypothetical situation about how to construct tests. You then draw conclusions that actually appear nowhere in the hypothetical situation you made, for example you state that " Second, we’ve sliced the latent trait into age/grade levels". But you didn't slice the latent trait into age/grade levels in your hypothetical situation. This is not a merely technical error, this is far more fundamental than that.

Of course I will continue to point out your technical errors as you make them, as I adore thrashing out technical matters like these. If you want to discuss them somewhere else on the WWW to which I can get acess, please say where. But drop the idea that you only have some technical errors to fix, that's just as wrong as anything else you've said.

RMD said...

Dick said
"Certainly. Ken's graph shows such schools too. The point is that these effects are built around a non-replicable set of school personnel and selected students."

Not true.

These schools have 1 thing in common: they use Direct Instruction. (one of them also uses Precision Teaching to measure fluency)

Other than that, their selection process is random (within the body of students available in that district).

And each of them does WAYYYYY better than the corresponding public schools.

One thing I will give you on standardized tests . . . they are usually not very good diagnostic tools if you want to find out whether or not students have learned the material you're teaching. So if you're teaching math, you might have to dive deep in the results to see if your students are, for example, mastering fractions, since the exam in question might not cover fractions or cover them sufficiently. And the test might not also tell you where they're getting hung up.

But, as a whole, it's interesting how some high-performing schools see tests as a way to demonstrate their superior performance (DI schools), while others complain and moan about them.

Dick Schutz said...

"Standardized tests were not used as measures of individual achievement [in elementary/secondary schools in the US] before the mid-1960s."

"A standardised test battery for assessing vascular and neurological components of the hand-arm vibration syndrome"

Are you telling me that this battery involves marking bubbles for multiple choice items, Tracy?
LOL

The Social Science Research Network provides a mechanism for anyone caring to post a paper, and it also includes a mechanism for contacting an author, if one wishes to point out errors. That's the sort of homework I was suggesting you might wish to do.

Hey, RMD. I've acknowledged that replicable results can attained using DI and other well-developed instructional architecture--even using instructionally insensitive standardized achievement tests. But this is not because they are charter schools.

It's true that item analyses of standardized achievement tests are informative. But that's a long way away from reporting scaled scores. How many item analyses have been in the news?

Tracy W said...

"Standardized tests were not used as measures of individual achievement [in elementary/secondary schools in the US] before the mid-1960s."

Thank you for correcting this. Please remember it in the future.

And I will note that the SAT was introduced in 1926 in the USA, meant as a standardised measure of individual aptitude (not achievement) and intended to reduce the number of Jews in Harvard and other Ivy League universities in the USA.
(see http://www.jewishachievement.com/domains/edu.html)

Personally I think replacing standardised aptitude tests with standardised achievement tests for making decisions about people's future is a good thing.

Are you telling me that this battery involves marking bubbles for multiple choice items, Tracy?

No Dick, and you are trying to change your argument again. You were talking about standardised achievement tests in general, you did not make a single mention of marking bubbles for multiple choice items. If you want to argue that no other field uses marking bubbles for multiple choice items then make that argument. You'll still be wrong in your earlier assertion that no other field uses standardised tests however.

Also standardised tests do not require marking bubbles for multiple choice items. The important thing about standardised tests is that different markers get roughly the same result given the same set of answers (obviously a testee might give different answers at different periods of time). Having the testees mark bubbles is one way of doing this but it is not essential.

The Social Science Research Network provides a mechanism for anyone caring to post a paper, and it also includes a mechanism for contacting an author, if one wishes to point out errors. That's the sort of homework I was suggesting you might wish to do.

Dick, you are the one who said "The IRT scores come out of the computer forming a normalized distribution with a mean of 0 and a
range of +/‐ 3." This is despite me pointing out to you in painstaking detail that IRT does not create or require a normalised distribution curve for tests, for example in the reference material you yourself point to it is clearly stated:
The shape of the test characteristic curve depends upon a number of factors, including the number of items, the item characteristic curve model employed, and the values of the item parameters. Because of this, there is no explicit formula, other than equation 4-1, for the test characteristic curve as there was for the item characteristic curve." (see chapter 4 of http://echo.edres.org:8080/irt/baker/final.pdf"

Fixing this is in your court and the amount of homework I do is irrelevant. I have pointed this error out to you, amongst some other errors, but you're the one who made the errors and you are the only one who can fix them.

Dick Schutz said...

If anyone wants to argue that standardized achievement tests generate by Item Response Theory are "fit for use" as sensitive measures of of the academic expertise of individual students, with "proficiency" results reported in terms of cut scores, "bring em on."

I've never heard anyone make that argument. As far as proponents go is, "the tests do have flaws but there is no other way"--(words to that effect.)

I've described a "better way." It didn't come "off the top of my head." It's rooted soundly psychometrically and is being applied routinely with good effect throughout the corporate world. If anyone finds the alternative methodology flawed that too might be worth talking about.

Item response theory IS a fit application for several situations. It's application in connection with NCLB in the US--which has been my focus--is not one of these situations.

Malcolm Kirkpatrick said...

(Schutz): "If anyone wants to argue that standardized achievement tests generate(d?) by Item Response Theory are 'fit for use'.."

Why "quotes" here?

(Schutz): "...as sensitive measures of of the academic expertise of individual students, with 'proficiency' results reported in terms of cut scores, 'bring em on'."

Mr. Schutz, Let's stick to one argument at a time. You wrote:...

(Schutz): "The failure that the Secretary of Education and the Obama Administration should be calling out is the policy failure of 'standards and accountability by standardized tests'. When 'proficiency' is nothing more than a cut score on a test that is sensitive only to racial/SES differences, not only South Carolina, but all kids are in trouble."

Thie makes sense only if you maintain (as you wrote in an earlier thread) that standardized tests of academic achievement are sensitive only to parent SES (and now race, evidently).

NAEP, TIMSS, and PISA respond to institutional and other variables beyond parent SES and parent race.

Dick Schutz said...

fit for use didn't really require quote marks. I tend to overuse them in comments because they're the only way to get emphasis other the CAPS, which are too strong.

"NAEP, TIMSS, and PISA respond to institutional and other variables beyond parent SES and parent race."

Yes they do. But they are also insensitive to "instructional" differences. SES/ethnicity is the popular non-instructional variable of interest, and that's why I said "only"

It's not that IRT generated achievement tests don't measure instructional accomplishments at all. They just do so too insensitively to be of much use--and of negligible use as measures of individual achievement.

There are selection "problems" in interpreting TIMSS and PISA results. They haven't done much for any country or for global education that I can see. But the show goes on.

Malcolm Kirkpatrick said...

(Malcolm): " 'NAEP, TIMSS, and PISA respond to institutional and other variables beyond parent SES and parent race.' "

(Schutz): "Yes they do. But they are also insensitive to 'instructional' differences."

In Politics, Markets, and America's Schools Chubb and Moe assert that standardized test scores respond to the policy of "tracking" (by which they apparently meant ability grouping). Several studies relate teacher quality (not Ed school credentials to student performance. Analysis of test score data indicates that classroom use of calculators and computers is counter-indicated. As Ken has argued, direct instruction (or Direct Instruction) outperforms discovery methods in Math and Whole Language methods in early reading instruction.

(Schutz): "SES/ethnicity is the popular non-instructional variable of interest, and that's why I said 'only'."

Odd use of "only", seems to me.

Tracy W said...

If anyone wants to argue that standardized achievement tests generate by Item Response Theory are "fit for use" as sensitive measures of of the academic expertise of individual students, with "proficiency" results reported in terms of cut scores, "bring em on."

This argument will not alter the fact that you are wrong in your assertions that standardised tests are only sensitive to racial/SES differences. It will not alter the fact that you were wrong in asserting that standardised achievement tests were not used before the mid 1960s. It will not alter the fact that your assertion that IRT-designed tests churn out a normalised distribution curve is wrong. It will not lead you to create internally-consistent arguments.

I've described a "better way." It didn't come "off the top of my head." It's rooted soundly psychometrically and is being applied routinely with good effect throughout the corporate world.

I note that you cite no evidence to support your assertion that it is rooted soundly psychometrically, nor to support
your statement that it is being applied routinely with good effect. By your stated standards, why do you expect anyone to believe you?

Nor is there anything in your described system 5-9 performance indicators that leads me to believe that it is incompatible with IRT. IRT is a way of using the pattern of right and wrong answers on a test to estimate the testee's performance on other more or less difficult tests. If students achieve some of your key performance indicators and not others, IRT could be used to estimate the ability level of said students and thus how well they are likely to achieve on other sets of key performance indicators related to whatever ability the key performance indicators are trying to measure - be that reading, mathematical, driving or whatever. (Of course IRT doesn't help us in estimating the underlying ability of any students who achieve all of the key performance indicators nor that of any students who achieve none of them.)

I do however think that you are right in putting quote marks around the words "better way", I am exceedingly doubtful as to whether it is a "better way" since your method appears to only be different from a standardised test in being worse - in particular it does not include any controls for inter-marker reliability, which creates some problems in comparing results from different schools in any large country (or indeed any country which is not very small).

Item response theory IS a fit application for several situations. It's application in connection with NCLB in the US--which has been my focus--is not one of these situations.

I actually agree with you on this, as the NCLB is only interested in whether students reach certain minimum standards, not in estimating what their underlying ability is, which is what IRT seeks to estimate. I can agree with two statements in one of your blog comments - perhaps this is a record!

Dick Schutz said...

"NCLB is only interested in whether students reach certain minimum standards, not in estimating what their underlying ability is"

Precisely. That's what concerns US parents, citizenry, and government. In former President Bush's words, NCLB was to answer the question, "What is our children learning." Standard achievement tests, administered in grades after formal instruction in reading has ended in the general curriculum are a very blunt instrument for answering the question.

The "minimum" of reading is fundamental to all further acquisition of academic expertise, so its importance is not to be minimized.

Tracy W said...

Dick, you are changing your argument again. You were saying that "Item response theory IS a fit application for several situations. It's application in connection with NCLB in the US--which has been my focus--is not one of these situations."

The goals of the NCLB don't depend on IRT being used in the design of standardised achievement tests. A far more fundamental problem it strikes me with the NCLB's standardised achievement tests is that the individual states get to define them - another thing that has nothing to do with IRT.

Dick Schutz said...

NCLB mandates the use of standardized tests. All test publishers follow IRT in constructing the tests and reporting the results. The states can use whatever test they like. Where the variation among states come into play is in the cut scores used to determine "adequate yearly progress" each year

Tracy W said...

Dick - you cite no more evidence in support of your assertions here than you did in support of your assertion that your "better way" is rooted soundly psychometrically or that it is being applied with good effect.
Given your asserted standards about evidence, I am starting to wonder if even you believe the things you say or if this is all some form of performance art.

Dick Schutz said...

Tracy:
Re psychometric grounding. Check out "Guttman scales" Wikipedia is a good place to start.

Re common application. Google for "business dashboards"

Tracy W said...

Dick, I didn't ask for advice on what might be an interesting topic to google nor do I want it. You are the one that asserted that your "better way" is rooted soundly psychometrically or that it is being applied with good effect. If you have actual evidence, provide a decent reference to it.

My Googling skills do not extend to being able to read your mind to figure out what evidence, if any, you based your assertions on. Especially since the last time you provided me with a link to an actual reference material, the Baker book, it turned out to thoroughly contradict what you claimed about IRT.