January 5, 2009

Bamboozling the Gifted

One of the reasons I've been neglecting the blog is that I've been forced to learn Pennsylvania's rules for gifted education.

Here's how gifted education is supposed to work according to the statute:

1. Student is identified as being gifted, i.e., an IQ of two standard deviations above the mean (with some leeway which allows schools to fudge the results a bit for students just missing the cutoff).

2. The gifted student's present level of educational performance is then determined to see where the student is academically. For example, a third grade student might be reading on a fifth grade level and doing math on a fourth grade level.

3. Then the student's instruction is supposed to be specially designed, i.e., individualized, to meet the needs of the student.

4. Annual goals (what the student is supposed to learn this year) and short term learning objectives (the steps the students is to take to learn the goals) are then developed.

5. And the whole plan is memorialized in a written document (GIEP) which must be approved by the student's parents.

That's how things are supposed to work in theory. In actuality, things typically work a little differently. Here's how it works in practice in most school districts:

1. Student is identified as being gifted.

2. School district recommends that the student particpate in its gifted pull-out program which typical entails "enrichment" not acceleration.

3. Student receives some "differentiated" school work (i.e., semi-random worksheets) in class (because the courts have determined that a gifted pull-out program is not sufficient by itself).

4. Fuzzy goals and learning outcomes are listed in the student's educational plan which are typically subjective, unquantifiable, and/or untestable.

5. Plan is presented to student's parents for approval without informing them that the district's recommendation is merely a preference and that other options are available tp the student.


I'd characterize this as the school's way of discharging the regulatory burdens of providing gifted education with the minimal amount of work and the minimal amount of additional academic expectations. Instead of the student's needs being paramount as intended by the law; the disctrict's administrative convenience is paramount.

As a parent of a regular education student you basically have no say in how your child is educated in the public school system. You don't agree with the school's choice of fashionable curriculum? Too bad; move to a new a new school district. But once your child is identified as gifted (or "special" at the other extreme) they become statutorily protected. Now the parent does have a say. But unfortunately, most parents willingly (if perhaps unwittingly) sign away this right as soon as they accept the district's recommendation which is, as I described above, designed to specifically appear to be doing something for the student without doing much of anything or being responsible for doing or accomplishing much of anything.

The school's favorite way of accomplishing this goal is to specify academic "enrichment" for the student. So, what is enrichment? It's one of those education weasel words. It could mean almost anything. But I think my definition of enrichment is a good functional definition

Enrichment is not acceleration.

That cuts right to the chase. If the student is receiving enrichment, he's not receiving acceleration. He might be learning more, but that "more" being learned isn't the stuff needed to make it to the next level.

Let's say the gufted student is capable of learning 50% faster than the regular education instructional pace. This means that in two years the student is capable of learning three years of academic content. If the student was in third grade and was being accelerated, he'd be ready to tackle sixth grade level work by the end of fourth grade (2 years). However, if the student were being enriched, he would likely only be prepared to do fifth grade work at the end of fourth grade.

Maybe an illustration would help.

The first three light blue ovals represent how much the regular student needs to learn. The light green circles represent how much our hypothetical gifted student learns in a given year (150%) in an enrichment program. The gifted student is clearly learning a lot more than the regular student for the three years of grades 3-5 depicted. At the end of the those three years, however, the student still is only prepared to do sixth grade work.

Let's contrast this with an acceleration program.


The student has learned the same amount of material, but the learning is focused in the direction of what the student needs to know to progress through the grades. The result is that after the same three years of learning, the accelerated student is ready to do work at grade 7.5 instead of grade 6 as in the enrichment example above.

Acceleration seems, at least to me, to be the preferred course of action for the gifted student. School districts, however, don't see it this way. The vast majority of schools only want to offer enrichment pull-out programs for their gifted students. Why do you suppose this is so?

I think that the reason is that there's increased accountability in accelerating the gifted student. In my example, the gifted student should be ready to do sixth grade level work by the end of 2 years instead of three. If the student isn't ready then something has gone wrong and the student hasn't learned what he was supposed to. Someone is going to be blamed and who wants that aggravation, especially considering these are the kids who should be coasting through the system and Taking up less of the teacher's time, allowing her to focus on the other kids.

The other reason is that acceleration programs present administrative challenges for the school since these gifted kids will have to be separately tracked ot perhaps taught in a different grade for some subjects.

Nonetheless, the statute clearly places the student's needs above the administrative problems of the schools, so this last factor shouldn't be an issue in theory. In practice, you know it is. This is a monopoly we're dealing with and monopolies don't care about their customers -- where else are they going to go? And who cares anyway, the same amount of tax dollars are still going to flow into the coffers every year.

48 comments:

Stephen Downes said...

The presumption is that acceleration is good for gifted students. But I don't see a good reason to believe that this is true.

KDeRosa said...

I don't think I made that normative judgment in the post or implied that conclusion. I did state a preference, but not that one was better than the other. Maybe that's fodder for another post.

Parry Graham said...

Ken,

A couple thoughts.

First of all, I don't see enrichment and acceleration as mutually exclusive. I also don't see them as synonymous.

Second, in terms of schools' predilection to favor enrichment over acceleration, I think your second reason (administrative challenges with acceleration) is the more likely of the two. From a scheduling standpoint, acceleration (especially when you're talking about students who are at one level in one subject and a different level in another subject) is incredibly difficult.

Finally, I think that there is more complexity to the acceleration/enrichment dichotomy than you provide. For naturally sequenced curricula (math being the most obvious example), acceleration seems sensible, as children are introduced to more advanced concepts that would typically be introduced in later years according to the curriculum (e.g., multiplying and dividing fractions in 4th grade rather than in 6th). In this situation, “enrichment” could feel like a cop out: rather than moving on to multiplying fractions, a child just practices adding and subtracting increasingly larger fractions

For other subjects, however, I’m not sure I see as much of a distinction. For example, science and history allow for tons of enrichment (i.e., content depth) without acceleration. In other words, a gifted child can spend considerable time studying US History in greater depth, rather than moving on to World History, which might be the next year’s major topic.

Of course, gifted education usually focuses on math and reading. But even in reading, I see the same flexibility. A gifted child doesn’t have to move on to poems, even though that’s a topic for the following year, but can spend additional time on short stories, reading more complex stories and engaging in more sophisticated assignments in response to the reading. In any well-run language arts class, students are reading texts at various levels of complexity and engaging with texts at various levels of sophistication. You don’t have to be in a 5th grade class to be challenged to read and write at a 5th grade level.

Parry

Robert Sperry said...

The scarier presumption is that you would let the regular public school system near your child.

It’s good to read you again :)

Anonymous said...

The research solidly supports acceleration as an appropriate strategy to meet the needs of gifted students. Unfortunately, too many educators believe only myths and misinformation with respect to academic acceleration.

Ohio requires that every school district have an acceleration policy based on the(research-based)state model, but it is slow going getting districts to do more than pay lip service to the requirement. Ohio has begun audits to take a closer look at what is or is not happening for gifted students.

Yes, enrichment can indeed be integrated into any accelerated learning opportunity, but enrichment alone is a cop out.

If you talk to individual school district officials you find the real reason why acceleration is seldom employed -- money. Districts are funded based on student count. If students are allowed to progress at their own pace and graduate early, the money attached to that child leaves as well.

Anonymous said...

You mean to tell us you've been spending your time trying to turn educationese into something that makes sense, Ken?

The policy and practice re "gifted" is as archaic and misguided as policy and practice with respect to the "learning disabled."

A good antidote to clear your head of the educationese is Malcolm Gladwell's best seller, "Outliers." He convincingly demonstrates that "determined practice" is the key to expertise in all areas, irrespective of "talent."--in this case, general ability.

That's pretty much a DI tenet too, isn't it.

The labeling of students based on an "intelligence test" is flat-out stupid. And arguing the relative merits "acceleration" vs "enrichment" just buys into the stupidity.

There's a lot of bamboozling going on here.

Anonymous said...

I think many teachers sincerely think that acceleration is a bad idea. They fear children will suffer socially.

Remember that many gifted students aren't just a year or two ahead. My husband skipped a grade and remained way ahead of his new classmates.

My ex-husband's daughter skipped high school and started college at 14. She loves it.

tm said...

A homeschooling friend of mine has been soliciting advice about how to approach teaching math to her precocious 7 year old--he's doing 4th grade math in her homeschool curriculum. She said a lot of other homeschoolers recommended enrichment. I told her that enrichment is what public schools do because they won't let the kids work at the speed they are capable of. I recommended making sure he truly understood the material by aking in depth question and approaching the problems from different angles than the curriculum (that may help ferret out any issues where he only understands the math when it is phrased the same way as the practice problems). Then I told her to let him have at it. There is no reason to hold him back as long as he has a full understanding of what he has learned already.

Acceleration does not have to mean skipping grades and possibly being in a position to be out of lock-step with children of the same age. I was in a gifted class--where a whole class was taught with special curriculum and the best/better teachers in the school--in middle school and high school. All subjects were taught with more depth and more quickly than for average students. I am not addressing the cost factor, though my small school district found a way to manage. One can also point out that not every gifted student should be in the same class, but it sure beats being in a slower class.

I do not understand why it is assumed that being with children of the same age is supposed to be a good thing. I would like to see evidence that retarding a student's academic acheivements in the name of age-leveling has benefitted them in the long term (adult salary, happiness, and responsibility--not some childhood adjustment period).

Anonymous said...

Thus has it always been.

I think the crux of the matter is that acceleration means more w-o-r-k for the teacher. If we "enrich" Johnny Smartkid, we might get him out of our hair for a couple hours a week. If we "accelerate" him, we have to come up with different stuff for him to do every day from what the main group is. Then we have to teach him and grade the assignments. Bad enough if you've got one in your class, but what if you have 3 and they're all moving at different paces??

Granted, it isn't a big deal to let a kid work ahead in a structured curriculum. Fine, let him have a couple extra Kumon sheets or whatever daily. Most teachers, however, have crappy textbooks built on crappy pedagogy that require oodles of teacher supplementation to work in the first place and do not lend themselves to working ahead.

Pull-out programs are little more than geekbranding.

And don't even get me started on "developmental appropriateness", that politically correct way of saying "we can't let Johnny be better than Susie at any cost." Except, of course, when it comes to sports. Then all bets are off. Let Johnny work as hard as he wants at throwing a ball, just keep him from doing division until the rest of class is ready!

Malcolm Kirkpatrick said...

Students serve as window-dressing in a massive make-work program. Enrollment determines the budget of the government schools. Schools therefore have no incentive to release students early. The crux of the matter is that most students could move through the English, History, Math and Science curriculum which a standard K-12 school takes 12 ywaes to impart in four of five years less time, but then there would be fewer employed teachers, lower dues revenue for the NEA, AFT, and AFSCME, and fewer padded construction and supplies contracts.

Malcolm Kirkpatrick said...

Kim: "A homeschooling friend of mine has been soliciting advice about how to approach teaching math to her precocious 7 year old--he's doing 4th grade math in her homeschool curriculum."

Tell your friend to teach her child the notation of set theory and logic, Arithmetic through addition and subtraction of fractions, then Algebra. It worked for one of my students.

Kim: "I do not understand why it is assumed that being with children of the same age is supposed to be a good thing."

Age segregation is a horrible idea. Please read this one page Marvin Minsky comment on school and this article on artificially extended adolescence by Ted Kolderie.

Malcolm Kirkpatrick said...

Downes: "The presumption is that acceleration is good for gifted students. But I don't see a good reason to believe that this is true".

The alternative to acceleration is incarceration. In Hawaii, juvenile arrests fall when school is not in session. Juvenile hospitalizations for human-induced trauma fall when school is not in session.

"...(M)any well-known adolescent difficulties are not intrinsic to the teenage years but are related to the mismatch between adolescents' developmental needs and the kinds of experiences most junior high and high schools provide. When students need close affiliation, they experience large depersonalized schools; when they need to develop autonomy, they experience few opportunities for choice and punitive approaches to discipline..."(Linda Darling-Hammond, professor of education, Stanford University), Kohn, "Constant Frustration and Occasional Violence", American School Board Journal, September 1999.


In addition to the psychological beefits to children, accelerating children through the K-12 curriculum would reduce the burden on taxpayers.

Anonymous said...

As the product of a school that didn't offer enrichment or acceleration, I feel bamboozled...but....

Look at steps 1-5 for identifying and educating gifted students...shouldn't we be doing steps 2-5 for all students?

Stephen, I don't understand your comment, but I think the difference is that I'm giving the benefit of the doubt that the acceleration will follow a challenging but obtainable pace for each student. In that case, acceleration is providing more options to the student, and that is undoubtedly a good thing.

For example, I might want to take accelerated math courses and cover higher-level Calculus before I graduate high school...or I might just take accelerated and complete my math credits by freshman year, freeing up more time for athletics, or arts, or technology electives. (Creating extended gaps between the last high school math class and the first college math class is a problem, but a solvable one that definitely doesn't negate the entire concept of acceleration.)

Anonymous said...

Anon... says, "Look at steps 1-5 for identifying and educating gifted students...shouldn't we be doing steps 2-5 for all students?"

What steps are you referring to? I must have missed the staircase somehow. I'm guessing that the answer to your question is "yes" but I'd like to know what the steps are.

My position is that all kids should advance as rapidly as they can. The question is what are they advancing on or through. "Numbered Grades" are not a good basis for el ed, and only in Math and Science are "Courses" a good basis in hi.

As an alternative, I've submitted that we need to define the academic expertise to be delivered, with stated prerequisites for beginning and transparent indicators for ending.

Malcolm Kirkpatrick said...

Schultz: "As an alternative, I've submitted that we need to define the academic expertise to be delivered, with stated prerequisites for beginning and transparent indicators for ending."

And then...? If you propose to use clear measures of student performance to discipline teachers whose students do not measure up, you will have a costly, losing political battle against the NEA/AFT/AFSCME cartel, who will observe (correctly) that out-of-school variables influence performance. If you propose to tie clear performance measures to incentives for students (such as testing out of school), I'm 100% behind you, but I wonder: what were we arguing about last week?

Malcolm Kirkpatrick said...

Parry: "From a scheduling standpoint, acceleration (especially when you're talking about students who are at one level in one subject and a different level in another subject) is incredibly difficult."

In the current system, yes. Schools have created this problem for themselves. Self-paced curricula would free schools from this difficulty, but then the public would see that professional teachers are not as vital a component of education as the NEA/AFT/AFSCME cartel would have you believe. The economist James Buchanan attributed his success, in part, to his education in a rural one-room schoolhouse, where older students had to work independently while the teacher worked with the younger students.

Anonymous said...

Malcolm says: "If you propose to use clear measures of student performance to discipline teachers whose students do not measure up, you will have a costly, losing political battle against the NEA/AFT/AFSCME cartel, who will observe (correctly) that out-of-school variables influence performance."

Cmon. This isn't about "disciplining teachers." The failure to deliver instructional accomplishments may well be flaws in the instructional products being used. It may be in what the teachers have been taught to do without any basis that is counter-productive. And it could well be that the teacher is "dogging it."

The thing is, at present there are NO means of sorting any of these matters out. All of the "accountability" is on the teachers backs and we're using measure that are insensitive to instructional differences to reward and punish teachers.

And the insanity goes unnoticed in the rhetorical fog and the muck of "instructional resources."

Of course, "out-of-school variables" affect student performance. But correlational data using tests that are sensitive only to SES differences provide false information concerning those variables. More simply, if instruction teaches reliably teaches kids to read and to acquire other desired expertise, the correlation between SES and achievement is inherently reduced to near-zero.

The limitation is not with teachers and kids. The higher up the EdChain you go the more glaring and consequential the weakness..

Malcolm Kirkpatrick said...

Schultz: "This isn't about 'disciplining teachers'. The failure to deliver instructional accomplishments may well be flaws in the instructional products being used. It may be in what the teachers have been taught to do without any basis that is counter-productive. And it could well be that the teacher is 'dogging it'."

--Someone-- selects "instructional products". I was a Secondary Math teacher for ten years (now I tutor). I made my own worksheets, since modern US K-12 Math textbooks distract and confuse more than they instruct. --Someone-- follows methods recommended by nitwit Professors of Education. I held myself responsible for taking students to the level that the teacher in the next class in the sequence would expect on day 1. That meant ignoring 90% of the advice which nitwit Professors of Education dispense.

Teachers inevitably share some of the responsibility for overall system performance (as measured by standardized tests), or why do we have teachers at all?

Standardized tests and juvenile crime rates indicate serious problems in the US State-monopoly school system. Standardized tests will not guide reform so long as insiders decide what tests to use and what use to make of test results.

Schultz: "All of the "accountability" is on the teachers backs...

Not that I have seen. That would be the case if parents had the power to decide which institution shall receive the taxpayers' K-12 education subsidy. Currently, no one except students and taxpayers suffers from poor system performance. There is no accountability at all.

Schultz: "...and we're using measure that are insensitive to instructional differences to reward and punish teachers...Of course, 'out-of-school variables' affect student performance. But correlational data using tests that are sensitive only to SES differences provide false information concerning those variables."

Please expand on this. What "measure...insensitive to instructional differences" are "we" using "to reward and punish teachers"? Which "tests...are sensitive only to SES differences" and what "false information about "out-of-school" variables do these tests convey?

Anonymous said...

Malcolm says: "Teachers inevitably share some of the responsibility for overall system performance"

Of course. But how about the publishers and teacher educators at the top who you rightfully lambasted and then quickly let off the hook? I'm suggested making these unaccountables share the responsibility with info that separates the sheep from the goats here also.

Malcolm says: "There is no accountability at all." Try telling that to the teachers and principals who have been beaten over the head with NCLB sanctions.

Malcolm says: "Please expand on this. What "measure...insensitive to instructional differences" are "we" using "to reward and punish teachers"?

Standardized achievement tests. The theory/practice underlying these tests does not permit anything other than arbitrary cut scores on an ungrounded scale that are interpreted as degrees of "proficiency."

The results are sensitive only to differences in SES, because the tests boil down to measures of "general ability." That is, irrespective of the subject matter label, the test results all correlate as highly with one another as the reliabilities of the tests permit.

The tests are sensitive only to SES differences because general ability correlates with SES due to the differential experiences of economic poverty and prosperity.

Malcolm Kirkpatrick said...

Mr. Schultz,

The point of my question: "And then...?" was that your defined "academic expertise to be delivered, with stated prerequisites for beginning and transparent indicators for ending" depends on some measure of student performance, and this is then useful only if it motivates someone to do something about poor performance. In the current system, teachers (broadly, including those promoted to administrators) must bear responsibility for inept instruction, including selection of wretched textbooks and the use of the inept methods of instruction promoted by colleges of Education. "I was only following orders" doesn't excuse poor performance.

Malcolm: "What 'measure...insensitive to instructional differences' are 'we' using 'to reward and punish teachers'?"
A.
Schultz: "Standardized achievement tests. The theory/practice underlying these tests does not permit anything other than arbitrary cut scores on an ungrounded scale that are interpreted as degrees of 'proficiency'."

Regardless of how they "are interpreted", the assertion that "the theory/practice underlying these tests does not permit anything other than arbitrary cut scores on an ungrounded scale" does not reflect reality. I wrote before:
1. "A measure is an order relation on a set" (with certain propertied which I won't detail here).
2. "A test is a procedure is a procedure of device for establishing a measure."
3. "A standard is a unit of measurement. A meter stick is a standard. A kilogram weight is a standard."
4. "A standardized test is a test which expresses its result in terms of a standard".
5. "Standardized tests permit intergroup comparison."

Nothing about standardized tests bars everything but "arbitrary cut scores on an ungrounded scale". The scale in your doctor's office is a standardized test.

B.
Schultz: "The results are sensitive only to differences in SES, because the tests boil down to measures of 'general ability'. That is, irrespective of the subject matter label, the test results all correlate as highly with one another as the reliabilities of the tests permit.

The tests are sensitive only to SES differences because general ability correlates with SES due to the differential experiences of economic poverty and prosperity.
"

Regardless of why IQ correlates with SES (and I agree that SAT and NAEP correlate with IQ), it is a blunt fact that standardized test scores correlate fairly strongly with other variables than SES. Some of these variables are institutioal, such as age of compulsory attendance (later is better), district size (smaller is better) and variations in teacher quality. The standardized tests in common use areNOT "insensitive to instructional differences" and are NOT "sensitive only to SES differences".

Dick Schutz said...

The scale in my doctor's office was not constructed using Item Response Theory which inherently forces all standardized achievement tests into measures of general ability.

Strings of unrelated quotations regarding "tests" and citing irrelevant correlations miss the point and only add to the rhetorical fog.

Malcolm Kirkpatrick said...

Define "standardized test". I did. Your turn.

The doctor's scale measures mass and the NAEP 8th grade Math test measures Math performance. Both express their result in terms of a unit, a standard. In the case of the doctor's scale, the unit is the kilogram. In the case of the NAEP 8th grade Math test, the unit is the standard deviation from the mean. Both are standardized tests. The NAEP correlates strongly with district size (smaller is better) and with age (start) of compulsory attendane. Variations in teacher ability influence variations in student performance. So it's false that standardized tests of academic performance correlate "only" with parent SES.

Anonymous said...

Malcolm: We don’t have to invent definitions. Try this:

"A testing instrument, typically standardized and norm referenced, used to measure how much a child has learned in relation to educational objectives."

www.ccsso.org/projects/SCASS/projects/early_childhood_education_
assessment_consortium/publications
_and_products/2911.cfm#Definition

What the definition leaves out is the fine print. All conventional standardized achievement tests in use today are constructed following the protocol of Item Response Theory. This practice yields a distribution of scores on an ungrounded statistical scale, with degrees of “proficiency” representing nothing more than arbitrarily set cut scores.

Weight, length, and temperature are equal-interval scales. Item Response Theory inherently forces results into a Gaussian distribution, the bell-shaped curve, commonly referred to as a normal distribution. Interpretation is ungrounded other than in terms of relative position in the distribution.

A normal distribution is NOT the distribution one obtains with reliable instruction. With reliable instructional accomplishments, scores pile up at the top of the distribution.

Item Response Theory precludes this from happening. In short, NCLB is based on statistical ignorance that makes it's aspirations mathematically impossible.

I'm not by any means the first to say this. Bob Linn was saying it very early everywhere he could find an audience, but the intelligence was ignored.

Malcolm Kirkpatrick said...

Mr. Schultz,

You quote the definition of "achievement test" not "standardized test". Please try again.

You say: "Weight, length, and temperature are equal-interval scales. Item Response Theory inherently forces results into a Gaussian distribution, the bell-shaped curve, commonly referred to as a normal distribution. Interpretation is ungrounded other than in terms of relative position in the distribution."

Weight, length, and temperature are not scales at all. "A is heavier than B", M is longer than N", "T is hotter than U" are observations. We use units of measurement (standards) to express how much heavier, longer, or hotter.

All measurement is relative.

Nothing you have said supports the contention that standardized achievement tests respond only to SES.

Malcolm Kirkpatrick said...

Schultz: "A normal distribution is NOT the distribution one obtains with reliable instruction. With reliable instructional accomplishments, scores pile up at the top of the distribution."

Depends. Take some large number (say, 50) of teachers who deliver what you call "reliable instruction" in Alg I and a large number (say, 150) of teachers who have deliver what you call "reliable instruction" in Alg II. Compose 1000 questions on the material we call "Alg I" and "Alg. II". Create (say) 20 different 50-item tests from the list of 1000 questions.

Divide the Alg II teachers into into three groups.

Over the span of (say) five years, pay students in the Alg I teachers' classes and pay students in the group I Alg II teachers' classes to take a sample final exam of composed both of Alg I and Alg II material.

Discard questions which everyone got correct, questions which everyone got wrong, and questions which added no new information (if someone got question #x right, they would get question #y right).

If you plot all scores on the remaining questons, you should get a bi-modal distribution, with the Alg I students distributed about a lower mode than the Alg II students modeal score.

Now take those questions which systematically discriminate between Alg I students and Alg II students.

Make a set of exams of these questions, and give them over a period of years to the students of the Alg II teachers in group II.

You could then --select-- questions which students are as likely to get correct as to get wrong. If you give this new test to students of tachers in group III, you will get a normal distribution of scores. If you want to avoid scaring students with difficult questions, you can use questions that, on average, 70% of students will get correct. This gives you a skewed distribution, but a smooth mapping exists which transforms your skewed distribution to a normal distribution. Just assign greater weight to more difficult questions.

All that matters is that the --relative position-- of students be the same by both scoring mechanisms.

You have a standardized test which reflects effective instruction.

There is nothing mystical or evil about standardized tests.

Anonymous said...

The example you provide, Malcolm, is not in accord with Item Response Theory. The example is out of step with every standardized achievement test battery in common use.

Malcolm Kirkpatrick said...

Schultz: "The example is out of step with every standardized achievement test battery in common use."

What does that have to do with anything? I have describd a process by which one could construct a standardized test of Math performance which would reflect differences in student ability induced by effective instruction.

In fact, the publishers of standardized tests follow a procedure something like the one I have described (but much more elaborate and well-considered): selecting questions for their ability to predict performance on other questions and to discriminate between students who know the material and those who do not. This is what makes tests useful at all.

Anonymous said...

Malcolm says: "I have describd a process by which one could construct a standardized test of Math performance which would reflect differences in student ability induced by effective instruction."

The operative word here is "could."
Study up on Item Response Theory, convince text publishers and users to abandon it and then get back to us.

Malcolm Kirkpatrick said...

A. Schultz: "A normal distribution is NOT the distribution one obtains with reliable instruction. With reliable instructional accomplishments, scores pile up at the top of the distribution."

As I demonstrated, that depends on how one constructs the test.

B. Nothing in what you have written supports the contention that standardized tests of academic performance respond only to parent SES. In fact, NAEP and TIMSS respond to institutional variables such as age (start) of compuslory attendance (later is better), district size (smaller is better), school size (smaller is better), State mandates of teacher credential requirements (Praxis and NTE are counter-indicated), and (Chubb and Moe) "institutional autonomy".

Tracy W said...

Standardized achievement tests. The theory/practice underlying these tests does not permit anything other than arbitrary cut scores on an ungrounded scale that are interpreted as degrees of "proficiency."

Please note that Dick Schutz has a bee in his bonnet about this for no reason I can figure out. We had a long debate a while ago in which he claimed that there was proof for this argument, but failed to provide it. I can not rule out completely the odds of there being some counter-intuitive mathematical proof that I haven't thought of, but there is nothing in the theory underlying these tests that prevents non-arbitrary cut scores that are grounded in something (what exactly depends on what you are testing, for example if you want to test if people can read a newspaper, you can validate a standardised test by defining more specifically what it means to read a newspaper (eg one component might be to be able to tell the who, what, when and where from a front page story), find a group of testees with existing known variation in skills (eg from complete illiterates in the language of the newspaper to experienced newspaper editors of the language in question), give them test questions and see if their results match with what you would expect, reiterate until satisfied about the validity of your test. (There is more to a test design than this of course).

Item Response Theory is about the probability of a person with a given ability level answering the question rightly. For example, someone totally illiterate has no ability at reading, and thus has roughly a zero chance of passing the front page story test (ignoring for the moment the existance of multi-choice tests). Someone who can barely write their name is also very unlikely to pass the front page story test. At the other end, newspapers tend to be pitched at a general audience, so for an English newspaper, learning Ancient Greek and Latin would not increase the chances of someone already very literate in English significantly of being able to answer the question. In between the two extremes is a skill point where a reader has a 50% chance of being able to pass the test. The more discriminatory the test question, the sharper the slope. The difficulty of the question determines where the 50% is - if your test has lots of questions like "write your name" and "what does this label say?" when the label says "Danger" then you will find that a lot of adults in the local area of the newspaper will be able to get more than half the questions right, if the test has lots of questions about difficult things like the meaning of meta-physical poems and the like then the number of people able to get 50% of the questions right should drop sharply relative to an easy test. A 50% probability of getting any individual question right does not translate automatically into a score of 50% for the whole test - it depends on the difficulty of the questions.

Now, returning to multi-choice questions. There is a chance that someone can get a multi-choice question right by pure change. This has caused the introduction of the three parameter logistic model into Item Response Theory, which takes account of this probability.

The results are sensitive only to differences in SES, because the tests boil down to measures of "general ability." That is, irrespective of the subject matter label, the test results all correlate as highly with one another as the reliabilities of the tests permit.

This is a new one. So a standardised driving test can never actually test driving skill? You are now claiming it is impossible for a person to improve their driving in a way that can be detected by a standardised test? And, say, that identical twins, brought up together by the same parents so the same SES, could never test differently on a standardised test of driving, no matter how different their driving abilities? Even if one twin lost his or her eyesight in an accident, they'd still drive just as well as their sibling, as they would still have the SES?

But let me guess, Dick is not going to be able to back this claim up either.

Anonymous said...

"So a standardised driving test can never actually test driving skill?"

We've been talking about standardized ACHIEVEMENT tests that are developed using Item Response Theory.

Driving skill is tested by having a person get behind the wheel and observing the driving. There are written tests of knowledge of information that's in the State's drivers manual. But those tests are NOT constructed using Item Response Theory. Scores pile uup at the top and IRT would never permit this.

For anyone interested in a reasonably light explanation of IRT, try Frank Baker's "The Basics of Item Response Theory"

http://echo.edres.org:8080/irt/baker/

The first page of Chapter 1 is as far as you need to read.

Frank states clearly that the goal of any test constructed using Item Response Theory is to measure a "Latent Trait." If anyone thinks academic achievement is a "latent trait," get back to us with some defense of that position.

This isn't my personal hobby horse. It's a very fundamental matter.

Tracy W said...

We've been talking about standardized ACHIEVEMENT tests that are developed using Item Response Theory. ...
Scores pile uup at the top and IRT would never permit this.


This is just wrong. I described above how a reading test developed using IRT could produce a test in which scores piled up at the top - if you asked a set of questions that the testee group could easily answer. IRT is about the individual question, not about the results over the whole test. The test results as a whole depend on the questions asked.

In the link you provided, in chapter 8, http://echo.edres.org:8080/irt/baker/chapter8.pdf, the author describes several different types of tests:
"Screening tests -... [which] have the capability to distinguish rather sharply between examinees whose abilities are just below a given ability level and those who are at or above that level."
"Broad-ranged tests - ... used to measure ability over a wide range of underlying ability scale. "
"Peaked tests - ...designed to measure ability quite well in a region of the ability scale where most of the examinee's abilities will be located, and less well outside this region."

So the author of the book you use as a reference believes that IRT can be used to design tests that have different distributions of results. And there's nothing in the mathematics that I can see that would indicate he is wrong. But of course you will ignore this and just keep on claiming with no support at all that IRT tests can't do this.

Frank states clearly that the goal of any test constructed using Item Response Theory is to measure a "Latent Trait." If anyone thinks academic achievement is a "latent trait," get back to us with some defense of that position.

Well my understanding is that the definition of "latent trait" is one that can't be measured directly. So you can measure my height directly (get out your ruler), but you can't measure my reading skills or my driving ability directly. To quote from page 1 of chapter 1 of the link you provided:
"In academic areas, one can use descriptive terms such as reading ability and arithmetic ability. Each of these is what psychometricians refer to as an unobservable, or latent, trait. Although such a variable is easily described, and knowledgeable persons can list its attributes, it cannot be measured directly as can height or weight, for example, since the variable is a concept rather than a physical dimension."

So Frank not merely states that the goal of a test constructed using IRT is to measure a "latent trait", he also explicitly states that reading ability and arithmetic ability are latent traits.

You can estimate my latent reading ability by asking me various reading-related questions and seeing which ones I can answer correctly. You can estimate my latent driving ability by asking me to carry out various driving-related tasks, in different sorts of vehicles (automatic/manual/light truck) etc. The difference between these tests and measuring my height is that individual questions only contain limited information about my likely ability for other questions, eg if I can tell you what, when and where on a front page newspaper story, this doesn't tell you if I can read and comprehend an electrical engineering journal article (though it may raise your estimate of the probability that I can). But if you measure my height and it's 5'2", you can tell just from the one measurement that I can walk under a bar 5'4" above the floor and a 6' high bar, but would have to duck to get under a 5'high bar.

Now, if you are going to argue that academic achievement is not a latent trait, please tell me what the equivalent of the ruler is for measuring reading ability or arithmetic ability.

This isn't my personal hobby horse. It's a very fundamental matter.

So, since you believe it's so fundamental, why not try something original and provide the mathematical proof backing up your claim that IRT tests would never permit the scores to pile up at the top?

Tracy W said...

Parry: A gifted child doesn’t have to move on to poems, even though that’s a topic for the following year, but can spend additional time on short stories, reading more complex stories and engaging in more sophisticated assignments in response to the reading. In any well-run language arts class, students are reading texts at various levels of complexity and engaging with texts at various levels of sophistication. You don’t have to be in a 5th grade class to be challenged to read and write at a 5th grade level.

But this is dependent on the teacher. Yes, the gifted child can read more complex stories and engage in more sophisticated assignments. But does the teacher then read the gifted child's completed assignments and then provide relevant feedback? You may not need to be a 5th grade class to be challenged to read and write at a 5th grade level, but I think most people need a teacher who can challenge them to read and write at a 5th grade level. And a 2nd grade teacher may not know that much about 5th grade reading and writing (say he's been teaching 2nd grade for years and has forgotten how to go beyond that), or, far more likely, a 2nd grade teacher can just not have the time to prepare more sophisticated assignments and to mark them while providing decent feedback given all the other students in his class.

Similarly for history. The gifted student can just memorise more details about US history, or they can have a go at formulating historical arguments, interpreting original sources, understanding histography, etc. The latter needs a teacher who can provide feedback on whether the gifted student is doing those things correctly. And just memorising details with no new skills gets boring fast, at least in my experience.

I don't know what my IQ is, but I learnt very quickly at school, and thus at primary school spent years being told to do artwork. I enjoyed art and that was certainly a more interesting use of my time than just another set of worksheet problems at the same level. But when I got to high school I wound up taking the Graphic Design course, and my teacher there was a professional graphic designer (as in he did it for pay in school holidays), and the feedback he could provide on my drawing just made me realise what a waste of time all that art practice had been at primary school. The feedback really improved my drawing. No criticism of my primary teachers, they were not trained in teaching art (teaching reading and arithmetic are their own specialised skill sets), and they had full classrooms at different skill levels taking up their time.

This is the problem with enrichment, schools ask teachers to do more without any idea of whether the teachers have the necessary skills or time to do it. Acceleration is more likely to get students to teachers who are specialised in the skills the students are likely to be learning.

Anonymous said...

Yes, Frank does fudge on morphing from Latent Trait into "Ability." And this is what all IRT theorists do. But the assumptions of IRT are that the latent trait is univariate and stable. That is, the "thing" exists in some amount and that you can get at this "amount" by going through the IRT rigamarole.

That's just not what is involved in the acquisition of any specific academic expertise. The behavior morphs as a consequence of learning/teaching. IRT practice deals with this by regularly changing the test items from grade to grade. The definition of the Latent Trait" is different at each graded level of the test. The only possible interpretation of the test results is ungrounded and relative--in terms of grade level or in terms of arbitrary cut scores on the statistical scale.

Has anyone ever seen a distribution of NAEP scores or of any other standardized achievement test scores that wasn't a normal distribution? No, you haven't, because the theory doesn't permit other than a normal distribution. The normal distribution is necessary to convert the scores at various grades to a single statistical scale.

Ask any reputable psychometrician, or consult any psychometric test to confirm what I've been saying. Cherry picking excerpts out of context is unproductive.

It's easy to measure driving skills directly. Usually it's a go, no-go decision. But if there were any reason to do so, one could easily conduct a Guttman scale (an ordered set like measuring spoons and cups) that would range from Novice to Nascar driver.

The same can be done for reading. Arrange texts in terms of Alphabetic Code complexity into 5-11 Key Accomplishment Indicators.

A "mathematical proof" of this matter is beside the point. You can see it with your own eyes.

Tracy W said...

Yes, Frank does fudge on morphing from Latent Trait into "Ability."

You didn't answer my question. If academic abilities are not latent traits, then what is your equivalent of the ruler for reading ability, or arithmetic ability, or for that matter, the non-academic ability of driving?

I am tired of you not engaging with my counter-arguments. You answer my question here, and I will give you replies to the rest of your comments.

Anonymous said...

"what is your equivalent of the ruler for reading ability, or arithmetic ability, or for that matter, the non-academic ability of driving"

Well, I'm having a tough time getting my point across.

What I'm trying to say is that reading ABILITY and driving ABILITY are not "traits"--latent
or otherwise. They exist as levels of expertise, that morph during the acquisition of expertise.

A ruler is an equal-interval indicator for measuring length. There is no counterpart to such a scale in measuring achievement, because teaching/learning is not that kind of a phenomenon.

The closest legitimate counterpart is a Guttman-like scale with Key Performance Indicators as I've described. This orientation is commonplace today in Business Intelligence endeavors. I'm not just coming up with it off the top of my head.

Tracy W said...

What I'm trying to say is that reading ABILITY and driving ABILITY are not "traits"--latent
or otherwise.


Thank you for this explanation. I don't know how you define trait, but the psychometricians are defining traits in a way that includes "reading ability" and "driving ability". The dictionary definition of "trait" is "habit, manner, custom, feature, attribute, quality, characteristic, idiosyncrasy, peculiarity, quirk, mannerism, oddity, trick; see also characteristic. See syn. study at quality." (see http://www.yourdictionary.com/trait). My reading ability or lack thereof is one of my qualities, attributes or features.

In the passage I quoted from http://echo.edres.org:8080/, the author explicitly states that psychometricians refer to reading ability and arithmetic ability as latent traits. This is a matter of definitions. Professions often have their own definitions that are slightly at odds to those of ordinary speech. This is clearly one of those cases.

A ruler is an equal-interval indicator for measuring length. There is no counterpart to such a scale in measuring achievement, because teaching/learning is not that kind of a phenomenon.

Which is what makes reading achievement, or driving achievement, or arithemtic achievement, a latent trait as defined by psychometricians.

Perhaps psychometricians should have called things that can't be directly measured, like reading ability or arithmetic ability, something other than "latent traits", as this name clearly has caused some confusion. But they didn't, and the definition of latent trait is stated as clearly covering reading ability or arithemtic ability. If you really dislike using the name "latent trait" for things like academic abilities or driving ability, I suggest mentally substituting another name whenever "latent trait" is used by psychometricians. Do you think that "non-directly-observable expertise level" would be a better name for your purposes?

As a thank you for answering my question, here are my responses to your other points from your previous comment.

How do you define univariate? How do you define stable?

I outlined earlier how an achievement test could be validated and grounded, so I don't know why you keep claiming that IRT results in tests that can only be ungrounded and relative. Yes, you can define different reading tests for each grade. There is however no point in redefining the latent trait at each grade level, indeed the whole point of IRT is that scores on "hard" tests can be compared with scores on "easy tests" compared to classical item theory (see chapter 1 of the link you provided). See pages 51 to 54, chapter 3, of the book you provided.

You claim that the theory doesn't permit other than a normal distribution, but provide no proof of this. I stated exactly how a test could provide a non-normal distribution (by containing a lot of easy questions that the test population could mostly answer, or by containing really hard questions), and I quoted, from the link you provided, a description of the different test distributions that could arise, depending on the design-purpose of the test.

I will explain another way that standardised tests constructed using IRT could return a non-normal distribution. IRT is designed for measuring "latent traits", or "non-directly-observable ability", or whatever you wish to call things like academic ability or driving ability. However, there is nothing stopping it from being used to measure something directly observable. one could construct a test, using IRT, to estimate a person's height. Instead of just hauling out the ruler, one could have a series of cards with marks on them and line the person up against them to see if they were taller or shorter than the mark. Now, there is always a slight error with physical measurements. So say that a mark is at 170.00 cm. People who are only 120 cm are basically certain to be measured as shorter than that, assuming that the examiner isn't doing something silly like reading the wrong card. People who are 220 cm tall are definitely going to be measured as taller than the 170 cm mark on the card. But someone who is 170.00 cm should have about a 50% chance of being measured as lower than the mark and 50% chance of being higher. Someone who is actually 16.90 cm has a non-zero chance of being wrongly measured as higher than 170cm. The error range depends on various factors, eg if you are measuring a bunch of baldies you can measure a lot more accurately than if you are measuring a group of people some of whom have afros. So for a group of baldies we would see a lot sharper item response curve than for a more diverse group of hair styles - in the words of the link you provide, the item question would be more discriminatory.

Now if we had many cards in the right range we could zoom in on a person's approximate height using IRT, always assuming that the examiner is measuring and recording the answers correctly of course. Normally you would never do this for height, you'd just measure someone's height. But with latent traits you have to measure indirectly.

Now say I designed, using IRT, a test that really distinguished between people whose heights were between 150-170 cm, say breaking them down into 1 cm groups, but then lumped everyone over 170 cm together, and everyone shorter than 150 cm together. So in other words my examiners would be holding up cards to each testee that were 150 cm, 151 cm, 152 cm etc to 170 cm, and for each testee recording for each card if the person was shorter or taller than this card. And then I used this test on adult American males, who apparently have a height of about 176.2 cm on average (http://en.wikipedia.org/wiki/Human_height). I would get a result with most of the group piling up at the top. The probability distribution would flatten out at a bit above 170 cm (depending on how discriminatory the test items are). If I did the test on Vietnamese women, I would get a result with very few people at the top. Not a normal distribution, despite that height is normally distributed.

Of course height is not an expertise level, and it is directly observable. The point of this example was to show that a test designed using IRT does not need to result in any particular normal distribution, by taking a concrete example.

If you want me to ask a reputable psychometrician, please provide a name, professional affilation and email address of someone that you regard as a reputable psychometrician. I ask you for this because I strongly suspect that if I picked one your response would be that that person was not a reputable psychometrician.

As for your comment about cherry-picking excerpts out of content, if you think I have cherry-picked an excerpt out of context please say so, and say what the context actually is that I've missed.

I am afraid that I don't see the truth of your claims with my own eyes. Indeed, in the link you provided, I see a mathematical demonstration that IRTs are independent of the test group, pages 51 to 54, chapter 3, of the book you provided. Now it is possible that this demonstration is faulty, but if it is I am missing the fault. And it is also possible that you are mistaken. That is why I asked you for a proof. If you want me to believe that you are right, provide the proof.

Anonymous said...

Just to further muddy the waters, the point should be made that in fact many tests in common use to evaluate "reading achievement" and "math achievement" are not ones that yield a Gaussian distribution. While the scores do not "pile up at the top," they do "pile up" in the upper range, with a relatively small number at the bottom. These include the standards-based assessments as developed by many states in response to NCLB.

They are constructed quite differently from norm-referenced measures, which do generate a normal distribution, hence the word "norm."

IRT is used in test construction for validity purposes but may or may not have any impact on the student scores, depending how the test is evaluated. Example, when the tests are strictly graded by individual graders, embedded MC items may be used to cross-check for validity and reliability but do not "count" for the student's score on the test. Anomalies are cause for the test to be double-or triple graded or for examination of possible irregularities.

These tests are more sensitive to instructional effects precisely because they tend to be grounded in the local curricular requirements, whereas norm-referenced measures are not. SES is correlated with academic outcomes generally but does not correlate to results of criterion-based assessments as much as to results of norm-referenced testing instruments.

There is tremendous variability in outcomes between populations with identical SES, including some very high-SES populations.

Investigations have found significant differences in teaching and learning conditions in the respective populations. A colleague has some data that vividly depict precisely this phenomenon but alas I am not in a position to share it here. There is plenty of data out there already however.


Tracy W, are you sure you don't have a PhD in psychometrics?{grin}

KDeRosa said...

SES is correlated with academic outcomes generally but does not correlate to results of criterion-based assessments as much as to results of norm-referenced testing instruments.

I've been wondering why criterion based tests aren't used more since the SES/IQ correlation is less.

It would seem that criterion tests would be preferred rather than norm based tests for this reasonand that criterion tests are more suited to properly written state standards.

Anonymous said...

The problem arises with the term "criterion." No current state (or local standards that I'm aware of are written in a manner that permits a close mapping with any test. The focus of the standards is on "content" and we've been around that barn here before.

So what it comes down to is that the criterion tests have to sample. Multiple-matrix sampling can get around this, but then problems arise in reporting individual scores.Some states, most notably California, were grappling with these matters and making good progress in the early to mid 1970's.

Then, through a series of flukes, IRT enthusiasts gained control. And it was shown that what were then being called "criterion-referenced tests" correlated highly with "normed-referenced tests" Since normed-referenced tests have some desirable statistical properties (of course they do in the eyes of IRT proponents, who construct the tests), bye, bye criterion referenced tests.

Text publishers these days do a lot of fancy footwork and hand waving to "align" their tests with "standards," but it's rhetorical.

I think we'd have to deal with current "criterion tests" on a case by case basis to sort the matter out. Certainly, as "disreputable" says it's possible to construct tests that have sound psychometric properties and that even knock out the correlation with SES. But this just isn't possible if you have to generate a statistical scale that spans several grades.

Why aren't good criterion tests more common? For the same reason that DI and other legitimate reading instructional architectures aren't more common. You know those reasons as well as I do, Ken.

Anonymous said...

Reading expertise IS directly observable. Same with musical expertise, cooking expertise, driving expertise, expertise in using given software, and on and on. We don't rely on measures created by IRT in any of these areas.

If we did, we'd morph them from "latent traits" to "abilities" and we'd generate the same kind of paper-pencil tests as we do for reading and other academic domains.

But that would be beyond silly, wouldn't it? It's beyond silly to do it in academic instruction, but we've been conned to believe that "although standardized achievement have their flaws, their is no better alternative." And the straw man of "portfolios" is typically knocked down to "prove the point."

There ARE alternatives. They involve practices that are commonplace in the corporate world today.

If you want confirmation of what I've said about IRT, psychometrist try anyone at ETS, ACT, or any of the test publishing companies. They'll give you essentially the same information about IRT that I've given you.

The only difference is that they're quite satisfied with IRT and they seek the ungrounded distributions they get. It doesn't phase them at all to treat arbitrary cut scores as "proficiency levels."

Most of them believe deep down that "ability" IS "fixed, permanent, and stable" as Charles Murray tells it. They see nothing out of whack with achievement correlating with SES and ethnic/racial characteristics. It's consistent with IRT and with what "tests show."

The tests "show that" only because of the way they are constructed.

I've been trying to explain how the current state of instruction and testing derives from IRT. But I can put it more simply without reference to IRT:

If you test what kids have been taught, you'll see what they have learned. And the distribution of scores will pile up at the top. But if you don't test what kids have been taught, you're involving the differential experience reflected in SES.

Tracy W said...

Disreputable Psychometrician: I'm sure I've not got a PhD in psychometrics - I tried to integrate a single parameter logistic function last night and failed. :(

Dick Schutz: Reading expertise IS directly observable. Same with musical expertise, cooking expertise, driving expertise, expertise in using given software, and on and on.

No, reading expertise is not directly observable. If you know I can read and comprehend a newspaper, you don't therefore know if I can read and comprehend a biology journal article. You have to test my reading comprehension on biology journal articles separately. Driving expertise is not directly observable - if you know I can safely drive a manual car on a busy moterway, you don't know if I can drive a truck in snow and ice. You have to test my driving ability in a truck on snow and ice separately.
However, if you know that I am 5'2", you know that I can walk through a doorway 6' high and a doorway 7' high. You don't need to re-test me on each doorway.

You yourself said that there wasn't the equivalent of a ruler for reading expertise or driving expertise. This is why statisticians talk about reading expertise, musical expertise, cooking expertise, etc, as being not directly observable. They are fundamentally different to height or weight, which are directly observable.

There is nothing in IRT that requires the test to take the form of paper-pencil. IRT is about the probability of getting a correct answer, given the person's underlying ability, or expertise. If we set up a driving test where a driver had to weave in and out of traffic cones, and the "right answer" was all the cones still upright at the end, IRT could estimate for us, given sufficient sample data, the probability of a person not hitting any cones given an underlying ability. If we added more test items, like a hill start without needing to restart the car, being able to do a 3-point turn in a certain small area, up to whatever tests NASCAR drivers must pass to get onto a team, we would have a lot more data and be able to make estimates about underlying driving ability across a range of outcomes using IRT. The advantage of IRT is that, given a driver's performance on the easy items (weaving through cones, etc), we can estimate their performance on the hard items, as long of course as the driver's performance varied enough from perfect to make an estimate of their ability possible - we presumably wouldn't be able to estimate NASCAR drivers' performance on the racing circuit from their performance on the easy tests as they would all get the easy tests perfectly correct.

Also, please answer my questions from earlier:
- how do you define the words "univariate" and "stable"?
- please provide me with the name, professional affiliation and email address of a psychometrician that you regard as reputable.
- where's the proof that IRT tests would never permit the scores to pile up at the top? I certainly don't see it with my own eyes, so if you want to convince me of this fundamental matter, provide the proof.

As for your comments about what psychometricians believe, you don't provide any evidence for this, so I place zero weight on your claims.

Anonymous said...

To humor Tracy W:

Univariate- “A frequency distribution of only one variate.” –as opposed to phenomena that are multivariate in nature.

www.answers.com/topic/univariate-distribution

Stable- “resistant to change of position or condition”

http://dictionary.die.net/stable

Reputable psychometricians—Any member of the Psychometrika Editorial Board certainly qualifies

www.psychometrika.org/journal/PMjBody.html

Proof that IRT tests would never permit the scores to pile up at the top:

Every distribution of standardized achievement test results for a large number of students ever published

Anonymous said...

I've been wondering why criterion based tests aren't used more since the SES/IQ correlation is less.

Ken, in point of fact, a great many of the tests in current use are criterion-based. Somewhere on the ed.gov website I found a document that listed all the state assessments used in conformation with NCLB and whether they were norm-referenced, criterion-based, or something else. Most states are using criterion-based tests that they have developed themselves, with assistance form ETS and similar consultants. This shouldn't surprise anyone since the purpose is to measure student performance on state expectations, and developing local tests is one way of ensuring a closer match between these.

As it happens I went to a professional training program not long ago which featured people from ETS and elsewhere explaining how these tests are constructed and validated. It was quite interesting and much was new to me, as in our daily work assessing students we use many more norm-referenced measures than anything else. The standards-based tests are constructed in collaboration with local curriculum and instructional people so as to target skills and expectations actually covered in classrooms.

I have to admit I originally considered these assessments to be pretty woolly stuff. They are graded individually using rubrics and given levels from 1-4,5 or 6 depending on the state, and consistency between graders is an ongoing issue. However I came away with a much greater understanding of how the tests are constructed to actually measure classroom learning, how features are built in to maximize reliability (it will never be perfect) and how IRT is used as an adjunct to the essay-type constructed response components to monitor validity and reliability, but not (in the cases we were shown) to affect students' scores.

Most of these tests are not standardized in the manner Schutz refers to; they are designed with the expectation that the majority of students -- 70-85%, depending on the assessment -- will be successful, and only a very small number will be complete failures, usually because they have not written anything at all or have written too little to yield a score of any kind.

The scores don't "pile up at the top" but they noticeably pile up in the upper end. The examples we were shown had 10-15% at the top and about 35% next to the top and another 25-30% in the middle with a very few at the bottom. Instead of a bell curve the distribution resembled an upside-down, slightly mis-shapen L.

Calling these "standardized tests" is perhaps a source of confusion." They are standardized in that they are keyed to state teaching and learning standards (which are skill based not just content) and developed according to some fairly rigorous protocols that I was heretofore unfamiliar with, but not normed on sample populations and the like as the SAT and others are.

Anonymous said...

Hmm. The info about "criterion tests" is interesting. If the results are interpreted in terms of an IRT generated scale, then all the talk about "standards" is window-dressing. If they're reported in some other way, I don't see how they can comply with "adequate yearly progress. Can you straighten me out on this, DP?

Are there any links to access any of these distributions and reports?

Tracy W said...

Dick - thank you for answering my questions. However, if you want to improve my mood, I suggest either backing up your implication earlier that I cherry-picked a quote out-of-context, or withdrawing it. (Claiming that this was a random comment with no such implication intended will not humour me).

Thank you for your definitions of univariate and stable. The definition you have supplied of stable is not necessary for IRT. To quote from the textbook you provided:

"However, if the examinee received remedial instruction between testings or if there were carryover effects, the examinee’s underlying ability level would bedifferent for each testing. Thus, the examinee’s underlying ability level is notimmutable. There are a number of applications of item response theory thatdepend upon an examinee’s ability level changing as a function of changes inthe educational context." (page 90, chapter 5)

IRT does assume that the ability of the examinee can be represented on a scale from negative infinity to positive infinity. IRT is about individual questions. This is however not the same as assuming that what goes into making up that ability is only driven by one variate. For example, a person's height is somewhere on a numerical scale from zero to infinity but their exact height is a result of some combination of at a minimum their genes, their age. and their nutrition during the years of childhood.

Reputable psychometricians—Any member of the Psychometrika Editorial Board certainly qualifies

Thank you - and now I've realised that I had better ask you exactly what you are claiming about IRT, before contacting one of them.

Proof that IRT tests would never permit the scores to pile up at the top: Every distribution of standardized achievement test results for a large number of students ever published

This isn't actually a proof. Collection of data points is not a proof by mathematical standards. Another explanation of the observed data, if true, is that no one wanted a standardised achievement test in which the test results piled scores up at the top. As I stated earlier, pages 51 to 54 of chapter 3 of the book you provided a link to shows that scores can pile up at the top. You have not demonstrated any flaw in this presentation, so no proof. I don't know why you have taken against IRT so much, but it clearly ain't based on the mathematical properties of IRT.

However this does imply that if I find a distribution of standardised achievement test results for a large number of students with scores piling up at the top, you will change your mind about IRT. What is the minimum number of students you would count as large?

I'm off for a long weekend, and then will have to email someone from the member of the Psychometrika Editorial Board, so will be a while until my next comment.

Anonymous said...

To whom it may concern: I will have nothing more to say about IRT in this thread. IRT is another way of "Bamboozling the Gifted," but this is not the place to go into the gory details of the theory. My attempts to clarify have gone as far as they can go. Further, actually. Sorry about that, but it's the Internet.

Tracy W said...

IRT is another way of "Bamboozling the Gifted,"

Only in the sense that the underlying mathematics is somewhat complicated, particularly for the case of multi-choice testing. Item Response Theory is a way of estimating a person's underlying expertise or ability based on their answers to questions, and thus being able to estimate an examinee's performance on tests with a different degree of difficulty.

There may be systematic problems with the current development and use of standardised achievement tests, but those problems would apply equally, or more, to tests designed using classical question theory, and have nothing to do with IRT, being a matter of test design, and in particular validation or lack thereof.