Making Sense of Student Performance Data

Kim Marshall draws on his 44 years’ experience as a teacher, principal, central office administrator and writer to compile the Marshall Memo, a weekly summary of 64 publications that have articles of interest to busy educators. He shared one of my recent articles, co-authored with doctoral students Britnie Kane and Jonee Wilson, in his latest memo and gave me permission to post her succinct and useful summary.

In this American Educational Research Journal article, Ilana Seidel Horn, Britnie Delinger Kane, and Jonee Wilson (Vanderbilt University) report on their study of how seventh-grade math teams in two urban schools worked with their students’ interim assessment data. The teachers’ district, under pressure to improve test scores, paid teams of teachers and instructional coaches to write interim assessments. These tests, given every six weeks, were designed to measure student achievement and hold teachers accountable. The district also provided time for teacher teams to use the data to inform their instruction. Horn, Kane, and Wilson observed and videotaped seventh-grade data meetings in the two schools, visited classrooms, looked at a range of artifacts, and interviewed and surveyed teachers and district officials. They were struck by how different the team dynamics were in the two schools, which they called Creekside Middle School and Park Falls Middle School. Here’s some of what they found:

  • Creekside’s seventh-grade team operated under what the authors call an instructional management logic, focused primarily on improving the test scores of “bubble” students. The principal, who had been in the building for a number of years, was intensely involved at every level, attending team meetings and pushing hard for improvement on AYP proficiency targets. The school had a full-time data manager who produced displays of interim assessment and state test results. These were displayed (with students’ names) in classrooms and elsewhere around the school. The principal also organized Saturday Math Camps for students who needed improvement. He visited classrooms frequently and had the school’s full-time math coach work with teachers whose students needed improvement. Interestingly, the math coach had a more sophisticated knowledge of math instruction than the principal, but the principal dominated team meetings.

In one data meeting, the principal asked teachers to look at interim assessment data to predict how their African-American students (the school’s biggest subgroup in need of AYP improvement) would do on the upcoming state test. The main focus was on these “bubble” students. “I have 18% passing, 27% bubble, 55% growth,” reported one teacher. The team was urged to motivate the targeted students, especially quiet, borderline kids, to personalize instruction, get marginal students to tutorials, and send them to Math Camp. The meeting spent almost no time looking at item results to diagnose ways in which teaching was effective or ineffective. The outcome: providing attention and resources to identified students. A critique: the team didn’t have at its fingertips the kind of item-by-item analysis of student responses necessary to have a discussion about improving math instruction, and the principal’s priority of improving the scores of the “bubble” students prevented a broader discussion of improving teaching for all seventh graders. “The prospective work of engaging students,” conclude Horn, Kane, and Wilson, “predominantly addressed the problem of improving test scores without substantially re-thinking the work of teaching, thus providing teachers with learning opportunities about redirecting their attention – and very little about the instructional nature of that attention… The summative data scores simply represented whether students had passed: they did not point to troublesome topics… By excluding critical issues of mathematics learning, the majority of the conversation avoided some of the potentially richest sources of supporting African-American bubble kids – and all students… Finally, there was little attention to the underlying reasons that African-American students might be lagging in achievement scores or what it might mean for the mostly white teachers to build motivating rapport, marking this as a colorblind conversation.”

  • The Park Falls seventh-grade team, working in the same district with the same interim assessments and the same pressure to raise test scores, used what the authors call an instructional improvement logic. The school had a brand-new principal, who was rarely in classrooms and team meetings, and an unhelpful math coach who had conflicts with the principal. This meant that teachers were largely on their own when it came to interpreting the interim assessments. In one data meeting, teachers took a diagnostic approach to the test data, using a number of steps that were strikingly different from those at Creekside:
  • Teachers reviewed a spreadsheet of results from the latest interim assessment and identified items that many students missed.
  • One teacher took the test himself to understand what the test was asking of students mathematically.
  • In the meeting, teachers had three things in front of them: the actual test, a data display of students’ correct and incorrect responses, and the marked-up test the teacher had taken.
  • Teachers looked at the low-scoring items one at a time, examined students’ wrong answers, and tried to figure out what students might have been thinking and why they went for certain distractors.
  • The team moved briskly through 18 test items, discussing possible reasons students

missed each one – confusing notation, skipping lengthy questions, mixing up similar-sounding words, etc.

  • Teachers were quite critical of the quality of several test items – rightly so, say Horn, Kane, and Wilson – but this may have distracted them from the practical task of figuring out how to improve their students’ test-taking skills.

The outcome of the meeting: re-teaching topics with attention to sources of confusion. A critique: the team didn’t slow down and spend quality time on a few test items, followed by a more thoughtful discussion about successful and unsuccessful teaching approaches. “The tacit assumption,” conclude Horn, Kane, and Wilson, “seemed to be that understanding student thinking would support more-effective instruction… The Park Falls teachers’ conversation centered squarely on student thinking, with their analysis of frequently missed items and interpretations of student errors. This activity mobilized teachers to modify their instruction in response to identified confusion… Unlike the conversation at Creekside, then, this discussion uncovered many details of students’ mathematical thinking, from their limited grasp of certain topics to miscues resulting from the test’s format to misalignments with instruction.” However, the Park Falls teachers ran out of time and didn’t focus on next instruction steps. After a discussion about students’ confusion about the word “dimension,” for example, one teacher said, “Maybe we should hit that word.” [Creekside and Park Falls meetings each had their strong points, and an ideal team data-analysis process would combine elements from both: the principal providing overall leadership and direction but deferring to expert guidance from a math coach; facilitation to focus the team on a more-thorough analysis of a few items; and follow-up classroom observations and ongoing discussions of effective and less-effective instructional practices. In addition, it would be helpful to have higher-quality interim assessments and longer meetings to allow for fuller discussion. K.M.] “Making Sense of Student Performance Data: Data Use Logics and Mathematics Teachers’ Learning Opportunities” by Ilana Seidel Horn, Britnie Delinger Kane, and Jonee Wilson in American Educational Research Journal, April 2015 (Vol. 52, #2, p. 208-242

First, Do No Harm

I have often wondered if teachers should have some form of a Hippocratic Oath, reminding themselves each day to first, do no harm.

Since the network of relationships in classrooms is so complex, it is often difficult to discern what we may do that causes children harm. Most of us have experienced the uncertainty of teaching, those dilemmas endemic to the classroom. Was it the right decision to stay firm on an assignment deadline for the child who always seems to misplace things, after giving several extensions? Or was there something more going on outside of the classroom that would alter that decision? Why did a student, who is usually amendable to playful teasing, suddenly storm out of the room today in the wake of such an interaction?

What I have arrived at is that there are levels of harm. The harm I describe in the previous examples can be recovered if teachers have relational competence — that is, the lines of communication are open with their students so that children can share and speak up if a teacher missteps.

What I am coming to realize is that mathematics teachers have a particular responsibility when it comes to doing no harm. Mathematics, for better or worse, is our culture’s stand-in subject for being smart. That is, if you are good at math, you must be smart. If you are not good at math, you are not truly smart.

I am not saying I believe that, but it is a popular message. I meet accomplished adults all the time who confess their insecurities stemming from their poor performance in mathematics classes.

Here is an incomplete list of common instructional practices that, in my view, do harm to students’ sense of competence:

1. Timed math tests

Our assessments communicate to students what we value. Jo Boaler recently wrote about the problems with these in terms of mathematical learning. Students who do well on these tend to see connections across the facts, while students who struggle tend not to. But if timed tests are the primary mode of assessment, then the students who struggle do not get many opportunities to develop those connections.

2. Not giving partial credit

Silly mistakes are par for the course in the course of demanding problem solving. Teachers who only use multiple choice tests or auto-grading do not get an opportunity to see students’ thinking. A wrong answer does not always indicate entirely wrong thinking. Students who are prone to getting the big idea and missing the details are regularly demoralized in mathematics classes.

Even worse, however, is …

3. Arbitrary grading that discounts sensemaking

Recently, a student I know had a construction quiz in a geometry class. The teacher marked her construction as “wrong” because she made her arcs below the line instead of above it, as the teacher had demonstrated. This teacher also counts answers as incorrect if the SAS Theorem is written as the SAS Postulate in proofs. Since different textbooks often name triangle congruence properties differently, this is an arbitrary distinction. This practice harms students by valuing imitation over sensemaking.

4. Moving the lesson along the path of “right answers”

Picture the following interaction:

Teacher:    “Can anyone tell me which is the vertical angle here?”

Layla:        “Angle C?”

Teacher:     “No. Robbie?”
Robbie:       “Angle D?”

Teacher: Yes. So now we know that Angle D also equals 38˚…

That type of interaction, called initiation-response-evaluation, is the most common format of mathematical talk in classrooms. Why is it potentially harmful? Let’s think about what Layla learned. She learned that she was wrong and, if she was listening, she learned that Angle D was the correct answer. However, she never got explicit instruction on why Angle C was incorrect. Over time, students like Layla often withdraw their participation from classroom discussions.

On the other hand, teachers who work with Layla’s incorrect answer –– or even better yet, value it as a good “non-example” to develop the class’s understanding of vertical angles –– increase student participation and mathematical confidence. And, they are doing more to grow everybody’s understanding.

What are other kinds of teaching practices that stand to “harm” students?

A Cascade of Errors in Interim “Summative” Assessments

I have been working on a paper investigating teachers interpreting student performance data. The data come from their district’s interim summative assessments, tests given every 6-8 weeks to help them predict how students will perform on the high stakes end of year tests. These interim assessments have taken a very important place in these schools, who are under the threat of AYP sanctions.

The teachers are all working so hard to do right by their kids, but there is a cascade of errors in the whole system.

First, the assessments were internally constructed. Although they match the state standards and have been thoughtfully designed, they have not been psychometrically validated. That means that when they are used to measure, say, a student’s understanding of addition of fractions, it has not been determined over repeated revision that this is what is actually being tested.

Second, the proficiency cut points are arbitrary, yet NCLB has everybody worried about percentage of students above proficiency. This is a national problem, as was so eloquently laid out in Andrew Ho’s 2008 article in Educational Researcher.

In the end, we are sacrificing validity for precision. We think these data reports tell us with great accuracy about who is learning what and to what degree. But there is reason to believe that this cascade of errors is just another sorting and labeling mechanism interfering with real teaching and learning.