FSA Technical Report: Looking For What Isn’t There

“Looking For What Isn’t There,” the Brief Version:

Almost since the day the Department of Education signed the contract with test developer American Institutes for Research (AIR), the validity of the Florida Standards and the Florida Standards Assessments (FSA) have been questioned. First, we were told it had been field tested and validated in Utah. When no documentation was forthcoming, legislators demanded an Independent Validity study. They hired Alpine Testing Solutions, a company anything but independent from FSA creator AIR, and the validity study was released in 8/31/15. The results were anything but reassuring, evaluating only 6 of 17 new assessments and finding the use of individual test scores “suspect” but, strangely, supporting the use of scores for the rating of teachers, schools and districts. At the time, we were advised to “look for what isn’t there” and we found no evidence the tests were shown to be valid, reliable or fair for at risk sub-populations of students. “Looking for what isn’t there” seemed like good advice, so when the 2015 FSA Technical Report was released, I started looking…

In a nut shell, despite its 897 page length, there seems to be a LOT that isn’t in the 2015 FSA Technical Report. To summarize, here is a list of the most obvious omissions:

  • Though the report clearly states that validity is dependent on how test scores will be used, this report seems to only evaluate the validity of the use of scores at the individual student level. The use of test scores to evaluate schools and districts is mentioned but there is no evidence those uses were ever evaluated.
  • The use of student scores to evaluate teachers (via VAM scores) is completely ignored and is left off the “Required Uses and Citations for the FSA” table, despite such use being statutorily mandated.
  • Despite the previous FCAT 2.0 Technical Report‘s concerns questioning “the inference that the state’s accountability program is making a positive impact on student proficiency and school accountability without causing unintended negative consequences,” no evaluation of these implication arguments are made for the FSA (and I don’t believe that is because there ARE, in fact, no unintended consequences).
  • Missing attachments to the report include: multiple reported appendices containing statistical data regarding test construction and administration, any validity documents or mention of the SAGE/Utah field study, any documentation of grade level appropriateness, any comparison of an individuals performance on both a computer and paper based test.
  • Results from the required “third-party, independent alignment  study” conducted in February 2016 by HumRRO (You guessed it! They are associated with AIR and they have a long history with the FLDOE).

Who is responsible for these documents that create more questions than they answer? Why aren’t they being held accountable? Why, if virtually our entire Accountability system is dependent on the use of test scores, isn’t it a top priority to ensure these tests are fair, valid and reliable? When Jeb Bush said “If you don’t measure, you don’t really care,” was he speaking of test validity? Because, it appears the FLDOE really doesn’t care.

Want more details? Our full blog is here:

FSA Technical Report: Looking For What Isn’t There

In early April, 2016, the FSA Technical Report FINALLY was published.  You can read it here.  The Florida Department of Education (FLDOE) publishes a technical report annually following state testing.  In general, these reports review general information about the construction of the statewide assessments, statistical analysis of the results, and the meaning of scores on these tests. Per the 2014 FCAT Technical Report,  these reports are designed to “help educators understand the technical characteristics of the assessments used to measure student achievement.” Usually, the report comes out in the December following the spring test administration.  This year, the FSA report was expected in January, following the completion of the “cut score process”  by the Board of Education on January 6, 2016. Still, there were significant delays beyond what was expected.

When you look at the new FSA report, the first thing that you notice is the format of the report is completely different from the previous technical reports. The 2014 FCAT report was published as one volume, 175 pages long, and referenced a “yearbook” of appendices that contained detailed statistics on the various assessments for the given academic year. The FSA report was published in 7 volumes, totaling 897 pages, and 5 of the seven volumes reference multiple appendices, which contained specific statistical data regarding test construction and administration (like Operational Item Statistics, Field Test Item Statistics, Distribution of T Scores, Scale Scores, and Standard Errors), that are NOT attached to the report (This is the first thing I found that wasn’t there).

What is the definition of  Validity?

The two reports have slightly different definitions of “validity.” The 2014 FCAT Report (page 126) defined validity this way:

“Validation is the process of collecting evidence to support inferences from assessment results. A prime consideration in validating a test is determining if the test measures what it purports to measure.”

The 2015 FSA (Volume 4) report’s definition is more convoluted (page 4):

“Validity refers to the degree to which “evidence and theory support the interpretations of test scores entailed by proposed uses of tests” (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 2014). Messick (1989) defines validity as “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores and other modes of assessment.”

Both of these definitions emphasize evidence and theory to support inferences and interpretations of test scores. They are perfect examples of the most obvious difference between the two reports.  The 2014 report is written in relatively clear and concise English and the new, 2015 report is verbose, complicated and confusing.  Seriously, has the definition of validity changed? Is it really necessary to reference sources defining validity?  Why does this remind me of when John Oliver said “If you want to do something evil, put it inside something boring”?

More importantly, does either report actually measure what it purports to? Both definitions caution that it is really the use of the test score that is validated, not the score itself.

Let’s take a quick look at how the FSA scores are used:

FullSizeRender-22

Table 1 (2015 Volume 1 page 2) delineates how FSA scores are to be used.  In addition to student specific uses, like 3rd grade retention and high school graduation requirement, required FSA score uses include School Grades, School Improvement Ratings, District Grades, Differentiated Accountability and Opportunity Scholarship.

Interestingly, this list does NOT include Teacher Performance Pay (or VAM calculations) yet the use of student test scores is clearly mandated in F.S. 1012.34(3)(a)1. For many, the use of student scores to evaluate teachers is one of the most contentious parts of the state’s accountability system. Is this an oversight or is there reluctance to put the use of VAM to the validity test? Does it matter to the FLDOE whether VAM is a valid assessment of teacher quality or do they plan on using it regardless?  Remember what Jeb said? “If you don’t measure, you don’t really care,” and it appears the FLDOE does not care.

So, did these Technical Reports validate their respective tests for these uses?

Not that I can tell.

Neither report includes (as far as I can tell) evaluations confirming the use of these tests for the determination of grade promotion or high school graduation. Indeed, the Alpine Validity report cautions against the use of individual scores calling them “suspect” for some students. There appear to have been no attempt to document that FCAT or FSA test scores can accurately rate schools, districts or teachers.

In fact, the 2014 FCAT 2.0 Technical cautioned such use on page 137.

“At the aggregate level (i.e., school, district, or statewide), the implication validity of school accountability assessments can be judged by the impact the testing program has on the overall proficiency of students. Validity evidence for this level of inference will result from examining changes over time in the percentage of students classified as proficient. As mentioned before, there exists a potential for negative impacts on schools as well, such as increased dropout rates and narrowing of the curriculum. Future validity studies need to investigate possible unintended negative effects as well.”

The “Summary of Validity Evidence” in the 2014 Report is telling.  While they conclude that the assessments appeared to be properly scored and the scores could be generalized to the universe score for the individual, they had significant concerns regarding the extrapolation and implication arguments (emphasis mine):

“Less strong is the empirical evidence for extrapolation and implication. This is due in part to the absence of criterion studies. Because an ideal criterion for the FCAT 2.0 or EOC assessments probably cannot be found, empirical evidence for the extrapolation argument may need to come from several studies showing convergent validity evidence. Further studies are also needed to verify some implication arguments. This is especially true for the inference that the state’s accountability program is making a positive impact on student proficiency and school accountability without causing unintended negative consequences.”

In April 2015, I emailed Benjamin Palazesi, from the FLDOE, asking if such “further studies” were ever done to verify the implication arguments, as suggested in the FCAT 2.0 Report. His response? “Since the FCAT 2.0 Reading and Mathematics and Algebra and Geometry EOC Assessments are being replaced by the Florida Standards Assessments (FSA) in these subject areas, there are no plans to conduct a criterion study on these assessments, and we will evaluate the need for additional studies for FSA.”

Hmmm, there is no mention of implication arguments at all in the FSA Report. Do you think they believe there are no unintended negative consequences due to the state’s accountability program? Maybe they don’t read our blog… unintended consequences seem to be a speciality of Florida’s accountabaloney system. Eventually, the FLDOE will need to recognize and address the many unintended consequences of their current system or such consequences can no longer be considered “unintended.”

The validity of the FSA has been in question since it was announced its questions would be “borrowed” from Utah’s Student Assessment of Growth and Excellence (SAGE). On March 4, 2015 (watch the questioning here at 58:26 or read about the “fall out” here), Commissioner Pam Stewart testified in front of the Senate Education Appropriations Subcommittee and proclaimed that the FSA was field tested in Utah and that it was “absolutely psychometrically valid and reliable.” At the time, Ms. Stewart promised to provide documentation to the Senate Education Subcommittee members. Some school board members from Utah were also interested in these documents, as they had not yet received any formal documentation regarding the validity or reliability of their own state test, SAGE (see letter from the Utah School Board here). No documents were ever delivered and, SURPRISE, there is no evidence of a “field test” in Utah or any SAGE Validity documents in this 2015 Technical report, either. (Now might be a good time for Commissioner Stewart to apologize for misleading this Senate Education Subcommittee.)

The 2015 Technical report does include both the legislative mandated Alpine Validity Study (Volume 7 Chapter 7) AND the Alpine presentation to the Senate (Volume 7 Chapter 6).  Remember the Alpine Validity Study, because of time constraints, chose not to assess validity for 11 of the 17 FSA tests, including the Algebra 2 and Geometry EOC. The Alpine study also did NOT assess validity or fairness for at-risk populations of students, like ESE or English Language Learners.

Another thing missing from these reports is any assurance that the level of performance tested is grade level appropriate. Neither technical report compared student performance on the FSA/FCAT to performance on a nationally normed test.  There is no measurement as to whether the 3rd grade Reading FSA, for example, actually tests 3rd grade reading levels (yet students are retained based on its results). This, I believe, is a major concern for parents and  has been seemingly disregarded by the state in the pursuit of “rigor.” Again, “if you don’t measure, you don’t really care” and the FLDOE appears not to care if children who can actually read at a 3rd grade level have been marked for retention.

There is a brief mention (1 paragraph) of statistical fairness in items (Volume 4 page 60), utilizing Differential Item Functioning (DIF) analysis. “DIF analyses were conducted for all items to detect potential item bias from a statistical perspective across major ethnic and gender groups” (Male/female, White/African American/Hispanic, English Language Learner and Students with disabilities). DIF was also used in the 2014 report, but there it seems to have been used to eliminate biased questions that were being field tested.  In the 2015 report, the DIF analysis is implied to assure fairness across subpopulations.

In section 5.2, Volume 1, page 20, DIF analysis is described (emphasis mine).

“Identifying DIF was important because it provided a statistical indicator that an item may contain cultural or other bias. DIF-flagged items were further examined by content experts who were asked to reexamine each flagged item to make a decision about whether the item should have been excluded from the pool due to bias. Not all items that exhibit DIF are biased; characteristics of the educational system may also lead to DIF. For example, if schools in certain areas are less likely to offer rigorous Geometry classes, students at those schools might perform more poorly on Geometry items than would be expected, given their proficiency on other types of items. In this example, it is not the item that exhibits bias but rather the instruction.”

I am not a psychometrician, but I do wonder how a test that is used to rate not only students but also schools, can determine a question is not biased because students came from a low performing school; especially since many “low performing schools” contain an overrepresentation of students from at-risk sub-populations. Regardless, I suspect that determining whether individual test questions are biased is not the same thing as evaluating whether a test is fair and valid for those at-risk populations.

Recent reports demonstrated that students who took the paper/pencil version of the PARCC test obtained higher scores than those who took the computer version. Was this evaluated for the FSA? Not in this Technical report, where the only evaluation of the paper/pencil test appears to be content alignment for students with accommodations .

One more thing to add to our list of things not in the report: page 20, section 3.5 of Volume 2 of the 2015 report  states “A third-party, independent alignment study was conducted in February 2016. Those results will be included in the 2015-2016 FSA Technical Report.” According FLDOE Deputy Commissioner, Vince Verges, that report is being completed by Human Resources Research Organization (HumRRO) who (SURPRISE!) were thanked, along with AIR, in the acknowledgements of the Alpine Validity Study as “organizations and individuals that serve as vendors for the components of FSA that were included in the evaluation.” Indeed, HumRRO has a long history with the FLDOE (read about it here). Seriously, the DOE needs a dictionary with the definitions of “third-party” and “independent” because HumRRO might be neither.

After looking what isn’t in the FSA Technical Report, I have come to a few conclusions:

  1. There was never a valid field test in Utah.
  2. The Alpine Validity Study was incomplete.
  3. These technical reports are more about describing test construction and administration than confirming validity.
  4. There remains no evidence that the FSA is fair, valid and reliable for at-risk subpopulations of students or the myriad of uses outside of the individual student.

Yet, we continue to use the FSA to retain students, deny diplomas and rank teachers, schools and districts. That is accountabaloney.

Why has so little time been spent assuring the validity of these tests? The FSA is the cornerstone of Florida’s education accountability system.  Why hasn’t serious attention been paid to assuring its validity?  Could it be because, as Jeb Bush has said, so many times, “If you don’t measure, you don’t really care.” I am beginning to believe that is true and am wondering who should we hold accountable for that?

SB1360: Baloney on Rye ADDENDUM

This is an addendum to our previous blog, “SB1360: Baloney on Rye is Still Full of Baloney” :

It has been brought to our attention that it is unclear whether the SAT score targets described in SB 1360 reflect scores from the current SAT or from the “newly designed” SAT, which will debut later this year (info here and here). The redesigned SAT will have a maximum score of 1600, compared to the current SAT maximum score of 2400. Since the exam is yet to be administered, the percentile ranking of scores on the new SAT can only be predicted. It is estimated that a score of 1200 (that required to be exempt from Florida’s U.S. History EOC) will be closer to the 75th percentile on the new SAT (not the 15% we stated in our blog).

Additional comparisons with SB1360’s target scores for the ACT, suggest that exemptions for Algebra 1, Geometry and Algebra 2 EOCs may be closer to the 50th to 75th percentiles, respectively. So, SB1360’s required scores may be more “rigorous” than we first thought, but will they be appropriate? It turns out neither the old nor the new the SAT assess math skills beyond basic geometry. Why are we allowing scores on an assessment that does not test beyond basic geometry to exempt students from their Algebra 2 EOC, which covers up to Trigonometry concepts? We hope the Senate Education committees will address this.

Since there are dramatic differences between the performance level associated with the same reported score, we feel SB1360 needs to define exactly which SAT exam (old or new) it is referring to. We also question why Florida would put into statute target scores from an exam that is yet to be administered (even if it does have the same name)? Are Florida students expected to field test the new SAT and then have those scores used for accountability purposes? Remember how well that worked out for the 2015 FSA?

Also, there are significant concerns regarding the math portions of the newly designed SAT, especially for low income and English language learner (more here). The new format of math questions will require higher level verbal and reasoning skills and is predicted to put English language learners and low income children at a significant disadvantage. Given the ever increasing population of low income, English language learner, and immigrant students in our public schools (Miami Dade is currently expecting ~8,000 new immigrant students this school year), why is Florida choosing an exam that would put those students, their schools and districts, at a distinct disadvantage. How is that a fair accountability assessment?

Our initial blog may have underestimated the “rigor” of SB1360’s target scores. If they represent scores from the new SAT, they may be more “rigorous” than we thought. Does this make us feel any better about this bill?

No.

Reviewing the new SAT only raises more questions about the fairness of an accountability system that uses these scores as metrics and in this manner. Students with high standardized test scores (even in subjects unrelated to the course they are taking) will be exempt from taking final exams/EOCs. Students with lower standardized test scores (many who will be immigrants, disadvantaged and/or english language learners), will not only be required to take the exams, but they will be worth 30% of their course grades and (for Algebra 1) passing will be required for graduation. “Smart kids” (often wealthier, white students) will no longer need to take the U.S. History or other state EOCs. They will be exempt from the Algebra 2 EOC based on scores that don’t test the course content; their course grades will reflect their classroom performance and will not suffer from poor performance on the EOC. It appears that students and schools with high test scores (like Seminole County, which has been lobbying hard for this bill, originally calling it the “Seminole Solution”) will require significantly less testing than their less advantaged counterparts.

This does not describe a fair, equitable, uniform education system. This describes the misuse of standardized test scores.

This will not “fix” anything.

This is Accountabaloney.

Florida’s Middle School Math Problems: A Perfect Storm

Why did Florida’s 8th grade National Assessment of Educational Progress (NAEP) math scores plummet?  We believe it may be the result of a “perfect storm” created by  Common Core Math and Accountabaloney…

Florida State University Physics Professor, Paul Cottle, in a December 5, 2015 op-ed (read it here) in the Tallahassee Democrat, sounded the alarm regarding recent dismal middle school math performance:

“Florida’s middle schools have fallen off a cliff in math, according to recently released results from the National Assessment of Educational Progress, an exam given to a sampling of students in nearly all states.

When NAEP was administered in 2013, it determined that 31 percent of Florida’s eighth graders were proficient in math. That was below the 2013 national average. But the news this year was much worse: Only 26 percent of the state’s eighth graders were found to be proficient in that subject.

It’s important to note that the national math proficiency rate for eighth graders declined as well – from 35 percent in 2013 to 33 percent in 2015.

But Florida’s decline was the nation’s largest. You might think that Florida’s educational leaders would mobilize an effort to address this crisis in middle school math. But you’d be wrong.”

Mr Cottle goes on to suggest attracting mathematically talented young people to teaching is the solution to our middle school math problems. While that may be part of the solution, we think the problem may go much deeper than teacher quality.

We do agree with Mr. Tuttle, however, when he says, “There is one thing for sure: Pretending the problem doesn’t exist isn’t going to make it go away.”

Lennie Jarratt wrote a column on 12/2/2015, discussing declining NAEP scores since the institution of Common Core State Standards (CCSS). She reported that the 2015 NAEP scores showed “an across-the-board decrease in math test scores,” the first drop in 25 years. She, also, pointed out that, after breaking down the data, there was a greater decline in states that had adopted the Common Core math standards. She voiced concern that the standards might be to blame:

“The math techniques now associated with Common Core-aligned math are solidly entrenched in many public education systems across the nation, even though in 2006 the National Council of Teachers of Mathematics called for an end to these techniques and a return to teaching the basics, i.e. direct instruction and memorization of basic facts. These basics provide a solid foundation for understanding, learning, and building future math concepts. Teachers who use Common Core-aligned math are similar to those who attempt to build a house without a foundation; the house is destined to crumble.”

When the National Council of Teachers of Mathematics are calling for a return to basics, one wonders why policy makers would not listen?

What has changed since transitioning to the Common Core? In a column from the Brooking Institute, Tom Loveless outlines the differences between math instruction prior to and after initiation of the CCSS. In a nutshell, he describes how the CCSS math sequence delays some basic math instruction, resulting in 6th graders now practicing basic division algorithms when they used to be focusing on the study of rational numbers  (fractions, decimals, percentages).

Since Florida Standards curriculum closely aligns to Common Core, this  means that  Florida students are now wasting time working on basic math into middle school, time that should be spent in the study of rational numbers.

Also in Florida, middle school math is being squeezed from the top end as well because there is a greater and greater push to move Algebra and Geometry classes into 6th and 7th grade. Florida’s current A-F school grading system rewards schools that place students into these advanced math courses. Schools have responded by placing more and more students into Algebra 1 whether they are ready or not. Last fall, parents in Orange County were outraged to learn that their middle school Algebra 1 students were simultaneously placed in remedial math courses, presumably to give those students extra time to prepare for the Algebra 1 End of Course exam. Placing students into advanced math when then are not properly prepared seems ill-advised.

In addition, the course content in Florida’s Algebra 1 and 2 courses no longer resembles the courses students took preCCSS. In an attempt to make Florida Standards more “rigorous” than regular CCSS, Florida added dozens of advanced math standards to the upper end of high school math. The trickle down effect has resulted in approximately 1/3 of the previous Algebra 1 content now being taught in pre-algebra, 1/3 of the previous Algebra 2 content now being taught in Algebra 1 and Trigonometry now being taught in Algebra 2. This shifting of advanced content into lower level math courses exacerbates the squeeze on time available to learn traditional middle school math material (fractions, decimals, percentages).

The shift has been so dramatic that we question whether middle school Algebra 1 teachers (with a middle school math credential which covers math content up to grade 9) are teaching out of subject when they are required to teach Algebra 2 content. Florida has a law that says families must be notified if their child’s teacher is teaching out of subject. Is your child’s Algebra 1 teacher certified to teach Algebra 2 content?

The combination of the Common Core Math’s delay of basic math instruction and Florida’s drive to reward increasingly difficult math into middle school has resulted in a perfect storm. It’s intensity is reflected in Florida’s plummeting 8th grade Math scores. It is another clear example of the destructive force of accountabaloney.

NAEP scores sounded the storm warning. We think that Florida’s educational leaders should be mobilizing an effort to address this crisis in middle school math. Do they have any such plans? Can they smell the baloney? We can only hope…

Batten the hatches.

 

DISCLAIMER and a Call for Help from Florida’s Math Teachers: It is not entire clear what or who is responsible for placing Algebra 2 content into Florida’s Algebra 1 course standards.  Is it part of CCSS or unique to Florida Standards? We are not sure. If anyone has information regarding this or any other impacts of accountabaloney on Florida’s math sequence, please contact us.