“Looking For What Isn’t There,” the Brief Version:
Almost since the day the Department of Education signed the contract with test developer American Institutes for Research (AIR), the validity of the Florida Standards and the Florida Standards Assessments (FSA) have been questioned. First, we were told it had been field tested and validated in Utah. When no documentation was forthcoming, legislators demanded an Independent Validity study. They hired Alpine Testing Solutions, a company anything but independent from FSA creator AIR, and the validity study was released in 8/31/15. The results were anything but reassuring, evaluating only 6 of 17 new assessments and finding the use of individual test scores “suspect” but, strangely, supporting the use of scores for the rating of teachers, schools and districts. At the time, we were advised to “look for what isn’t there” and we found no evidence the tests were shown to be valid, reliable or fair for at risk sub-populations of students. “Looking for what isn’t there” seemed like good advice, so when the 2015 FSA Technical Report was released, I started looking…
In a nut shell, despite its 897 page length, there seems to be a LOT that isn’t in the 2015 FSA Technical Report. To summarize, here is a list of the most obvious omissions:
- Though the report clearly states that validity is dependent on how test scores will be used, this report seems to only evaluate the validity of the use of scores at the individual student level. The use of test scores to evaluate schools and districts is mentioned but there is no evidence those uses were ever evaluated.
- The use of student scores to evaluate teachers (via VAM scores) is completely ignored and is left off the “Required Uses and Citations for the FSA” table, despite such use being statutorily mandated.
- Despite the previous FCAT 2.0 Technical Report‘s concerns questioning “the inference that the state’s accountability program is making a positive impact on student proficiency and school accountability without causing unintended negative consequences,” no evaluation of these implication arguments are made for the FSA (and I don’t believe that is because there ARE, in fact, no unintended consequences).
- Missing attachments to the report include: multiple reported appendices containing statistical data regarding test construction and administration, any validity documents or mention of the SAGE/Utah field study, any documentation of grade level appropriateness, any comparison of an individuals performance on both a computer and paper based test.
- Results from the required “third-party, independent alignment study” conducted in February 2016 by HumRRO (You guessed it! They are associated with AIR and they have a long history with the FLDOE).
Who is responsible for these documents that create more questions than they answer? Why aren’t they being held accountable? Why, if virtually our entire Accountability system is dependent on the use of test scores, isn’t it a top priority to ensure these tests are fair, valid and reliable? When Jeb Bush said “If you don’t measure, you don’t really care,” was he speaking of test validity? Because, it appears the FLDOE really doesn’t care.
Want more details? Our full blog is here:
FSA Technical Report: Looking For What Isn’t There
In early April, 2016, the FSA Technical Report FINALLY was published. You can read it here. The Florida Department of Education (FLDOE) publishes a technical report annually following state testing. In general, these reports review general information about the construction of the statewide assessments, statistical analysis of the results, and the meaning of scores on these tests. Per the 2014 FCAT Technical Report, these reports are designed to “help educators understand the technical characteristics of the assessments used to measure student achievement.” Usually, the report comes out in the December following the spring test administration. This year, the FSA report was expected in January, following the completion of the “cut score process” by the Board of Education on January 6, 2016. Still, there were significant delays beyond what was expected.
When you look at the new FSA report, the first thing that you notice is the format of the report is completely different from the previous technical reports. The 2014 FCAT report was published as one volume, 175 pages long, and referenced a “yearbook” of appendices that contained detailed statistics on the various assessments for the given academic year. The FSA report was published in 7 volumes, totaling 897 pages, and 5 of the seven volumes reference multiple appendices, which contained specific statistical data regarding test construction and administration (like Operational Item Statistics, Field Test Item Statistics, Distribution of T Scores, Scale Scores, and Standard Errors), that are NOT attached to the report (This is the first thing I found that wasn’t there).
What is the definition of Validity?
The two reports have slightly different definitions of “validity.” The 2014 FCAT Report (page 126) defined validity this way:
“Validation is the process of collecting evidence to support inferences from assessment results. A prime consideration in validating a test is determining if the test measures what it purports to measure.”
The 2015 FSA (Volume 4) report’s definition is more convoluted (page 4):
“Validity refers to the degree to which “evidence and theory support the interpretations of test scores entailed by proposed uses of tests” (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 2014). Messick (1989) defines validity as “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores and other modes of assessment.”
Both of these definitions emphasize evidence and theory to support inferences and interpretations of test scores. They are perfect examples of the most obvious difference between the two reports. The 2014 report is written in relatively clear and concise English and the new, 2015 report is verbose, complicated and confusing. Seriously, has the definition of validity changed? Is it really necessary to reference sources defining validity? Why does this remind me of when John Oliver said “If you want to do something evil, put it inside something boring”?
More importantly, does either report actually measure what it purports to? Both definitions caution that it is really the use of the test score that is validated, not the score itself.
Let’s take a quick look at how the FSA scores are used:
Table 1 (2015 Volume 1 page 2) delineates how FSA scores are to be used. In addition to student specific uses, like 3rd grade retention and high school graduation requirement, required FSA score uses include School Grades, School Improvement Ratings, District Grades, Differentiated Accountability and Opportunity Scholarship.
Interestingly, this list does NOT include Teacher Performance Pay (or VAM calculations) yet the use of student test scores is clearly mandated in F.S. 1012.34(3)(a)1. For many, the use of student scores to evaluate teachers is one of the most contentious parts of the state’s accountability system. Is this an oversight or is there reluctance to put the use of VAM to the validity test? Does it matter to the FLDOE whether VAM is a valid assessment of teacher quality or do they plan on using it regardless? Remember what Jeb said? “If you don’t measure, you don’t really care,” and it appears the FLDOE does not care.
So, did these Technical Reports validate their respective tests for these uses?
Not that I can tell.
Neither report includes (as far as I can tell) evaluations confirming the use of these tests for the determination of grade promotion or high school graduation. Indeed, the Alpine Validity report cautions against the use of individual scores calling them “suspect” for some students. There appear to have been no attempt to document that FCAT or FSA test scores can accurately rate schools, districts or teachers.
In fact, the 2014 FCAT 2.0 Technical cautioned such use on page 137.
“At the aggregate level (i.e., school, district, or statewide), the implication validity of school accountability assessments can be judged by the impact the testing program has on the overall proficiency of students. Validity evidence for this level of inference will result from examining changes over time in the percentage of students classified as proficient. As mentioned before, there exists a potential for negative impacts on schools as well, such as increased dropout rates and narrowing of the curriculum. Future validity studies need to investigate possible unintended negative effects as well.”
The “Summary of Validity Evidence” in the 2014 Report is telling. While they conclude that the assessments appeared to be properly scored and the scores could be generalized to the universe score for the individual, they had significant concerns regarding the extrapolation and implication arguments (emphasis mine):
“Less strong is the empirical evidence for extrapolation and implication. This is due in part to the absence of criterion studies. Because an ideal criterion for the FCAT 2.0 or EOC assessments probably cannot be found, empirical evidence for the extrapolation argument may need to come from several studies showing convergent validity evidence. Further studies are also needed to verify some implication arguments. This is especially true for the inference that the state’s accountability program is making a positive impact on student proficiency and school accountability without causing unintended negative consequences.”
In April 2015, I emailed Benjamin Palazesi, from the FLDOE, asking if such “further studies” were ever done to verify the implication arguments, as suggested in the FCAT 2.0 Report. His response? “Since the FCAT 2.0 Reading and Mathematics and Algebra and Geometry EOC Assessments are being replaced by the Florida Standards Assessments (FSA) in these subject areas, there are no plans to conduct a criterion study on these assessments, and we will evaluate the need for additional studies for FSA.”
Hmmm, there is no mention of implication arguments at all in the FSA Report. Do you think they believe there are no unintended negative consequences due to the state’s accountability program? Maybe they don’t read our blog… unintended consequences seem to be a speciality of Florida’s accountabaloney system. Eventually, the FLDOE will need to recognize and address the many unintended consequences of their current system or such consequences can no longer be considered “unintended.”
The validity of the FSA has been in question since it was announced its questions would be “borrowed” from Utah’s Student Assessment of Growth and Excellence (SAGE). On March 4, 2015 (watch the questioning here at 58:26 or read about the “fall out” here), Commissioner Pam Stewart testified in front of the Senate Education Appropriations Subcommittee and proclaimed that the FSA was field tested in Utah and that it was “absolutely psychometrically valid and reliable.” At the time, Ms. Stewart promised to provide documentation to the Senate Education Subcommittee members. Some school board members from Utah were also interested in these documents, as they had not yet received any formal documentation regarding the validity or reliability of their own state test, SAGE (see letter from the Utah School Board here). No documents were ever delivered and, SURPRISE, there is no evidence of a “field test” in Utah or any SAGE Validity documents in this 2015 Technical report, either. (Now might be a good time for Commissioner Stewart to apologize for misleading this Senate Education Subcommittee.)
The 2015 Technical report does include both the legislative mandated Alpine Validity Study (Volume 7 Chapter 7) AND the Alpine presentation to the Senate (Volume 7 Chapter 6). Remember the Alpine Validity Study, because of time constraints, chose not to assess validity for 11 of the 17 FSA tests, including the Algebra 2 and Geometry EOC. The Alpine study also did NOT assess validity or fairness for at-risk populations of students, like ESE or English Language Learners.
Another thing missing from these reports is any assurance that the level of performance tested is grade level appropriate. Neither technical report compared student performance on the FSA/FCAT to performance on a nationally normed test. There is no measurement as to whether the 3rd grade Reading FSA, for example, actually tests 3rd grade reading levels (yet students are retained based on its results). This, I believe, is a major concern for parents and has been seemingly disregarded by the state in the pursuit of “rigor.” Again, “if you don’t measure, you don’t really care” and the FLDOE appears not to care if children who can actually read at a 3rd grade level have been marked for retention.
There is a brief mention (1 paragraph) of statistical fairness in items (Volume 4 page 60), utilizing Differential Item Functioning (DIF) analysis. “DIF analyses were conducted for all items to detect potential item bias from a statistical perspective across major ethnic and gender groups” (Male/female, White/African American/Hispanic, English Language Learner and Students with disabilities). DIF was also used in the 2014 report, but there it seems to have been used to eliminate biased questions that were being field tested. In the 2015 report, the DIF analysis is implied to assure fairness across subpopulations.
In section 5.2, Volume 1, page 20, DIF analysis is described (emphasis mine).
“Identifying DIF was important because it provided a statistical indicator that an item may contain cultural or other bias. DIF-flagged items were further examined by content experts who were asked to reexamine each flagged item to make a decision about whether the item should have been excluded from the pool due to bias. Not all items that exhibit DIF are biased; characteristics of the educational system may also lead to DIF. For example, if schools in certain areas are less likely to offer rigorous Geometry classes, students at those schools might perform more poorly on Geometry items than would be expected, given their proficiency on other types of items. In this example, it is not the item that exhibits bias but rather the instruction.”
I am not a psychometrician, but I do wonder how a test that is used to rate not only students but also schools, can determine a question is not biased because students came from a low performing school; especially since many “low performing schools” contain an overrepresentation of students from at-risk sub-populations. Regardless, I suspect that determining whether individual test questions are biased is not the same thing as evaluating whether a test is fair and valid for those at-risk populations.
Recent reports demonstrated that students who took the paper/pencil version of the PARCC test obtained higher scores than those who took the computer version. Was this evaluated for the FSA? Not in this Technical report, where the only evaluation of the paper/pencil test appears to be content alignment for students with accommodations .
One more thing to add to our list of things not in the report: page 20, section 3.5 of Volume 2 of the 2015 report states “A third-party, independent alignment study was conducted in February 2016. Those results will be included in the 2015-2016 FSA Technical Report.” According FLDOE Deputy Commissioner, Vince Verges, that report is being completed by Human Resources Research Organization (HumRRO) who (SURPRISE!) were thanked, along with AIR, in the acknowledgements of the Alpine Validity Study as “organizations and individuals that serve as vendors for the components of FSA that were included in the evaluation.” Indeed, HumRRO has a long history with the FLDOE (read about it here). Seriously, the DOE needs a dictionary with the definitions of “third-party” and “independent” because HumRRO might be neither.
After looking what isn’t in the FSA Technical Report, I have come to a few conclusions:
- There was never a valid field test in Utah.
- The Alpine Validity Study was incomplete.
- These technical reports are more about describing test construction and administration than confirming validity.
- There remains no evidence that the FSA is fair, valid and reliable for at-risk subpopulations of students or the myriad of uses outside of the individual student.
Yet, we continue to use the FSA to retain students, deny diplomas and rank teachers, schools and districts. That is accountabaloney.
Why has so little time been spent assuring the validity of these tests? The FSA is the cornerstone of Florida’s education accountability system. Why hasn’t serious attention been paid to assuring its validity? Could it be because, as Jeb Bush has said, so many times, “If you don’t measure, you don’t really care.” I am beginning to believe that is true and am wondering who should we hold accountable for that?