In the last post I talked about one method to estimate distributional differences from ordinal data, such as those reported by statewide accountability systems. In this post, we’ll put this method to work for the state of California. I’ll show how we can estimate school-level Hispanic/White achievement gaps for every school in the state that reports data on both groups. In California, this means the school must have at least 30 students in each group, for the corresponding grade.
The data
The primary data we’ll be looking at are available here. As I mentioned in the previous post, part of what I think is so cool about this method is that these data are reported across all states, so you could apply this method with any state. I chose California here because I have some experience with their specific data, I’m a west-coaster, and California is more interesting than Oregon (where I live) because they are much more diverse and have areas of dense population.
These data have a number of numeric codes in them that don’t make much sense without the code book, which is available here.
I’m also always interested in geographic variance in social things, including school performance, so I also like to try to grab the longitude and latitude of the schools. That’s available through a separate file, available here. Note that geographic information is available more generally for every public school in the country through the National Center for Education Statistics (NCES) Education Demographic and Geographic Estimates (EDGE) program.
Loading the data
We could, of course, just visit these websites and pull the data down and load it in manually, but that’s no fun. This is R. Let’s do it through code!
The file we want is at http://www3.cde.ca.gov/caasppresearchfiles/2018/sb/sb_ca2018_all_csv_v3.zip. The tricky part is, it’s in a zip file with one other file. One way to handle this is by creating a temporary directory, downloading the zip file there, then unzipping the file and pulling just the data we want out. In our case, the filename is the same as the zip file, but with a .txt extension. I’ll be using the tidyverse later anyway so I’ll do something like this
That gives us the basic file we want, but we don’t know what any of the subgroup IDs represent. To get that, we’ll have to download another datafile. This is another zip file, but note I’m using a slightly different approach below, which I can do because the zip file only contains a single file.
It’s fairly difficult to see what’s going on here so let’s limit our data to only the things we really care about here. We’ll need the district and school codes, the group variables we just added in, and all the percentage in each category.
Now we’ll limit the data to only Hispanic/White students, which is the achievement gap we’ll investigate across schools. I don’t know the specific labels, so I’ll look at these first, then filter accordingly.
We now have a pretty basic dataset that we’re ready to use to estimate effect size. If you recall from the previous post, what we need is the cumulate percentage of students in each category, rather than the raw percents. I’m going to do this by first creating a lower category that has zero students in it. I’ll then reshape the data to a long(er) format and calculate the cumulative sum.
We need this because of the cumulative sum calculation that comes next. First though, let’s reshape the data. After the reshape, I do a tiny bit of cleanup so the category variable doesn’t repeat "percentage_standard_" over and over.
## 1 10017 0112607 11 Ethnicity Hispanic or L… Mathem…
## 2 10017 0112607 11 Ethnicity Hispanic or L… ELA
## 3 10017 0112607 13 Ethnicity Hispanic or L… ELA
## 4 10017 0112607 13 Ethnicity Hispanic or L… Mathem…
## 5 10017 0112607 11 Ethnicity White Mathem…
## 6 10017 0112607 11 Ethnicity White ELA
## 7 10017 0112607 13 Ethnicity White ELA
## 8 10017 0112607 13 Ethnicity White Mathem…
## 9 10017 0123968 3 Ethnicity Hispanic or L… Mathem…
## 10 10017 0123968 3 Ethnicity Hispanic or L… ELA
## # … with 802,385 more rows, and 2 more variables: category <chr>,
## # percentage <dbl>
Now we need to make sure the categories are ordered in ascending order within a school. The best way to do this, from my perspective, is to transform category into a categorical variable.
## 6 65243 0100016 3 Ethnicity Hispanic or L… Mathem…
## 7 65243 0100016 3 Ethnicity Hispanic or L… Mathem…
## 8 65243 0100016 3 Ethnicity Hispanic or L… Mathem…
## 9 65243 0100016 3 Ethnicity Hispanic or L… Mathem…
## 10 65243 0100016 3 Ethnicity Hispanic or L… Mathem…
## # … with 802,385 more rows, and 3 more variables: category <fct>,
## # percentage <dbl>, cumm_perc <dbl>
And now we’re getting close. We just need a column for each each group. We’ll drop the raw percentage (so rows are uniquely defined) and spread the cumulative sum into to columns according to the specific group
## # … with 219,915 more rows, and 2 more variables:
## # hispanic_or_latino <dbl>, white <dbl>
And now we’re very close, but if you look carefully you can see we have one issue remaining - every school has the low category reported for both groups. We need to remove schools that only have the low category reported (because they don’t actually have any real data reported). There’s lots of ways to do this, of course, but a fairly straightforward way is to count the rows within each school/grade/test combination and make sure there are five observations (four categories, plus the low category). Then we’ll select for just those observations.
## # … with 181,715 more rows, and 2 more variables:
## # hispanic_or_latino <dbl>, white <dbl>
And our data are finally finalized! 🥳
Produce estimates
First, let’s compute the area under the paired curves. To do this, we just use an x/y integration. This will give us one estimate for each school/test/grade combination. I’ll use the {pracma} package again. One small caveat here… to get the correct AUC, the cumulative percentages actually need to be cumulative proportions. We could have done this transformation above in our data prep (and maybe I should have done that) but you can also do it in the integration and it doesn’t change the results at all. We’ll take this approach.
As a reminder, these values represent the probability that a randomly selected student from the x axis group, in this case students coded Hispanic/Latino, would score above a randomly selected student from the y axis group, in this case students coded White.
Now, we can transform these values into effect sizes using sqrt(2)*qnorm(auc), where auc represents the values we just calculated.
1
2
3
4
v <- aucs %>%
mutate(v = sqrt(2)*qnorm(auc))
v
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## # A tibble: 36,345 x 5
## school_code grade test_id auc v
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 0100016 4 ELA 0.342 -0.577
## 2 0100016 4 Mathematics 0.339 -0.588
## 3 0100016 13 ELA 0.389 -0.400
## 4 0100016 13 Mathematics 0.420 -0.285
## 5 0100024 3 ELA 0.429 -0.253
## 6 0100024 3 Mathematics 0.484 -0.0565
## 7 0100024 4 ELA 0.534 0.119
## 8 0100024 4 Mathematics 0.477 -0.0803
## 9 0100024 5 ELA 0.478 -0.0792
## 10 0100024 5 Mathematics 0.445 -0.195
## # … with 36,335 more rows
And voilà! We have effect size estimates for every school in California that reported data on both groups.
Quick exploration
This is already a long post, so I’ll keep this brief, but let’s quickly explore the effect size estimates.
First, let’s just look at the distributions by content area.
This gives us a quick understanding of the overall distribution. For the vast majority of schools, students coded Hispanic/Latino are scoring, on average, lower than students coded White. But this is not true for all schools. We can also see that these achievement disparities are, on average, slightly larger in Math than in ELA.
Notice, however, that there is considerable variability between schools. What drives this variability? This is currently my primary area of interest.
One more quick exploration, let’s look at the distributions by grade. I’ll use the {ggridges} package to produce distributions by grade.
subtitle = "Effect sizes estimated from ordinal percent proficient data",
caption = "Data obtained from the California Department of Education website: \n https://caaspp.cde.ca.gov/sb2018/ResearchFileList") +
facet_wrap(~test_id)
Is there evidence of the achievement gaps growing by grade? Maybe… let’s take a different look.
1
2
3
4
5
6
7
ggplot(grade_means, aes(grade, mean)) +
geom_errorbar(aes(ymax = mean + qnorm(0.975)*mean_se,
ymin = mean + qnorm(0.025)*mean_se)) +
geom_point() +
geom_smooth() +
facet_wrap(~test_id) +
scale_x_continuous(breaks = c(3:8, 11, 13))
Maybe some, but the evidence is not overwhelmingly strong in this case
Conclusions
This was a long post, but an important one, I think. In the next post, I’ll talk about geographical variation in school-level achievement gaps, which will require linking the schools with data including longitude and latitude, and exploring things like census variables to explore how they may relate to the between-school variability.
Thanks for reading! Please get in touch if you found it interesting, see areas that need correcting, or have follow-up questions.