Sharing some functions from my personal R package
Contents
In this post I basically just wanted to share some recent developments that I’ve made to my personal R package {sundry}. All of the recent advancements have been made to work with the tidyverse, so things like group_by
should work seamlessly. If you feel like giving the package a whirl, I’d love any feedback you have or bugs you may find. At this point the package is only on github. If there seems to be interest from others in using any of this functionality, I may submit it to CRAN. You can install it with devtools::install_github("DJAnderson_07/sundry")
Batch reading data
Probably my favorite new function is read_files
, which is basically meant to make batch reading in files easy. It uses purrr::map_df
by default, so all the data frames are bound into a single data frame. And, the part I think is really neat, is that it leverages the power of the {rio} package so you don’t really have to worry much about file types. In fact, the files you read in can all be of different types and it’s no big deal. Here’s an example, from the README.
First, we’ll load the tidyverse and sundry, then split the iris dataset by species.
library(sundry)
library(tidyverse)
by_species <- iris %>%
split(.$Species) %>%
map(select, -Species)
str(by_species)
## List of 3
## $ setosa :'data.frame': 50 obs. of 4 variables:
## ..$ Sepal.Length: num [1:50] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## ..$ Sepal.Width : num [1:50] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## ..$ Petal.Length: num [1:50] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## ..$ Petal.Width : num [1:50] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ versicolor:'data.frame': 50 obs. of 4 variables:
## ..$ Sepal.Length: num [1:50] 7 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 ...
## ..$ Sepal.Width : num [1:50] 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 ...
## ..$ Petal.Length: num [1:50] 4.7 4.5 4.9 4 4.6 4.5 4.7 3.3 4.6 3.9 ...
## ..$ Petal.Width : num [1:50] 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1 1.3 1.4 ...
## $ virginica :'data.frame': 50 obs. of 4 variables:
## ..$ Sepal.Length: num [1:50] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 ...
## ..$ Sepal.Width : num [1:50] 3.3 2.7 3 2.9 3 3 2.5 2.9 2.5 3.6 ...
## ..$ Petal.Length: num [1:50] 6 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 ...
## ..$ Petal.Width : num [1:50] 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 ...
Next, we’ll export each dataset from the split as different file types. Note that for each dataset, I’m not only writing the file out in a different format (csv, EXCEL, and SPSS), but I’m also only writing out specific columns (which are not all in common). So these are all fairly different files at this point, but we can imagine them all being part of the same study.
rio::export(by_species$setosa[ ,1:2], "setosa.csv")
rio::export(by_species$versicolor[ ,2:4], "versicolor.xlsx")
rio::export(by_species$virginica[ ,2:3], "virginica.sav")
Now, we can import all of these datasets back into R with the sundry::read_files
function.
d <- read_files()
d
## # A tibble: 150 x 5
## file Sepal.Length Sepal.Width Petal.Length Petal.Width
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.10 3.50 NA NA
## 2 setosa 4.90 3.00 NA NA
## 3 setosa 4.70 3.20 NA NA
## 4 setosa 4.60 3.10 NA NA
## 5 setosa 5.00 3.60 NA NA
## 6 setosa 5.40 3.90 NA NA
## 7 setosa 4.60 3.40 NA NA
## 8 setosa 5.00 3.40 NA NA
## 9 setosa 4.40 2.90 NA NA
## 10 setosa 4.90 3.10 NA NA
## # ... with 140 more rows
d %>%
count(file)
## # A tibble: 3 x 2
## file n
## <chr> <int>
## 1 setosa 50
## 2 versicolor 50
## 3 virginica 50
This can be a little tricky though, because sometimes you might have the same file in multiple formats, or you may want to only read in data some datasets from a directory but not all. That’s where the optional pat
argument comes in. For example, let’s write out one additional csv file, and then read in only the csv files.
rio::export(by_species$virginica[ ,2:3], "virginica.csv")
d2 <- read_files(pat = "csv")
head(d2)
## # A tibble: 6 x 4
## file Sepal.Length Sepal.Width Petal.Length
## <chr> <dbl> <dbl> <dbl>
## 1 setosa 5.10 3.50 NA
## 2 setosa 4.90 3.00 NA
## 3 setosa 4.70 3.20 NA
## 4 setosa 4.60 3.10 NA
## 5 setosa 5.00 3.60 NA
## 6 setosa 5.40 3.90 NA
tail(d2)
## # A tibble: 6 x 4
## file Sepal.Length Sepal.Width Petal.Length
## <chr> <dbl> <dbl> <dbl>
## 1 virginica NA 3.30 5.70
## 2 virginica NA 3.00 5.20
## 3 virginica NA 2.50 5.00
## 4 virginica NA 3.00 5.20
## 5 virginica NA 3.40 5.40
## 6 virginica NA 3.00 5.10
Finally, we’ll clean up a bit by deleting all the files we wrote out for this example
fs::file_delete(c("setosa.csv", "versicolor.xlsx",
"virginica.sav", "virginica.csv"))
Quick descriptive statistics
I’m sure there are other packages that do similar things, but I often find myself just needing quick descriptives for a variable. That’s where sundry::descrips
comes in helpful. It’s relatively straightforward. You just supply the columns you want descriptive statistics on and, by default, it returns the number of cases, min and max values, as well as the mean and standard deviation.
storms %>%
descrips(wind, pressure)
## # A tibble: 2 x 6
## variable n min max mean sd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 pressure 10010 882 1022 992 19.5
## 2 wind 10010 10.0 160 53.5 26.2
The function also works well with dplyr::group_by
storms %>%
group_by(year) %>%
descrips(wind, pressure)
## # A tibble: 82 x 7
## year variable n min max mean sd
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1975 pressure 86.0 963 1014 995 15.2
## 2 1975 wind 86.0 20.0 100 50.9 23.6
## 3 1976 pressure 52.0 957 1012 989 15.3
## 4 1976 wind 52.0 20.0 105 59.9 24.8
## 5 1977 pressure 53.0 926 1015 995 20.4
## 6 1977 wind 53.0 20.0 150 54.0 29.6
## 7 1978 pressure 54.0 980 1012 1006 6.64
## 8 1978 wind 54.0 20.0 80.0 40.5 13.9
## 9 1979 pressure 301 924 1014 995 19.9
## 10 1979 wind 301 15.0 150 48.7 30.3
## # ... with 72 more rows
And finally, if you want different functions, you can supply them via the optional .funs
argument. Below, we’ll calculate the 25th, 50th, and 75th percentiles instead.
storms %>%
group_by(year) %>%
descrips(wind, pressure,
.funs = funs(qtile25 = quantile(., 0.25),
median,
qtile75 = quantile(., 0.75)))
## # A tibble: 82 x 5
## year variable qtile25 median qtile75
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1975 pressure 984 997 1011
## 2 1975 wind 25.0 52.5 65.0
## 3 1976 pressure 978 992 1000
## 4 1976 wind 38.8 60.0 80.0
## 5 1977 pressure 994 1001 1010
## 6 1977 wind 30.0 45.0 70.0
## 7 1978 pressure 1006 1007 1009
## 8 1978 wind 30.0 40.0 45.0
## 9 1979 pressure 988 1002 1008
## 10 1979 wind 25.0 35.0 65.0
## # ... with 72 more rows
Remove rows with complete missing data across a set of variables
In many datasets I work with, there are sets of variables that have complete missing data. I want to remove any rows that are missing acrross all of these variables. This is different from janitor::remove_empty_rows
, because these rows may have valid data on other variables, just not across the set of variables I’m interested in. Below is an example from the Oregon Department of Education on schools, where the number and percent of students scoring in each statewide proficiency category on the statewide test are missing if the n size is too small (for confidentiality purposes).
d <- rio::import("http://www.oregon.gov/ode/educator-resources/assessment/TestResults2017/pagr_schools_ela_tot_ecd_ext_gnd_lep_1617.xlsx",
setclass = "tbl_df",
na = c("--", "*")) %>%
janitor::clean_names()
d %>%
select(district_id, number_level_4:percent_level_1)
## # A tibble: 23,760 x 9
## district_id number_level_4 percent_level_4 number_level_3
## <dbl> <chr> <chr> <chr>
## 1 2063 <NA> <NA> <NA>
## 2 2063 <NA> <NA> <NA>
## 3 2063 <NA> <NA> <NA>
## 4 2063 <NA> <NA> <NA>
## 5 2063 <NA> <NA> <NA>
## 6 2063 <NA> <NA> <NA>
## 7 2063 <NA> <NA> <NA>
## 8 2063 <NA> <NA> <NA>
## 9 2063 <NA> <NA> <NA>
## 10 2063 <NA> <NA> <NA>
## # ... with 23,750 more rows, and 5 more variables: percent_level_3 <chr>,
## # number_level_2 <chr>, percent_level_2 <chr>, number_level_1 <chr>,
## # percent_level_1 <chr>
Note that there are many more columns here, but I’m only showing the ones that I’m interested in. I want to remove and rows that are missing across these variables, which I can do with sundry::rm_empty_rows
.
d %>%
rm_empty_rows(number_level_4:percent_level_1) %>%
select(district_id, number_level_4:percent_level_1)
## # A tibble: 15,081 x 9
## district_id number_level_4 percent_level_4 number_level_3
## <dbl> <chr> <chr> <chr>
## 1 2113 5 45.5 5
## 2 2113 5 29.4 8
## 3 2113 2 20 5
## 4 2113 5 33.299999999999997 7
## 5 2113 2 20 5
## 6 2113 0 0 6
## 7 2113 7 53.8 5
## 8 2113 5 41.7 4
## 9 2113 5 41.7 6
## 10 2113 - - -
## # ... with 15,071 more rows, and 5 more variables: percent_level_3 <chr>,
## # number_level_2 <chr>, percent_level_2 <chr>, number_level_1 <chr>,
## # percent_level_1 <chr>
In the above, rows with partial missing data across the set of columns are still returned. The function can also be provided without any column arguments, and the function will then mimic the behavior or janitor::remove_empty_rows
.
Filter by functions
For plotting purposes, in particular, I often find myself needing to filter a dataset according to values that can be obtained from functions, such as the min and the max. Below is an example, where the dataset is filtered to return only the rows where the wind speed is equal to the minimum or the maximum wind speed.
storms %>%
filter_by_funs(wind, funs(min, max)) %>%
select(fun, name, year, wind)
## # A tibble: 11 x 4
## fun name year wind
## <chr> <chr> <dbl> <int>
## 1 min Bonnie 1986 10
## 2 min Bonnie 1986 10
## 3 min AL031987 1987 10
## 4 min AL031987 1987 10
## 5 min AL031987 1987 10
## 6 min Alberto 1994 10
## 7 min Alberto 1994 10
## 8 min Alberto 1994 10
## 9 min Alberto 1994 10
## 10 max Gilbert 1988 160
## 11 max Wilma 2005 160
So in this case, the minimum wind speed is 10, and the maximum is 160. The first 9 rows all match the minimum wind speed, while the 10th and 11th rows match the max. You can quickly identify which function the row was selected on by the fun column that is added to the data frame. Again, this function works well with dplyr::group_by
, and should work with any function.
storms %>%
group_by(year) %>%
filter_by_funs(wind, funs(min, max)) %>%
select(fun, name, year, wind) %>%
arrange(year)
## # A tibble: 387 x 4
## # Groups: year [41]
## fun name year wind
## <chr> <chr> <dbl> <int>
## 1 min Caroline 1975 20
## 2 min Caroline 1975 20
## 3 max Caroline 1975 100
## 4 max Caroline 1975 100
## 5 min Gloria 1976 20
## 6 min Gloria 1976 20
## 7 max Belle 1976 105
## 8 max Belle 1976 105
## 9 min Anita 1977 20
## 10 min Clara 1977 20
## # ... with 377 more rows
Conclusions
There are a few other functions in the package that might be helfpul, but I didn’t want this post to get too long. I would love any feedback!
Author Daniel Anderson
LastMod 2018-04-08 (89314bf)