# A tibble: 4 × 6
id q1 q2 q3 q4 q5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 3 3 3
2 2 3 1 2 1 5
3 3 1 2 3 1 2
4 4 2 1 5 1 4
tidyverse functions you might not know about
Sometimes you ask for help and people provide you with solutions using functions you didn’t know exist. That happened to me today and made me realise that I should look through the list of functions provided in the tidyverse
packages I regularly use. So I did that and I discovered some cool functions that I should definitely be using!
I’ve listed some handy functions here and some use cases I’ve had for them.
dplyr
If you want to check out the list of dplyr
functions yourself, you can find them at https://dplyr.tidyverse.org/reference/index.html
Combining values from multiple columns row-wise (c_across()
)
Scenario: You have set of variables you need to create a total score for, like the Kessler K-5.
c_across()
provides a neat way of doing this and is designed to work with data grouped using rowwise()
:
|>
k5 rowwise() |>
mutate(
k5_score = sum(c_across(q1:q5)),
|>
) ungroup()
# A tibble: 4 × 7
id q1 q2 q3 q4 q5 k5_score
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 3 3 3 19
2 2 3 1 2 1 5 12
3 3 1 2 3 1 2 9
4 4 2 1 5 1 4 13
My original approach would’ve used a combination of rowSums()
and pick()
:
<- k5 |>
k5 mutate(
k5_score = rowSums(pick(q1:q5))
) k5
# A tibble: 4 × 7
id q1 q2 q3 q4 q5 k5_score
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 3 3 3 19
2 2 3 1 2 1 5 12
3 3 1 2 3 1 2 9
4 4 2 1 5 1 4 13
The advantage of my original approach is you don’t need to specify rowwise()
and ungroup()
, but this only works for sums and means because base only provides rowSums
and rowMeans
. If you want to do any other operations, c_across()
is your friend!
Set values to missing under a condition (na_if()
)
I would normally use case_when()
to set observations to missing when they meet particular conditions, but na_if()
seems like a more direct way of doing that:
<- k5 |>
k5 mutate(
k5_score = na_if(k5_score, 13)
) k5
# A tibble: 4 × 7
id q1 q2 q3 q4 q5 k5_score
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 3 3 3 19
2 2 3 1 2 1 5 12
3 3 1 2 3 1 2 9
4 4 2 1 5 1 4 NA
ifelse
but tidyverse
style (if_else()
)
I would normally have used case_when()
in this case too, as I find it easier read what’s happening in each condition, but occasionally I find myself using ifelse
for convenience when I only need to use one condition.
if_else()
just seems like a better version of ifelse()
. The main appeal is the inclusion of the missing
argument, which lets you specify what you want missing values to be coded as:
|>
k5 mutate(
k5_category = if_else(k5_score < 12, "low","high", missing = "missing")
)
# A tibble: 4 × 8
id q1 q2 q3 q4 q5 k5_score k5_category
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 5 5 3 3 3 19 high
2 2 3 1 2 1 5 12 high
3 3 1 2 3 1 2 9 low
4 4 2 1 5 1 4 NA missing
Using another dataset to update rows (rows_update()
)
This was a great recommendation by someone who helped me solve a problem efficiently. I had one dataset consisting of uncleaned responses to a paper questionnaire, which had been digitsed using OCR. I had a secondary dataset where the original paper questionnaire had been audited when the OCR seemed to have done a poor job.
Sticking with the K-5 data, imagine we have the audited data as follows:
# A tibble: 2 × 4
id variable original audited
<dbl> <chr> <dbl> <dbl>
1 3 q1 2 1
2 3 q2 4 5
All I had to do was use pivot_longer()
to put the uncleaned data into long format and then use rows_update()
to find and replace the corresponding values, then pivot_wider()
back into the original format.
|>
k5 pivot_longer(
!id,
names_to = "variable",
values_to = "response"
|>
) rows_update(
|>
audit rename(response = audited) |>
select(id, variable, response),
by = c("id","variable")
|>
) pivot_wider(
names_from = "variable",
values_from = "response"
)
# A tibble: 4 × 7
id q1 q2 q3 q4 q5 k5_score
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 5 3 3 3 19
2 2 3 1 2 1 5 12
3 3 1 5 3 1 2 9
4 4 2 1 5 1 4 NA
Note that inside of rows_update()
I take audit
, rename the column with the correct values to match the column name in the data
, then keep only the columns used to identify the unique rows (id
and variable
) and the value used for updating (response
).
There’s a whole suite of row-specific functions (see here), but I have found rows_update()
being especially helpful.
Adding rows to a dataset (bind_rows()
)
I feel like everyone probably knows about this, but I wanted to quickly mention that I recently discovered I was using it inefficiently! bind_rows()
accepts multiple things to be bound, meaning you can write:
|> bind_rows(df2, df3, df4) df1
Instead of what I was previously doing:
|>
df1 bind_rows(df2) |>
bind_rows(df3) |>
bind_rows(df4)
tibble
You can find the tibble
reference list here https://tibble.tidyverse.org/reference/index.html
Adding rows in specific locations (add_row()
)
The use-case I had for this was actually solved by just using gtsummary
, but before I discovered that package I was making summary tables by hand. I’d do this by combining the output from tabyl()
and summarise()
into one data frame. For example:
<- k5 |>
summary summarise(
n = n(),
mean = mean(q1),
sd = sd(q1),
min = min(q1),
max = max(q1)
|>
) pivot_longer(everything())
summary
# A tibble: 5 × 2
name value
<chr> <dbl>
1 n 4
2 mean 2.75
3 sd 1.71
4 min 1
5 max 5
But the number of participants is different to summaries of responses to Q1, so I want to add a row to make this clear. This is easily done by using the .before =
argument:
|>
summary add_row(
name = "Q1 responses",
.before = 2
)
# A tibble: 6 × 2
name value
<chr> <dbl>
1 n 4
2 Q1 responses NA
3 mean 2.75
4 sd 1.71
5 min 1
6 max 5
Converting lists to tibbles (enframe()
)
I’ve been doing a lot of work with APIs and getting data returned as JSON, which the httr2
package helpfully converts into a more usable format in R. Unfortunately, what I get tends to be a lot of nested lists that I need to combine together into a rectangular format.
<- c("Victoria","Western Australia","Tasmania")
my_list
my_list
[1] "Victoria" "Western Australia" "Tasmania"
Enter enframe()
. It works with vectors and converts them into a tibble with two columns. Very simple, very useful.
enframe(my_list)
# A tibble: 3 × 2
name value
<int> <chr>
1 1 Victoria
2 2 Western Australia
3 3 Tasmania
tidyr
See the tidyr
reference list here https://tidyr.tidyverse.org/reference/index.html
Getting values out of list columns (unnest()
)
Building on the previous example, you might end up with a column that contains lists, but you want those values as columns (or rows).
<- tibble(
my_list state = list(tibble("Victoria","Western Australia","Tasmania")),
territory = list(tibble("Australian Capital Territory","Northern Territory"))
)
my_list
# A tibble: 1 × 2
state territory
<list> <list>
1 <tibble [1 × 3]> <tibble [1 × 2]>
These lists can be extracted using unnest
:
unnest(my_list, col = c(state, territory))
# A tibble: 1 × 5
`"Victoria"` `"Western Australia"` `"Tasmania"` "Australian Capital Territor…¹
<chr> <chr> <chr> <chr>
1 Victoria Western Australia Tasmania Australian Capital Territory
# ℹ abbreviated name: ¹`"Australian Capital Territory"`
# ℹ 1 more variable: `"Northern Territory"` <chr>
There’s also unnest_longer()
and unnest_wider()
, depending on the shape you.
Propagating non-missing values row-wise (fill()
)
This has been such a useful function in so many situations for me, yet somehow I frequently forget about it!
My most recent use-case was creating a data dictionary based on metadata output from REDCap. In the questionnaire, section headers are used to paginate the different sections, but in the metadata the header is attached to a single variable:
<- tibble(
dictionary variable = c("name","dob","q1","q2","q3","q4","q5"),
header = c("About you",NA,"Psychological distress",NA,NA,NA,NA)
)
dictionary
# A tibble: 7 × 2
variable header
<chr> <chr>
1 name About you
2 dob <NA>
3 q1 Psychological distress
4 q2 <NA>
5 q3 <NA>
6 q4 <NA>
7 q5 <NA>
When creating the data dictionary, I want the header to be present for every variable as it can be useful to identify or analyse data by section:
|>
dictionary fill(header, .direction = "down")
# A tibble: 7 × 2
variable header
<chr> <chr>
1 name About you
2 dob About you
3 q1 Psychological distress
4 q2 Psychological distress
5 q3 Psychological distress
6 q4 Psychological distress
7 q5 Psychological distress
The .direction
argument accepts “down”, “up”, “downup”, and “updown”, meaning you can fill in multiple directions. This is especially powerful when you combine it with group_by()
, allowing you to fill without worrying about propagating into the wrong group.