tidyverse functions you might not know about

R
programming
data
Author

Ben Harrap

Published

May 23, 2025

Sometimes you ask for help and people provide you with solutions using functions you didn’t know exist. That happened to me today and made me realise that I should look through the list of functions provided in the tidyverse packages I regularly use. So I did that and I discovered some cool functions that I should definitely be using!

I’ve listed some handy functions here and some use cases I’ve had for them.

dplyr

If you want to check out the list of dplyr functions yourself, you can find them at https://dplyr.tidyverse.org/reference/index.html

Combining values from multiple columns row-wise (c_across())

Scenario: You have set of variables you need to create a total score for, like the Kessler K-5.

# A tibble: 4 × 6
     id    q1    q2    q3    q4    q5
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     5     5     3     3     3
2     2     3     1     2     1     5
3     3     1     2     3     1     2
4     4     2     1     5     1     4

c_across() provides a neat way of doing this and is designed to work with data grouped using rowwise():

k5 |> 
  rowwise() |> 
  mutate(
    k5_score = sum(c_across(q1:q5)),
  ) |> 
  ungroup()
# A tibble: 4 × 7
     id    q1    q2    q3    q4    q5 k5_score
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
1     1     5     5     3     3     3       19
2     2     3     1     2     1     5       12
3     3     1     2     3     1     2        9
4     4     2     1     5     1     4       13

My original approach would’ve used a combination of rowSums() and pick():

k5 <- k5 |> 
  mutate(
    k5_score = rowSums(pick(q1:q5))
  )
k5
# A tibble: 4 × 7
     id    q1    q2    q3    q4    q5 k5_score
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
1     1     5     5     3     3     3       19
2     2     3     1     2     1     5       12
3     3     1     2     3     1     2        9
4     4     2     1     5     1     4       13

The advantage of my original approach is you don’t need to specify rowwise() and ungroup(), but this only works for sums and means because base only provides rowSums and rowMeans. If you want to do any other operations, c_across() is your friend!

Set values to missing under a condition (na_if())

I would normally use case_when() to set observations to missing when they meet particular conditions, but na_if() seems like a more direct way of doing that:

k5 <- k5 |> 
  mutate(
    k5_score = na_if(k5_score, 13)
  )
k5
# A tibble: 4 × 7
     id    q1    q2    q3    q4    q5 k5_score
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
1     1     5     5     3     3     3       19
2     2     3     1     2     1     5       12
3     3     1     2     3     1     2        9
4     4     2     1     5     1     4       NA

ifelse but tidyverse style (if_else())

I would normally have used case_when() in this case too, as I find it easier read what’s happening in each condition, but occasionally I find myself using ifelse for convenience when I only need to use one condition.

if_else() just seems like a better version of ifelse(). The main appeal is the inclusion of the missing argument, which lets you specify what you want missing values to be coded as:

k5 |> 
  mutate(
    k5_category = if_else(k5_score < 12, "low","high", missing = "missing")
  )
# A tibble: 4 × 8
     id    q1    q2    q3    q4    q5 k5_score k5_category
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <chr>      
1     1     5     5     3     3     3       19 high       
2     2     3     1     2     1     5       12 high       
3     3     1     2     3     1     2        9 low        
4     4     2     1     5     1     4       NA missing    

Using another dataset to update rows (rows_update())

This was a great recommendation by someone who helped me solve a problem efficiently. I had one dataset consisting of uncleaned responses to a paper questionnaire, which had been digitsed using OCR. I had a secondary dataset where the original paper questionnaire had been audited when the OCR seemed to have done a poor job.

Sticking with the K-5 data, imagine we have the audited data as follows:

# A tibble: 2 × 4
     id variable original audited
  <dbl> <chr>       <dbl>   <dbl>
1     3 q1              2       1
2     3 q2              4       5

All I had to do was use pivot_longer() to put the uncleaned data into long format and then use rows_update() to find and replace the corresponding values, then pivot_wider() back into the original format.

k5 |> 
  pivot_longer(
    !id,
    names_to = "variable",
    values_to = "response"
    ) |>
  rows_update(
    audit |> 
      rename(response = audited) |> 
      select(id, variable, response),
    by = c("id","variable")
  ) |> 
  pivot_wider(
    names_from = "variable",
    values_from = "response"
  )
# A tibble: 4 × 7
     id    q1    q2    q3    q4    q5 k5_score
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
1     1     5     5     3     3     3       19
2     2     3     1     2     1     5       12
3     3     1     5     3     1     2        9
4     4     2     1     5     1     4       NA

Note that inside of rows_update() I take audit, rename the column with the correct values to match the column name in the data, then keep only the columns used to identify the unique rows (id and variable) and the value used for updating (response).

There’s a whole suite of row-specific functions (see here), but I have found rows_update() being especially helpful.

Adding rows to a dataset (bind_rows())

I feel like everyone probably knows about this, but I wanted to quickly mention that I recently discovered I was using it inefficiently! bind_rows() accepts multiple things to be bound, meaning you can write:

df1 |> bind_rows(df2, df3, df4)

Instead of what I was previously doing:

df1 |> 
  bind_rows(df2) |> 
  bind_rows(df3) |> 
  bind_rows(df4)

tibble

You can find the tibble reference list here https://tibble.tidyverse.org/reference/index.html

Adding rows in specific locations (add_row())

The use-case I had for this was actually solved by just using gtsummary, but before I discovered that package I was making summary tables by hand. I’d do this by combining the output from tabyl() and summarise() into one data frame. For example:

summary <- k5 |>
  summarise(
    n = n(),
    mean = mean(q1),
    sd = sd(q1),
    min = min(q1),
    max = max(q1)
  ) |>
  pivot_longer(everything())

summary
# A tibble: 5 × 2
  name  value
  <chr> <dbl>
1 n      4   
2 mean   2.75
3 sd     1.71
4 min    1   
5 max    5   

But the number of participants is different to summaries of responses to Q1, so I want to add a row to make this clear. This is easily done by using the .before = argument:

summary |> 
  add_row(
    name = "Q1 responses",
    .before = 2
  )
# A tibble: 6 × 2
  name         value
  <chr>        <dbl>
1 n             4   
2 Q1 responses NA   
3 mean          2.75
4 sd            1.71
5 min           1   
6 max           5   

Converting lists to tibbles (enframe())

I’ve been doing a lot of work with APIs and getting data returned as JSON, which the httr2 package helpfully converts into a more usable format in R. Unfortunately, what I get tends to be a lot of nested lists that I need to combine together into a rectangular format.

my_list <- c("Victoria","Western Australia","Tasmania")

my_list
[1] "Victoria"          "Western Australia" "Tasmania"         

Enter enframe(). It works with vectors and converts them into a tibble with two columns. Very simple, very useful.

enframe(my_list)
# A tibble: 3 × 2
   name value            
  <int> <chr>            
1     1 Victoria         
2     2 Western Australia
3     3 Tasmania         

tidyr

See the tidyr reference list here https://tidyr.tidyverse.org/reference/index.html

Getting values out of list columns (unnest())

Building on the previous example, you might end up with a column that contains lists, but you want those values as columns (or rows).

my_list <- tibble(
  state = list(tibble("Victoria","Western Australia","Tasmania")),
  territory = list(tibble("Australian Capital Territory","Northern Territory"))
    )

my_list
# A tibble: 1 × 2
  state            territory       
  <list>           <list>          
1 <tibble [1 × 3]> <tibble [1 × 2]>

These lists can be extracted using unnest:

unnest(my_list, col = c(state, territory))
# A tibble: 1 × 5
  `"Victoria"` `"Western Australia"` `"Tasmania"` "Australian Capital Territor…¹
  <chr>        <chr>                 <chr>        <chr>                         
1 Victoria     Western Australia     Tasmania     Australian Capital Territory  
# ℹ abbreviated name: ¹​`"Australian Capital Territory"`
# ℹ 1 more variable: `"Northern Territory"` <chr>

There’s also unnest_longer() and unnest_wider(), depending on the shape you.

Propagating non-missing values row-wise (fill())

This has been such a useful function in so many situations for me, yet somehow I frequently forget about it!

My most recent use-case was creating a data dictionary based on metadata output from REDCap. In the questionnaire, section headers are used to paginate the different sections, but in the metadata the header is attached to a single variable:

dictionary <- tibble(
  variable = c("name","dob","q1","q2","q3","q4","q5"),
  header = c("About you",NA,"Psychological distress",NA,NA,NA,NA)
)

dictionary
# A tibble: 7 × 2
  variable header                
  <chr>    <chr>                 
1 name     About you             
2 dob      <NA>                  
3 q1       Psychological distress
4 q2       <NA>                  
5 q3       <NA>                  
6 q4       <NA>                  
7 q5       <NA>                  

When creating the data dictionary, I want the header to be present for every variable as it can be useful to identify or analyse data by section:

dictionary |> 
  fill(header, .direction = "down")
# A tibble: 7 × 2
  variable header                
  <chr>    <chr>                 
1 name     About you             
2 dob      About you             
3 q1       Psychological distress
4 q2       Psychological distress
5 q3       Psychological distress
6 q4       Psychological distress
7 q5       Psychological distress

The .direction argument accepts “down”, “up”, “downup”, and “updown”, meaning you can fill in multiple directions. This is especially powerful when you combine it with group_by(), allowing you to fill without worrying about propagating into the wrong group.