library(tidyverse)
addresses <- tibble(
unit = c("Shop 4","Unit 185", NA, NA),
street = c("2 Frencham Pl","20 Allara St","1 McCoy Cct","130 Garran Rd Acton ACT 2601"),
suburb = c("Downer","Canberra","Acton", NA),
postcode = c("2602","2601","2601", NA)
)Joining strings with missing data together in R
Joining multiple string columns together in R isn’t a particularly tricky thing to do. Until recently though, I found it unreasonably annoying trying to join strings where missing data is present. For example, let’s say we’ve got the following address data:
The first two have a value for each element, while the third address doesn’t have a unit, and the fourth for some reason has all the elements in street. What I want to do is combine all of these address elements into a single string and ignore missing values. Let’s review the ways we can go about this.
tidyr::unite()
I’ll cut to the chase – tidyr::unite() is perfect for this:
addresses |>
bind_cols(
unite(
data = addresses,
col = "address",
sep = " ",
na.rm = TRUE
)
)# A tibble: 4 × 5
unit street suburb postcode address
<chr> <chr> <chr> <chr> <chr>
1 Shop 4 2 Frencham Pl Downer 2602 Shop 4 2 Frencham Pl …
2 Unit 185 20 Allara St Canberra 2601 Unit 185 20 Allara St…
3 <NA> 1 McCoy Cct Acton 2601 1 McCoy Cct Acton 2601
4 <NA> 130 Garran Rd Acton ACT 2601 <NA> <NA> 130 Garran Rd Acton A…
I only recently learned about unite() (thanks anonymous cat data scientist!) and was surprised to find it in tidyr and not in stringr. Anyway, it’s perfect!
base::paste()
We could use paste(), except that it includes NA as the letters NA in the output, which is not what I want.
addresses |>
mutate(
address = paste(
unit, street, suburb, postcode,
sep = " "
)
)# A tibble: 4 × 5
unit street suburb postcode address
<chr> <chr> <chr> <chr> <chr>
1 Shop 4 2 Frencham Pl Downer 2602 Shop 4 2 Frencham Pl …
2 Unit 185 20 Allara St Canberra 2601 Unit 185 20 Allara St…
3 <NA> 1 McCoy Cct Acton 2601 NA 1 McCoy Cct Acton …
4 <NA> 130 Garran Rd Acton ACT 2601 <NA> <NA> NA 130 Garran Rd Acto…
We could do some cleaning to overcome this issue:
addresses |>
replace_na(
replace = list(unit = "", suburb = "", postcode = "")
) |>
mutate(
address = paste(
unit, street, suburb, postcode,
sep = " "
)
)# A tibble: 4 × 5
unit street suburb postcode address
<chr> <chr> <chr> <chr> <chr>
1 "Shop 4" 2 Frencham Pl "Downer" "2602" "Shop 4 2 Frencha…
2 "Unit 185" 20 Allara St "Canberra" "2601" "Unit 185 20 Alla…
3 "" 1 McCoy Cct "Acton" "2601" " 1 McCoy Cct Act…
4 "" 130 Garran Rd Acton ACT 2601 "" "" " 130 Garran Rd A…
But the empty strings are now causing excessive separators to be included. We can fix this with str_trim() and str_squish(), but it’s getting silly at this point. Just use unite()!
addresses |>
replace_na(
replace = list(unit = "", suburb = "", postcode = "")
) |>
mutate(
address = str_trim(
paste(
unit, street, suburb, postcode,
sep = " "
)
)
)# A tibble: 4 × 5
unit street suburb postcode address
<chr> <chr> <chr> <chr> <chr>
1 "Shop 4" 2 Frencham Pl "Downer" "2602" Shop 4 2 Frencham…
2 "Unit 185" 20 Allara St "Canberra" "2601" Unit 185 20 Allar…
3 "" 1 McCoy Cct "Acton" "2601" 1 McCoy Cct Acton…
4 "" 130 Garran Rd Acton ACT 2601 "" "" 130 Garran Rd Act…
base::sprintf()
Another base option is to use sprintf(), which seems like we’d still need to do some extra work to handle the NA.
addresses |>
mutate(
address = sprintf(
"%s %s %s %s",
unit, street, suburb, postcode
)
)# A tibble: 4 × 5
unit street suburb postcode address
<chr> <chr> <chr> <chr> <chr>
1 Shop 4 2 Frencham Pl Downer 2602 Shop 4 2 Frencham Pl …
2 Unit 185 20 Allara St Canberra 2601 Unit 185 20 Allara St…
3 <NA> 1 McCoy Cct Acton 2601 NA 1 McCoy Cct Acton …
4 <NA> 130 Garran Rd Acton ACT 2601 <NA> <NA> NA 130 Garran Rd Acto…
stringr::str_c()
We could use str_c() but we’ll have the same issues as we did with paste. Notice though that str_c returns NA if any of the input strings are NA.
addresses |>
mutate(
address = str_c(
unit, street, suburb, postcode,
sep = " "
)
)# A tibble: 4 × 5
unit street suburb postcode address
<chr> <chr> <chr> <chr> <chr>
1 Shop 4 2 Frencham Pl Downer 2602 Shop 4 2 Frencham Pl …
2 Unit 185 20 Allara St Canberra 2601 Unit 185 20 Allara St…
3 <NA> 1 McCoy Cct Acton 2601 <NA>
4 <NA> 130 Garran Rd Acton ACT 2601 <NA> <NA> <NA>
stringr::str_flatten()
Interestingly, we can achieve the same result as unite() if we perform str_flatten() on a row-wise basis. I doubt this is a good idea though, as row-wise stuff tends to get very expensive, very quickly.
addresses |>
rowwise() |>
mutate(
address = str_flatten(
c(unit, street, suburb, postcode),
collapse = " ",
na.rm = TRUE
)
)# A tibble: 4 × 5
# Rowwise:
unit street suburb postcode address
<chr> <chr> <chr> <chr> <chr>
1 Shop 4 2 Frencham Pl Downer 2602 Shop 4 2 Frencham Pl …
2 Unit 185 20 Allara St Canberra 2601 Unit 185 20 Allara St…
3 <NA> 1 McCoy Cct Acton 2601 1 McCoy Cct Acton 2601
4 <NA> 130 Garran Rd Acton ACT 2601 <NA> <NA> 130 Garran Rd Acton A…
glue::glue()
glue is another option and arguably works better than paste or str_c because of the .na option letting us handle the NA within the function itself.
library(glue)
addresses |>
mutate(
address = glue(
"{unit}","{street}","{suburb}","{postcode}",
.na = "",
.sep = " "
)
)# A tibble: 4 × 5
unit street suburb postcode address
<chr> <chr> <chr> <chr> <glue>
1 Shop 4 2 Frencham Pl Downer 2602 Shop 4 2 Frencham Pl …
2 Unit 185 20 Allara St Canberra 2601 Unit 185 20 Allara St…
3 <NA> 1 McCoy Cct Acton 2601 1 McCoy Cct Acton 26…
4 <NA> 130 Garran Rd Acton ACT 2601 <NA> <NA> 130 Garran Rd Acton …
Again though, we’d need to trim the leading whitespace created.
Summary
There’s not much to say except use unite()!