library(tidyverse)
<- tibble(
addresses unit = c("Shop 4","Unit 185", NA, NA),
street = c("2 Frencham Pl","20 Allara St","1 McCoy Cct","130 Garran Rd Acton ACT 2601"),
suburb = c("Downer","Canberra","Acton", NA),
postcode = c("2602","2601","2601", NA)
)
Joining strings with missing data together in R
Joining multiple string columns together in R isn’t a particularly tricky thing to do. Until recently though, I found it unreasonably annoying trying to join strings where missing data is present. For example, let’s say we’ve got the following address data:
The first two have a value for each element, while the third address doesn’t have a unit
, and the fourth for some reason has all the elements in street
. What I want to do is combine all of these address elements into a single string and ignore missing values. Let’s review the ways we can go about this.
tidyr::unite()
I’ll cut to the chase – tidyr::unite()
is perfect for this:
|>
addresses bind_cols(
unite(
data = addresses,
col = "address",
sep = " ",
na.rm = TRUE
) )
# A tibble: 4 × 5
unit street suburb postcode address
<chr> <chr> <chr> <chr> <chr>
1 Shop 4 2 Frencham Pl Downer 2602 Shop 4 2 Frencham Pl …
2 Unit 185 20 Allara St Canberra 2601 Unit 185 20 Allara St…
3 <NA> 1 McCoy Cct Acton 2601 1 McCoy Cct Acton 2601
4 <NA> 130 Garran Rd Acton ACT 2601 <NA> <NA> 130 Garran Rd Acton A…
I only recently learned about unite()
(thanks anonymous cat data scientist!) and was surprised to find it in tidyr
and not in stringr
. Anyway, it’s perfect!
base::paste()
We could use paste()
, except that it includes NA
as the letters NA in the output, which is not what I want.
|>
addresses mutate(
address = paste(
unit, street, suburb, postcode,sep = " "
) )
# A tibble: 4 × 5
unit street suburb postcode address
<chr> <chr> <chr> <chr> <chr>
1 Shop 4 2 Frencham Pl Downer 2602 Shop 4 2 Frencham Pl …
2 Unit 185 20 Allara St Canberra 2601 Unit 185 20 Allara St…
3 <NA> 1 McCoy Cct Acton 2601 NA 1 McCoy Cct Acton …
4 <NA> 130 Garran Rd Acton ACT 2601 <NA> <NA> NA 130 Garran Rd Acto…
We could do some cleaning to overcome this issue:
|>
addresses replace_na(
replace = list(unit = "", suburb = "", postcode = "")
|>
) mutate(
address = paste(
unit, street, suburb, postcode,sep = " "
) )
# A tibble: 4 × 5
unit street suburb postcode address
<chr> <chr> <chr> <chr> <chr>
1 "Shop 4" 2 Frencham Pl "Downer" "2602" "Shop 4 2 Frencha…
2 "Unit 185" 20 Allara St "Canberra" "2601" "Unit 185 20 Alla…
3 "" 1 McCoy Cct "Acton" "2601" " 1 McCoy Cct Act…
4 "" 130 Garran Rd Acton ACT 2601 "" "" " 130 Garran Rd A…
But the empty strings are now causing excessive separators to be included. We can fix this with str_trim()
and str_squish()
, but it’s getting silly at this point. Just use unite()
!
|>
addresses replace_na(
replace = list(unit = "", suburb = "", postcode = "")
|>
) mutate(
address = str_trim(
paste(
unit, street, suburb, postcode,sep = " "
)
) )
# A tibble: 4 × 5
unit street suburb postcode address
<chr> <chr> <chr> <chr> <chr>
1 "Shop 4" 2 Frencham Pl "Downer" "2602" Shop 4 2 Frencham…
2 "Unit 185" 20 Allara St "Canberra" "2601" Unit 185 20 Allar…
3 "" 1 McCoy Cct "Acton" "2601" 1 McCoy Cct Acton…
4 "" 130 Garran Rd Acton ACT 2601 "" "" 130 Garran Rd Act…
base::sprintf()
Another base option is to use sprintf()
, which seems like we’d still need to do some extra work to handle the NA
.
|>
addresses mutate(
address = sprintf(
"%s %s %s %s",
unit, street, suburb, postcode
) )
# A tibble: 4 × 5
unit street suburb postcode address
<chr> <chr> <chr> <chr> <chr>
1 Shop 4 2 Frencham Pl Downer 2602 Shop 4 2 Frencham Pl …
2 Unit 185 20 Allara St Canberra 2601 Unit 185 20 Allara St…
3 <NA> 1 McCoy Cct Acton 2601 NA 1 McCoy Cct Acton …
4 <NA> 130 Garran Rd Acton ACT 2601 <NA> <NA> NA 130 Garran Rd Acto…
stringr::str_c()
We could use str_c()
but we’ll have the same issues as we did with paste
. Notice though that str_c
returns NA
if any of the input strings are NA
.
|>
addresses mutate(
address = str_c(
unit, street, suburb, postcode,sep = " "
) )
# A tibble: 4 × 5
unit street suburb postcode address
<chr> <chr> <chr> <chr> <chr>
1 Shop 4 2 Frencham Pl Downer 2602 Shop 4 2 Frencham Pl …
2 Unit 185 20 Allara St Canberra 2601 Unit 185 20 Allara St…
3 <NA> 1 McCoy Cct Acton 2601 <NA>
4 <NA> 130 Garran Rd Acton ACT 2601 <NA> <NA> <NA>
stringr::str_flatten()
Interestingly, we can achieve the same result as unite()
if we perform str_flatten()
on a row-wise basis. I doubt this is a good idea though, as row-wise stuff tends to get very expensive, very quickly.
|>
addresses rowwise() |>
mutate(
address = str_flatten(
c(unit, street, suburb, postcode),
collapse = " ",
na.rm = TRUE
) )
# A tibble: 4 × 5
# Rowwise:
unit street suburb postcode address
<chr> <chr> <chr> <chr> <chr>
1 Shop 4 2 Frencham Pl Downer 2602 Shop 4 2 Frencham Pl …
2 Unit 185 20 Allara St Canberra 2601 Unit 185 20 Allara St…
3 <NA> 1 McCoy Cct Acton 2601 1 McCoy Cct Acton 2601
4 <NA> 130 Garran Rd Acton ACT 2601 <NA> <NA> 130 Garran Rd Acton A…
glue::glue()
glue
is another option and arguably works better than paste
or str_c
because of the .na
option letting us handle the NA
within the function itself.
library(glue)
|>
addresses mutate(
address = glue(
"{unit}","{street}","{suburb}","{postcode}",
.na = "",
.sep = " "
) )
# A tibble: 4 × 5
unit street suburb postcode address
<chr> <chr> <chr> <chr> <glue>
1 Shop 4 2 Frencham Pl Downer 2602 Shop 4 2 Frencham Pl …
2 Unit 185 20 Allara St Canberra 2601 Unit 185 20 Allara St…
3 <NA> 1 McCoy Cct Acton 2601 1 McCoy Cct Acton 26…
4 <NA> 130 Garran Rd Acton ACT 2601 <NA> <NA> 130 Garran Rd Acton …
Again though, we’d need to trim the leading whitespace created.
Summary
There’s not much to say except use unite()
!