Not all missings are created equal

programming

Author

Ben Harrap

Published

July 14, 2025

I posted this question on Bluesky the other day, because I wasn’t sure whether a row-wise or column-wise solution would perform better for the problem I was trying to solve.

The feedback was consistent in stating that column-wise operations would be faster. So what did I do? I wrote a row-wise solution!

Ok hear me out though, there was a good reason and I’ll go through the problem and my thinking here, because the problem isn’t as simple as I’d laid out in my Bluesky post.

The problem

People do questionnaires (this is not the problem). Some people skip a question or two, some people skip a whole set, and some people get bored and stop half-way. These all lead to missing data, but the pattern of missingness is different and is informative when thinking about designing the next survey wave’s questionnaire.

The problem is how to efficiently identify these types of missingness in the data and code them as such.

Let’s start with a data dictionary some example data to help illustrate.

The data

library(tidyverse)
library(gt)
set.seed(4)

survey_data <- tibble(
  id = c(1, 2, 3, 4, 5),
  name_first = c("Tom", "Penny", "Trevor", "Ursula", "Jenny"),
  name_last = c("Smith", "Jones", NA, "Smith", "Jones"),
  age = c(62,52,37,44,NA),
  x1 = runif(5, 0, 2),
  x2 = runif(5, 0, 2),
  x3 = runif(5, 0, 2),
  x_text = c("Lorem", NA, NA, "Lorem", NA),
  y1 = runif(5, 0, 2),
  y2 = runif(5, 0, 2),
  y3 = runif(5, 0, 2),
  z1 = runif(5, 0, 2),
  z2 = runif(5, 0, 2),
  z3 = runif(5, 0, 2)
) |>
  mutate(
    across(matches("\\d"), \(x) if_else(x > 1, NA, round(x))),
    across(everything(), as.character)
    )

gt(survey_data)

id	name_first	name_last	age	x1	x2	x3	x_text	y1	y2	y3	z1	z2	z3
1	Tom	Smith	62	NA	1	NA	Lorem	1	NA	NA	NA	NA	NA
2	Penny	Jones	52	0	NA	1	NA	NA	NA	1	0	1	0
3	Trevor	NA	37	1	NA	0	NA	NA	NA	NA	NA	NA	NA
4	Ursula	Smith	44	1	NA	NA	Lorem	NA	1	NA	NA	1	0
5	Jenny	Jones	NA	NA	0	1	NA	NA	NA	NA	1	0	NA

In this example, we’ve got the data in a tabular format with responses as rows and variables as columns. There’s an ID, the respondent’s names and age, then three sets of questions:

The x1 to x_text questions are all part of the x group of questions
The y1 to y3 and z1 to z3 questions are all part of the y and z group of questions, respectively

Looking at this data, we can see that:

Tom didn’t answer any of the z questions
Penny and Ursula have valid responses for at least one question from each group
Trevor didn’t give his last name and stopped responding after the x questions
Jenny didn’t respond to the y questions but answered at least one z question

A dictionary

Like any good dataset, a data dictionary is available:

dictionary <- tibble(
  variable = names(survey_data),
  type = c("numeric",rep("string",2),"numeric",rep("factor",3),"string",rep("factor",6)),
  variable_group = c("id", "name", "name", "age", rep("x", 4), rep("y", 3), rep("z", 3))
)

gt(dictionary)

variable	type	variable_group
id	numeric	id
name_first	string	name
name_last	string	name
age	numeric	age
x1	factor	x
x2	factor	x
x3	factor	x
x_text	string	x
y1	factor	y
y2	factor	y
y3	factor	y
z1	factor	z
z2	factor	z
z3	factor	z

It’s rudimentary but will serve its purpose to illustrate how we can use it in data cleaning and the reason for why I opted for a row-wise solution.

My solution

The column-wise solution for identifying the drop-offs is fairly straightforward, but I wasn’t sure how to incorporate missingness at the question group level where the participant had not dropped off (e.g. Jenny). Maybe I’m too tidyverse-pilled, but I found it much easier to solve this problem using group_by().

To start with, we need to transpose the data such that participant responses are in columns and the variables are in rows.

survey_data_tranposed <- survey_data |>
  select(any_of(dictionary$variable)) |> 
1  t() |>
2  as.data.frame() |>
3  rownames_to_column(var = "variable")

gt(survey_data_tranposed)

1: t() transposes the data into a matrix
2: Convert it back into a data frame
3: Then take the row names (which were previously column names) and turn them into the variable column.

variable	V1	V2	V3	V4	V5
id	1	2	3	4	5
name_first	Tom	Penny	Trevor	Ursula	Jenny
name_last	Smith	Jones	NA	Smith	Jones
age	62	52	37	44	NA
x1	NA	0	1	1	NA
x2	1	NA	NA	NA	0
x3	NA	1	0	NA	1
x_text	Lorem	NA	NA	Lorem	NA
y1	1	NA	NA	NA	NA
y2	NA	NA	NA	1	NA
y3	NA	1	NA	NA	NA
z1	NA	0	NA	NA	1
z2	NA	1	NA	1	0
z3	NA	0	NA	0	NA

Now that the data’s in this format, we can join information from the dictionary

survey_data_tranposed <- survey_data_tranposed |>
  left_join(dictionary, by = "variable") |>
  relocate(c(variable_group,type), .after = variable)

gt(survey_data_tranposed)

variable	variable_group	type	V1	V2	V3	V4	V5
id	id	numeric	1	2	3	4	5
name_first	name	string	Tom	Penny	Trevor	Ursula	Jenny
name_last	name	string	Smith	Jones	NA	Smith	Jones
age	age	numeric	62	52	37	44	NA
x1	x	factor	NA	0	1	1	NA
x2	x	factor	1	NA	NA	NA	0
x3	x	factor	NA	1	0	NA	1
x_text	x	string	Lorem	NA	NA	Lorem	NA
y1	y	factor	1	NA	NA	NA	NA
y2	y	factor	NA	NA	NA	1	NA
y3	y	factor	NA	1	NA	NA	NA
z1	z	factor	NA	0	NA	NA	1
z2	z	factor	NA	1	NA	1	0
z3	z	factor	NA	0	NA	0	NA

Then I wrote the following function, which incorporates the information we just joined in how missingness categories are identified. The code annotations below the code chunk explain each step.

code_missingness <- function(data, column) {
  output <- data |>
1    mutate(drop_off = row_number() > max(which(!is.na(!!sym(column))))) |>
    group_by(variable_group) |>
    mutate(
2      drop_off = min(drop_off),
3      n_vars = max(row_number()),
      n_missing = sum(is.na(!!sym(column))),
4      missingness = case_when(
5        drop_off == TRUE ~ "-777777",
6        n_vars > 1 & n_missing == n_vars ~ "-888888",
7        n_missing != n_vars &
          is.na(!!sym(column)) &
          type == "factor" ~ "0",
8        is.na(!!sym(column)) ~ "-999999",
9        .default = !!sym(column)
      )
    ) |> 
    ungroup() |> 
    select(missingness) |> 
10    rename(!!sym(column) := missingness) |>
    as.vector()
  
  return(output)
}

1: For each row in the column, !is.na() returns TRUE (1) is non-missing or FALSE (0) if missing. which() then returns a vector of indices where the non-missing values are. max() then identifies the last row where non-missing data exists. Up to and including this row, drop_off is FALSE, becoming TRUE for every row thereafter.
2: We then need to update drop_off if there are any non-missing responses within the question group. I’ll explain why later, but the group minimum of drop_off will set all rows to FALSE if any FALSE exists.
3: Then the number of variables within a group n_vars and number of missing responses within a group n_missing are calculated
4: With these elements now created, we can use case_when() to apply the missing coding
5: Due to step 2, we can be confident that drop_off == TRUE correctly identifies where responses have dropped off
6: The value -888888 identifies missingness at the question group level. If the number of missing responses within the question group is equal to the number of questions within said group, that is missing at the question group-level. n_vars > 1 is included since I consider groups to mean a collection of 2 or more questions.
7: In this condition, any missing values is.na(!!sym(column)) for a factor question type == "factor" where there are non-missing values within the question group n_missing != n_vars are coded to 0
8: Any remaining missing values can be coded as plain old -999999
9: Any rows that didn’t meet any of these rules are left as-is
10: Finally, rename the missingness column to whatever the current value of the column input is, since we’re going to iterate over every column

Now that this function is defined, we can apply it to the relevant columns of the data, which are currently named V1 through V5:

responses <- names(survey_data_tranposed |> select(matches("V\\d")))

# Use purrr:map to iterate over the columns identified in `responses`
survey_data_coded <- map(responses, \(x) code_missingness(survey_data_tranposed, x)) |>
  bind_cols()

survey_data_tranposed |>
  bind_cols(survey_data_coded |>
              rename_with( ~ paste0(., "_new"), matches("V\\d"))) |> 
  select(variable, variable_group, type, starts_with(c("V1","V2","V3","V4","V5"))) |> 
  gt()

variable	variable_group	type	V1	V1_new	V2	V2_new	V3	V3_new	V4	V4_new	V5	V5_new
id	id	numeric	1	1	2	2	3	3	4	4	5	5
name_first	name	string	Tom	Tom	Penny	Penny	Trevor	Trevor	Ursula	Ursula	Jenny	Jenny
name_last	name	string	Smith	Smith	Jones	Jones	NA	-999999	Smith	Smith	Jones	Jones
age	age	numeric	62	62	52	52	37	37	44	44	NA	-999999
x1	x	factor	NA	0	0	0	1	1	1	1	NA	0
x2	x	factor	1	1	NA	0	NA	0	NA	0	0	0
x3	x	factor	NA	0	1	1	0	0	NA	0	1	1
x_text	x	string	Lorem	Lorem	NA	-999999	NA	-999999	Lorem	Lorem	NA	-999999
y1	y	factor	1	1	NA	0	NA	-777777	NA	0	NA	-888888
y2	y	factor	NA	0	NA	0	NA	-777777	1	1	NA	-888888
y3	y	factor	NA	0	1	1	NA	-777777	NA	0	NA	-888888
z1	z	factor	NA	-777777	0	0	NA	-777777	NA	0	1	1
z2	z	factor	NA	-777777	1	1	NA	-777777	1	1	0	0
z3	z	factor	NA	-777777	0	0	NA	-777777	0	0	NA	0

It looks to have done the job I intended it to do:

Trevor has -999999 for last name and the x text variable, and -777777 from questions y onward
Jenny has group-level missingness -888888 for the y questions, because she goes on to answer the z questions so has not dropped off
Missing responses to factor questions in a group are coded to 0 when a non-missing response exists on any other question

Some explanations

The reason why I update the value of drop_off after grouping (annotation 2) is because of the underlying structure of the data I’m working with, which isn’t quite reflected in the example data. When outputting checkboxes, REDCap (a survey platform) auto-populates unchecked boxes with a 0 as opposed to missing. To work around this in the context of missingness, I first replace every 0 for these checkbox variables with NA so that I’m left with only values of 1 or NA. You might be thinking “someone could reasonably have answered the question by not ticking any boxes”, except that we provide ‘no’ and ‘unsure’ checkboxes to reasonably cover all response options. So if these options are 0 in addition to the other options, I feel more confident in saying the whole question group was skipped.

The original reason for asking the question was about performance, and this approach is probably slower than a column-wise solution. It takes around 10 minutes to iterate over ~6,500 responses to ~550 variables across ~100 question groups. In the future this isn’t going to be an issue because the data cleaning is going to be run on a frequent basis as data comes in (so smaller batches), rather than the entire set of responses.