I posted this question on Bluesky the other day, because I wasn’t sure whether a row-wise or column-wise solution would perform better for the problem I was trying to solve.
The feedback was consistent in stating that column-wise operations would be faster. So what did I do? I wrote a row-wise solution!
Ok hear me out though, there was a good reason and I’ll go through the problem and my thinking here, because the problem isn’t as simple as I’d laid out in my Bluesky post.
The problem
People do questionnaires (this is not the problem). Some people skip a question or two, some people skip a whole set, and some people get bored and stop half-way. These all lead to missing data, but the pattern of missingness is different and is informative when thinking about designing the next survey wave’s questionnaire.
The problem is how to efficiently identify these types of missingness in the data and code them as such.
Let’s start with a data dictionary some example data to help illustrate.
In this example, we’ve got the data in a tabular format with responses as rows and variables as columns. There’s an ID, the respondent’s names and age, then three sets of questions:
The x1 to x_text questions are all part of the x group of questions
The y1 to y3 and z1 to z3 questions are all part of the y and z group of questions, respectively
Looking at this data, we can see that:
Tom didn’t answer any of the z questions
Penny and Ursula have valid responses for at least one question from each group
Trevor didn’t give his last name and stopped responding after the x questions
Jenny didn’t respond to the y questions but answered at least one z question
A dictionary
Like any good dataset, a data dictionary is available:
It’s rudimentary but will serve its purpose to illustrate how we can use it in data cleaning and the reason for why I opted for a row-wise solution.
My solution
The column-wise solution for identifying the drop-offs is fairly straightforward, but I wasn’t sure how to incorporate missingness at the question group level where the participant had not dropped off (e.g. Jenny). Maybe I’m too tidyverse-pilled, but I found it much easier to solve this problem using group_by().
To start with, we need to transpose the data such that participant responses are in columns and the variables are in rows.
Then take the row names (which were previously column names) and turn them into the variable column.
variable
V1
V2
V3
V4
V5
id
1
2
3
4
5
name_first
Tom
Penny
Trevor
Ursula
Jenny
name_last
Smith
Jones
NA
Smith
Jones
age
62
52
37
44
NA
x1
NA
0
1
1
NA
x2
1
NA
NA
NA
0
x3
NA
1
0
NA
1
x_text
Lorem
NA
NA
Lorem
NA
y1
1
NA
NA
NA
NA
y2
NA
NA
NA
1
NA
y3
NA
1
NA
NA
NA
z1
NA
0
NA
NA
1
z2
NA
1
NA
1
0
z3
NA
0
NA
0
NA
Now that the data’s in this format, we can join information from the dictionary
survey_data_tranposed <- survey_data_tranposed |>left_join(dictionary, by ="variable") |>relocate(c(variable_group,type), .after = variable)gt(survey_data_tranposed)
variable
variable_group
type
V1
V2
V3
V4
V5
id
id
numeric
1
2
3
4
5
name_first
name
string
Tom
Penny
Trevor
Ursula
Jenny
name_last
name
string
Smith
Jones
NA
Smith
Jones
age
age
numeric
62
52
37
44
NA
x1
x
factor
NA
0
1
1
NA
x2
x
factor
1
NA
NA
NA
0
x3
x
factor
NA
1
0
NA
1
x_text
x
string
Lorem
NA
NA
Lorem
NA
y1
y
factor
1
NA
NA
NA
NA
y2
y
factor
NA
NA
NA
1
NA
y3
y
factor
NA
1
NA
NA
NA
z1
z
factor
NA
0
NA
NA
1
z2
z
factor
NA
1
NA
1
0
z3
z
factor
NA
0
NA
0
NA
Then I wrote the following function, which incorporates the information we just joined in how missingness categories are identified. The code annotations below the code chunk explain each step.
For each row in the column, !is.na() returns TRUE (1) is non-missing or FALSE (0) if missing. which() then returns a vector of indices where the non-missing values are. max() then identifies the last row where non-missing data exists. Up to and including this row, drop_off is FALSE, becoming TRUE for every row thereafter.
2
We then need to update drop_off if there are any non-missing responses within the question group. I’ll explain why later, but the group minimum of drop_off will set all rows to FALSE if any FALSE exists.
3
Then the number of variables within a group n_vars and number of missing responses within a group n_missing are calculated
4
With these elements now created, we can use case_when() to apply the missing coding
5
Due to step 2, we can be confident that drop_off == TRUE correctly identifies where responses have dropped off
6
The value -888888 identifies missingness at the question group level. If the number of missing responses within the question group is equal to the number of questions within said group, that is missing at the question group-level. n_vars > 1 is included since I consider groups to mean a collection of 2 or more questions.
7
In this condition, any missing values is.na(!!sym(column)) for a factor question type == "factor" where there are non-missing values within the question group n_missing != n_vars are coded to 0
8
Any remaining missing values can be coded as plain old -999999
9
Any rows that didn’t meet any of these rules are left as-is
10
Finally, rename the missingness column to whatever the current value of the column input is, since we’re going to iterate over every column
Now that this function is defined, we can apply it to the relevant columns of the data, which are currently named V1 through V5:
responses <-names(survey_data_tranposed |>select(matches("V\\d")))# Use purrr:map to iterate over the columns identified in `responses`survey_data_coded <-map(responses, \(x) code_missingness(survey_data_tranposed, x)) |>bind_cols()survey_data_tranposed |>bind_cols(survey_data_coded |>rename_with( ~paste0(., "_new"), matches("V\\d"))) |>select(variable, variable_group, type, starts_with(c("V1","V2","V3","V4","V5"))) |>gt()
variable
variable_group
type
V1
V1_new
V2
V2_new
V3
V3_new
V4
V4_new
V5
V5_new
id
id
numeric
1
1
2
2
3
3
4
4
5
5
name_first
name
string
Tom
Tom
Penny
Penny
Trevor
Trevor
Ursula
Ursula
Jenny
Jenny
name_last
name
string
Smith
Smith
Jones
Jones
NA
-999999
Smith
Smith
Jones
Jones
age
age
numeric
62
62
52
52
37
37
44
44
NA
-999999
x1
x
factor
NA
0
0
0
1
1
1
1
NA
0
x2
x
factor
1
1
NA
0
NA
0
NA
0
0
0
x3
x
factor
NA
0
1
1
0
0
NA
0
1
1
x_text
x
string
Lorem
Lorem
NA
-999999
NA
-999999
Lorem
Lorem
NA
-999999
y1
y
factor
1
1
NA
0
NA
-777777
NA
0
NA
-888888
y2
y
factor
NA
0
NA
0
NA
-777777
1
1
NA
-888888
y3
y
factor
NA
0
1
1
NA
-777777
NA
0
NA
-888888
z1
z
factor
NA
-777777
0
0
NA
-777777
NA
0
1
1
z2
z
factor
NA
-777777
1
1
NA
-777777
1
1
0
0
z3
z
factor
NA
-777777
0
0
NA
-777777
0
0
NA
0
It looks to have done the job I intended it to do:
Trevor has -999999 for last name and the x text variable, and -777777 from questions y onward
Jenny has group-level missingness -888888 for the y questions, because she goes on to answer the z questions so has not dropped off
Missing responses to factor questions in a group are coded to 0 when a non-missing response exists on any other question
Some explanations
The reason why I update the value of drop_off after grouping (annotation 2) is because of the underlying structure of the data I’m working with, which isn’t quite reflected in the example data. When outputting checkboxes, REDCap (a survey platform) auto-populates unchecked boxes with a 0 as opposed to missing. To work around this in the context of missingness, I first replace every 0 for these checkbox variables with NA so that I’m left with only values of 1 or NA. You might be thinking “someone could reasonably have answered the question by not ticking any boxes”, except that we provide ‘no’ and ‘unsure’ checkboxes to reasonably cover all response options. So if these options are 0 in addition to the other options, I feel more confident in saying the whole question group was skipped.
The original reason for asking the question was about performance, and this approach is probably slower than a column-wise solution. It takes around 10 minutes to iterate over ~6,500 responses to ~550 variables across ~100 question groups. In the future this isn’t going to be an issue because the data cleaning is going to be run on a frequent basis as data comes in (so smaller batches), rather than the entire set of responses.