sort(c("w1","w2","w3","w10","w11","w12"))
[1] "w1" "w10" "w11" "w12" "w2" "w3"
sort(c("w01","w02","w03","w10","w11","w12"))
[1] "w01" "w02" "w03" "w10" "w11" "w12"
Ben Harrap
March 3, 2025
In particular, what makes a good variable naming convention for a longitudinal or panel survey? I wanted to answer this question so I could propose a good naming convention for a project at work. I have lots of opinions about this, largely based on working with variable names I don’t like, but I thought I should see what others thought so I posed the question on BlueSky…
… and I got just as many comments on what not to do as what to do! So let’s go through the do’s and don’ts and I’ll use the following question to demonstrate.
How would you rate your general health?
Poor
Fair
Good
Excellent
In terms of features, this question:
Let’s start out with what makes for a bad naming convention. If we figure out what we don’t like, we can start to figure out what we do.
Making variable names up as you go along is clearly a bad idea. It creates much more work as you’ll have to either remember every specific variable name or look at the data dictionary every time. Planning variable names ahead of time is also important, if you make up a few names then try and fit new names into the pattern you’ve just made up, you’ll quickly run into issues where the names don’t work. So at the very least we need a naming convention.
For our general health question, maybe we called it generalhealth
. Not terrible, but this isn’t going to be adequate in the context of an entire survey, as hopefully you’ll come to realise.
Making variable names as short as possible will save a few keystrokes at the cost of constantly referring back to the data dictionary to make sure you’re using the right one.
We could call the general health question genh
or gh
or ghr
(general health rating). Yes, they’re short, but see how I had to explain what ghr
stood for? It’s not immediately clear.
Also, text-completion exists in many IDEs, so don’t prioritise brevity!
Conversely, prioritising interpretability can lead to excessively long variable names, which is at the other extreme!
wave_1_health_general_health_general_health_rating_ordinal
encodes lots of information - the wave, the section, the sub-section, some of the question wording, and the type of data. This kind of length might work for your own solo projects but not for a dataset that’s going to be used by lots of people.
This is more common with conventions that prioritise brevity, as they tend to use abbreviations and avoid delimiters, which makes variables difficult for our brains to process quickly and accurately. Imagine the convention {wave}{respondent}{topic}{2-letter ID}
resulted in the variables bcgenmo
and bcgemno
. They’re from entirely different topics, gen
and gem
, but it is very easy to mix them up visually and through typos.
camelCase, snake_case, kebab-case, SCREAMING-KEBAB-CASE, they’re all candidates but switching between cases is a no-no for variable names. It might be useful in other situations where you use one case for functions, another for variable names. But we’re just talking about variable names here, so don’t use more than one case.
This was a good point from my BlueSky post (thanks Russell!). In my context in Australia, I’d use English, and only English, to name the variables in my dataset.
People are going to have an easier time if we call the general health question general_health
instead of allgemeine_gesundheit
. If you were collecting data in Germany, allgemeine_gesundheit
might make more sense.
Ok this is probably a controversial one because I see it done time and time again. Yes, it’s a concise and convenient way of representing a number. However, it only works up to 26 waves (our survey’s going to run forever, right!?). It’s also a pain trying to remember which letter is which number and it forces you to write extra code to translate letters into numbers (e.g. match("k",letters[1:26])
).
Imagine we’ve got 20 waves of survey data, what wave is the general health question hhlthgenord
in? I’m sat here counting out letters on my fingers because I don’t know off the top of my head.
Also, it can lead to odd variable names - some examples from the wild include:
asdtype
- Wave 1 survey type, not type of autism spectrum disorderewluge
- A question about geriatric care in wave 5, not a disdain for sledsThese are fairly innocuous examples but I’ve seen some inappropriately named variables too.
You might be tempted to incorporate the question number into the variable name. Don’t.
Imagine the general health question is number 50 in wave 1, so we call it a50genh
. In preparation for wave 2 we examine individual question response rates - looks like rates of missing data increase with increasing question number. We could address that in wave 2 by randomising the order of sections or putting the general health question earlier in the questionnaire. Good ideas, but either of these choices break the naming convention.
The same goes for using the section. The general health question might’ve been the only health question in the first wave, so it went in the ‘about you’ section and was called aaboutgenh
. For wave 2 though we got funding from the Department of Health and they want to add a section on health. Now the general health question has moved to the ‘health’ section. Calling it bhealthgenh
is encoding the section in wave 2 but is inconsistent with the wave 1 name. Calling it baboutgenh
encodes the incorrect section in the name. Not good!
We have an idea of what makes a bad convention, so what makes a good one?
Naming conventions are built on components, with their combination resulting in variable names. Components you might use include:
Once you’ve identified the important components, next decide on how they will be combined - and stick to this formula!
Applying the convention {wave}_{theme}_{keyword}
to the general health question in wave 1 could lead to the variable name w01_health_general
.
The ordering of components should have a logic or hierarchy to it. I’d recommend following a hierarchy of information scope.
For example, the hierarchy in {wave}_{theme}_{keyword}
goes from broad to narrow - {wave}
relates to the whole survey and is the broadest in scope, {theme}
relates to the theme or concept the question is getting at, but isn’t necessarily restricted to that question alone, and {keyword}
has the most narrow scope, drawing on the specific wording of a question.
Also, think about how you might be working with the variables - R’s dplyr
provides functions like ends_with()
and starts_with()
for a reason. From the palmer penguins dataset, consider the variables bill_length_mm
, bill_depth_mm
, and flipper_length_mm
. If we wanted to work with measurements relating to bills, these variables are readily identifiable (starts_with("bill_")
). Similarly, if we wanted to change millimetres to centimetres, we can identify these variables using ends_with("_mm")
.
This also makes it easy to see sets of related variables in the data dictionary and produces sensible options in auto-complete lists. Contrast these two approaches (I made up a few extra variables):
bill_length_mm
bill_depth_mm
bill_width_mm
bill_colour
bill_density
length_bill_mm
depth_bill_mm
width_bill_mm
colour_bill
density_bill
Having bill
as the first component ({bodypart}
?) makes it easy to spot variables relating to bills. Having {measurement}
as the first component is more visually cluttered and in a dictionary of hundreds of variables would be harder to spot. You should consider the focus of the data though, maybe measurements are the focus in which case it would be more useful to have measurement as the first component.
This consideration might not be relevant if you only have a small number of variables, but once you start hitting a large enough number that your data dictionary requires you to scroll up and down to see everything, it becomes useful to require each component to be the same length. Compare the following variable names following {topic}_{element}_{measurement}
:
No restriction on component length:
penguin_species
location_island
penguin_bill_length
penguin_bill_depth
penguin_bill_width
penguin_bill_colour
penguin_bill_density
penguin_bill_mass
penguin_flipper_length
penguin_flipper_depth
penguin_flipper_width
penguin_flipper_density
penguin_flipper_colour
penguin_flipper_mass
penguin_sex
study_year
4 characters, 3 characters, 4 characters:
peng_spe
loca_isl
peng_bil_leng
peng_bil_dept
peng_bil_widt
peng_bil_colo
peng_bil_dens
peng_bil_mass
peng_fli_leng
peng_fli_dept
peng_fli_widt
peng_fli_dens
peng_fli_colo
peng_fli_mass
peng_sex
stud_yea
Ignoring whether the abbreviations are good or not (I just cut them off), I find having the variable names the same length means I’m having to do less visual processing when looking at them (I’m not distracted by the varying length) and can focus more easily on finding the variable I’m looking for.
You might find yourself in a circumstance where a component is optional, such as in the above example - the bill and flipper both have measurements like length
and depth
, which are encoded in the second component, but this component is only included where there are multiple measurements of the same thing. Ideally all components are required, however if you are going to use an optional component restrict the convention to one optional component and stick it at the end of the variable name. This prevents the absence of the component from impacting on how the variables look in a list:
Optional {measure}
4 characters, {topic}
4 characters, {element}
3 characters:
peng_spe
loca_isl
leng_peng_bil
dept_peng_bil
widt_peng_bil
colo_peng_bil
dens_peng_bil
mass_peng_bil
leng_peng_fli
dept_peng_fli
widt_peng_fli
dens_peng_fli
colo_peng_fli
mass_peng_fli
peng_sex
stud_yea
{topic}
4 characters, {element}
3 characters, optional {measure}
4 characters:
peng_spe
loca_isl
peng_bil_leng
peng_bil_dept
peng_bil_widt
peng_bil_colo
peng_bil_dens
peng_bil_mass
peng_fli_leng
peng_fli_dept
peng_fli_widt
peng_fli_dens
peng_fli_colo
peng_fli_mass
peng_sex
stud_yea
We know that questionnaire section is a poor component because it breaks as soon as questions are moved between sections. Instead, try and choose components that are unlikely to change over time. For example, the general health question asks about health so we could say its theme or focus is health
. In our previous example then, {wave}_{theme}_{keyword}
is not going to break when the question moved from ‘about you’ to ‘health’, but {wave}_{section}_{keyword}
will.
A side effect of choosing components that are resilient to changes in survey design is that if a circumstance arises that looks to break a variable name, it could be indicating that you need a new question instead.
The goal is to create variable names that are succinct.
succinct, adjective. marked by compact precise expression without wasted words
This is as much of an art as a science, don’t be surprised if it’s difficult!
Take the previous example wave_1_health_general_health_general_health_rating_ordinal
. The section and sub-sections are redundant, repeating the words health
and general
. We could instead switch these components for the theme, which is still health, and get to wave_1_health_general_rating_ordinal
. wave_1
could then be shortened to w01
, and we could provide a schema for data type abbreviations:
str
= stringscat
= categoricalord
= ordinalnum
= numericSo we switch from {wave_#}_{section}_{subsection}_{keywords}_{type}
to {wave}_{theme}_{keyword}_{type}
and end up with w01_health_general_ord
.
On abbreviations, I would suggest using them sparingly, rather than not at all. Sometimes there’s an important bit of information you want to encode in a variable or set of variables, but you can’t do it without using an abbreviation.
Related to brevity, it’s important to think about your end-users and the software they use. Stata, SPSS, and SAS only support variable names up to 32 characters in length, meaning that is the hard maximum for your variable names. Variable names shouldn’t start with numbers either, hence why my examples for {wave}
have been w01
not 01
.
In coming up with my proposed convention, I kept in mind that people use datasets in long and wide format. I was debating whether to include wave
as a component, w01_health_general
, or to store it as a separate variable. Moving wave
to be a variable instead of a component didn’t mean I now had an extra four characters (w01_
) to work with though - if Stata users reshape the dataset into wide format using wave
, they’ll run into problems with variable names that are 29 characters or more.
While this might seem like over-engineering, it was a legitimate requirement of the end-users.
Aspects of the survey can also inform the components you choose, and also how you define those components.
I’ve used w01
as an example of how wave could be encoded in the variable name. The reason for w01
and not w1
is that this allows for more than 10 waves to be encoded in a way that can be sorted:
[1] "w1" "w10" "w11" "w12" "w2" "w3"
[1] "w01" "w02" "w03" "w10" "w11" "w12"
The Longitudinal Study of Indigenous Children gathers responses from children, their parents, their teacher or carer, and their school principal. Respondent is included as a component in their variable names:
a
= parent 1b
= parent 2c
= study childd
= teacher/carere
= principalThe use of single letters here is prioritising brevity over interpretability, but you can see how this element of the survey informed the naming convention.
Using a delimiter to separate the components of your variable names makes them much easier for humans to read, just make sure you pick one and stick with it. Which delimiter do you prefer:
bill_length_mm
bill.length.mm
bill-length-mm
I prefer underscores, I think they’re the easiest to read. But you also need to consider your end-user’s software. Programs like Stata don’t support dashes or dots in variable names, so you’re kinda forced into using underscores anyway. Fine by me!
The convention you choose might act as a guide or it might act as a rule. If it’s a guide, there’s going to be some discretion involved in choosing what to use for certain elements, like {keyword}
. For the general health rating, both w01_health_general
and w01_health_rating
are suitable options, but you have to choose one.
Regardless of whether the convention is applied as a guide or rule, make sure you have a human review the resulting list of variable names, it’s important that you don’t end up with names that are offensive.
Here’s a few other things that aren’t necessarily do’s or don’ts:
{wave}
into a variable, rather than using it in the variable namepid
is sufficient over admin_pid
Emily Riederer wrote a great blog post on this topic, which helped inform my thinking - https://www.emilyriederer.com/post/column-name-contracts/
Wikipedia has articles on:
The tidyverse
style guide has opinions about names and more https://style.tidyverse.org/
@online{harrap2025variable,
author = {Benjamin Harrap},
title = {What makes a good variable naming convention},
year = {2025},
url = {https://benharrap.com/post/2025-03-03-variable-naming-convention},
}