- 1
-
Start recording a log of the session, outputting to
log.smcl - 2
-
Use the
autodata, similar tomtcars - 3
- Create a two-way table
- 4
- Tabulate a single variable
- 5
- Produce descriptives of a single variable
- 6
- Then for multiple variables
- 7
-
Fit a regression model with
mpgas the outcome - 8
-
Produce predictive margins at a severak values of
weight - 9
- Stop recording to the log file
Using R to extract results from Stata log files
Taking results from Stata and inserting them into a report is a colossal pain in the ass, to put it mildly. There are a few custom commands out there, like baselinetable, that help ease the pain somewhat, but copy-pasting results from console output or log files seems inevitable and introduces human error.
It doesn’t have to be this way though! Recently, my predominantly Stata-using team has started making an effort to use Quarto for some regular reports we need to deliver. Naturally I’m thrilled, because Quarto is such a useful tool and it’ll save us a lot of time and error copy-pasting results from Stata. The only problem is we need Stata output in a format that we can read in for our Quarto report.
My solution for this? Write R functions that extract results from Stata’s log files! mind_blown.gif
What’s a log file?
Every Stata user should already be familiar with log files, but if you’re not, they just log all of the commands you run during a Stata session, along with their outputs, in an .smcl file. This stands for “Stata Markup and Control Language” and is pronounced “smickle” according to Stata’s help files. I don’t care to get into the detail of how these work, all I need to know is I can read it in to R as a text file.
To demonstrate, I ran the following commands in Stata
If you want to see what the file looks like, it’s here.
The guts of it
Now we’ve got a log file to play with, we can read it into R and have a look.
data <- read.delim("log.smcl")
data[8:22,] [1] "{com}. tab headroom foreign"
[2] " {txt}Headroom {c |} Car origin"
[3] " (in.) {c |} Domestic Foreign {c |} Total"
[4] "{hline 11}{c +}{hline 22}{c +}{hline 10}"
[5] " 1.5 {c |}{res} 3 1 {txt}{c |}{res} 4 "
[6] "{txt} 2.0 {c |}{res} 10 3 {txt}{c |}{res} 13 "
[7] "{txt} 2.5 {c |}{res} 4 10 {txt}{c |}{res} 14 "
[8] "{txt} 3.0 {c |}{res} 7 6 {txt}{c |}{res} 13 "
[9] "{txt} 3.5 {c |}{res} 13 2 {txt}{c |}{res} 15 "
[10] "{txt} 4.0 {c |}{res} 10 0 {txt}{c |}{res} 10 "
[11] "{txt} 4.5 {c |}{res} 4 0 {txt}{c |}{res} 4 "
[12] "{txt} 5.0 {c |}{res} 1 0 {txt}{c |}{res} 1 "
[13] "{txt}{hline 11}{c +}{hline 22}{c +}{hline 10}"
[14] " Total {c |}{res} 52 22 {txt}{c |}{res} 74 "
[15] "{com}. tab rep78"
There’s a bunch of markup here that Stata uses to format the file nicely when you view it using Stata. It’s a bit ugly to look at as plain-text but there are some helpful things to notice here, which we can use to extract our results:
- Command lines begin with
{com}.(see lines 1 and 15) - The results lines end with a number (lines 5 through 12, and 14)
The output looks a little complicated with how it’s ordered, because it’s tabulating two variables. Line 5 has results, but doesn’t begin with {txt}. The first value in lines 5 to 12 are the values of headroom, while the remaining values correspond to ‘Domestic’, ‘Foreign’ and ‘Total’.
Let’s check the output for the other commands though. Here’s the summarise (or su) output:
data[33:43,] [1] "{com}. su price"
[2] "{txt} Variable {c |} Obs Mean Std. dev. Min Max"
[3] "{hline 13}{c +}{hline 57}"
[4] "{space 7}price {c |}{res} 74 6165.257 2949.496 3291 15906"
[5] "{com}. su price weight length gear_ratio"
[6] "{txt} Variable {c |} Obs Mean Std. dev. Min Max"
[7] "{hline 13}{c +}{hline 57}"
[8] "{space 7}price {c |}{res} 74 6165.257 2949.496 3291 15906"
[9] "{txt}{space 6}weight {c |}{res} 74 3019.459 777.1936 1760 4840"
[10] "{txt}{space 6}length {c |}{res} 74 187.9324 22.26634 142 233"
[11] "{txt}{space 2}gear_ratio {c |}{res} 74 3.014865 .4562871 2.19 3.89"
The results also end with a number here, but they don’t have a trailing whitespace. We can fix that later using stringr::str_trim(). Otherwise it looks fairly straightforward. Lines 2, 6, tell us what the results correspond to.
Let’s check the regression output:
data[44:60,] [1] "{com}. reg mpg displacement gear_ratio weight price"
[2] "{txt} Source {c |} SS df MS Number of obs ={res} 74"
[3] "{txt}{hline 13}{c +}{hline 34} F(4, 69) = {res} 32.92"
[4] "{txt} Model {c |} {res} 1603.26274 4 400.815685 {txt}Prob > F ={res} 0.0000"
[5] "{txt} Residual {c |} {res} 840.196719 69 12.176764 {txt}R-squared ={res} 0.6561"
[6] "{txt}{hline 13}{c +}{hline 34} Adj R-squared ={res} 0.6362"
[7] "{txt} Total {c |} {res} 2443.45946 73 33.4720474 {txt}Root MSE = {res} 3.4895"
[8] "{txt}{hline 13}{c TT}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"
[9] "{col 1} mpg{col 14}{c |} Coefficient{col 26} Std. err.{col 38} t{col 46} P>|t|{col 54} [95% con{col 67}f. interval]"
[10] "{hline 13}{c +}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"
[11] "displacement {c |}{col 14}{res}{space 2} .0088833{col 26}{space 2} .0117481{col 37}{space 1} 0.76{col 46}{space 3}0.452{col 54}{space 4}-.0145535{col 67}{space 3} .03232"
[12] "{txt}{space 2}gear_ratio {c |}{col 14}{res}{space 2} .9008024{col 26}{space 2} 1.645546{col 37}{space 1} 0.55{col 46}{space 3}0.586{col 54}{space 4}-2.381972{col 67}{space 3} 4.183577"
[13] "{txt}{space 6}weight {c |}{col 14}{res}{space 2}-.0063068{col 26}{space 2} .0012248{col 37}{space 1} -5.15{col 46}{space 3}0.000{col 54}{space 4}-.0087502{col 67}{space 3}-.0038635"
[14] "{txt}{space 7}price {c |}{col 14}{res}{space 2}-.0001173{col 26}{space 2} .0001687{col 37}{space 1} -0.70{col 46}{space 3}0.489{col 54}{space 4}-.0004538{col 67}{space 3} .0002193"
[15] "{txt}{space 7}_cons {c |}{col 14}{res}{space 2} 36.595{col 26}{space 2} 6.734683{col 37}{space 1} 5.43{col 46}{space 3}0.000{col 54}{space 4} 23.15968{col 67}{space 3} 50.03033"
[16] "{txt}{hline 13}{c BT}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"
[17] "{res}"
It’s very busy but the same principle applies — result lines (11 to 15) end with a number.
Finally, the margins output:
data[61:85,] [1] "{com}. margins, at(weight=(2000(500)4500))"
[2] "{res}"
[3] "{txt}{col 1}Predictive margins{col 61}{lalign 13:Number of obs}{col 74} = {res}{ralign 2:74}"
[4] "{txt}{col 1}Model VCE: {res:OLS}"
[5] "{txt}{p2colset 1 13 13 2}{...}"
[6] "{p2col:Expression:}{res:Linear prediction, predict()}{p_end}"
[7] "{p2colreset}{...}"
[8] "{lalign 7:1._at: }{space 0}{lalign 6:weight} = {res:{ralign 4:2000}}"
[9] "{lalign 7:2._at: }{space 0}{lalign 6:weight} = {res:{ralign 4:2500}}"
[10] "{lalign 7:3._at: }{space 0}{lalign 6:weight} = {res:{ralign 4:3000}}"
[11] "{lalign 7:4._at: }{space 0}{lalign 6:weight} = {res:{ralign 4:3500}}"
[12] "{lalign 7:5._at: }{space 0}{lalign 6:weight} = {res:{ralign 4:4000}}"
[13] "{lalign 7:6._at: }{space 0}{lalign 6:weight} = {res:{ralign 4:4500}}"
[14] "{res}{txt}{hline 13}{c TT}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"
[15] "{col 14}{c |}{col 26} Delta-method"
[16] "{col 14}{c |} Margin{col 26} std. err.{col 38} t{col 46} P>|t|{col 54} [95% con{col 67}f. interval]"
[17] "{hline 13}{c +}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"
[18] "{space 9}_at {c |}"
[19] "{space 10}1 {c |}{col 14}{res}{space 2} 27.72685{col 26}{space 2} 1.312853{col 37}{space 1} 21.12{col 46}{space 3}0.000{col 54}{space 4} 25.10778{col 67}{space 3} 30.34593"
[20] "{txt}{space 10}2 {c |}{col 14}{res}{space 2} 24.57344{col 26}{space 2} .75454{col 37}{space 1} 32.57{col 46}{space 3}0.000{col 54}{space 4} 23.06817{col 67}{space 3} 26.07871"
[21] "{txt}{space 10}3 {c |}{col 14}{res}{space 2} 21.42002{col 26}{space 2} .4063483{col 37}{space 1} 52.71{col 46}{space 3}0.000{col 54}{space 4} 20.60938{col 67}{space 3} 22.23067"
[22] "{txt}{space 10}4 {c |}{col 14}{res}{space 2} 18.26661{col 26}{space 2} .714807{col 37}{space 1} 25.55{col 46}{space 3}0.000{col 54}{space 4} 16.84061{col 67}{space 3} 19.69261"
[23] "{txt}{space 10}5 {c |}{col 14}{res}{space 2} 15.11319{col 26}{space 2} 1.267604{col 37}{space 1} 11.92{col 46}{space 3}0.000{col 54}{space 4} 12.58439{col 67}{space 3} 17.642"
[24] "{txt}{space 10}6 {c |}{col 14}{res}{space 2} 11.95978{col 26}{space 2} 1.858154{col 37}{space 1} 6.44{col 46}{space 3}0.000{col 54}{space 4} 8.252865{col 67}{space 3} 15.66669"
[25] "{txt}{hline 13}{c BT}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"
The main thing to notice is that the values of weight at which the estimates are produced are in lines 8 to 13, while the estimates themselves are on separate lines (19 to 24).
Separate out the results
I could’ve split the analysis up across multiple log files. For example, outputting tabulations into tabulation.log, the summaries into summaries.log, the regression model into regression.log and the margins into margins.log. There’s not much need though, since we can use the command rows to subset the data. We’ll need them like this for our Quarto report anyway, plus each command seems to need slightly different work.
library(tidyverse)
data <- data |>
mutate(
# Create a variable to identify each command
command = if_else(
str_detect(X.smcl., "\\{com\\}\\."),
true = X.smcl.,
false = NA
)
) |>
# Fill downward to catch all the lines of output from that command
fill(command, .direction = "down")Notice the double backslashes in str_detect() - we need to escape the special characters since we’re using a regular expression here.
Now we’ve identified the lines from each command, we can separate out our results.
tab_headroom_foreign <- filter(data, command == "{com}. tab headroom foreign")
tab_rep78 <- filter(data, command == "{com}. tab rep78")
summaries <- filter(data, command == "{com}. su price weight length gear_ratio")
regression <- filter(data, command == "{com}. reg mpg displacement gear_ratio weight price")
margins <- filter(data, command == "{com}. margins, at(weight=(2000(500)4500))")One-way tables
Starting with the simplest output, let’s extract the counts of each level of rep78:
tab_rep78 <- tab_rep78 |>
mutate(
results = case_when(
1 str_detect(X.smcl., "((-)?\\.)?\\d+(\\.\\d+)?$")
2 ~ str_extract_all(X.smcl.,"((-)?\\.)?\\d+(\\.\\d+)?"),
.default = NA
),
4 X.smcl. = str_squish(str_remove_all(X.smcl.,"\\{.+\\}"))
) |>
3 hoist(
"results",
repair_record = 1,
frequency = 2,
percent = 3,
cumulative = 4
) |>
select(-command)
tab_rep78- 1
-
This regex is looking for strings that optionally start with a negative or decimal (
((-)?\\.)?), followed by any number of digits (\\d+), optionally followed by a decimal and any number of digits ((\\.\\d+)), at the end of the string$ - 2
-
stringr::str_extract_all()then extracts all instances of this pattern anywhere in the string (we omit the$), returning a list-column calledresults - 3
-
tidyr::hoist()then extracts the elements ofresultsinto their own column, to which we assign names - 4
- I just formatted the output to look nicer on this blog post, this step is unnecessary here
X.smcl. repair_record frequency percent cumulative
1 . tab rep78 78 <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA>
3 record 1978 Freq. Percent Cum. <NA> <NA> <NA> <NA>
4 <NA> <NA> <NA> <NA>
5 1 2 2.90 2.90 1 2 2.90 2.90
6 8 11.59 14.49 2 8 11.59 14.49
7 30 43.48 57.97 3 30 43.48 57.97
8 18 26.09 84.06 4 18 26.09 84.06
9 11 15.94 100.00 5 11 15.94 100.00
10 <NA> <NA> <NA> <NA>
11 Total 69 100.00 69 100.00 <NA> <NA>
Notice that the ‘Total’ row (11) hasn’t quite worked. We could tinker with the regex or mutate() the correct values in. Or we could just use janitor::adorn_totals() to make our lives easier:
library(janitor)
tab_rep78 |>
filter(!is.na(cumulative)) |>
select(repair_record:percent) |>
mutate(across(everything(), as.numeric)) |>
adorn_totals("row") repair_record frequency percent
1 2 2.90
2 8 11.59
3 30 43.48
4 18 26.09
5 11 15.94
Total 69 100.00
Regression output
The combination of str_extract_all() and hoist() is what this all boils down to, but there’s a bit of tweaking we need to do for the regression output. The regex in the previous example was extracting all instances of digits with optional decimals, however in the regression output there’s spacing markup like {hline 13}, which we don’t want to extract.
We could fix this by modifying the regex, but I’m not a fan of making large, complicated expressions if I can avoid it, so instead we can just replace anything inside of a curly bracket with a space, using str_replace_all().
regression |>
mutate(
X.smcl. = str_replace_all(X.smcl., "\\{[\\w|\\d|\\s]+\\}", " "),
results = case_when(
str_detect(X.smcl., "((-)?\\.)?\\d+(\\.\\d+)?$") ~ str_extract_all(
X.smcl.,
"((-)?\\.)?\\d+(\\.\\d+)?"
),
.default = NA
)
) |>
hoist(
"results",
estimate = 1,
std_err = 2,
t = 3,
p_val = 4,
ci_lower = 5,
ci_upper = 6
) |>
filter(!is.na(ci_upper)) |>
select(-command) |>
mutate(
X.smcl. = str_sub(
X.smcl.,
start = str_locate(X.smcl., "[:graph:]+")[, "start"],
end = str_locate(X.smcl., "[:graph:]+")[, "end"]
)
) |>
rename(variable = X.smcl.) variable estimate std_err t p_val ci_lower ci_upper
1 displacement .0088833 .0117481 0.76 0.452 -.0145535 .03232
2 gear_ratio .9008024 1.645546 0.55 0.586 2.381972 4.183577
3 weight -.0063068 .0012248 5.15 0.000 -.0087502 -.0038635
4 price -.0001173 .0001687 0.70 0.489 -.0004538 .0002193
5 _cons 36.595 6.734683 5.43 0.000 23.15968 50.03033
The last thing to do here is create the variable column by extracting the variable names using str_sub and the start/end locations of the first place that numbers/letters/punctuation are detected.
Margins output
For the margins we can recycle the same code to extract the results, although we need to do a bit more work to name the margins correctly.
margins |>
mutate(
X.smcl. = str_replace_all(X.smcl., "\\{[\\w|\\d|\\s]+\\}", " "),
results = case_when(
str_detect(X.smcl., "\\d+(\\.\\d+)?$") ~ str_extract_all(X.smcl.,"((-)?\\.)?\\d+(\\.\\d+)?"),
.default = NA
)
) |>
hoist(
"results",
margin = 1,
estimate = 2,
std_err = 3,
t = 4,
p_val = 5,
ci_lower = 6,
ci_upper = 7
) |>
filter(!is.na(ci_upper)) |>
select(-command, -X.smcl.) |>
mutate(
margin = case_when(
margin == "1" ~ "weight = 2000",
margin == "2" ~ "weight = 2500",
margin == "3" ~ "weight = 3000",
margin == "4" ~ "weight = 3500",
margin == "5" ~ "weight = 4000",
margin == "6" ~ "weight = 4500",
)
) margin estimate std_err t p_val ci_lower ci_upper
1 weight = 2000 27.72685 1.312853 21.12 0.000 25.10778 30.34593
2 weight = 2500 24.57344 .75454 32.57 0.000 23.06817 26.07871
3 weight = 3000 21.42002 .4063483 52.71 0.000 20.60938 22.23067
4 weight = 3500 18.26661 .714807 25.55 0.000 16.84061 19.69261
5 weight = 4000 15.11319 1.267604 11.92 0.000 12.58439 17.642
6 weight = 4500 11.95978 1.858154 6.44 0.000 8.252865 15.66669
I could’ve written code to automatically extract the margin value and assign it, instead of manually coding the value of margin, but I want to finish this blog post so *waves hands*.
Extracting a two-way table
tab_headroom_foreign |> select(-command) X.smcl.
1 {com}. tab headroom foreign
2 {txt}Headroom {c |} Car origin
3 (in.) {c |} Domestic Foreign {c |} Total
4 {hline 11}{c +}{hline 22}{c +}{hline 10}
5 1.5 {c |}{res} 3 1 {txt}{c |}{res} 4
6 {txt} 2.0 {c |}{res} 10 3 {txt}{c |}{res} 13
7 {txt} 2.5 {c |}{res} 4 10 {txt}{c |}{res} 14
8 {txt} 3.0 {c |}{res} 7 6 {txt}{c |}{res} 13
9 {txt} 3.5 {c |}{res} 13 2 {txt}{c |}{res} 15
10 {txt} 4.0 {c |}{res} 10 0 {txt}{c |}{res} 10
11 {txt} 4.5 {c |}{res} 4 0 {txt}{c |}{res} 4
12 {txt} 5.0 {c |}{res} 1 0 {txt}{c |}{res} 1
13 {txt}{hline 11}{c +}{hline 22}{c +}{hline 10}
14 Total {c |}{res} 52 22 {txt}{c |}{res} 74
I left this to last because it looks annoying to do. Unfortunately, my patience for dealing with Stata log files has run out as I write this blog post, so let’s pretend this last one is homework for you to figure out ;)
Hopefully though I’ve demonstrated that, with a little bit of work, you can get results out of Stata log files. Once you’ve done that, it’s business as usual as you write up your Quarto report!