Using R to extract results from Stata log files

Stata

programming

Author

Ben Harrap

Published

September 24, 2025

Taking results from Stata and inserting them into a report is a colossal pain in the ass, to put it mildly. There are a few custom commands out there, like baselinetable, that help ease the pain somewhat, but copy-pasting results from console output or log files seems inevitable and introduces human error.

It doesn’t have to be this way though! Recently, my predominantly Stata-using team has started making an effort to use Quarto for some regular reports we need to deliver. Naturally I’m thrilled, because Quarto is such a useful tool and it’ll save us a lot of time and error copy-pasting results from Stata. The only problem is we need Stata output in a format that we can read in for our Quarto report.

My solution for this? Write R functions that extract results from Stata’s log files! mind_blown.gif

What’s a log file?

Every Stata user should already be familiar with log files, but if you’re not, they just log all of the commands you run during a Stata session, along with their outputs, in an .smcl file. This stands for “Stata Markup and Control Language” and is pronounced “smickle” according to Stata’s help files. I don’t care to get into the detail of how these work, all I need to know is I can read it in to R as a text file.

To demonstrate, I ran the following commands in Stata

1log using log
2sysuse auto, clear
3tab headroom foreign
4tab rep78
5su price
6su price weight length gear_ratio
7reg mpg displacement gear_ratio weight price
8margins, at(weight=(2000(500)4500))
9log close

1: Start recording a log of the session, outputting to log.smcl
2: Use the auto data, similar to mtcars
3: Create a two-way table
4: Tabulate a single variable
5: Produce descriptives of a single variable
6: Then for multiple variables
7: Fit a regression model with mpg as the outcome
8: Produce predictive margins at a severak values of weight
9: Stop recording to the log file

If you want to see what the file looks like, it’s here.

The guts of it

Now we’ve got a log file to play with, we can read it into R and have a look.

data <- read.delim("log.smcl")
data[8:22,]

 [1] "{com}. tab headroom foreign"                                               
 [2] "  {txt}Headroom {c |}      Car origin"                                     
 [3] "     (in.) {c |}  Domestic    Foreign {c |}     Total"                     
 [4] "{hline 11}{c +}{hline 22}{c +}{hline 10}"                                  
 [5] "       1.5 {c |}{res}         3          1 {txt}{c |}{res}         4 "     
 [6] "{txt}       2.0 {c |}{res}        10          3 {txt}{c |}{res}        13 "
 [7] "{txt}       2.5 {c |}{res}         4         10 {txt}{c |}{res}        14 "
 [8] "{txt}       3.0 {c |}{res}         7          6 {txt}{c |}{res}        13 "
 [9] "{txt}       3.5 {c |}{res}        13          2 {txt}{c |}{res}        15 "
[10] "{txt}       4.0 {c |}{res}        10          0 {txt}{c |}{res}        10 "
[11] "{txt}       4.5 {c |}{res}         4          0 {txt}{c |}{res}         4 "
[12] "{txt}       5.0 {c |}{res}         1          0 {txt}{c |}{res}         1 "
[13] "{txt}{hline 11}{c +}{hline 22}{c +}{hline 10}"                             
[14] "     Total {c |}{res}        52         22 {txt}{c |}{res}        74 "     
[15] "{com}. tab rep78"

There’s a bunch of markup here that Stata uses to format the file nicely when you view it using Stata. It’s a bit ugly to look at as plain-text but there are some helpful things to notice here, which we can use to extract our results:

Command lines begin with {com}. (see lines 1 and 15)
The results lines end with a number (lines 5 through 12, and 14)

The output looks a little complicated with how it’s ordered, because it’s tabulating two variables. Line 5 has results, but doesn’t begin with {txt}. The first value in lines 5 to 12 are the values of headroom, while the remaining values correspond to ‘Domestic’, ‘Foreign’ and ‘Total’.

Let’s check the output for the other commands though. Here’s the summarise (or su) output:

data[33:43,]

 [1] "{com}. su price"                                                                             
 [2] "{txt}    Variable {c |}        Obs        Mean    Std. dev.       Min        Max"            
 [3] "{hline 13}{c +}{hline 57}"                                                                   
 [4] "{space 7}price {c |}{res}         74    6165.257    2949.496       3291      15906"          
 [5] "{com}. su price weight length gear_ratio"                                                    
 [6] "{txt}    Variable {c |}        Obs        Mean    Std. dev.       Min        Max"            
 [7] "{hline 13}{c +}{hline 57}"                                                                   
 [8] "{space 7}price {c |}{res}         74    6165.257    2949.496       3291      15906"          
 [9] "{txt}{space 6}weight {c |}{res}         74    3019.459    777.1936       1760       4840"    
[10] "{txt}{space 6}length {c |}{res}         74    187.9324    22.26634        142        233"    
[11] "{txt}{space 2}gear_ratio {c |}{res}         74    3.014865    .4562871       2.19       3.89"

The results also end with a number here, but they don’t have a trailing whitespace. We can fix that later using stringr::str_trim(). Otherwise it looks fairly straightforward. Lines 2, 6, tell us what the results correspond to.

Let’s check the regression output:

data[44:60,]

 [1] "{com}. reg mpg displacement gear_ratio weight price"                                                                                                                                       
 [2] "{txt}      Source {c |}       SS           df       MS      Number of obs   ={res}        74"                                                                                              
 [3] "{txt}{hline 13}{c +}{hline 34}   F(4, 69)        = {res}    32.92"                                                                                                                         
 [4] "{txt}       Model {c |} {res} 1603.26274         4  400.815685   {txt}Prob > F        ={res}    0.0000"                                                                                    
 [5] "{txt}    Residual {c |} {res} 840.196719        69   12.176764   {txt}R-squared       ={res}    0.6561"                                                                                    
 [6] "{txt}{hline 13}{c +}{hline 34}   Adj R-squared   ={res}    0.6362"                                                                                                                         
 [7] "{txt}       Total {c |} {res} 2443.45946        73  33.4720474   {txt}Root MSE        =   {res} 3.4895"                                                                                    
 [8] "{txt}{hline 13}{c TT}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"                                                                                                           
 [9] "{col 1}         mpg{col 14}{c |} Coefficient{col 26}  Std. err.{col 38}      t{col 46}   P>|t|{col 54}     [95% con{col 67}f. interval]"                                                   
[10] "{hline 13}{c +}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"                                                                                                                 
[11] "displacement {c |}{col 14}{res}{space 2} .0088833{col 26}{space 2} .0117481{col 37}{space 1}    0.76{col 46}{space 3}0.452{col 54}{space 4}-.0145535{col 67}{space 3}   .03232"            
[12] "{txt}{space 2}gear_ratio {c |}{col 14}{res}{space 2} .9008024{col 26}{space 2} 1.645546{col 37}{space 1}    0.55{col 46}{space 3}0.586{col 54}{space 4}-2.381972{col 67}{space 3} 4.183577"
[13] "{txt}{space 6}weight {c |}{col 14}{res}{space 2}-.0063068{col 26}{space 2} .0012248{col 37}{space 1}   -5.15{col 46}{space 3}0.000{col 54}{space 4}-.0087502{col 67}{space 3}-.0038635"    
[14] "{txt}{space 7}price {c |}{col 14}{res}{space 2}-.0001173{col 26}{space 2} .0001687{col 37}{space 1}   -0.70{col 46}{space 3}0.489{col 54}{space 4}-.0004538{col 67}{space 3} .0002193"     
[15] "{txt}{space 7}_cons {c |}{col 14}{res}{space 2}   36.595{col 26}{space 2} 6.734683{col 37}{space 1}    5.43{col 46}{space 3}0.000{col 54}{space 4} 23.15968{col 67}{space 3} 50.03033"     
[16] "{txt}{hline 13}{c BT}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"                                                                                                           
[17] "{res}"

It’s very busy but the same principle applies — result lines (11 to 15) end with a number.

Finally, the margins output:

data[61:85,]

 [1] "{com}. margins, at(weight=(2000(500)4500))"                                                                                                                                         
 [2] "{res}"                                                                                                                                                                              
 [3] "{txt}{col 1}Predictive margins{col 61}{lalign 13:Number of obs}{col 74} = {res}{ralign 2:74}"                                                                                       
 [4] "{txt}{col 1}Model VCE: {res:OLS}"                                                                                                                                                   
 [5] "{txt}{p2colset 1 13 13 2}{...}"                                                                                                                                                     
 [6] "{p2col:Expression:}{res:Linear prediction, predict()}{p_end}"                                                                                                                       
 [7] "{p2colreset}{...}"                                                                                                                                                                  
 [8] "{lalign 7:1._at: }{space 0}{lalign 6:weight} = {res:{ralign 4:2000}}"                                                                                                               
 [9] "{lalign 7:2._at: }{space 0}{lalign 6:weight} = {res:{ralign 4:2500}}"                                                                                                               
[10] "{lalign 7:3._at: }{space 0}{lalign 6:weight} = {res:{ralign 4:3000}}"                                                                                                               
[11] "{lalign 7:4._at: }{space 0}{lalign 6:weight} = {res:{ralign 4:3500}}"                                                                                                               
[12] "{lalign 7:5._at: }{space 0}{lalign 6:weight} = {res:{ralign 4:4000}}"                                                                                                               
[13] "{lalign 7:6._at: }{space 0}{lalign 6:weight} = {res:{ralign 4:4500}}"                                                                                                               
[14] "{res}{txt}{hline 13}{c TT}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"                                                                                               
[15] "{col 14}{c |}{col 26} Delta-method"                                                                                                                                                 
[16] "{col 14}{c |}     Margin{col 26}   std. err.{col 38}      t{col 46}   P>|t|{col 54}     [95% con{col 67}f. interval]"                                                               
[17] "{hline 13}{c +}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"                                                                                                          
[18] "{space 9}_at {c |}"                                                                                                                                                                 
[19] "{space 10}1  {c |}{col 14}{res}{space 2} 27.72685{col 26}{space 2} 1.312853{col 37}{space 1}   21.12{col 46}{space 3}0.000{col 54}{space 4} 25.10778{col 67}{space 3} 30.34593"     
[20] "{txt}{space 10}2  {c |}{col 14}{res}{space 2} 24.57344{col 26}{space 2}   .75454{col 37}{space 1}   32.57{col 46}{space 3}0.000{col 54}{space 4} 23.06817{col 67}{space 3} 26.07871"
[21] "{txt}{space 10}3  {c |}{col 14}{res}{space 2} 21.42002{col 26}{space 2} .4063483{col 37}{space 1}   52.71{col 46}{space 3}0.000{col 54}{space 4} 20.60938{col 67}{space 3} 22.23067"
[22] "{txt}{space 10}4  {c |}{col 14}{res}{space 2} 18.26661{col 26}{space 2}  .714807{col 37}{space 1}   25.55{col 46}{space 3}0.000{col 54}{space 4} 16.84061{col 67}{space 3} 19.69261"
[23] "{txt}{space 10}5  {c |}{col 14}{res}{space 2} 15.11319{col 26}{space 2} 1.267604{col 37}{space 1}   11.92{col 46}{space 3}0.000{col 54}{space 4} 12.58439{col 67}{space 3}   17.642"
[24] "{txt}{space 10}6  {c |}{col 14}{res}{space 2} 11.95978{col 26}{space 2} 1.858154{col 37}{space 1}    6.44{col 46}{space 3}0.000{col 54}{space 4} 8.252865{col 67}{space 3} 15.66669"
[25] "{txt}{hline 13}{c BT}{hline 11}{hline 11}{hline 9}{hline 8}{hline 13}{hline 12}"

The main thing to notice is that the values of weight at which the estimates are produced are in lines 8 to 13, while the estimates themselves are on separate lines (19 to 24).

Separate out the results

I could’ve split the analysis up across multiple log files. For example, outputting tabulations into tabulation.log, the summaries into summaries.log, the regression model into regression.log and the margins into margins.log. There’s not much need though, since we can use the command rows to subset the data. We’ll need them like this for our Quarto report anyway, plus each command seems to need slightly different work.

library(tidyverse)
data <- data |> 
  mutate(
    # Create a variable to identify each command
    command = if_else(
      str_detect(X.smcl., "\\{com\\}\\."),
      true = X.smcl.,
      false = NA
    )
  ) |> 
  # Fill downward to catch all the lines of output from that command
  fill(command, .direction = "down")

Notice the double backslashes in str_detect() - we need to escape the special characters since we’re using a regular expression here.

Now we’ve identified the lines from each command, we can separate out our results.

tab_headroom_foreign <- filter(data, command == "{com}. tab headroom foreign")
tab_rep78 <- filter(data, command == "{com}. tab rep78")
summaries <- filter(data, command == "{com}. su price weight length gear_ratio")
regression <- filter(data, command == "{com}. reg mpg displacement gear_ratio weight price")
margins <- filter(data, command == "{com}. margins, at(weight=(2000(500)4500))")

One-way tables

Starting with the simplest output, let’s extract the counts of each level of rep78:

tab_rep78 <- tab_rep78 |> 
  mutate(
    results = case_when(
1        str_detect(X.smcl., "((-)?\\.)?\\d+(\\.\\d+)?$")
2        ~ str_extract_all(X.smcl.,"((-)?\\.)?\\d+(\\.\\d+)?"),
        .default = NA
      ),
4    X.smcl. = str_squish(str_remove_all(X.smcl.,"\\{.+\\}"))
  ) |> 
3  hoist(
    "results",
    repair_record = 1,
    frequency = 2,
    percent = 3,
    cumulative = 4
  ) |> 
  select(-command)

tab_rep78

1: This regex is looking for strings that optionally start with a negative or decimal (((-)?\\.)?), followed by any number of digits (\\d+), optionally followed by a decimal and any number of digits ((\\.\\d+)), at the end of the string $
2: stringr::str_extract_all() then extracts all instances of this pattern anywhere in the string (we omit the $), returning a list-column called results
3: tidyr::hoist() then extracts the elements of results into their own column, to which we assign names
4: I just formatted the output to look nicer on this blog post, this step is unnecessary here

                          X.smcl. repair_record frequency percent cumulative
1                     . tab rep78            78      <NA>    <NA>       <NA>
2                                          <NA>      <NA>    <NA>       <NA>
3  record 1978 Freq. Percent Cum.          <NA>      <NA>    <NA>       <NA>
4                                          <NA>      <NA>    <NA>       <NA>
5                   1 2 2.90 2.90             1         2    2.90       2.90
6                   8 11.59 14.49             2         8   11.59      14.49
7                  30 43.48 57.97             3        30   43.48      57.97
8                  18 26.09 84.06             4        18   26.09      84.06
9                 11 15.94 100.00             5        11   15.94     100.00
10                                         <NA>      <NA>    <NA>       <NA>
11                Total 69 100.00            69    100.00    <NA>       <NA>

Notice that the ‘Total’ row (11) hasn’t quite worked. We could tinker with the regex or mutate() the correct values in. Or we could just use janitor::adorn_totals() to make our lives easier:

library(janitor)
tab_rep78 |> 
  filter(!is.na(cumulative)) |> 
  select(repair_record:percent) |> 
  mutate(across(everything(), as.numeric)) |> 
  adorn_totals("row")

 repair_record frequency percent
             1         2    2.90
             2         8   11.59
             3        30   43.48
             4        18   26.09
             5        11   15.94
         Total        69  100.00

Regression output

The combination of str_extract_all() and hoist() is what this all boils down to, but there’s a bit of tweaking we need to do for the regression output. The regex in the previous example was extracting all instances of digits with optional decimals, however in the regression output there’s spacing markup like {hline 13}, which we don’t want to extract.

We could fix this by modifying the regex, but I’m not a fan of making large, complicated expressions if I can avoid it, so instead we can just replace anything inside of a curly bracket with a space, using str_replace_all().

regression |>
  mutate(
    X.smcl. = str_replace_all(X.smcl., "\\{[\\w|\\d|\\s]+\\}", " "),
    results = case_when(
      str_detect(X.smcl., "((-)?\\.)?\\d+(\\.\\d+)?$") ~ str_extract_all(
        X.smcl.,
        "((-)?\\.)?\\d+(\\.\\d+)?"
      ),
      .default = NA
    )
  ) |>
  hoist(
    "results",
    estimate = 1,
    std_err = 2,
    t = 3,
    p_val = 4,
    ci_lower = 5,
    ci_upper = 6
  ) |>
  filter(!is.na(ci_upper)) |>
  select(-command) |>
  mutate(
    X.smcl. = str_sub(
      X.smcl.,
      start = str_locate(X.smcl., "[:graph:]+")[, "start"],
      end = str_locate(X.smcl., "[:graph:]+")[, "end"]
    )
  ) |>
  rename(variable = X.smcl.)

      variable  estimate  std_err    t p_val  ci_lower  ci_upper
1 displacement  .0088833 .0117481 0.76 0.452 -.0145535    .03232
2   gear_ratio  .9008024 1.645546 0.55 0.586  2.381972  4.183577
3       weight -.0063068 .0012248 5.15 0.000 -.0087502 -.0038635
4        price -.0001173 .0001687 0.70 0.489 -.0004538  .0002193
5        _cons    36.595 6.734683 5.43 0.000  23.15968  50.03033

The last thing to do here is create the variable column by extracting the variable names using str_sub and the start/end locations of the first place that numbers/letters/punctuation are detected.

Margins output

For the margins we can recycle the same code to extract the results, although we need to do a bit more work to name the margins correctly.

margins |>
  mutate(
    X.smcl. = str_replace_all(X.smcl., "\\{[\\w|\\d|\\s]+\\}", " "),
    results = case_when(
      str_detect(X.smcl., "\\d+(\\.\\d+)?$") ~ str_extract_all(X.smcl.,"((-)?\\.)?\\d+(\\.\\d+)?"),
      .default = NA
    )
  ) |>
  hoist(
    "results",
    margin = 1,
    estimate = 2,
    std_err = 3,
    t = 4,
    p_val = 5,
    ci_lower = 6,
    ci_upper = 7
  ) |> 
  filter(!is.na(ci_upper)) |> 
  select(-command, -X.smcl.) |> 
  mutate(
    margin = case_when(
      margin == "1" ~ "weight = 2000",
      margin == "2" ~ "weight = 2500",
      margin == "3" ~ "weight = 3000",
      margin == "4" ~ "weight = 3500",
      margin == "5" ~ "weight = 4000",
      margin == "6" ~ "weight = 4500",
    )
  )

         margin estimate  std_err     t p_val ci_lower ci_upper
1 weight = 2000 27.72685 1.312853 21.12 0.000 25.10778 30.34593
2 weight = 2500 24.57344   .75454 32.57 0.000 23.06817 26.07871
3 weight = 3000 21.42002 .4063483 52.71 0.000 20.60938 22.23067
4 weight = 3500 18.26661  .714807 25.55 0.000 16.84061 19.69261
5 weight = 4000 15.11319 1.267604 11.92 0.000 12.58439   17.642
6 weight = 4500 11.95978 1.858154  6.44 0.000 8.252865 15.66669

I could’ve written code to automatically extract the margin value and assign it, instead of manually coding the value of margin, but I want to finish this blog post so *waves hands*.

Extracting a two-way table

tab_headroom_foreign |> select(-command)

                                                                      X.smcl.
1                                                 {com}. tab headroom foreign
2                                         {txt}Headroom {c |}      Car origin
3                            (in.) {c |}  Domestic    Foreign {c |}     Total
4                                    {hline 11}{c +}{hline 22}{c +}{hline 10}
5              1.5 {c |}{res}         3          1 {txt}{c |}{res}         4 
6  {txt}       2.0 {c |}{res}        10          3 {txt}{c |}{res}        13 
7  {txt}       2.5 {c |}{res}         4         10 {txt}{c |}{res}        14 
8  {txt}       3.0 {c |}{res}         7          6 {txt}{c |}{res}        13 
9  {txt}       3.5 {c |}{res}        13          2 {txt}{c |}{res}        15 
10 {txt}       4.0 {c |}{res}        10          0 {txt}{c |}{res}        10 
11 {txt}       4.5 {c |}{res}         4          0 {txt}{c |}{res}         4 
12 {txt}       5.0 {c |}{res}         1          0 {txt}{c |}{res}         1 
13                              {txt}{hline 11}{c +}{hline 22}{c +}{hline 10}
14           Total {c |}{res}        52         22 {txt}{c |}{res}        74

I left this to last because it looks annoying to do. Unfortunately, my patience for dealing with Stata log files has run out as I write this blog post, so let’s pretend this last one is homework for you to figure out ;)

Hopefully though I’ve demonstrated that, with a little bit of work, you can get results out of Stata log files. Once you’ve done that, it’s business as usual as you write up your Quarto report!