Graphing in Stata

Stata has great graphing, but most people don't think so after the first few uses and revert to Excel (or say how much better it is). This is probably because the default colour, look, scales etc. on Stata graphs aren't very nice and Excel let's you do it all by hand.

If you're writing a paper, essay, article etc. it's good practice to make sure your work is reproducible. Since they are often main motivation for a research question, or present results most effectively, this should also be the case for graphs. If you do a lot of calculations in Excel and then make a graph by hand, this isn't straightforward. Using Stata also means you can have pre-made code with your preferred options that you can just copy and paste/re-use for similar settings.

This is an intro. to some basic graphs in Stata that I use for introductory courses. It has 4 parts:

  1. The graphing syntax
  2. More detailed formatting
  3. Overlaying
  4. Bar charts
  5. Exporting to Excel

It's meant to show the flexibility of Stata graphs, and provide some example code to play around with. Probably the best thing to do is to copy and paste any code that looks interesting into a .do file and include/exclude options one at a time to see how they change the charts.

I'll update this from time-to-time based on questions and things that come up.

Last updated: October 20, 2020

Basic graphing syntax

We can start by loading one of Stata’s system datasets on cars, and produce a basic graph - how are price and miles per gallon correlated?

sysuse "auto.dta" 
graph twoway scatter price mpg
scatter-def

To play around with some formatting, just add a comma at the end of the command and start playing around with the options. Start with just fixing the axes labels:

twoway scatter price mpg, ///
	ylab(0(2000)16000, angle(0) labsize(small) nogrid) ///
	xlab(10(5)40, labsize(small)) 

This already looks much better. But the color scheme is quite ugly. One way to change this is just to use one of Stata’s in-built schemes:

twoway scatter price mpg, ///
	ylab(0(2000)16000, angle(0) labsize(small) nogrid) ///
	xlab(10(5)40, labsize(small)) ///
	scheme(s2mono)

twoway scatter price mpg, ///
	ylab(0(2000)16000, angle(0) labsize(small) nogrid) ///
	xlab(10(5)40, labsize(small)) ///
	scheme(s1rcolor)
s2mono
s1rcolor

But it’s good to get to grips with the gaph syntax to make your own, and present them exactly how you’d like them - or in line with company/organisation branding. With a few basic options he basic scatter plot can be tidied up a lot:

twoway scatter price mpg, ///
	ylab(0(2000)16000, angle(0) labsize(small) nogrid) ///
	xlab(10(5)40, labsize(small)) ///
	mcolor(red%40) symbol(Oh) ///
	graphregion(fcolor(white) lcolor(white)) ///
	title("Price and miles per gallon", margin(b=5) pos(11) color(black))

A good way to figure out what each option does it just to cpy and past this code and re-run it taking them out on-by-one. Here's Statas graph twoway manual with all options for all graphs, but you need to follow a link-trail to figure most things out.

More detailed formatting syntax

Going back to the simple scatter graph tha we modified slightly:

twoway scatter price mpg, ///
	ylab(0(2000)16000, angle(0) labsize(small) nogrid) ///
	xlab(10(5)40, labsize(small)) ///
	mcolor(red%40) symbol(Oh) ///
	graphregion(fcolor(white) lcolor(white)) ///
	title("Price and miles per gallon", margin(b=5) pos(11) color(black))

We can clean it up even more. For example, the dots (markers in Stata speak) can have different coloured outline and fills, and the shape can be changed:

twoway scatter price mpg, ///
	ylab(0(2000)16000, angle(0) labsize(small) nogrid) ///
	xlab(10(5)40, labsize(small)) ///
	mlcolor(red%40) mfcolor(midblue%40) symbol(triangle) ///
	graphregion(fcolor(white) lcolor(white)) ///
	title("Price and miles per gallon", margin(b=5) pos(11) color(black))

It doesn't look good on this graph, but it is possible to add labels tp the markers:

twoway scatter price mpg, ///
	ylab(0(2000)16000, angle(0) labsize(small) nogrid) ///
	xlab(10(5)40, labsize(small)) ///
	mlcolor(red%40) mfcolor(midblue%40) symbol(triangle)///
	mlab(make) mlabsize(small) mlabpos(12)  ///
	graphregion(fcolor(white) lcolor(white)) ///
	title("Price and miles per gallon", margin(b=5) pos(11) color(black))

It's also possible to change the axes and gridlines on the tidies up scatter plot:

twoway scatter price mpg, ///
  ylab(0(2000)16000, angle(0) labsize(small) glpattern(dash) glcolor(black%20) glwidth(thin)) ///
  xlab(10(5)40, labsize(small)) ///
  mlcolor(red) mfcolor(black) symbol(triangle) ///
  xscale(range(8) noextend lcolor(red)) ///
  yscale(range(0) noextend lcolor(red)) plotregion(margin(b=5)) ///
  graphregion(fcolor(white) lcolor(white)) ///
  title("Price and miles per gallon", margin(b=5) pos(11) color(black))

Of course these are silly modifications to highlight changes, but it is possible to create some very nice graphs using the options:

twoway scatter price mpg, ///
  ylab(0(2000)16000, angle(0) labsize(small) glcolor(white) glwidth(medthick)) ///
  xlab(10(5)40, labsize(small) grid glcolor(white) glwidth(medthick)) ///
  mlcolor(black) mfcolor(black%20) ///
  graphregion(fcolor(white) lcolor(white)) ///
  title("Price and miles per gallon", margin(b=5) pos(11) color(black)) ///
  xscale(lcolor(white) titlegap(5)) yscale(lcolor(white) titlegap(5)) ///
  plotregion(fcolor(gs15)) ///
  ytitle("Price ({c $|})") 

Overlaying

It's common to want to overlay charts. For example, in the scatter plot it might be a good idea to have a best-fitting line overlaid:

twoway (scatter price mpg) || (lfit price mpg)

To tidy it up, the specific options for each chart have to go inside the brackets, and master options outside:

twoway (scatter price mpg, mcolor(red%40) symbol(Oh)) ///
  || (lfit price mpg, lcolor(black)), ///
  ylab(0(2000)16000, angle(0) labsize(small) nogrid) ///
  xlab(10(5)40, labsize(small)) ///
  graphregion(fcolor(white) lcolor(white)) ///
  ytitle("Price ({c $|})") ///
  title("Price and miles per gallon", margin(b=5) pos(11) color(black))

It's possible to format and move the legend around as well:

twoway (scatter price mpg, mcolor(red%40) symbol(Oh)) ///
  || (lfit price mpg, lcolor(black)), ///
  ylab(0(2000)16000, angle(0) labsize(small) nogrid) ///
  xlab(10(5)40, labsize(small)) ///
  graphregion(fcolor(white) lcolor(white)) ///
  ytitle("Price ({c $|})") ///
  title("Price and miles per gallon", margin(b=5) pos(11) color(black)) ///
  legend(order(1 "Actual values" 2 "Fitted values") region(lcolor(white)) ring(0) pos(1) rows(2))

Bar charts

The flexibility above applies to all types of twoway graphs. One of the most common graphs to use is a bar chart. Although Stata has a twoway bar, it also has a specific graph bar command. This is because twoway plots plot the values of two variables against one another, whereas bar charts typically plot means or totals across categories. For example, to look at the average price by whether a car is domestic (to America) or foreign-made, Stata's graph bar is ideal:

graph bar price, over(foreign)

And it can be tidied up in a similar way as before:

graph bar price, over(foreign, label(labsize(large))) ///
  bar(1, lcolor(black) fcolor(gs12)) ///
  graphregion(fcolor(white) lcolor(white)) ///
  plotregion(lcolor(black)) ///
  ylab(0(1000)7000, ang(0) nogrid) ///
  ytitle("Average price ({c $|})"" ")

The default stat of graph bar is the mean within groups, but it can also report many other statistics, like the median:

graph (median) bar price, over(foreign, label(labsize(large))) ///
  bar(1, lcolor(black) fcolor(gs12)) ///
  graphregion(fcolor(white) lcolor(white)) ///
  plotregion(lcolor(black)) ///
  ylab(0(1000)7000, ang(0) nogrid) ///
  ytitle("Average price ({c $|})"" ")

It's also also possible to have two bars, and add a few more options:

graph bar price weight, over(foreign, label(labsize(large))) ///
  bargap(-30) blab(bar, pos(inside) format(%5.0fc) size(medium)) ///
  bar(1, lcolor(black) fcolor(gs12)) ///
  bar(2, lcolor(black) fcolor(white)) ///
  graphregion(fcolor(white) lcolor(white)) ///
  plotregion(lcolor(black)) ///
  ylab(0(1000)7000, ang(0) nogrid) ///
  ytitle("Average price ({c $|}) / Average weight (lbs)"" ") ///
  legend(order(1 "Price" 2 "Weight") region(lcolor(none) fcolor(none)) rows(2) ring(0) pos(1))

One problem with the graph bar command is that it's not possible to overlay charts. For example, we might want to display the mean price and an estimate of a confidence interval. For that, we need to use Stata's twoway bar alongside the collapse command:

collapse (mean) price (semean) se_price=price, by(foreign)
* Create confidence intervals (using a normal approximation for ease)
gen upper = mean_price + 1.96*se_price
gen lower = mean_price - 1.96*se_price 

twoway (bar mean_price foreign) || (rcap upper lower foreign), xlab(0 1)

We can add a third plot to this to highlight the actual point of the mean, an tidy it up:

twoway (bar mean_price foreign, barwidth(0.75) lcolor(black) fcolor(gs12)) ///
  || (rcap upper lower foreign, lcolor(midblue)) ///
  || (scatter mean_price foreign, mcolor(midblue%40)), ///
  xlab(0 "Domestic" 1 "Foreign") ylab(0(1000)7000, ang(0) nogrid) ///
  plotregion(margin(b=0 l=5 r=5)) ///
  graphregion(fcolor(white) lcolor(white)) ///
  ytitle("Price ({c $|})"" ") xtitle("") ///
  legen(order(1 "Price" 2 "95% CI") region(lcolor(white)))

Exporting data to Excel

One option is to collase Stata data into the form you want for a graph and export it to Excel For example, in the case of the simple bar charts above:


collapse (mean) mean_price=price (semean) se_price=price, by(foreign)
gen upper = mean_price + 1.96*se_price
gen lower = mean_price - 1.96*se_price 

label var mean_price "Price ($)"
lab var se_price "SE of the mean"
lab var upper "Upper 95% CI limit"
lab var lower "Upper 95% CI limit"

export Excel using "MyBook.xlsx", replace firstrow(varlab) keepcellfmt

This way, your at least your calculations are reproducible although the graph will be done by hand. Another shortcut is to set up the graph style you want after you fist output the data, say like this:

It's then possible just to update the relevant cells in order to update the graph automatically. Say, as a silly example, we get new data on car prices in which all prices has been shifted up by $1000:

* First make the change to imitate new data
replace price = price+1000

collapse (mean) mean_price=price (semean) se_price=price, by(foreign)
gen upper = mean_price + 1.96*se_price
gen lower = mean_price - 1.96*se_price 

label var mean_price "Price ($)"
lab var se_price "SE of the mean"
lab var upper "Upper 95% CI limit"
lab var lower "Upper 95% CI limit"

export Excel using "MyBook.xlsx", firstrow(varlab) keepcellfmt sheet("Sheet1") sheetmodify cell("A1")