Getting started in Stata

This is an intro. to the vey basics of Stata. It's meant for people with no experience using Stata and has 6 parts:
  1. The Stata interface
  2. Using Stata as a calculator
  3. Preliminaries
  4. Loading and exploring data
  5. Manipulating and creating variables
  6. Some descriptive analysis

I'll update this from time-to-time

Last updated: October 20, 2020

1.   The Stata interface

Stata has an interactive User Interface (UI), the Stata viewer:

Stata UI

You can easily work interactively in the Stata viewer by loading data using Ctrl + o and selecting the dataset you would like to work with, then executing individual commands in the command prompt. However, if you're producing work for a project, essay, thesis or paper it is good practice to ensure your work is reproducible.

There are also lots of things you can't do in the command prompt. For these reasons, it is best to use Stata's do-file editor: .do files are text documents in which you can write a whole list of commands and then ask Stata to "do" them, either all together or one-by-one. To open a do file, click the do-file editor button on the menu bar at the top of the Stata viewer:

Once you've written some commands in a .do file. you can hit Ctrl + d to execute the whole file in Stata. If you only want to run one or more lines, you can highlight them and hit Ctrl + d .

Commenting in a .do file is useful to keep track of what you have done, both for your future self and others looking to replicate/understand your work.

There are three ways to make a comment: A "*" can be used at the beginning of any line and will prevent it from being executed by Stata; using "//" in any place on any line will turn the code that follows it on that line green; and inserting code in "/* ... */" allows you to make comments over more than one line:

* One asterisk: single line comment

// Slashes: single line comment in-line //

/* Asterisk and slashes for longer
multi-line comments, maybe giving more
detail about a block of code or results */

There are lots of examples of using comments in the rest of this intro.

Stata also has excellent help facilities. In the Stata command prompt you can type three things:

help [keyword]			// information for a specific command
search [keyword]		// search the Stata manual for help
findit [keyword]		// search the Stata manual + internet for help

Most of your queries can be answered by using these commands, or searching the internet and Stata's comprehensive manuals for its commands (something I do on a daily basis). For example, googling "Stata manual regress" returns this detailed 25 page document on every aspect of Stata's regress command, for linear regressions.

2.   Basic operations

Before getting in to loading and using Stata to load and analyse some data, it is useful to run through some of the basic operations - Stata is a very sophisticated calculator, but it is still "just" a calculator. How it interprets the commands or blocks of code you ask it to execute like a calculator interpets simple calculations:

*** Do some sums
di 2+2
4
di 2*2
4
di 2/2
1
di (2*2)/(2+2)
1	

The di command is short for display. This is common with Stata commands - you don't need to type their full but can type only a few letters and Stata will recognise what you mean. How short you can go is shown by the underlined portion of the command in its manual. Se for example the manual for display. We can do the same thing by defining scalars:

* Defining some scalars
scalar define a = 2	
sca de b = 3		//Note you can shorten the command 
scalar c = a*b 		//or even omit the word define - it's optional and the default sub-command
di c
6

Note the in-line comments, and that the single "=" is the assignment operator i.e it assigns the object on th left hand side the value on the right hand side. We can do the same thing with text, or strings, or with both numbers and text:

* The same with strings
di "This is a string"
scalar a = "This is a string"
di a
This is a string 

scalar a = 2
di "This is a number: " a
This is a number: 2

Section 13.2 of this manual explains all of Stata's "operators". You can also do some matrix algebra:

* Matrices - "," for new column, "\" for new row, whole matrix enclose in "()"
matrix A=(1, 2 \ 2, 1)
matrix list list A

symmetric A[2,2]
    c1  c2
r1   1
r2   2   1


matrix B = 2*A
mat list B 				// again, note the shortening of matrix command

symmetric B[2,2]
    c1  c2
r1   2
r2   4   2

This manual gives you all you'll need to know about Stata's basic matrix functions/operators. Stata also has another, newer matrix language called Mata, however it's not necessary to know it for anything this intro. covers.

3.   Preliminaries

Before you get into loading and analysing data, there are some preliminary set-up steps that'll make working with Stata more streamlined. There's a small block of code I have at the top of my .do file that outlines what the file contains, defines where Stata should look for anything I want to upload or save, and establishes some preferences.

For example, the beginning a .do file containing the next three parts of this course might start like this:

/*************************************************************/
This file contains code for an introductory course in Stata
Structure:
	1. Preliminaries
	2. Loading and exploring data
	3. Some descriptive analysis

Las updated: 24/06/2020
*************************************************************/

* Some simple preferences
clear allows				//Clear everything in Stata and start fresh
set maxvar 1000				//Allow datsets with up to 1,000 variables to be loaded
set more off, permanently		//Automoatically show all output without asking- handy for longer tasks 

* Set base directory: where Stata should start to look for files I call/save
cd "C:\Users\markm\Documents\StataIntro"

As you go on and learn even more about Stata, you can add more preferences to this block, install commands, and define other aspects of your worflow like locations to save output. It's not essential to start .do files like this, but it's a good habit to get into to help keep your code organised.

The cd command here stands for current directory. Following it with that particular file path tells Stata to set the current directory to the StatIntro folder, that I've created in my Documents.

The current durectory is the folder Stata will automatically look in for data files you want to upload, or where it will automatically save images or data files you create. You can tell Stata to navigate to sub-folders in this directory, or to look in on folder abover it.

When starting a project it is useful to create a new folder and some sub-folders to keep your work and any results/output organised. For example, for this intro. I have created a main "StataIntro" folder, then within it folders for Data and Figures:

Stata UI

4.   Loading and exploring Data

Stata has its own format for datasets, indicated by the file extension .dta. You can also easily load excel sheets (.xls, .xlsx) or CSV (.csv) data into Stata as well. In my Data folder I have a small dataset saved in all three formats*.

* Load Stata format
use "Data\Growth.dta", clear
browse

* or load from en excel workbook
import excel using "Data\Growth.xlsx", clear firstrow		//firstrow tells Stata the top row has variable names
br 															//browse can be shortened to br

The browse command here tells Stata to open up the data broswer. Once you run it, a window like this will pop up:

You can then look at the variables in the dataset to see what they actually contain, their format, units, which observations have missing information etc. This comes in handy if you are trying to figure out why you are getting weird results.

With data loaded, we can start to figure out what we are working with. There are a few commands that are great for this:

* List all of the vriables in the dataset:
describe 


Contains data from C:\Users\markm\Documents\StataIntro\Data\Growth.dta
  obs:            65                          
 vars:             8                          2 Aug 2006 09:11
 size:         3,055                          
------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------
country_name    str19   %19s                  
growth          float   %9.0g                 
oil             float   %9.0g                 
rgdp60          float   %9.0g                 
tradeshare      float   %9.0g                 
yearsschool     float   %9.0g                 
rev_coups       float   %9.0g                 
assasinations   float   %9.0g                 
------------------------------------------------------------------------------------------------------------------


* or just a few:
desc country_name growth tradeshare oil


storage   display    value
variable name   type    format     label	variable	label
------------------------------------------------------------------------------------------------------------------		
country_name    str19   %19s	    
growth          float   %9.0g	    
tradeshare      float   %9.0g	    
oil             float   %9.0g	    

The describe command gives us a brief description of the variables, showing us that country_name is a text - or string - variable, indicated by its type being str19. It also shows us that tradeshare and oil are numeric variables as their type is listed as float. None of the variables have a label, which can make it difficult to know what they actually represent, however here we have an accompanying guide that tells us what each variable is. We will come back to labels shortly. For a bit more detail we can use codebook:


codebook country_name tradeshare oil


------------------------------------------------------------------------------------------------------------------------------------------
country_name                                                                                                                   (unlabeled)
------------------------------------------------------------------------------------------------------------------------------------------

                  type:  string (str19)

         unique values:  65                       missing "":  0/65

              examples:  "Denmark"
                         "India"
                         "New Zealand"
                         "Spain"

               warning:  variable has embedded blanks

------------------------------------------------------------------------------------------------------------------------------------------
tradeshare                                                                                                                     (unlabeled)
------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [.14050198,1.9926157]        units:  1.000e-08
         unique values:  65                       missing .:  0/65

                  mean:   .564703
              std. dev:    .28927

           percentiles:        10%       25%       50%       75%       90%
                           .299406   .393251   .543337   .681555   .830238

------------------------------------------------------------------------------------------------------------------------------------------
oil                                                                                                                            (unlabeled)
------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [0,0]                        units:  1
         unique values:  1                        missing .:  0/65

            tabulation:  Freq.  Value
                            65  0

The codebook command is particularly useful for finding out more about numeric variables. The command above shows is that while tradeshare and oil are both numeric, tradeshare has 65 unique values and ranges from roughly 0.14 to about 2, whereas oil ranges from 0 to 0 and has 1 unique value - in other words it has the same value for every observation. It also shows that that country_name has 65 unique values, 1 for each observation. We can use the list command to get a list of all of these:


list country_name

     +---------------------+
     |        country_name |
     |---------------------|
  1. |               India |
  2. |           Argentina |
  3. |               Japan |
  4. |              Brazil |
  5. |       United States |
     |---------------------|
  6. |          Bangladesh |
  7. |               Spain |
  8. |            Colombia |
  9. |                Peru |
 10. |               Haiti |
     |---------------------|
 11. |           Australia |
 12. |               Italy |
 13. |              Greece |
 14. |              France |
 15. |               Zaire |
     |---------------------|
 16. |             Uruguay |
 17. |              Mexico |
 18. |            Pakistan |
 19. |               Niger |
 20. |             Bolivia |
     |---------------------|
 21. |             Germany |
 22. |              Canada |
 23. |      United Kingdom |
 24. |         New Zealand |
 25. |         Philippines |
     |---------------------|
 26. |             Finland |
 27. |           Venezuela |
 28. |  Korea, Republic of |
 29. |           Guatemala |
 30. |            Honduras |
     |---------------------|
 31. |         El Salvador |
 32. |               Chile |
 33. |            Thailand |
 34. |              Sweden |
 35. |             Senegal |
     |---------------------|
 36. | Trinidad and Tobago |
 37. |             Ecuador |
 38. |             Denmark |
 39. |         Switzerland |
 40. |             Austria |
     |---------------------|
 41. |            Zimbabwe |
 42. |            Paraguay |
 43. |          Costa Rica |
 44. |            Portugal |
 45. |                Togo |
     |---------------------|
 46. |             Iceland |
 47. |              Israel |
 48. |        South Africa |
 49. |              Norway |
 50. |        Sierra Leone |
     |---------------------|
 51. |  Dominican Republic |
 52. |               Ghana |
 53. |           Sri Lanka |
 54. |       Taiwan, China |
 55. |              Panama |
     |---------------------|
 56. |    Papua New Guinea |
 57. |               Kenya |
 58. |             Ireland |
 59. |             Jamaica |
 60. |         Netherlands |
     |---------------------|
 61. |              Cyprus |
 62. |            Malaysia |
 63. |             Belgium |
 64. |           Mauritius |
 65. |               Malta |
     +---------------------+

The summarise command gives basic summary statistics, and even more detail with the detail option:


sum tradeshare


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
  tradeshare |         65     .564703    .2892703    .140502   1.992616


sum tradeshare, detail


                         tradeshare
-------------------------------------------------------------
      Percentiles      Smallest
 1%      .140502        .140502
 5%     .1604051        .156623
10%     .2994059       .1577032       Obs                  65
25%     .3932509       .1604051       Sum of Wgt.          65

50%     .5433371                      Mean            .564703
                        Largest       Std. Dev.      .2892703
75%     .6815552       1.105364
90%     .8302383       1.115917       Variance       .0836773
95%     1.105364       1.127937       Skewness       1.996913
99%     1.992616       1.992616       Kurtosis        10.6097

You can also print a list of the statistics the summarise command calculates and work with them as scalars:


sum tradeshare

sum tradeshare, detail
 
-- output omitted --

return list 


scalars:
                  r(N) =  65
              r(sum_w) =  65
               r(mean) =  .5647030305403929
                r(Var) =  .0836773165437838
                 r(sd) =  .2892703174260778
           r(skewness) =  1.996913186886965
           r(kurtosis) =  10.60969926467839
                r(sum) =  36.70569698512554
                r(min) =  .1405019760131836
                r(max) =  1.992615699768066
                 r(p1) =  .1405019760131836
                 r(p5) =  .1604050844907761
                r(p10) =  .2994059324264526
                r(p25) =  .3932509124279022
                r(p50) =  .5433371067047119
                r(p75) =  .6815551519393921
                r(p90) =  .8302383422851563
                r(p95) =  1.105364203453064
                r(p99) =  1.992615699768066

 scalar a = r(N)
 scalar b = r(mean)
 scalar c = r(sum)

 di a*b
 36.70569698512554

 di c
 36.70569698512554

If you're working with binary, discrete, or categorical variables, then tabulate whows you how observations are distributed across their values. We only have one such variable at the minute, oil, and remember it is all zeros:

tab oil 


        oil |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         65      100.00      100.00
------------+-----------------------------------
      Total |         65      100.00

Once we have a good idea of what the variables represent and what type of data they contain, we can start to manipulate them and create new variables we might want to us in any analysis.

5.   Manipulating and creating variables

Often datasets don't contain the exact variables required for the analysis you want to carry out. Or they do have the variable, but it is not in the right format. With Stata it is very easy to ``clean'' and create variables. We can start by creating a new variable that is assasinations per thousand as oppose to million. Im general the syntax for this type of operation is newvar = somefunction(oldvar) :

sum assasinations

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
assasinati~s |         65    .2775641    .4915284          0   2.466667

gen assasinations_pth = assasinations*1000
sum assasinations assasination_pth


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
assasinati~s |         65    .2775641    .4915284          0   2.466667
assasinati~h |         65    277.5641    491.5284          0   2466.667

The somefunction(oldvar) can be a range of things using the operations shown in Section 2 - addition (+), subtraction (-), division (/), squares (^2), multiplication (*). To modify an existing variable, the replace command can be used. For example, say you want to treat everyone with more than 1,000 assasinations per thousand the same:

replace assasinations_pth = 1001 if assasinations_pth > 1000
sum assasinations_pth


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
assasinati~h |         65    219.1949    305.4257          0       1001

This also introduces logical statements - the if above tells Stata only to make the change if assasinations_pth is above 1,000. You can string more than one of these together if, for example, you only want to make a replacement for one country:

replace assasinations_pth = 1001 if assasinations_pth > 1000 & country_name = "Argentina"
list country_name assasinations_pth

              country_name   assasi~h  
  1.                 India   866.6667  
  2.	         Argentina       1001  
  3.                 Japan        200  
  4.                Brazil        100  
  5.         United States   433.3333  
  6.            Bangladesh        175  
  7.                 Spain   1433.333  
  8.              Colombia   766.6666  
  9.                  Peru   566.6667  
 10.                 Haiti        200  
 11.             Australia   66.66667  
 12.                 Italy       1200  
 13.                Greece   166.6667  
 14.                France        300  
 15.                 Zaire   55.55556  
 16.               Uruguay   166.6667  
 17.                Mexico   166.6667  
 18.              Pakistan   266.6667  
 19.                 Niger          0  
 20.               Bolivia        200  
 21.               Germany   233.3333  
 22.                Canada   66.66667  
 23.        United Kingdom   333.3333  
 24.           New Zealand          0  
 25.           Philippines   1033.333  
 26.               Finland          0  
 27.             Venezuela        100  
 28.    Korea, Republic of        100  
 29.             Guatemala   2466.667  
 30.              Honduras   66.66667  
 31.           El Salvador   1733.333  
 32.                 Chile   466.6667  
 33.              Thailand   33.33334  
 34.                Sweden          0  
 35.               Senegal   66.66667  
 36.   Trinidad and Tobago          0  
 37.               Ecuador          0  
 38.               Denmark          0  
 39.           Switzerland          0  
 40.               Austria          0  
 41.              Zimbabwe   233.3333  
 42.              Paraguay   33.33334  
 43.            Costa Rica          0  
 44.              Portugal          0  
 45.                  Togo   33.33334  
 46.               Iceland          0  
 47.                Israel        200  
 48.          South Africa   366.6667  
 49.                Norway          0  
 50.          Sierra Leone   33.33334  
 51.    Dominican Republic        200  
 52.                 Ghana        100  
 53.             Sri Lanka        200  
 54.         Taiwan, China   66.66667  
 55.                Panama   66.66667  
 56.      Papua New Guinea          0  
 57.                 Kenya   144.4445  
 58.               Ireland   66.66667  
 59.               Jamaica   133.3333  
 60.           Netherlands          0  
 61.                Cyprus   166.6667  
 62.              Malaysia   33.33334  
 63.               Belgium          0  
 64.             Mauritius          0  
 65.                 Malta          0  

6.   Some initial descriptive analysis

Once you've loaded and explored the properties of the data, you can start to do some analysis. For descriptives, Stata has great graphing commands. For example, the graph and graph twoway command has lots of common graphs like scatter plot, bar charts, line graphs, box plots etc. etc. Offten we want to look at the relationship between two variables. For example, in the growth data, looking at the relationship between trade and GDP growth might be interesting:

corr growth tradeshare 
(obs=65)

             |   growth tradeshare
-------------+------------------
      growth |   1.0000
  tradeshare |   0.3517   1.0000

There seems to be a reasonable correlation there, but descriptive staple for this type of analysis is a scatter plot:

[graph] twoway scatter growth tradeshare

		Raw output

graph is in square brackets because you don't need it for the command to work. From the graph, is seems as though one outlying observation for both tradeshare and growth might be driving the relationship. Let's see what it looks like without this country:

[graph] twoway scatter growth tradeshare if tradeshare <=1.5

		Raw output

Now the relationship isn't nearly as strong. Thi is reflected in the correlation between the two without this country (note the observations):

corr growth tradeshare if tradeshare <= 1.5
(obs=64)

             |   growth tradeshare
-------------+------------------
      growth |   1.0000
  tradeshare |   0.2113   1.0000

If you want to actually use the graph from Stata, you might want to change how it looks - default graphs in Stata are a bit ugly. Stata gives you complete flexibility in changing the aspects of the graph. For example, the background colour can be ade white, gridlines change, axes labels altered, an the dots changed:

twoway scatter growth tradeshare if tradeshare <=1, ///
	graphregion(color(white) lcolor(white)) ///
	plotregion(lcolor(black)) ///
	mcolor(red%20) msymbol(square) ///
	ylabel(-2(2)8, angle(0) nogrid) ytitle("GDP growth (%)") ///
	xtitle("Share of GDP in international trade") ///
	title("GDP growth and international trade") ///
	note(" ""Note: one outlying country with high trade-share and GDP growth omitted", size(small))

graph export "\Figure\Trade-Growth.png", replace 


		Raw output

Copy and paste this code into a .do file (whith the data loaded) and you can go comment out each line and figure out what it does. I also have a short note on graphing in Stata for more on basic graphs.

Notes:
*The data are from James H. Stock and Mark W. Watson's "Introduction to Econometrics", Third Edition.