Introduction
One of the great things about statistics programs is that, when you know how, you can generate data without merely typing it in. In Stata, there are many ways to generate data, so much so that hundreds of pages of Stata documentation cover this topic. Fortunately, most students working on academic papers, research essays, or theses have only a few basic data generation needs in Stata, one of which we will cover in this blog post.
Subjects: An Example
Imagine that you’ve copied in data from 200 subjects without actually numbering subjects. For instance, imagine that you have height and weight data for 200 subjects, but no subject numbers. Note that we used the following code to create this data, which, in itself, shows you how to use Stata to generate normally distributed numbers with specified means and standard deviations:
set obs 200
drawnorm height, mean(70) sd(3)
drawnorm weight, mean(170) sd(15)
label variable height "Height"
label variable weight "Weight"
Understandably, you don’t want to create a variable called subject and type in 1 to 200 by hand. Fortunately, Stata has powerful data generation capabilities that can help. Simply type in the following code:
gen subj = _n
label variable subj "Subject"
order subj
Here, n refers to your number of observations (which you already know to be 200). Using the code above creates a new variable named subj, labels that variable “Subject,” and moves the variable to the first column in your dataset. Now this is what you have:
Let’s say that, for whatever reason, your subject numbering doesn’t run sequentially from 1 to n, with n being the total number of subjects. Let’s say your subject numbering starts at 10 and ends at 210. If that’s the case, you would code the variable of subject in this way:
drop subj
gen subj = _n+9
label variable subj "Subject"
order subj
As you now know, _n starts the numbering at 1 and stops at n, with n being the number of rows in your dataset. Therefore, if you want to start at 10, you would add 9 to 1, which, in Stata, you would express as _n+9.
What if, for an equally unusual reason, your numbering goes by interval of 10? How could you number subjects 10, 20, 30, etc.? Try this code:
drop subj
gen subj_a = _n
gen subj = subj_a*10
drop subj_a
label variable subj "Subject"
order subj
You can tell Stata to multiply, divide, add, subtract, and carry out combinations of operations. In the code above, gen subj = subj_a*10 created a new variable, subj, that is 10 times _n, resulting in the structure you see above.
In future blogs, we will explore how Stata can be used for more varied and complex forms of data creation.
BridgeText can help you with all of your statistical analysis needs.