SlideShare una empresa de Scribd logo
1 de 51
Descargar para leer sin conexión
Data Management in Stata

                             Ista Zahn

                                 IQSS


                        October 12th 2012



                      The Institute
                      for Quantitative Social Science
                      at Harvard University


Ista Zahn (IQSS)        Data Management in Stata   October 12th 2012   1 / 51
Outline

1   Introduction

2   Generating and replacing variables

3   By processing

4   Missing values

5   Variable types

6   Merging, appending, and joining

7   Creating summarized data sets

8   Wrap-up


      Ista Zahn (IQSS)       Data Management in Stata   October 12th 2012   2 / 51
Introduction

Topic

1   Introduction

2   Generating and replacing variables

3   By processing

4   Missing values

5   Variable types

6   Merging, appending, and joining

7   Creating summarized data sets

8   Wrap-up


      Ista Zahn (IQSS)       Data Management in Stata   October 12th 2012   3 / 51
Introduction

Materials and Setup




    USERNAME: dataclass
    PASSWORD: dataclass
    Find class materials at:
Scratch > DataClass > StataDatMan
    Copy this folder to your desktop!




    Ista Zahn (IQSS)           Data Management in Stata   October 12th 2012   4 / 51
Introduction

Workshop Description




   This is an Introduction to data management in Stata
   Assumes basic knowledge of Stata
   Not appropriate for people already well familiar with Stata
   If you are catching on before the rest of the class, experiment with
   command features described in help files




   Ista Zahn (IQSS)        Data Management in Stata    October 12th 2012   5 / 51
Introduction

Organization




   Please feel free to ask questions at any point if they are relevant to
   the current topic (or if you are lost!)
   There will be a Q&A after class for more specific, personalized
   questions
   Collaboration with your neighbors is encouraged
   If you are using a laptop, you will need to adjust paths accordingly




   Ista Zahn (IQSS)        Data Management in Stata     October 12th 2012   6 / 51
Introduction

Opening Files in Stata


   Look at bottom left hand corner of Stata screen
         This is the directory Stata is currently reading from
   Files are located in the StataDatMan folder on the Desktop
   Start by telling Stata where to look for these

    // change directory
    cd "C:/Users/dataclass/Desktop/StataDatMan"

    // Use dir to see what is in the directory:
    dir

    // use the gss data set
    use gss.dta




   Ista Zahn (IQSS)           Data Management in Stata       October 12th 2012   7 / 51
Generating and replacing variables

Topic

1   Introduction

2   Generating and replacing variables

3   By processing

4   Missing values

5   Variable types

6   Merging, appending, and joining

7   Creating summarized data sets

8   Wrap-up


      Ista Zahn (IQSS)                  Data Management in Stata   October 12th 2012   8 / 51
Generating and replacing variables

Logic Statements Useful Data Manipulation



        == equal to (status quo)
          = used in assigning values
         != not equal to
          > greater than
        >= greater than or equal to
          & and
            | or




   Ista Zahn (IQSS)                  Data Management in Stata   October 12th 2012   9 / 51
Generating and replacing variables

Basic Data Manipulation Commands




  Basic commands you’ll use for generating new variables or recoding
  existing variables:
         egen
         replace
         recode
  Many different means of accomplishing the same thing in Stata
         Find what is comfortable (and easy) for you




   Ista Zahn (IQSS)                  Data Management in Stata   October 12th 2012   10 / 51
Generating and replacing variables

Generate and Replace



   The gen command is often used with logic statements, as in this
   example:

   // create "hapnew" variable
   gen hapnew = . //set to missing
   //set to 1 if happy equals 1
   replace hapnew=1 if happy==1
   //set to 1 if happy and hapmar = 3
   replace hapnew=1 if happy>3 & hapmar>3
   //set to 3 if happy or hapmar = 4
   replace hapnew=3 if happy>4 | hapmar>4
   tab hapnew // tabulate the new variable




   Ista Zahn (IQSS)                  Data Management in Stata   October 12th 2012   11 / 51
Generating and replacing variables

Recode



  The recode command is basically generate and replace combined
  You can recode an existing variable OR use recode to create a new
  variable
   // recode the wrkstat variable
   recode wrkstat (1=8) (2=7) (3=6) (4=5) (5=4) (6=3) (7=2) (8=1)
   // recode wrkstat into a new variable named wrkstat2
   recode wrkstat (1=8), gen(wrkstat2)
   // tabulate workstat
   tab wrkstat




   Ista Zahn (IQSS)                  Data Management in Stata   October 12th 2012   12 / 51
Generating and replacing variables

Basic Rules for Recode




         Rule                     Example           Meaning
         #=#                      3=1               3 recoded to 1
         ##=#                     2. =9             2 and . recoded to 9
         #/# = #                  1/5=4             1 through 5 recoded to 4
         nonmissing=#             nonmiss=8         nonmissing recoded to 8
         missing=#                miss=9            missing recoded to 9




   Ista Zahn (IQSS)                  Data Management in Stata     October 12th 2012   13 / 51
Generating and replacing variables

egen

   egen means “extension” to generate
   Contains a variety of more sophisticated functions
   Type “help egen” in Stata to get a complete list of functions
   Let’s create a new variable that counts the number of “yes” responses
   on computer, email and internet use:

   // count number of yes on comp email and interwebs
   egen compuser= anycount(usecomp usemail usenet), values(1)
   tab compuser
   // assess how much missing data each participant has:
   egen countmiss = rowmiss(age-wifeft)
   codebook countmiss
   // compare values on multiple variables
   egen ftdiff=diff(wkft//)
   codebook ftdiff


   Ista Zahn (IQSS)                  Data Management in Stata   October 12th 2012   14 / 51
By processing

Topic

1   Introduction

2   Generating and replacing variables

3   By processing

4   Missing values

5   Variable types

6   Merging, appending, and joining

7   Creating summarized data sets

8   Wrap-up


      Ista Zahn (IQSS)       Data Management in Stata   October 12th 2012   15 / 51
By processing

The “By” Command


  Sometimes, you’d like to create variables based on different categories
  of a single variable
        For example, say you want to look at happiness based on whether an
        individual is male or female
  The “by” prefix does just this:

   // tabulate happy separately for male and female
   bysort sex: tab happy
   // generate summary statistics using bysort
   bysort state: egen stateincome = mean(income)
   bysort degree: egen degreeincome = mean(income)
   bysort marital: egen marincomesd = sd(income)




  Ista Zahn (IQSS)         Data Management in Stata    October 12th 2012   16 / 51
By processing

The “By” Command

  Some commands won’t work with by prefix, but have by options:

   // generate separate histograms for female and male
   hist happy, by(sex)




  Ista Zahn (IQSS)       Data Management in Stata   October 12th 2012   17 / 51
Missing values

Topic

1   Introduction

2   Generating and replacing variables

3   By processing

4   Missing values

5   Variable types

6   Merging, appending, and joining

7   Creating summarized data sets

8   Wrap-up


      Ista Zahn (IQSS)        Data Management in Stata   October 12th 2012   18 / 51
Missing values

Missing Values




   Always need to consider how missing values are coded when recoding
   variables
   Stata’s symbol for a missing value is “.”
   Stata interprets “.” as a large value
         What are implications of this?




   Ista Zahn (IQSS)           Data Management in Stata   October 12th 2012   19 / 51
Missing values

Missing Values

   If we want to generate a new variable that identifies highly educated
   women, we might use the command:

   ,// generate and replace without considering missing values
   gen hi_ed=0
   replace hi_ed=1 if wifeduc>15
   // What happens to our missing values when we
   tab hi_ed, mi nola
   // Instead, we might try:
   drop hi_ed
   // gen hi_ed, but don’t set a value if wifeduc is missing
   gen hi_ed = 0 if wifeduc != .
   // only replace non-missing
   replace hi_ed=1 if wifeduc >15 & wifeduc !=.
   tab hi_ed, mi //check to see that missingness is preserved

   Note that you need the “mi” option to tab to view your missing data
   values
   Ista Zahn (IQSS)        Data Management in Stata   October 12th 2012   20 / 51
Missing values

Missing Values


   What if you used a numeric value originally to code missing data (e.g.,
   “999”)?
   The mvdecode command will convert all these values to missing

   mvdecode _all, mv(999)

   The “_all” command tells Stata to do this to all variables
   Use this command carefully!
         If you have any variables where “999” is a legitimate value, Stata is
         going to recode it to missing
         As an alternative, you could list var names separately rather than using
         “_all” command




   Ista Zahn (IQSS)           Data Management in Stata     October 12th 2012   21 / 51
Variable types

Topic

1   Introduction

2   Generating and replacing variables

3   By processing

4   Missing values

5   Variable types

6   Merging, appending, and joining

7   Creating summarized data sets

8   Wrap-up


      Ista Zahn (IQSS)        Data Management in Stata   October 12th 2012   22 / 51
Variable types

Variable Types




   Stata uses two main types of variables: String and Numeric
   String variables are typically used for text variables
   To be able to perform any mathematical operations, your variables
   need to be in a numeric format




   Ista Zahn (IQSS)          Data Management in Stata       October 12th 2012   23 / 51
Variable types

     Variable Types



        Stata’s numeric variable types:

1     type                 Minimum              Maximum    being 0     bytes
2     ----------------------------------------------------------------------
3     byte                    -127                  100    +/-1          1
4     int                  -32,767               32,740    +/-1          2
5     long          -2,147,483,647        2,147,483,620    +/-1          4
6     float   -1.70141173319*10^38 1.70141173319*10^38     +/-10^-38     4
7     double -8.9884656743*10^307 8.9884656743*10^307      +/-10^-323    8
8     ----------------------------------------------------------------------
9     Precision for float is 3.795x10^-8.
10    Precision for double is 1.414x10^-16.




         Ista Zahn (IQSS)       Data Management in Stata   October 12th 2012   24 / 51
Variable types

Variable Types




   How can I deal with those annoying string variables?
   Sometimes you need to convert to/from string variables

   // not run; generate "newvar" equal to numeric version of var2
   destring var1, gen(newvar)
   // not run; convert var1 to string.
   tostring var1, gen(newvar)




   Ista Zahn (IQSS)        Data Management in Stata   October 12th 2012   25 / 51
Variable types

Date and Time Variable Types




   Stata offers several options for date and time variables
   Generally, Stata will read date/time variables as strings
   You’ll need to convert string variables in order to perform any
   mathematical operations
   Once data is in date/time form, Stata uses several symbols to identify
   these variables
         %tc, %td, %tw, etc.




   Ista Zahn (IQSS)            Data Management in Stata   October 12th 2012   26 / 51
Variable types

     Variable Types: Date and Time



1     Format String-to-numeric conversion function
2         -------+-----------------------------------------
3         %tc     clock(string, mask)
4         %tC     Clock(string, mask)
5         %td     date(string, mask)
6         %tw     weekly(string, mask)
7         %tm     monthly(string, mask)
8         %tq     quarterly(string, mask)
9         %th     halfyearly(string, mask)
10        %ty     yearly(string, mask)
11        %tg     no function necessary; read as numeric
12        -------------------------------------------------




         Ista Zahn (IQSS)       Data Management in Stata   October 12th 2012   27 / 51
Variable types

Variable Types: Date and Time



    Format Meaning         Value = -1         Value = 0       Value = 1
    %tc    clock           31dec1959          01jan1960       01jan1960
                           23:59:59.999       00:00:00.000    00:00:00.001
    %td         days       31dec1959          01jan1960       02jan1960
    %tw         weeks      1959w52            1960w1          1960w2
    %tm         months     1959m12            1960m1          1960m2
    %tq         quarters 1959q4               1960q1          1960q2
    %th         half-years 1959h2             1960h1          1960h2
    %tg         generic    -1                 0               1




   Ista Zahn (IQSS)          Data Management in Stata        October 12th 2012   28 / 51
Variable types

Variable Types: Date and Time


   To convert a string variable into date/time format, first select the
   date/time format you’ll be using (e.g., %tc, %td, %tw, etc.)
   Let’s say we create a string variable, today’s date (today) that we
   want to format
   // create string variable and convert to date
   gen today = "Feb 18, 2011"
   gen date1 = date(today, "MDY")
   tab date1
   // use the format command to change how the date is displayed
   // format so humans can read the date
   format date1 %d
   tab date1




   Ista Zahn (IQSS)         Data Management in Stata   October 12th 2012   29 / 51
Variable types

Variable Types



   What if you have a variable “time” formatted as
   DDMMYYYYhhmmss?
   // Not run: conceptual example only
   generate double time2 = clock(time, "DMYHMS")
   tab time2
   format time2 %tc
   tab time2

   “double” command necessary for all clock formats
   basically tells Stata to allow a long string of characters




   Ista Zahn (IQSS)         Data Management in Stata    October 12th 2012   30 / 51
Variable types

Exercise 1: Generate, Replace, Recode & Egen
Open the datafile, gss.dta.
  1   Generate a new variable that represents the squared value of age.
  2   Recode values “99” and “98” on the variable, “hrs1” as “missing.”
  3   Generate a new variable equal to “1” if income is greater than “19”.
  4   Recode the marital variable into a “string” variable and then back into
      a numeric variable.
  5   Create a new variable that counts the number of times a respondent
      answered “don’t know” in regard to the following variables: life,
      richwork, hapmar.
  6   Create a new variable that counts the number of missing responses for
      each respondent.
  7   Create a new variable that associates each individual with the average
      number of hours worked among individuals with matching educational
      degrees.

      Ista Zahn (IQSS)         Data Management in Stata   October 12th 2012   31 / 51
Merging, appending, and joining

Topic

1   Introduction

2   Generating and replacing variables

3   By processing

4   Missing values

5   Variable types

6   Merging, appending, and joining

7   Creating summarized data sets

8   Wrap-up


      Ista Zahn (IQSS)                 Data Management in Stata   October 12th 2012   32 / 51
Merging, appending, and joining

Merging Datasets




   Merge in Stata is for adding new variables from a second dataset to
   the dataset you’re currently working with
         Current active dataset = master dataset
         Dataset you’d like to merge with master = using dataset
   If you want to add OBSERVATIONS, you’d use “append” (we’ll go
   over that next)




   Ista Zahn (IQSS)                 Data Management in Stata   October 12th 2012   33 / 51
Merging, appending, and joining

Merging Datasets




   Several different ways that you might be interested in merging data
         Two datasets with same participant pool, one row per participant (1:1)
         A dataset with one participant per row with a dataset with multiple
         rows per participant (1:many or many:1)




   Ista Zahn (IQSS)                 Data Management in Stata   October 12th 2012   34 / 51
Merging, appending, and joining

Merging Datasets




   Stata will create a new variable (“_merge”) that describes the source
   of the data
         Use option, “nogenerate” if you don’t want _merge created
         Use option, “generate(varname)” to give merge your own variable name
   Need to add IDs to your dataset?

   // create a variable "id" equal to the row number
   generate id = _n




   Ista Zahn (IQSS)                 Data Management in Stata   October 12th 2012   35 / 51
Merging, appending, and joining

Merging Datasets




   Before you begin:
         Identify the “ID” that you will use to merge your two datasets
         Determine which variables you’d like to merge
         In Stata >= 11, data does NOT have to be sorted
         Variable types must match across datasets
         Can use “force” option to get around this, but not recommended




   Ista Zahn (IQSS)                 Data Management in Stata   October 12th 2012   36 / 51
Merging, appending, and joining

Merging Datasets



    Let’s say I want to perform a 1:1 merge using the dataset “data2” and
    the ID, “participant”
merge 1:1 participant using data2.dta
    Now, let’s say that I have one dataset with individual students
    (master) and another dataset with information about the students’
    schools called “school”
     // Not run: conceptual example only. Merge school and student data
     merge m:1 schoolID using school.dta




     Ista Zahn (IQSS)                 Data Management in Stata   October 12th 2012   37 / 51
Merging, appending, and joining

Merging Datasets




   What if my school dataset was the master and my student dataset
   was the merging dataset?

   // Not run: conceptual example only.
   merge 1:m schoolID using student.dta

   It is also possible to do a many:many merge
         Data needs to be sorted in both the master and using datasets




   Ista Zahn (IQSS)                 Data Management in Stata   October 12th 2012   38 / 51
Merging, appending, and joining

Merging Datasets



   Update and replace options:
         In standard merge, the master dataset is the authority and WON’T
         CHANGE
         What if your master dataset has missing data and some of those values
         are not missing in your using dataset?
         Specify “update”- it will fill in missing without changing nonmissing
   What if you want data from your using dataset to overwrite that in
   your master?
         Specify “replace update”- it will replace master data with using data
         UNLESS the value is missing in the using dataset




   Ista Zahn (IQSS)                 Data Management in Stata   October 12th 2012   39 / 51
Merging, appending, and joining

Appending Datasets




   Sometimes, you’ll have observations in two different datasets, or you’d
   like to add observations to an existing dataset
   Append will simply add observations to the end of the observations in
   the master
   // Not run: conceptual example. add rows of data from dataset2
   append using dataset2




   Ista Zahn (IQSS)                 Data Management in Stata   October 12th 2012   40 / 51
Merging, appending, and joining

Appending Datasets



   Some options with Append:
         generate(newvar) will create variable that identifies source of
         observation

   // Not run: conceptual example.
   append using dataset1, generate(observesource)

   “force” will allow for data type mismatches (again, this is not
   recommended)




   Ista Zahn (IQSS)                 Data Management in Stata   October 12th 2012   41 / 51
Merging, appending, and joining

Joinby



   Merge will add new observations from using that do not appear in
   master
   Sometimes, you need to add variables from using but want to be sure
   the list of participants in your master does not change

   // Not run: conceptual exampe. Similar to merge but drops non-matches
   joinby participant using dataset1

   Any observations in using that are NOT in master will be omitted




   Ista Zahn (IQSS)                 Data Management in Stata   October 12th 2012   42 / 51
Creating summarized data sets

Topic

1   Introduction

2   Generating and replacing variables

3   By processing

4   Missing values

5   Variable types

6   Merging, appending, and joining

7   Creating summarized data sets

8   Wrap-up


      Ista Zahn (IQSS)                Data Management in Stata   October 12th 2012   43 / 51
Creating summarized data sets

Collapse




   Collapse will take master data and create a new dataset of summary
   statistics
   Useful in hierarchal linear modeling if you’d like to create aggregate,
   summary statistics
   Can generate group summary data for many descriptive stats
         Mean, media, sd, sum, min, max, percentiles, standard errors
   Can also attach weights




   Ista Zahn (IQSS)                Data Management in Stata   October 12th 2012   44 / 51
Creating summarized data sets

Collapse




   Before you collapse
         Save your master dataset and then save it again under a new name
               This will prevent collapse from writing over your original data
         Consider issues of missing data. Do you want Stata to use all possible
         observations? If not:
               cw (casewise) option will make casewise deletions




   Ista Zahn (IQSS)                Data Management in Stata     October 12th 2012   45 / 51
Creating summarized data sets

Collapse



   Let’s say you have a dataset with patient information from multiple
   hospitals
   You want to generate mean levels of patient satisfaction for EACH
   hospital

   // Not run: conceptual example. calculate average ptsatisfaction by ho
   save originaldata
   collapse (mean) ptsatisfaction, by(hospital)
   save hospitalcollapse




   Ista Zahn (IQSS)                Data Management in Stata   October 12th 2012   46 / 51
Creating summarized data sets

Collapse



   You could also generate different statistics for multiple variables

   // create mean ptsatisfaction, median ptincome, sd ptsatisfaction for
   collapse (mean) ptsatisfaction (median) ptincome (sd) ptsatisfaction,

   What if you want to rename your new variables in this process?

   // Same as previous example, but rename variables
   collapse (mean) ptsatmean=ptsatisfaction (median) ptincmed=ptincome
    (sd) sdptsat=ptsatisfaction, by(hospital)




   Ista Zahn (IQSS)                Data Management in Stata   October 12th 2012   47 / 51
Creating summarized data sets

Exercise 2: Merge, Append, and Joinby
Open the dataset, gss2.dta
  1   The gss2 dataset contains only half of the variables that are in the
      complete gss dataset. Merge dataset gss1 with dataset gss2. The
      identification variable is “id.”
  2   Open the dataset, gss.dta
  3   Merge in data from the “marital.dta” dataset, which includes income
      information grouped by individuals’ marital status. The marital
      dataset contains collapsed data regarding average statistics of
      individuals based on their marital status.
  4   Additional observations for the gssAppend.dta dataset can be found in
      “gssAddObserve.dta.” Create a new dataset that combines the
      observations in gssAppend.dta with those in gssAddObserve.dta.
  5   Create a new dataset that summarizes mean and standard deviation of
      income based on individuals’ degree status (“degree”). In the process
      of creating this new dataset, rename your three new variables.
      Ista Zahn (IQSS)                Data Management in Stata   October 12th 2012   48 / 51
Wrap-up

Topic

1   Introduction

2   Generating and replacing variables

3   By processing

4   Missing values

5   Variable types

6   Merging, appending, and joining

7   Creating summarized data sets

8   Wrap-up


      Ista Zahn (IQSS)       Data Management in Stata   October 12th 2012   49 / 51
Wrap-up

Help Us Make This Workshop Better




   Please take a moment to fill out a very short feedback form
   These workshops exist for you–tell us what you need!
   http://tinyurl.com/StataDatManFeedback




   Ista Zahn (IQSS)       Data Management in Stata   October 12th 2012   50 / 51
Wrap-up

Additional resources



   training and consulting
         IQSS workshops:
         http://projects.iq.harvard.edu/rtc/filter_by/workshops
         IQSS statistical consulting: http://rtc.iq.harvard.edu
   Stata resources
         UCLA website: http://www.ats.ucla.edu/stat/Stata/
         Great for self-study
         Links to resources
   Stata website: http://www.stata.com/help.cgi?contents
   Email list: http://www.stata.com/statalist/




   Ista Zahn (IQSS)          Data Management in Stata     October 12th 2012   51 / 51

Más contenido relacionado

La actualidad más candente

A Technique to Rescue Non-Parametric Outlier Data Using SAS®
A Technique to Rescue Non-Parametric Outlier Data Using SAS®A Technique to Rescue Non-Parametric Outlier Data Using SAS®
A Technique to Rescue Non-Parametric Outlier Data Using SAS®Venu Perla
 
WEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And AttributesWEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And AttributesDataminingTools Inc
 
Important SAS Tips and Tricks for A Grade
Important SAS Tips and Tricks for A GradeImportant SAS Tips and Tricks for A Grade
Important SAS Tips and Tricks for A GradeLesa Cote
 
Weka presentation
Weka presentationWeka presentation
Weka presentationSaeed Iqbal
 
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Ankit Pandey
 
Getting Started with MySQL II
Getting Started with MySQL IIGetting Started with MySQL II
Getting Started with MySQL IISankhya_Analytics
 
Weka a tool_for_exploratory_data_mining
Weka a tool_for_exploratory_data_miningWeka a tool_for_exploratory_data_mining
Weka a tool_for_exploratory_data_miningTony Frame
 
Weka presentation
Weka presentationWeka presentation
Weka presentationAbrar ali
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorialbutest
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using wekaPrashant Menon
 
Weka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule GenerationWeka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule Generationrsathishwaran
 
Data mining with Weka
Data mining with WekaData mining with Weka
Data mining with WekaAlbanLevy
 
My Article on MySQL Magazine
My Article on MySQL MagazineMy Article on MySQL Magazine
My Article on MySQL MagazineJonathan Levin
 
Data Mining with WEKA WEKA
Data Mining with WEKA WEKAData Mining with WEKA WEKA
Data Mining with WEKA WEKAbutest
 

La actualidad más candente (19)

A Technique to Rescue Non-Parametric Outlier Data Using SAS®
A Technique to Rescue Non-Parametric Outlier Data Using SAS®A Technique to Rescue Non-Parametric Outlier Data Using SAS®
A Technique to Rescue Non-Parametric Outlier Data Using SAS®
 
WEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And AttributesWEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And Attributes
 
Wek1
Wek1Wek1
Wek1
 
Important SAS Tips and Tricks for A Grade
Important SAS Tips and Tricks for A GradeImportant SAS Tips and Tricks for A Grade
Important SAS Tips and Tricks for A Grade
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
 
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
Data Mining Techniques using WEKA (Ankit Pandey-10BM60012)
 
Getting Started with MySQL II
Getting Started with MySQL IIGetting Started with MySQL II
Getting Started with MySQL II
 
Weka a tool_for_exploratory_data_mining
Weka a tool_for_exploratory_data_miningWeka a tool_for_exploratory_data_mining
Weka a tool_for_exploratory_data_mining
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
 
WEKA Tutorial
WEKA TutorialWEKA Tutorial
WEKA Tutorial
 
Weka
WekaWeka
Weka
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
Weka library, JAVA
Weka library, JAVAWeka library, JAVA
Weka library, JAVA
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
Weka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule GenerationWeka project - Classification & Association Rule Generation
Weka project - Classification & Association Rule Generation
 
Data mining with Weka
Data mining with WekaData mining with Weka
Data mining with Weka
 
Weka tutorial
Weka tutorialWeka tutorial
Weka tutorial
 
My Article on MySQL Magazine
My Article on MySQL MagazineMy Article on MySQL Magazine
My Article on MySQL Magazine
 
Data Mining with WEKA WEKA
Data Mining with WEKA WEKAData Mining with WEKA WEKA
Data Mining with WEKA WEKA
 

Destacado

R Regression Models with Zelig
R Regression Models with ZeligR Regression Models with Zelig
R Regression Models with Zeligizahn
 
Class ppt overview of analytics
Class ppt overview of analyticsClass ppt overview of analytics
Class ppt overview of analyticsJigsawAcademy2014
 
Graphing stata (2 hour course)
Graphing stata (2 hour course)Graphing stata (2 hour course)
Graphing stata (2 hour course)izahn
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environmentizahn
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with RShareThis
 
Statistical Analysis with R and Mind Mapping automation
Statistical Analysis with R and Mind Mapping automationStatistical Analysis with R and Mind Mapping automation
Statistical Analysis with R and Mind Mapping automationJosé M. Guerrero
 
Introduction to SAS
Introduction to SASIntroduction to SAS
Introduction to SASizahn
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2izahn
 
R programming Basic & Advanced
R programming Basic & AdvancedR programming Basic & Advanced
R programming Basic & AdvancedSohom Ghosh
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programmingizahn
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)Dataspora
 

Destacado (20)

C1 t1,t2,t3,t4 complete
C1 t1,t2,t3,t4 completeC1 t1,t2,t3,t4 complete
C1 t1,t2,t3,t4 complete
 
R Regression Models with Zelig
R Regression Models with ZeligR Regression Models with Zelig
R Regression Models with Zelig
 
Class ppt overview of analytics
Class ppt overview of analyticsClass ppt overview of analytics
Class ppt overview of analytics
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Big data gaurav
Big data gauravBig data gaurav
Big data gaurav
 
Graphing stata (2 hour course)
Graphing stata (2 hour course)Graphing stata (2 hour course)
Graphing stata (2 hour course)
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
 
WebC2 t1 t2-t3
WebC2 t1 t2-t3WebC2 t1 t2-t3
WebC2 t1 t2-t3
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
Statistical Analysis with R and Mind Mapping automation
Statistical Analysis with R and Mind Mapping automationStatistical Analysis with R and Mind Mapping automation
Statistical Analysis with R and Mind Mapping automation
 
Introduction to SAS
Introduction to SASIntroduction to SAS
Introduction to SAS
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2
 
R programming Basic & Advanced
R programming Basic & AdvancedR programming Basic & Advanced
R programming Basic & Advanced
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
Class ppt intro to r
Class ppt intro to rClass ppt intro to r
Class ppt intro to r
 
Intro to C++ - language
Intro to C++ - languageIntro to C++ - language
Intro to C++ - language
 
C++ programming
C++ programmingC++ programming
C++ programming
 
C++ ppt
C++ pptC++ ppt
C++ ppt
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)
 
R learning by examples
R learning by examplesR learning by examples
R learning by examples
 

Similar a Data Management in Stata

Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data setsCreditas
 
Data ware house design
Data ware house designData ware house design
Data ware house designSayed Ahmed
 
Data ware house design
Data ware house designData ware house design
Data ware house designSayed Ahmed
 
Writing maintainable Oracle PL/SQL code
Writing maintainable Oracle PL/SQL codeWriting maintainable Oracle PL/SQL code
Writing maintainable Oracle PL/SQL codeDonald Bales
 
Writing Maintainable Code
Writing Maintainable CodeWriting Maintainable Code
Writing Maintainable CodeDonald Bales
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresSteven Johnson
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupRussell Jurney
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Cincinnati Tableau User Group Event #1
Cincinnati Tableau User Group Event #1Cincinnati Tableau User Group Event #1
Cincinnati Tableau User Group Event #1Russell Spangler
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...Gianmario Spacagna
 
IMG1.jpgIMG2.jpgIMG3.jpg2016 6 19 156 Page .docx
IMG1.jpgIMG2.jpgIMG3.jpg2016 6 19   156   Page .docxIMG1.jpgIMG2.jpgIMG3.jpg2016 6 19   156   Page .docx
IMG1.jpgIMG2.jpgIMG3.jpg2016 6 19 156 Page .docxwilcockiris
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
 
MS ACCESS Tutorials
MS ACCESS TutorialsMS ACCESS Tutorials
MS ACCESS TutorialsDuj Law
 

Similar a Data Management in Stata (20)

Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data sets
 
Introduction to dbms
Introduction to dbmsIntroduction to dbms
Introduction to dbms
 
Data ware house design
Data ware house designData ware house design
Data ware house design
 
Data ware house design
Data ware house designData ware house design
Data ware house design
 
Writing maintainable Oracle PL/SQL code
Writing maintainable Oracle PL/SQL codeWriting maintainable Oracle PL/SQL code
Writing maintainable Oracle PL/SQL code
 
Writing Maintainable Code
Writing Maintainable CodeWriting Maintainable Code
Writing Maintainable Code
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Metopen 6
Metopen 6Metopen 6
Metopen 6
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science MeetupAgile Data Science 2.0 - Big Data Science Meetup
Agile Data Science 2.0 - Big Data Science Meetup
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
SQL for interview
SQL for interviewSQL for interview
SQL for interview
 
Cincinnati Tableau User Group Event #1
Cincinnati Tableau User Group Event #1Cincinnati Tableau User Group Event #1
Cincinnati Tableau User Group Event #1
 
10 class.pptx
10 class.pptx10 class.pptx
10 class.pptx
 
Stored procedure tunning
Stored procedure tunningStored procedure tunning
Stored procedure tunning
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
IMG1.jpgIMG2.jpgIMG3.jpg2016 6 19 156 Page .docx
IMG1.jpgIMG2.jpgIMG3.jpg2016 6 19   156   Page .docxIMG1.jpgIMG2.jpgIMG3.jpg2016 6 19   156   Page .docx
IMG1.jpgIMG2.jpgIMG3.jpg2016 6 19 156 Page .docx
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
MS ACCESS Tutorials
MS ACCESS TutorialsMS ACCESS Tutorials
MS ACCESS Tutorials
 

Último

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Último (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Data Management in Stata

  • 1. Data Management in Stata Ista Zahn IQSS October 12th 2012 The Institute for Quantitative Social Science at Harvard University Ista Zahn (IQSS) Data Management in Stata October 12th 2012 1 / 51
  • 2. Outline 1 Introduction 2 Generating and replacing variables 3 By processing 4 Missing values 5 Variable types 6 Merging, appending, and joining 7 Creating summarized data sets 8 Wrap-up Ista Zahn (IQSS) Data Management in Stata October 12th 2012 2 / 51
  • 3. Introduction Topic 1 Introduction 2 Generating and replacing variables 3 By processing 4 Missing values 5 Variable types 6 Merging, appending, and joining 7 Creating summarized data sets 8 Wrap-up Ista Zahn (IQSS) Data Management in Stata October 12th 2012 3 / 51
  • 4. Introduction Materials and Setup USERNAME: dataclass PASSWORD: dataclass Find class materials at: Scratch > DataClass > StataDatMan Copy this folder to your desktop! Ista Zahn (IQSS) Data Management in Stata October 12th 2012 4 / 51
  • 5. Introduction Workshop Description This is an Introduction to data management in Stata Assumes basic knowledge of Stata Not appropriate for people already well familiar with Stata If you are catching on before the rest of the class, experiment with command features described in help files Ista Zahn (IQSS) Data Management in Stata October 12th 2012 5 / 51
  • 6. Introduction Organization Please feel free to ask questions at any point if they are relevant to the current topic (or if you are lost!) There will be a Q&A after class for more specific, personalized questions Collaboration with your neighbors is encouraged If you are using a laptop, you will need to adjust paths accordingly Ista Zahn (IQSS) Data Management in Stata October 12th 2012 6 / 51
  • 7. Introduction Opening Files in Stata Look at bottom left hand corner of Stata screen This is the directory Stata is currently reading from Files are located in the StataDatMan folder on the Desktop Start by telling Stata where to look for these // change directory cd "C:/Users/dataclass/Desktop/StataDatMan" // Use dir to see what is in the directory: dir // use the gss data set use gss.dta Ista Zahn (IQSS) Data Management in Stata October 12th 2012 7 / 51
  • 8. Generating and replacing variables Topic 1 Introduction 2 Generating and replacing variables 3 By processing 4 Missing values 5 Variable types 6 Merging, appending, and joining 7 Creating summarized data sets 8 Wrap-up Ista Zahn (IQSS) Data Management in Stata October 12th 2012 8 / 51
  • 9. Generating and replacing variables Logic Statements Useful Data Manipulation == equal to (status quo) = used in assigning values != not equal to > greater than >= greater than or equal to & and | or Ista Zahn (IQSS) Data Management in Stata October 12th 2012 9 / 51
  • 10. Generating and replacing variables Basic Data Manipulation Commands Basic commands you’ll use for generating new variables or recoding existing variables: egen replace recode Many different means of accomplishing the same thing in Stata Find what is comfortable (and easy) for you Ista Zahn (IQSS) Data Management in Stata October 12th 2012 10 / 51
  • 11. Generating and replacing variables Generate and Replace The gen command is often used with logic statements, as in this example: // create "hapnew" variable gen hapnew = . //set to missing //set to 1 if happy equals 1 replace hapnew=1 if happy==1 //set to 1 if happy and hapmar = 3 replace hapnew=1 if happy>3 & hapmar>3 //set to 3 if happy or hapmar = 4 replace hapnew=3 if happy>4 | hapmar>4 tab hapnew // tabulate the new variable Ista Zahn (IQSS) Data Management in Stata October 12th 2012 11 / 51
  • 12. Generating and replacing variables Recode The recode command is basically generate and replace combined You can recode an existing variable OR use recode to create a new variable // recode the wrkstat variable recode wrkstat (1=8) (2=7) (3=6) (4=5) (5=4) (6=3) (7=2) (8=1) // recode wrkstat into a new variable named wrkstat2 recode wrkstat (1=8), gen(wrkstat2) // tabulate workstat tab wrkstat Ista Zahn (IQSS) Data Management in Stata October 12th 2012 12 / 51
  • 13. Generating and replacing variables Basic Rules for Recode Rule Example Meaning #=# 3=1 3 recoded to 1 ##=# 2. =9 2 and . recoded to 9 #/# = # 1/5=4 1 through 5 recoded to 4 nonmissing=# nonmiss=8 nonmissing recoded to 8 missing=# miss=9 missing recoded to 9 Ista Zahn (IQSS) Data Management in Stata October 12th 2012 13 / 51
  • 14. Generating and replacing variables egen egen means “extension” to generate Contains a variety of more sophisticated functions Type “help egen” in Stata to get a complete list of functions Let’s create a new variable that counts the number of “yes” responses on computer, email and internet use: // count number of yes on comp email and interwebs egen compuser= anycount(usecomp usemail usenet), values(1) tab compuser // assess how much missing data each participant has: egen countmiss = rowmiss(age-wifeft) codebook countmiss // compare values on multiple variables egen ftdiff=diff(wkft//) codebook ftdiff Ista Zahn (IQSS) Data Management in Stata October 12th 2012 14 / 51
  • 15. By processing Topic 1 Introduction 2 Generating and replacing variables 3 By processing 4 Missing values 5 Variable types 6 Merging, appending, and joining 7 Creating summarized data sets 8 Wrap-up Ista Zahn (IQSS) Data Management in Stata October 12th 2012 15 / 51
  • 16. By processing The “By” Command Sometimes, you’d like to create variables based on different categories of a single variable For example, say you want to look at happiness based on whether an individual is male or female The “by” prefix does just this: // tabulate happy separately for male and female bysort sex: tab happy // generate summary statistics using bysort bysort state: egen stateincome = mean(income) bysort degree: egen degreeincome = mean(income) bysort marital: egen marincomesd = sd(income) Ista Zahn (IQSS) Data Management in Stata October 12th 2012 16 / 51
  • 17. By processing The “By” Command Some commands won’t work with by prefix, but have by options: // generate separate histograms for female and male hist happy, by(sex) Ista Zahn (IQSS) Data Management in Stata October 12th 2012 17 / 51
  • 18. Missing values Topic 1 Introduction 2 Generating and replacing variables 3 By processing 4 Missing values 5 Variable types 6 Merging, appending, and joining 7 Creating summarized data sets 8 Wrap-up Ista Zahn (IQSS) Data Management in Stata October 12th 2012 18 / 51
  • 19. Missing values Missing Values Always need to consider how missing values are coded when recoding variables Stata’s symbol for a missing value is “.” Stata interprets “.” as a large value What are implications of this? Ista Zahn (IQSS) Data Management in Stata October 12th 2012 19 / 51
  • 20. Missing values Missing Values If we want to generate a new variable that identifies highly educated women, we might use the command: ,// generate and replace without considering missing values gen hi_ed=0 replace hi_ed=1 if wifeduc>15 // What happens to our missing values when we tab hi_ed, mi nola // Instead, we might try: drop hi_ed // gen hi_ed, but don’t set a value if wifeduc is missing gen hi_ed = 0 if wifeduc != . // only replace non-missing replace hi_ed=1 if wifeduc >15 & wifeduc !=. tab hi_ed, mi //check to see that missingness is preserved Note that you need the “mi” option to tab to view your missing data values Ista Zahn (IQSS) Data Management in Stata October 12th 2012 20 / 51
  • 21. Missing values Missing Values What if you used a numeric value originally to code missing data (e.g., “999”)? The mvdecode command will convert all these values to missing mvdecode _all, mv(999) The “_all” command tells Stata to do this to all variables Use this command carefully! If you have any variables where “999” is a legitimate value, Stata is going to recode it to missing As an alternative, you could list var names separately rather than using “_all” command Ista Zahn (IQSS) Data Management in Stata October 12th 2012 21 / 51
  • 22. Variable types Topic 1 Introduction 2 Generating and replacing variables 3 By processing 4 Missing values 5 Variable types 6 Merging, appending, and joining 7 Creating summarized data sets 8 Wrap-up Ista Zahn (IQSS) Data Management in Stata October 12th 2012 22 / 51
  • 23. Variable types Variable Types Stata uses two main types of variables: String and Numeric String variables are typically used for text variables To be able to perform any mathematical operations, your variables need to be in a numeric format Ista Zahn (IQSS) Data Management in Stata October 12th 2012 23 / 51
  • 24. Variable types Variable Types Stata’s numeric variable types: 1 type Minimum Maximum being 0 bytes 2 ---------------------------------------------------------------------- 3 byte -127 100 +/-1 1 4 int -32,767 32,740 +/-1 2 5 long -2,147,483,647 2,147,483,620 +/-1 4 6 float -1.70141173319*10^38 1.70141173319*10^38 +/-10^-38 4 7 double -8.9884656743*10^307 8.9884656743*10^307 +/-10^-323 8 8 ---------------------------------------------------------------------- 9 Precision for float is 3.795x10^-8. 10 Precision for double is 1.414x10^-16. Ista Zahn (IQSS) Data Management in Stata October 12th 2012 24 / 51
  • 25. Variable types Variable Types How can I deal with those annoying string variables? Sometimes you need to convert to/from string variables // not run; generate "newvar" equal to numeric version of var2 destring var1, gen(newvar) // not run; convert var1 to string. tostring var1, gen(newvar) Ista Zahn (IQSS) Data Management in Stata October 12th 2012 25 / 51
  • 26. Variable types Date and Time Variable Types Stata offers several options for date and time variables Generally, Stata will read date/time variables as strings You’ll need to convert string variables in order to perform any mathematical operations Once data is in date/time form, Stata uses several symbols to identify these variables %tc, %td, %tw, etc. Ista Zahn (IQSS) Data Management in Stata October 12th 2012 26 / 51
  • 27. Variable types Variable Types: Date and Time 1 Format String-to-numeric conversion function 2 -------+----------------------------------------- 3 %tc clock(string, mask) 4 %tC Clock(string, mask) 5 %td date(string, mask) 6 %tw weekly(string, mask) 7 %tm monthly(string, mask) 8 %tq quarterly(string, mask) 9 %th halfyearly(string, mask) 10 %ty yearly(string, mask) 11 %tg no function necessary; read as numeric 12 ------------------------------------------------- Ista Zahn (IQSS) Data Management in Stata October 12th 2012 27 / 51
  • 28. Variable types Variable Types: Date and Time Format Meaning Value = -1 Value = 0 Value = 1 %tc clock 31dec1959 01jan1960 01jan1960 23:59:59.999 00:00:00.000 00:00:00.001 %td days 31dec1959 01jan1960 02jan1960 %tw weeks 1959w52 1960w1 1960w2 %tm months 1959m12 1960m1 1960m2 %tq quarters 1959q4 1960q1 1960q2 %th half-years 1959h2 1960h1 1960h2 %tg generic -1 0 1 Ista Zahn (IQSS) Data Management in Stata October 12th 2012 28 / 51
  • 29. Variable types Variable Types: Date and Time To convert a string variable into date/time format, first select the date/time format you’ll be using (e.g., %tc, %td, %tw, etc.) Let’s say we create a string variable, today’s date (today) that we want to format // create string variable and convert to date gen today = "Feb 18, 2011" gen date1 = date(today, "MDY") tab date1 // use the format command to change how the date is displayed // format so humans can read the date format date1 %d tab date1 Ista Zahn (IQSS) Data Management in Stata October 12th 2012 29 / 51
  • 30. Variable types Variable Types What if you have a variable “time” formatted as DDMMYYYYhhmmss? // Not run: conceptual example only generate double time2 = clock(time, "DMYHMS") tab time2 format time2 %tc tab time2 “double” command necessary for all clock formats basically tells Stata to allow a long string of characters Ista Zahn (IQSS) Data Management in Stata October 12th 2012 30 / 51
  • 31. Variable types Exercise 1: Generate, Replace, Recode & Egen Open the datafile, gss.dta. 1 Generate a new variable that represents the squared value of age. 2 Recode values “99” and “98” on the variable, “hrs1” as “missing.” 3 Generate a new variable equal to “1” if income is greater than “19”. 4 Recode the marital variable into a “string” variable and then back into a numeric variable. 5 Create a new variable that counts the number of times a respondent answered “don’t know” in regard to the following variables: life, richwork, hapmar. 6 Create a new variable that counts the number of missing responses for each respondent. 7 Create a new variable that associates each individual with the average number of hours worked among individuals with matching educational degrees. Ista Zahn (IQSS) Data Management in Stata October 12th 2012 31 / 51
  • 32. Merging, appending, and joining Topic 1 Introduction 2 Generating and replacing variables 3 By processing 4 Missing values 5 Variable types 6 Merging, appending, and joining 7 Creating summarized data sets 8 Wrap-up Ista Zahn (IQSS) Data Management in Stata October 12th 2012 32 / 51
  • 33. Merging, appending, and joining Merging Datasets Merge in Stata is for adding new variables from a second dataset to the dataset you’re currently working with Current active dataset = master dataset Dataset you’d like to merge with master = using dataset If you want to add OBSERVATIONS, you’d use “append” (we’ll go over that next) Ista Zahn (IQSS) Data Management in Stata October 12th 2012 33 / 51
  • 34. Merging, appending, and joining Merging Datasets Several different ways that you might be interested in merging data Two datasets with same participant pool, one row per participant (1:1) A dataset with one participant per row with a dataset with multiple rows per participant (1:many or many:1) Ista Zahn (IQSS) Data Management in Stata October 12th 2012 34 / 51
  • 35. Merging, appending, and joining Merging Datasets Stata will create a new variable (“_merge”) that describes the source of the data Use option, “nogenerate” if you don’t want _merge created Use option, “generate(varname)” to give merge your own variable name Need to add IDs to your dataset? // create a variable "id" equal to the row number generate id = _n Ista Zahn (IQSS) Data Management in Stata October 12th 2012 35 / 51
  • 36. Merging, appending, and joining Merging Datasets Before you begin: Identify the “ID” that you will use to merge your two datasets Determine which variables you’d like to merge In Stata >= 11, data does NOT have to be sorted Variable types must match across datasets Can use “force” option to get around this, but not recommended Ista Zahn (IQSS) Data Management in Stata October 12th 2012 36 / 51
  • 37. Merging, appending, and joining Merging Datasets Let’s say I want to perform a 1:1 merge using the dataset “data2” and the ID, “participant” merge 1:1 participant using data2.dta Now, let’s say that I have one dataset with individual students (master) and another dataset with information about the students’ schools called “school” // Not run: conceptual example only. Merge school and student data merge m:1 schoolID using school.dta Ista Zahn (IQSS) Data Management in Stata October 12th 2012 37 / 51
  • 38. Merging, appending, and joining Merging Datasets What if my school dataset was the master and my student dataset was the merging dataset? // Not run: conceptual example only. merge 1:m schoolID using student.dta It is also possible to do a many:many merge Data needs to be sorted in both the master and using datasets Ista Zahn (IQSS) Data Management in Stata October 12th 2012 38 / 51
  • 39. Merging, appending, and joining Merging Datasets Update and replace options: In standard merge, the master dataset is the authority and WON’T CHANGE What if your master dataset has missing data and some of those values are not missing in your using dataset? Specify “update”- it will fill in missing without changing nonmissing What if you want data from your using dataset to overwrite that in your master? Specify “replace update”- it will replace master data with using data UNLESS the value is missing in the using dataset Ista Zahn (IQSS) Data Management in Stata October 12th 2012 39 / 51
  • 40. Merging, appending, and joining Appending Datasets Sometimes, you’ll have observations in two different datasets, or you’d like to add observations to an existing dataset Append will simply add observations to the end of the observations in the master // Not run: conceptual example. add rows of data from dataset2 append using dataset2 Ista Zahn (IQSS) Data Management in Stata October 12th 2012 40 / 51
  • 41. Merging, appending, and joining Appending Datasets Some options with Append: generate(newvar) will create variable that identifies source of observation // Not run: conceptual example. append using dataset1, generate(observesource) “force” will allow for data type mismatches (again, this is not recommended) Ista Zahn (IQSS) Data Management in Stata October 12th 2012 41 / 51
  • 42. Merging, appending, and joining Joinby Merge will add new observations from using that do not appear in master Sometimes, you need to add variables from using but want to be sure the list of participants in your master does not change // Not run: conceptual exampe. Similar to merge but drops non-matches joinby participant using dataset1 Any observations in using that are NOT in master will be omitted Ista Zahn (IQSS) Data Management in Stata October 12th 2012 42 / 51
  • 43. Creating summarized data sets Topic 1 Introduction 2 Generating and replacing variables 3 By processing 4 Missing values 5 Variable types 6 Merging, appending, and joining 7 Creating summarized data sets 8 Wrap-up Ista Zahn (IQSS) Data Management in Stata October 12th 2012 43 / 51
  • 44. Creating summarized data sets Collapse Collapse will take master data and create a new dataset of summary statistics Useful in hierarchal linear modeling if you’d like to create aggregate, summary statistics Can generate group summary data for many descriptive stats Mean, media, sd, sum, min, max, percentiles, standard errors Can also attach weights Ista Zahn (IQSS) Data Management in Stata October 12th 2012 44 / 51
  • 45. Creating summarized data sets Collapse Before you collapse Save your master dataset and then save it again under a new name This will prevent collapse from writing over your original data Consider issues of missing data. Do you want Stata to use all possible observations? If not: cw (casewise) option will make casewise deletions Ista Zahn (IQSS) Data Management in Stata October 12th 2012 45 / 51
  • 46. Creating summarized data sets Collapse Let’s say you have a dataset with patient information from multiple hospitals You want to generate mean levels of patient satisfaction for EACH hospital // Not run: conceptual example. calculate average ptsatisfaction by ho save originaldata collapse (mean) ptsatisfaction, by(hospital) save hospitalcollapse Ista Zahn (IQSS) Data Management in Stata October 12th 2012 46 / 51
  • 47. Creating summarized data sets Collapse You could also generate different statistics for multiple variables // create mean ptsatisfaction, median ptincome, sd ptsatisfaction for collapse (mean) ptsatisfaction (median) ptincome (sd) ptsatisfaction, What if you want to rename your new variables in this process? // Same as previous example, but rename variables collapse (mean) ptsatmean=ptsatisfaction (median) ptincmed=ptincome (sd) sdptsat=ptsatisfaction, by(hospital) Ista Zahn (IQSS) Data Management in Stata October 12th 2012 47 / 51
  • 48. Creating summarized data sets Exercise 2: Merge, Append, and Joinby Open the dataset, gss2.dta 1 The gss2 dataset contains only half of the variables that are in the complete gss dataset. Merge dataset gss1 with dataset gss2. The identification variable is “id.” 2 Open the dataset, gss.dta 3 Merge in data from the “marital.dta” dataset, which includes income information grouped by individuals’ marital status. The marital dataset contains collapsed data regarding average statistics of individuals based on their marital status. 4 Additional observations for the gssAppend.dta dataset can be found in “gssAddObserve.dta.” Create a new dataset that combines the observations in gssAppend.dta with those in gssAddObserve.dta. 5 Create a new dataset that summarizes mean and standard deviation of income based on individuals’ degree status (“degree”). In the process of creating this new dataset, rename your three new variables. Ista Zahn (IQSS) Data Management in Stata October 12th 2012 48 / 51
  • 49. Wrap-up Topic 1 Introduction 2 Generating and replacing variables 3 By processing 4 Missing values 5 Variable types 6 Merging, appending, and joining 7 Creating summarized data sets 8 Wrap-up Ista Zahn (IQSS) Data Management in Stata October 12th 2012 49 / 51
  • 50. Wrap-up Help Us Make This Workshop Better Please take a moment to fill out a very short feedback form These workshops exist for you–tell us what you need! http://tinyurl.com/StataDatManFeedback Ista Zahn (IQSS) Data Management in Stata October 12th 2012 50 / 51
  • 51. Wrap-up Additional resources training and consulting IQSS workshops: http://projects.iq.harvard.edu/rtc/filter_by/workshops IQSS statistical consulting: http://rtc.iq.harvard.edu Stata resources UCLA website: http://www.ats.ucla.edu/stat/Stata/ Great for self-study Links to resources Stata website: http://www.stata.com/help.cgi?contents Email list: http://www.stata.com/statalist/ Ista Zahn (IQSS) Data Management in Stata October 12th 2012 51 / 51