SlideShare a Scribd company logo
1 of 6
Download to read offline
PROC SQL - What Does It Offer Traditional SAS® Programming?
                                          Ian Whitlock, Westat Inc.

Abstract
                                                            Matching multiple data sets at different
PROC SQL plays two important roles in SAS:                  levels

      1.   To connect SAS with other data base              PROC SQL provides a powerful tool when extracting
           systems                                          data from different data sets at several different
                                                            levels. It not only provides simpler code, it provides
      2.   To help the typical SAS data processing          a new way of looking at these problems.
           programmer
                                                            Suppose we have data at the state, county and city
This paper looks at the second role. SAS program-           level stored in three data sets.
ming problems are presented, where PROC SQL has
a distinct advantage over SAS programs that do not                 State (state, region,...)
use this procedure. By focusing on such problems                   county (cntyid, state, cnty, area,
with coded solutions I hope to answer the question,                ...)
"When is it most appropriate to consider using PROC                city (city, cntyid, area, pop,...)
SQL in a SAS program?"
                                                            Prepare a report of all cities in the midwest with
                                                            populations over 100,000 with the ratio of the city
Introduction                                                area to the enclosing county area.

The typical SAS programmer needs PROC SQL                   A SAS procedural solution demands that we decide
because:                                                    whether to start with states or cities, specify all sorts
                                                            needed for the various DATA step merges, and
      1.   It is superb at accessing data stored in         specify those merges in detail ending up with a
           multiple data sets at different levels.          PROC PRINT.
      2.   It can easily produce a Cartesian product.
                                                            In contrast, SQL asks the fundamental questions:
      3.   It can perform matching where the
           condition of a match is not equality.                   1.   What are the data sets?
                                                                   2.   What are the subsetting conditions?
      4.   It is good at summarization.
                                                                   3.   What are the linking conditions?
      5.   With the introduction of 6.11, it can make              4.   What columns should appear?
           arrays of macro variables or do away with
           the need for these arrays by assigning a                proc sql ;
           whole column of values to one macro                          select st.state ,
           variable.                                                             cn.cnty ,
                                                                                 ct.city ,
      6.   Macro - SQL interaction enhances both
                                                                                 ct.area / cn.area as arearato
           macro and SQL.
                                                                             from state as st ,
                                                                                  county as cn ,
In addition to the direct values listed above, one                                city as ct
should not underestimate the value of SQL training in                       where ct.pop > 100000 and
teaching one data organization. An example of how                                 st.region = 'MW' and
SQL can teach data organization is given at the end                               st.state = cn.state and
of the summarizing section.                                                       cn.cntyid = ct.cntyid
                                                                        ;
                                                                   quit ;




                                                        1
It is easy to see how to produce the report with a
In all the remaining example code the PROC                   DATA _NULL_ step when the right information is in a
statement and the QUIT will be omitted.                      data set ( VARIABLE, FORMAT, VALUE, LABEL,
                                                             and COUNT ). Here is the SQL code to produce the
                                                             file.
Cartesian Product
                                                                   create table report as
Cartesian product matches are far more common                      select
than one-to-one matches, but the MERGE statement                       coalesce (sfm.variable, fq.variable)
assumes one-to-one within BY-groups. To find a                               as variable ,
Cartesian product match, let's look at a codebook                      coalesce ( sfm.format , fq.format )
example. I have three data sets:                                             as format ,
                                                                       coalesce ( sfm.value , fq.value )
      specs ( variable , format )                                            as value ,
      freq ( variable, format, value, count )                          sfm.label ,
      fmts ( format, value, label )                                    coalesce ( fq.count , 0 ) as count


The report might look something like this.                             from sfm full join freq as fq
                                                                       on sfm.variable = fq.variable and
      first variable using fmt1name format                                  sfm.format = fq.format and
                                                                            sfm.value = fq.value
          value label count                                            order by variable , format , value
            1     first     500                                    ;
            2     sec         0
            3     rem       300                              Note that in two SQL statements we have done a lot
                                                             of the work toward creating a codebook. If one could
      second variable using fmt2name format                  produce SPECS, FREQ, and FMTS easily, then one
                                                             could produce a codebook for any properly formatted
          etc.                                               SAS data set. The FMTS file is trivially produced
                                                             with the FMTLIB option of PROC FORMAT. The
Before tackling the problem let's look at the code for       FREQ file requires some macro code. We will
joining SPECS and FMTS. The problem involves a               postpone discussion of the SPECS file to a later
Cartesian product because a format may be                    section.
associated with more than one variable in SPECS,
and formats typically have more than one value.
Thus FORMAT does not determine a single record in            Fuzzy Matching
either data set; hence a merge by FORMAT will not
work. Note that accomplishing this sort of combining         Fuzzy matching comes in two varieties. In date (or
records in a DATA step involves using sophisticated          time) line matches one file holds a specific date (or
SAS techniques when SQL is not used.                         time) and one wants the corresponding record which
                                                             holds a range of dates (or time). For SAS dates
      create view sfm as                                     DATE, BEGDATE, and ENDDATE the WHERE
      select s.variable ,                                    clause might be
                 s.format ,
                 fm.value ,                                        where date is
                 fm.label                                              between begdate and enddate
          from specs as s , fmts as fm
          where s.format = fm.format                         For efficiency reasons it is important to add an equi-
      ;                                                      condition whenever possible. In date (or time)
                                                             matches one often has an ID that must also match,
                                                             hence the equi-join condition becomes




                                                         2
where a.id = b.id and                                                 ( select distinct school
          date is between begdate and enddate                                    from studsamp
                                                                                 group by school
In the other kind of fuzzy matching one cannot trust                             having wght/sum(wght) > .2
the identifying variables. Suppose we want to match                         ) as want
on social security numbers, SSN, but expect                           where stu.school = want.school
transposition errors and single digit mutations. Now                  order by stu.school, stu.wght desc
the WHERE clause might be                                        ;


      where sum ( substr(a.SSN,1,1) =                      The technique is important because there are many
                        substr(b.SSN,1,1) ,                times one wants to view every one in a group if
                     substr(a.SSN,2,1) =                   anyone in the group has some property. SQL
                        substr(b.SSN,2,1) ,                provides a natural idiom for producing the report.
                     ....
                     substr(a.SSN,9,1) =                   Knowing SQL should make one more sensitive to
                        substr(b.SSN,9,1)                  bad patterns of storing data. For example, a
                    ) >= 7                                 common question on SAS-L is how to array data.
                                                           Given the data
To make this an equi-join we might add
                                                                     ID        DATE        COUNT
              and substr(a.zip,1,3) =
                                                                      1      5jun1993        50
                     substr(b.zip,1,3)
                                                                      1      16oct1993       25
                                                                      1     21dec1993         8
or some other relatively safe blocking variable.                      2     14may1990        16
                                                                      2      27jan1991        3

Summarizing                                                how do you produce one record per ID with as many
                                                           date and count fields as needed, say ID, DATE1 -
One PROC SQL step can do the job of a PROC                 DATE32 and COUNT1 - COUNT32?                   Another
SUMMARY followed by merging of the results with            common question is how to work with the arrayed
the original data. For example, suppose we have a          data. For example, how can you compute the rate of
weighted student sample including many different           decrease in count per month and per year for each
schools. We want the percentage weight of each             ID. The answer is a trivial SQL problem, when the
student in a school. Then we might have:                   data are stored as they were originally given.

      select school , student , wght ,
                                                                 select id ,
                 100*wght/sum(wght) as pctwt
                                                                       (max(count)-min(count))/
           from studsamp
                                                                       intck('month',min(date),max(date))
           group by school
                                                                           as decpmon,
      ;
                                                                       calculated decpmon * 12 as decpyr
                                                                     from origdata
In this case one gets a message that summary data                    group by id
was remerged with the original data, but that is                 ;
precisely what we wanted.
                                                           After arraying it becomes a harder problem. Perhaps
Now suppose we want to look at all the students from       if SAS programmers learned SQL, and how to solve
any school which has some student contributing             problems without arrays, then they would also learn
more than 20% of the weight. The code might be             the advantages of storing data in a non-arrayed form.
                                                           With SQL training, one comes to realize the
      select stu.*
                                                           importance of putting the information into the data
           from studsamp as stu ,
                                                           instead of the variable names. Of course, this also




                                                       3
means that the usefulness of SQL is highly                  In the section on the Cartesian product, we
dependent on how well the data are stored, but it           postponed discussion of the data set SPECS. It
would be wrong to conclude that one might as well           could be generated from one of the "dictionary" files
avoid learning SQL because of bad data                      documented in the Technical Report P-222.
management practices.                                       Suppose we are interested in making a codebook for
                                                            the data set LIB.MYDATA, then the following code
                                                            could generate SPECS.
Macro Lists Via PROC SQL
                                                                  create specs as
PROC SQL's ability to assign a whole column of                    select name as variable ,
values to a macro variable has drastically changed                           case
how one writes macro code. Consider the splitting                              when format='' and type='char'
problem. Given a data set ALL with a variable SPLIT                                   then $char
naming a member, split ALL by the variable SPLIT.                              when format='' and type= 'num'
Before version 6.11 one had to use CALL SYMPUT                                      then best
to create an array of data set names and values and                            else     format
then write a monster SELECT statement. The whole                             end as format
thing had to be in a macro in order to repetitively                   from     dictionary.columns
process the array. Now one might view it as a                         where libname = 'LIB' and
problem to produce two lists                                                   memname = 'MYDATA'
                                                                  ;
          1.   The names of data sets
          2.   WHEN / OUTPUT statements for a               To prepare for doing the frequencies needed to make
               SELECT block                                 the data set FREQ we could use the array form of
                                                            generating variables from a column.
The first case is easily handled by
                                                                  select variable ,
      select distinct 'lib.'||split                                          format ,
                   into :datalist separated by ' '                           into :var1 - var9999 ,
          from all ;                                                                :fmt1 - fmt9999
                                                                      from specs
The second is more of the same, only harder.                      ;
                                                                  %let nvar = &sqlobs ;
      select distinct
               'when ('||split||') output lib.'||           The frequency data sets can then be generated in a
                    split into :whenlist                    PROC FORMAT with the macro code
                        separated by ';'
           from all                                               proc freq data = lib.mydata ;
      ;                                                               %do i = 1 %to &nvar ;
                                                                          table &&var&i / out=&&var&i ;
Now the code to produce the split is trivial and need                     format &&var&i &&fmt&i... ;
not even be housed in a macro.                                        %end ;
                                                                  run ;
      data &datalist ;
           set all ;                                        We still have not combined the frequency data sets
           select ( split ) ;                               into one data set, but that task can be left to a
                &whenlist ;                                 competent macro programmer, even one who
                otherwise ;                                 doesn't know SQL (assuming that that is not a
           end ;                                            contradiction in terms).
      run ;




                                                        4
Macro - SQL Interaction                                          2.    Find the minimum of all MINGROUP
                                                                       values for all names in a common group
The previous section showed how the making of lists                    (e.g. GROUP = 2 has MINGROUP = 1).
has had a dramatic effect on the way one codes
macro problems involving lists. Now we consider a          PROC SQL is very suitable to both types of
more complex interaction between PROC SQL and              minimization. In the first case we might have
macro, where macro code is used to write the SQL
code in a loop and the whole problem is much easier,             create table t as
precisely because it is SQL code.                                select name, group,
                                                                             min (group) as mingroup
Suppose we have a data set, W, containing the                          from dataset
variables NAME and GROUP.                                              group by name ;


           NAME        GROUP                               In the second case we might have
             A           1
                                                                 create table t as
             B           1
                                                                 select name, group,
             B           2
                                                                             min (mingroup) as mingroup
             C           2
                                                                       from t
             D           2
                                                                       group by group ;
             D           3
             E           3
             F           4                                 These two operations must be repeated over and
            G            4                                 over until no new minimums are found, since each
            G            5                                 new extension of a group may mean further
             H           5                                 collapsing. To express the iteration of this code to an
                                                           arbitrary level, we need a macro %DO-loop. This
                                                           time we will present the complete macro,
We want to collapse groups to the lowest level. For        %GROUPIT.
example, since A and B belong to group 1, and B and
C belong to group 2, then all members of group 2 are       For generality, we make parameters to name the
part of group 1 because the groups have the                input and output data sets, and the variables
common member B. Once this is seen one can add             represented by NAME, GROUP, and MINGROUP.
group 3 to the new group 1 because of the common           The parameter MAX is added to insure that the
member D. Thus group 1 covers A, B, C, D and E.            macro does not execute for an excessively long time.
Similarly F, G, and H ultimately belong to group 4.        (Since the algorithm does converge one could do
More formally, two groups are in the same chain if         away with this parameter or set it to the number of
there is a sequence of groups containing the given         observations.)
groups such that each consecutive pair of groups
                                                                 %macro groupit
contains a common name. Using this definition the
                                                                       ( data=&syslast,       /* input data */
data set consists of disjoint chains. The problem is
                                                                         out=_DATA_,      /* resulting data */
to write a program identifying each chain by the
                                                                         name=name,       /* name variable     */
minimum group number in the chain.
                                                                         group=group,     /* group variable */
                                                                         mingroup=mingroup,      /* minimum */
The intuitive argument given in the         previous
                                                                         max=20        /* limit #iterations */
paragraph uses two kinds of minimization.
                                                                       ) ;

      1.   Find the minimum group (call it
                                                                 /* ------------------------------------
           MINGROUP) for all names having the
                                                                      minimize group on name and then group
           same value (e.g. NAME = 'B' has
                                                                      repeat until max iterations or done
           MINGROUP = 1).
                                                                 ------------------------------------ */




                                                       5
%local i done ;                                                     ;


proc sql ;                                                          %let done =
      /* ------------------------------                                          %eval ( not &sqlobs ) ;
         initial set up - get first                                 reset print ;
         minimums, start numbered                                   drop table __t%eval(&i-1) ;
         sequence of data sets                                 %end ;           /* end iterative loop */
      ------------------------------ */
                                                               %if not &done %then
      create table __t0 as                                         %put WARNING(GROUPIT):Process
      select &name , &group ,                                       stopped by condition MAX=&max;
               min (&group) as &mingroup                       %else
         from &data                                            %do ;
         group by &name ;                                               create table &out as
                                                                        select &name , &group ,
      create table __t0 as                                                        &mingroup
      select &name , &group ,                                                 from __t&i
             min (&mingroup) as &mingroup                                     order by &name , &group
         from __t0                                                      ;
         group by &group ;                                              drop table __t&i ;
                                                               %end ;
      /* ------------------------------
         iterate until done or too many                   quit ;
         iterations                                    %mend    groupit ;
      ------------------------------ */
                                                       %groupit ( data = w , name = name1 ,
      %do %until (&done or &i > &max) ;                                     group = group1 , out = w2 )
         %let i = %eval ( &i + 1 ) ;
                                                       proc print data = w2 ;
         create table __t&i as                         run ;
         select &name , &group ,
             min (&mingroup) as &mingroup
              from __t%eval(&i-1)               Conclusion
              group by &name
         ;                                      I have pointed out six areas where SQL code excels.
         create table __t&i as                  My conclusion is that a good SAS programmer can
         select &name , &group ,                no longer ignore PROC SQL and remain good.
             min (&mingroup) as &mingroup
              from __t&i                        The author can be contacted by mail at:
              group by &group
         ;                                               Westat Inc.
                                                         1650 Research Boulevard
         /* are we finished? */                          Rockville, MD 20850-3129
         reset noprint ;
         select w1.&name                        or by e-mail at:
              from __t%eval(&i-1) as w1,
                   __t&i as w2                           whitloi1@westat.com
              where w1.&name=w2.&name and
                     &group=w2.&group and       SAS is a registered trademark or trademark of SAS
                     w1.&mingroup ^=            Institute Inc. in the USA and other countries. ®
                        w2.&mingroup            indicates USA registration




                                            6

More Related Content

What's hot

Understanding SAS Data Step Processing
Understanding SAS Data Step ProcessingUnderstanding SAS Data Step Processing
Understanding SAS Data Step Processingguest2160992
 
Polymorphic Table Functions in 18c
Polymorphic Table Functions in 18cPolymorphic Table Functions in 18c
Polymorphic Table Functions in 18cAndrej Pashchenko
 
SAS cheat sheet
SAS cheat sheetSAS cheat sheet
SAS cheat sheetAli Ajouz
 
Proc SQL in SAS Enterprise Guide 4.3
Proc SQL in SAS Enterprise Guide 4.3Proc SQL in SAS Enterprise Guide 4.3
Proc SQL in SAS Enterprise Guide 4.3Mark Tabladillo
 
Data Definition Language (DDL)
Data Definition Language (DDL) Data Definition Language (DDL)
Data Definition Language (DDL) Mohd Tousif
 
SQL Pattern Matching – should I start using it?
SQL Pattern Matching – should I start using it?SQL Pattern Matching – should I start using it?
SQL Pattern Matching – should I start using it?Andrej Pashchenko
 
Introduction to SAS Data Set Options
Introduction to SAS Data Set OptionsIntroduction to SAS Data Set Options
Introduction to SAS Data Set OptionsMark Tabladillo
 
Oracle naveen Sql
Oracle naveen   SqlOracle naveen   Sql
Oracle naveen Sqlnaveen
 
Base SAS Statistics Procedures
Base SAS Statistics ProceduresBase SAS Statistics Procedures
Base SAS Statistics Proceduresguest2160992
 
DDL,DML,SQL Functions and Joins
DDL,DML,SQL Functions and JoinsDDL,DML,SQL Functions and Joins
DDL,DML,SQL Functions and JoinsAshwin Dinoriya
 
SAS Macros part 1
SAS Macros part 1SAS Macros part 1
SAS Macros part 1venkatam
 
Subqueries, Backups, Users and Privileges
Subqueries, Backups, Users and PrivilegesSubqueries, Backups, Users and Privileges
Subqueries, Backups, Users and PrivilegesAshwin Dinoriya
 
A must Sql notes for beginners
A must Sql notes for beginnersA must Sql notes for beginners
A must Sql notes for beginnersRam Sagar Mourya
 
A SAS<sup>®</sup> Users Guide to Regular Expressions When the Data Resi...
A SAS<sup>®</sup> Users Guide to Regular Expressions When the Data Resi...A SAS<sup>®</sup> Users Guide to Regular Expressions When the Data Resi...
A SAS<sup>®</sup> Users Guide to Regular Expressions When the Data Resi...Ken Borowiak
 

What's hot (20)

Understanding SAS Data Step Processing
Understanding SAS Data Step ProcessingUnderstanding SAS Data Step Processing
Understanding SAS Data Step Processing
 
Polymorphic Table Functions in 18c
Polymorphic Table Functions in 18cPolymorphic Table Functions in 18c
Polymorphic Table Functions in 18c
 
SAS cheat sheet
SAS cheat sheetSAS cheat sheet
SAS cheat sheet
 
Proc SQL in SAS Enterprise Guide 4.3
Proc SQL in SAS Enterprise Guide 4.3Proc SQL in SAS Enterprise Guide 4.3
Proc SQL in SAS Enterprise Guide 4.3
 
Data Definition Language (DDL)
Data Definition Language (DDL) Data Definition Language (DDL)
Data Definition Language (DDL)
 
SQL Pattern Matching – should I start using it?
SQL Pattern Matching – should I start using it?SQL Pattern Matching – should I start using it?
SQL Pattern Matching – should I start using it?
 
SAS ODS HTML
SAS ODS HTMLSAS ODS HTML
SAS ODS HTML
 
Introduction to SAS Data Set Options
Introduction to SAS Data Set OptionsIntroduction to SAS Data Set Options
Introduction to SAS Data Set Options
 
Set, merge, and update
Set, merge, and updateSet, merge, and update
Set, merge, and update
 
Oracle naveen Sql
Oracle naveen   SqlOracle naveen   Sql
Oracle naveen Sql
 
Les10 Creating And Managing Tables
Les10 Creating And Managing TablesLes10 Creating And Managing Tables
Les10 Creating And Managing Tables
 
Base SAS Statistics Procedures
Base SAS Statistics ProceduresBase SAS Statistics Procedures
Base SAS Statistics Procedures
 
Sql
SqlSql
Sql
 
DDL,DML,SQL Functions and Joins
DDL,DML,SQL Functions and JoinsDDL,DML,SQL Functions and Joins
DDL,DML,SQL Functions and Joins
 
SAS Macros part 1
SAS Macros part 1SAS Macros part 1
SAS Macros part 1
 
Subqueries, Backups, Users and Privileges
Subqueries, Backups, Users and PrivilegesSubqueries, Backups, Users and Privileges
Subqueries, Backups, Users and Privileges
 
A must Sql notes for beginners
A must Sql notes for beginnersA must Sql notes for beginners
A must Sql notes for beginners
 
A SAS<sup>®</sup> Users Guide to Regular Expressions When the Data Resi...
A SAS<sup>®</sup> Users Guide to Regular Expressions When the Data Resi...A SAS<sup>®</sup> Users Guide to Regular Expressions When the Data Resi...
A SAS<sup>®</sup> Users Guide to Regular Expressions When the Data Resi...
 
SQL
SQLSQL
SQL
 
Assignment#07
Assignment#07Assignment#07
Assignment#07
 

Viewers also liked

Sas Macro Examples
Sas Macro ExamplesSas Macro Examples
Sas Macro ExamplesSASTechies
 
Interviewing Basics
Interviewing BasicsInterviewing Basics
Interviewing Basicsdkaltved
 
Approximating_probability_density_functions_for_the_Collective_Risk_Model
Approximating_probability_density_functions_for_the_Collective_Risk_ModelApproximating_probability_density_functions_for_the_Collective_Risk_Model
Approximating_probability_density_functions_for_the_Collective_Risk_ModelHarini Vaidyanath
 
Base sas interview questions
Base sas interview questionsBase sas interview questions
Base sas interview questionsDr P Deepak
 
Clinical trials - ECRAN Project
Clinical trials - ECRAN ProjectClinical trials - ECRAN Project
Clinical trials - ECRAN ProjectScienzainrete
 
Learn SAS Programming
Learn SAS ProgrammingLearn SAS Programming
Learn SAS ProgrammingSASTechies
 
64 interview questions
64 interview questions64 interview questions
64 interview questionsTarikul Alam
 
Base SAS Exam Questions
Base SAS Exam QuestionsBase SAS Exam Questions
Base SAS Exam Questionsguestc45097
 
Basics Of SAS Programming Language
Basics Of SAS Programming LanguageBasics Of SAS Programming Language
Basics Of SAS Programming Languageguest2160992
 
Clinical research ppt,
Clinical research   ppt,Clinical research   ppt,
Clinical research ppt,Malay Singh
 

Viewers also liked (11)

Sas Macro Examples
Sas Macro ExamplesSas Macro Examples
Sas Macro Examples
 
Interviewing Basics
Interviewing BasicsInterviewing Basics
Interviewing Basics
 
Approximating_probability_density_functions_for_the_Collective_Risk_Model
Approximating_probability_density_functions_for_the_Collective_Risk_ModelApproximating_probability_density_functions_for_the_Collective_Risk_Model
Approximating_probability_density_functions_for_the_Collective_Risk_Model
 
Base sas interview questions
Base sas interview questionsBase sas interview questions
Base sas interview questions
 
Clinical trials - ECRAN Project
Clinical trials - ECRAN ProjectClinical trials - ECRAN Project
Clinical trials - ECRAN Project
 
Learn SAS Programming
Learn SAS ProgrammingLearn SAS Programming
Learn SAS Programming
 
64 interview questions
64 interview questions64 interview questions
64 interview questions
 
Base SAS Exam Questions
Base SAS Exam QuestionsBase SAS Exam Questions
Base SAS Exam Questions
 
Basics Of SAS Programming Language
Basics Of SAS Programming LanguageBasics Of SAS Programming Language
Basics Of SAS Programming Language
 
SAS basics Step by step learning
SAS basics Step by step learningSAS basics Step by step learning
SAS basics Step by step learning
 
Clinical research ppt,
Clinical research   ppt,Clinical research   ppt,
Clinical research ppt,
 

Similar to Proc sql tips

Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cqlzznate
 
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...DataStax
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Kiruthikak14
 
XML Schema Computations: Schema Compatibility Testing and Subschema Extraction
XML Schema Computations: Schema Compatibility Testing and Subschema ExtractionXML Schema Computations: Schema Compatibility Testing and Subschema Extraction
XML Schema Computations: Schema Compatibility Testing and Subschema ExtractionThomas Lee
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsRavindra kumar
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystemAlex Thompson
 
Assignment # 2PreliminariesImportant Points· Evidence of acad.docx
Assignment  # 2PreliminariesImportant Points· Evidence of acad.docxAssignment  # 2PreliminariesImportant Points· Evidence of acad.docx
Assignment # 2PreliminariesImportant Points· Evidence of acad.docxjane3dyson92312
 
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Kiruthikak14
 
SQL Interview Questions For Experienced
SQL Interview Questions For ExperiencedSQL Interview Questions For Experienced
SQL Interview Questions For Experiencedzynofustechnology
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architectureAjeet Singh
 
Introduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhgIntroduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhgzznate
 
Sql tutorial
Sql tutorialSql tutorial
Sql tutorialAxmed Mo.
 
Advanced MySQL Query Optimizations
Advanced MySQL Query OptimizationsAdvanced MySQL Query Optimizations
Advanced MySQL Query OptimizationsDave Stokes
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparationKushaal Singla
 
Sharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for PatstatSharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for PatstatGianluca Tarasconi
 
Impl reference manual_for_quantities
Impl reference manual_for_quantitiesImpl reference manual_for_quantities
Impl reference manual_for_quantitiesAlkis Vazacopoulos
 
MODULE 5.pptx
MODULE 5.pptxMODULE 5.pptx
MODULE 5.pptxlathass5
 

Similar to Proc sql tips (20)

Dvm
DvmDvm
Dvm
 
Meetup cassandra for_java_cql
Meetup cassandra for_java_cqlMeetup cassandra for_java_cql
Meetup cassandra for_java_cql
 
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
 
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
Bigdata analytics K.kiruthika 2nd M.Sc.,computer science Bon secoures college...
 
XML Schema Computations: Schema Compatibility Testing and Subschema Extraction
XML Schema Computations: Schema Compatibility Testing and Subschema ExtractionXML Schema Computations: Schema Compatibility Testing and Subschema Extraction
XML Schema Computations: Schema Compatibility Testing and Subschema Extraction
 
Quick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skillsQuick Guide to Refresh Spark skills
Quick Guide to Refresh Spark skills
 
The Apache Cassandra ecosystem
The Apache Cassandra ecosystemThe Apache Cassandra ecosystem
The Apache Cassandra ecosystem
 
Assignment # 2PreliminariesImportant Points· Evidence of acad.docx
Assignment  # 2PreliminariesImportant Points· Evidence of acad.docxAssignment  # 2PreliminariesImportant Points· Evidence of acad.docx
Assignment # 2PreliminariesImportant Points· Evidence of acad.docx
 
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
Big data analytics K.Kiruthika II-M.Sc.,Computer Science Bonsecours college f...
 
SQL Interview Questions For Experienced
SQL Interview Questions For ExperiencedSQL Interview Questions For Experienced
SQL Interview Questions For Experienced
 
Ms sql server architecture
Ms sql server architectureMs sql server architecture
Ms sql server architecture
 
Introduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhgIntroduction to apache_cassandra_for_developers-lhg
Introduction to apache_cassandra_for_developers-lhg
 
Data structures
Data structuresData structures
Data structures
 
Sql tutorial
Sql tutorialSql tutorial
Sql tutorial
 
Sql tutorial
Sql tutorialSql tutorial
Sql tutorial
 
Advanced MySQL Query Optimizations
Advanced MySQL Query OptimizationsAdvanced MySQL Query Optimizations
Advanced MySQL Query Optimizations
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
Sharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for PatstatSharing names and address cleaning patterns for Patstat
Sharing names and address cleaning patterns for Patstat
 
Impl reference manual_for_quantities
Impl reference manual_for_quantitiesImpl reference manual_for_quantities
Impl reference manual_for_quantities
 
MODULE 5.pptx
MODULE 5.pptxMODULE 5.pptx
MODULE 5.pptx
 

Proc sql tips

  • 1. PROC SQL - What Does It Offer Traditional SAS® Programming? Ian Whitlock, Westat Inc. Abstract Matching multiple data sets at different PROC SQL plays two important roles in SAS: levels 1. To connect SAS with other data base PROC SQL provides a powerful tool when extracting systems data from different data sets at several different levels. It not only provides simpler code, it provides 2. To help the typical SAS data processing a new way of looking at these problems. programmer Suppose we have data at the state, county and city This paper looks at the second role. SAS program- level stored in three data sets. ming problems are presented, where PROC SQL has a distinct advantage over SAS programs that do not State (state, region,...) use this procedure. By focusing on such problems county (cntyid, state, cnty, area, with coded solutions I hope to answer the question, ...) "When is it most appropriate to consider using PROC city (city, cntyid, area, pop,...) SQL in a SAS program?" Prepare a report of all cities in the midwest with populations over 100,000 with the ratio of the city Introduction area to the enclosing county area. The typical SAS programmer needs PROC SQL A SAS procedural solution demands that we decide because: whether to start with states or cities, specify all sorts needed for the various DATA step merges, and 1. It is superb at accessing data stored in specify those merges in detail ending up with a multiple data sets at different levels. PROC PRINT. 2. It can easily produce a Cartesian product. In contrast, SQL asks the fundamental questions: 3. It can perform matching where the condition of a match is not equality. 1. What are the data sets? 2. What are the subsetting conditions? 4. It is good at summarization. 3. What are the linking conditions? 5. With the introduction of 6.11, it can make 4. What columns should appear? arrays of macro variables or do away with the need for these arrays by assigning a proc sql ; whole column of values to one macro select st.state , variable. cn.cnty , ct.city , 6. Macro - SQL interaction enhances both ct.area / cn.area as arearato macro and SQL. from state as st , county as cn , In addition to the direct values listed above, one city as ct should not underestimate the value of SQL training in where ct.pop > 100000 and teaching one data organization. An example of how st.region = 'MW' and SQL can teach data organization is given at the end st.state = cn.state and of the summarizing section. cn.cntyid = ct.cntyid ; quit ; 1
  • 2. It is easy to see how to produce the report with a In all the remaining example code the PROC DATA _NULL_ step when the right information is in a statement and the QUIT will be omitted. data set ( VARIABLE, FORMAT, VALUE, LABEL, and COUNT ). Here is the SQL code to produce the file. Cartesian Product create table report as Cartesian product matches are far more common select than one-to-one matches, but the MERGE statement coalesce (sfm.variable, fq.variable) assumes one-to-one within BY-groups. To find a as variable , Cartesian product match, let's look at a codebook coalesce ( sfm.format , fq.format ) example. I have three data sets: as format , coalesce ( sfm.value , fq.value ) specs ( variable , format ) as value , freq ( variable, format, value, count ) sfm.label , fmts ( format, value, label ) coalesce ( fq.count , 0 ) as count The report might look something like this. from sfm full join freq as fq on sfm.variable = fq.variable and first variable using fmt1name format sfm.format = fq.format and sfm.value = fq.value value label count order by variable , format , value 1 first 500 ; 2 sec 0 3 rem 300 Note that in two SQL statements we have done a lot of the work toward creating a codebook. If one could second variable using fmt2name format produce SPECS, FREQ, and FMTS easily, then one could produce a codebook for any properly formatted etc. SAS data set. The FMTS file is trivially produced with the FMTLIB option of PROC FORMAT. The Before tackling the problem let's look at the code for FREQ file requires some macro code. We will joining SPECS and FMTS. The problem involves a postpone discussion of the SPECS file to a later Cartesian product because a format may be section. associated with more than one variable in SPECS, and formats typically have more than one value. Thus FORMAT does not determine a single record in Fuzzy Matching either data set; hence a merge by FORMAT will not work. Note that accomplishing this sort of combining Fuzzy matching comes in two varieties. In date (or records in a DATA step involves using sophisticated time) line matches one file holds a specific date (or SAS techniques when SQL is not used. time) and one wants the corresponding record which holds a range of dates (or time). For SAS dates create view sfm as DATE, BEGDATE, and ENDDATE the WHERE select s.variable , clause might be s.format , fm.value , where date is fm.label between begdate and enddate from specs as s , fmts as fm where s.format = fm.format For efficiency reasons it is important to add an equi- ; condition whenever possible. In date (or time) matches one often has an ID that must also match, hence the equi-join condition becomes 2
  • 3. where a.id = b.id and ( select distinct school date is between begdate and enddate from studsamp group by school In the other kind of fuzzy matching one cannot trust having wght/sum(wght) > .2 the identifying variables. Suppose we want to match ) as want on social security numbers, SSN, but expect where stu.school = want.school transposition errors and single digit mutations. Now order by stu.school, stu.wght desc the WHERE clause might be ; where sum ( substr(a.SSN,1,1) = The technique is important because there are many substr(b.SSN,1,1) , times one wants to view every one in a group if substr(a.SSN,2,1) = anyone in the group has some property. SQL substr(b.SSN,2,1) , provides a natural idiom for producing the report. .... substr(a.SSN,9,1) = Knowing SQL should make one more sensitive to substr(b.SSN,9,1) bad patterns of storing data. For example, a ) >= 7 common question on SAS-L is how to array data. Given the data To make this an equi-join we might add ID DATE COUNT and substr(a.zip,1,3) = 1 5jun1993 50 substr(b.zip,1,3) 1 16oct1993 25 1 21dec1993 8 or some other relatively safe blocking variable. 2 14may1990 16 2 27jan1991 3 Summarizing how do you produce one record per ID with as many date and count fields as needed, say ID, DATE1 - One PROC SQL step can do the job of a PROC DATE32 and COUNT1 - COUNT32? Another SUMMARY followed by merging of the results with common question is how to work with the arrayed the original data. For example, suppose we have a data. For example, how can you compute the rate of weighted student sample including many different decrease in count per month and per year for each schools. We want the percentage weight of each ID. The answer is a trivial SQL problem, when the student in a school. Then we might have: data are stored as they were originally given. select school , student , wght , select id , 100*wght/sum(wght) as pctwt (max(count)-min(count))/ from studsamp intck('month',min(date),max(date)) group by school as decpmon, ; calculated decpmon * 12 as decpyr from origdata In this case one gets a message that summary data group by id was remerged with the original data, but that is ; precisely what we wanted. After arraying it becomes a harder problem. Perhaps Now suppose we want to look at all the students from if SAS programmers learned SQL, and how to solve any school which has some student contributing problems without arrays, then they would also learn more than 20% of the weight. The code might be the advantages of storing data in a non-arrayed form. With SQL training, one comes to realize the select stu.* importance of putting the information into the data from studsamp as stu , instead of the variable names. Of course, this also 3
  • 4. means that the usefulness of SQL is highly In the section on the Cartesian product, we dependent on how well the data are stored, but it postponed discussion of the data set SPECS. It would be wrong to conclude that one might as well could be generated from one of the "dictionary" files avoid learning SQL because of bad data documented in the Technical Report P-222. management practices. Suppose we are interested in making a codebook for the data set LIB.MYDATA, then the following code could generate SPECS. Macro Lists Via PROC SQL create specs as PROC SQL's ability to assign a whole column of select name as variable , values to a macro variable has drastically changed case how one writes macro code. Consider the splitting when format='' and type='char' problem. Given a data set ALL with a variable SPLIT then $char naming a member, split ALL by the variable SPLIT. when format='' and type= 'num' Before version 6.11 one had to use CALL SYMPUT then best to create an array of data set names and values and else format then write a monster SELECT statement. The whole end as format thing had to be in a macro in order to repetitively from dictionary.columns process the array. Now one might view it as a where libname = 'LIB' and problem to produce two lists memname = 'MYDATA' ; 1. The names of data sets 2. WHEN / OUTPUT statements for a To prepare for doing the frequencies needed to make SELECT block the data set FREQ we could use the array form of generating variables from a column. The first case is easily handled by select variable , select distinct 'lib.'||split format , into :datalist separated by ' ' into :var1 - var9999 , from all ; :fmt1 - fmt9999 from specs The second is more of the same, only harder. ; %let nvar = &sqlobs ; select distinct 'when ('||split||') output lib.'|| The frequency data sets can then be generated in a split into :whenlist PROC FORMAT with the macro code separated by ';' from all proc freq data = lib.mydata ; ; %do i = 1 %to &nvar ; table &&var&i / out=&&var&i ; Now the code to produce the split is trivial and need format &&var&i &&fmt&i... ; not even be housed in a macro. %end ; run ; data &datalist ; set all ; We still have not combined the frequency data sets select ( split ) ; into one data set, but that task can be left to a &whenlist ; competent macro programmer, even one who otherwise ; doesn't know SQL (assuming that that is not a end ; contradiction in terms). run ; 4
  • 5. Macro - SQL Interaction 2. Find the minimum of all MINGROUP values for all names in a common group The previous section showed how the making of lists (e.g. GROUP = 2 has MINGROUP = 1). has had a dramatic effect on the way one codes macro problems involving lists. Now we consider a PROC SQL is very suitable to both types of more complex interaction between PROC SQL and minimization. In the first case we might have macro, where macro code is used to write the SQL code in a loop and the whole problem is much easier, create table t as precisely because it is SQL code. select name, group, min (group) as mingroup Suppose we have a data set, W, containing the from dataset variables NAME and GROUP. group by name ; NAME GROUP In the second case we might have A 1 create table t as B 1 select name, group, B 2 min (mingroup) as mingroup C 2 from t D 2 group by group ; D 3 E 3 F 4 These two operations must be repeated over and G 4 over until no new minimums are found, since each G 5 new extension of a group may mean further H 5 collapsing. To express the iteration of this code to an arbitrary level, we need a macro %DO-loop. This time we will present the complete macro, We want to collapse groups to the lowest level. For %GROUPIT. example, since A and B belong to group 1, and B and C belong to group 2, then all members of group 2 are For generality, we make parameters to name the part of group 1 because the groups have the input and output data sets, and the variables common member B. Once this is seen one can add represented by NAME, GROUP, and MINGROUP. group 3 to the new group 1 because of the common The parameter MAX is added to insure that the member D. Thus group 1 covers A, B, C, D and E. macro does not execute for an excessively long time. Similarly F, G, and H ultimately belong to group 4. (Since the algorithm does converge one could do More formally, two groups are in the same chain if away with this parameter or set it to the number of there is a sequence of groups containing the given observations.) groups such that each consecutive pair of groups %macro groupit contains a common name. Using this definition the ( data=&syslast, /* input data */ data set consists of disjoint chains. The problem is out=_DATA_, /* resulting data */ to write a program identifying each chain by the name=name, /* name variable */ minimum group number in the chain. group=group, /* group variable */ mingroup=mingroup, /* minimum */ The intuitive argument given in the previous max=20 /* limit #iterations */ paragraph uses two kinds of minimization. ) ; 1. Find the minimum group (call it /* ------------------------------------ MINGROUP) for all names having the minimize group on name and then group same value (e.g. NAME = 'B' has repeat until max iterations or done MINGROUP = 1). ------------------------------------ */ 5
  • 6. %local i done ; ; proc sql ; %let done = /* ------------------------------ %eval ( not &sqlobs ) ; initial set up - get first reset print ; minimums, start numbered drop table __t%eval(&i-1) ; sequence of data sets %end ; /* end iterative loop */ ------------------------------ */ %if not &done %then create table __t0 as %put WARNING(GROUPIT):Process select &name , &group , stopped by condition MAX=&max; min (&group) as &mingroup %else from &data %do ; group by &name ; create table &out as select &name , &group , create table __t0 as &mingroup select &name , &group , from __t&i min (&mingroup) as &mingroup order by &name , &group from __t0 ; group by &group ; drop table __t&i ; %end ; /* ------------------------------ iterate until done or too many quit ; iterations %mend groupit ; ------------------------------ */ %groupit ( data = w , name = name1 , %do %until (&done or &i > &max) ; group = group1 , out = w2 ) %let i = %eval ( &i + 1 ) ; proc print data = w2 ; create table __t&i as run ; select &name , &group , min (&mingroup) as &mingroup from __t%eval(&i-1) Conclusion group by &name ; I have pointed out six areas where SQL code excels. create table __t&i as My conclusion is that a good SAS programmer can select &name , &group , no longer ignore PROC SQL and remain good. min (&mingroup) as &mingroup from __t&i The author can be contacted by mail at: group by &group ; Westat Inc. 1650 Research Boulevard /* are we finished? */ Rockville, MD 20850-3129 reset noprint ; select w1.&name or by e-mail at: from __t%eval(&i-1) as w1, __t&i as w2 whitloi1@westat.com where w1.&name=w2.&name and &group=w2.&group and SAS is a registered trademark or trademark of SAS w1.&mingroup ^= Institute Inc. in the USA and other countries. ® w2.&mingroup indicates USA registration 6