SlideShare una empresa de Scribd logo
1 de 8
Descargar para leer sin conexión
CT-03

                          PRXChange: Accept No Substitutions
                                  Kenneth W. Borowiak, PPD, Inc.

Abstract

SAS® provides a variety of functions for removing and replacing text, such as COMPRESS,
TRANSLATE & TRANWRD. However, when the replacement is conditional upon the text
around the string the logic can become long and difficult to follow. The PRXCHANGE
function is ideal for complicated text replacements, as it leverages the power of regular
expressions. The PRXCHANGE function not only encapsulates the functionality of traditional
character string functions, but exceeds them because of the tremendous flexibility afforded
by concepts such as predefined and user-defined character classes, capture buffers, and
positive and negative look-arounds.

Introduction

SAS provides a variety of ways to find and replace characters or strings in character fields.
Some of the traditional functions include the TRANWRD and TRANSLATE functions, but
these are limited to static strings and characters. The COMPBL function can be used to
reduce consecutive whitespace characters to a single whitespace character. The COMPRESS
function can used to eliminate characters, and this function was enhanced in Version 9.0
to include a third argument to delete or keep classes of characters. While these traditional
character functions are useful for relatively easy tasks, more difficult tasks involving text
substitutions and extractions often involve nesting of functions and conditional logic, which
can be cumbersome to follow and maintain.

With the release of Version 9.0, SAS has introduced Perl-style regular expressions through
the use of the PRX functions and call routines. This rich and powerful language for pattern
matching is ideal for text substitutions, as they allow the user to leverage predefined sets of
characters, customized boundaries and manipulation of captured text. This paper explores
some of the key functionality of the PRXCHANGE function. Some of the basic concepts of
regular expressions (a.k.a regex) are discussed in introductory papers by Borowiak [2008]
and Cassell [2007] and those unfamiliar with PRX should refer to those papers before
proceeding with this paper.

The PRXCHANGE function takes the following form:

prxchange( regular expression|id, occurrence, source )

    ●   regular expression or id : The substitution regex can either be entered directly in the
        first argument or a variable with the precompiled result from a call to the PRXPARSE
        function. The substitution regex takes the form:
                s/matching expression/replacement expression/modifiers
                      The s before the first delimiter is required1
                      The matching expression is the regex of the string to match
                      The replacement expression applied to the matched string
                      Modifiers control the behaviour of the matching part of the regex (i.e.
                      such i, x or o )

1 This is unlike the PRXMATCH function, where the m is optional before the first delimiter in the regex
(e.g ‘m/^d/’ ).


                                                   1
●    occurrence : The number of times to perform the the match and substitution. Valid
        values are positive integers. A value of -1 is also valid, which performs replacement
        as many times as it finds the matching pattern.
   ●    source - The character string or field where the where the pattern is to be searched.

Consider the example in Figure 1, where a new variable NAME2 is created in a PROC SQL
step to replace all occurrences of the letter a with the letter e.

Figure 1 - Replacement of a to e



proc sql outobs=5 ;
 select name
        , prxchange( ‘s/a/e/i’, -1, name ) as name2
 from     sashelp.class
 order by name2
 ;
 quit ;

   Name        name2

  Barbara      Berbere

   Carol        Cerol

   Henry        Henry

   Jeffrey      Jeffrey

   James        Jemes



Since the i modifier is used, it makes the pattern matching case-insensitive, so occurrences
of a and A will be replaced by e. The value of the second argument is -1, so all occurrences
of a are replaced when found in the variable NAME.

Compression

A special case of a find-and-replace operation is compression, where the replacement is
nothing. This is an operation that is often performed using the COMPRESS function. In the
query below in Figure 3, both the COMPRESS and PRXCHANGE functions are used to remove
the vowels from the variable NAME into the variables NAME2 and NAME3, respectively.

Figure 3 - Remove all vowels


proc sql outobs=4;
  select    name
          , compress( name, ‘aeiou’, ‘i’ ) as name2
         , prxchange( ‘s/[aeiou]//i’, -1, name ) as name3
  from      sashelp.class



                                              2
order by name3
  ;
  quit ;

   Name      name2    name3

  Barbara     Brbr     Brbr

   Carol      Crl      Crl

   Henry      Hnry     Hnry

   Judy       Jdy      Jdy



Now consider a slightly more restricted case where you want to remove vowels but the
vowel can’t be preceded by the letters l, m or n, while ignoring case. This is relatively
easy condition to implement with regular expressions by using a positive look-behind
assertion (?<=), as demonstrated in Figure 4.


Figure 4 - Remove vowels not preceded by l ,m or n


proc sql ;
  select name
         , prxchange( ‘s/(?<=[lmn])[aeiou]//i’, -1, name ) as name3
  from     sashelp.class
  where prxmatch( ‘m/(?<=[lmn])[aeiou]/i’, name )>0
  order by name3
  ;
  quit ;

   Name       name3

   Alice       Alce

   James      Jams

   Jane        Jan

   Janet       Jant

   Louise     Luise

   Mary        Mry

   Philip     Philp

   Ronald     Ronld

  Thomas      Thoms




                                            3
William         Willam


Look-arounds are zero-width assertions (i.e. they do not consume characters in the string)
that only confirm whether a match is possible by checking conditions around a specific
location in the pattern . Look-arounds can be positive or negative and can be forward or
backward looking, and they are discussed in more detail in Borowiak [2006] and Dunn
[2011]. To implement a solution to the problem in Figure 4 with the COMPRESS function
one need to use a DO loop with the SUBSTR function to check the condition, which rules out
using a SQL step.

Dedaction

Another useful type of substitution example is dedaction, or, the ‘blacking out’ of sensitive
information. Consider the case where a pattern is matched and the characters are replaced
by a static string, as in Figure 5 below, where you want to match the actual digits in the
social security number in the free-text field and replace them with the letter x.

Figure 5 - Replace digits of social security numbers with an x


data SSN ;
  input SSN $20. ;
datalines ;
123-54-2280
#987-65-4321
S.S. 666-77-8888
246801357
soc # 133-77-2000
ssnum 888_22-7779
919-555-4689
call me 1800123456
;
run ;

proc sql feedback ;
  select ssn
         , prxchange( ‘s/(?<!d)d{3}[-_]?d{2}[-_]?d{4}(?!d)/xxxxxxxxx/io’, -1, ssn)
              as ssn2
   from ssn
  ;
  quit ;

             SSN                    ssn2

        123-54-2280               xxxxxxxxx


       #987-65-4321              #xxxxxxxxx


      S.S. 666-77-8888          S.S. xxxxxxxxx




                                                 4
246801357                     xxxxxxxxx


     soc # 133-77-2000            soc # xxxxxxxxx


    ssnum 888_22-7779            ssnum xxxxxxxxx


       919-555-4689                919-555-4689


    call me 1800123456         call me 18001234567




Capture Buffers

Like many other programming languages, regular expressions allow you to use parenthesis
to add clarity by grouping logical expressions together. A result of using grouping
parentheses is that it creates temporary variables, which can be used in the substitution
part of a regex. Consider an extension of the example in Figure 5 where we continue to
replace the digits of social security numbers with an x, but want to maintain any of the
dashes or underscores in the original variables.


Figure 6 - Maintain dashes and underscores in social security number dedaction


proc sql feedback ;
  select ssn
          , prxchange('s/(?<!d)d{3}(-|_)?d{2}(-|_)?d{4}(?!d)/xxx$1xx$2xxxx/io', -1
              , ssn ) as ssn2
  from      ssn
  ;
quit ;


        SSN                    ssn2

     123-54-2280           xxx-xx-xxxx


    #987-65-4321           #xxx-xx-xxxx


   S.S. 666-77-8888      S.S. xxx-xx-xxxx


     246801357              xxxxxxxxx


  soc # 133-77-2000      soc # xxx-xx-xxxx




                                                    5
ssnum 888_22-7779       ssnum xxx_xx-xxxx


    919-555-4689             919-555-4689


 call me 18001234567      call me 18001234567



Note the regular expression actually has four pairs of parentheses. The negative look-behind
(i.e. (?<!d) ) and negative look-ahead assertions (i.e. (?!d) ) are non-capturing. The
sub-expression (-|_), which appears the second and third pairs of parentheses, are the two
capturing parentheses, which correspond to the variables $1 and $2. And though the match
of the sub-expressions in the two capturing parentheses are optional, as they are followed
by the ? quantifier, the variables $1 and $2 can be populated with null values.

Case-Folding Prefixes and Spans

When performing a replacement, users have the ability to manipulate captured variables by
changing their case. Consider the example in Figure 7, where the titles Dr, Mr, and Mrs
are entered in various ways but need to be ‘propcased’.


Figure 7 - Propcase courtesy titles


data guest_list ;
 input attendees $30. ;
datalines;
MR and MRS DRaco Malfoy
mr and dr M Johnson
MrS. O.M. Goodness
DR. Evil
mr&mrs R. Miller
;
run ;

proc sql ;
 select attendees
        , prxchange( 's/b(d|m)(r(?!a)s?)/u$1L$2E/io', -1, attendees ) as attendees2
 from guest_list
 ;
 quit ;


         attendees                    attendees2

 MR and MRS DRaco Malfoy       Mr and Mrs DRaco Malfoy


    mr and dr M Johnson          Mr and Dr M Johnson




                                                   6
MrS. O.M. Goodness                       Mrs. O.M. Goodness


             DR. Evil                             Dr. Evil


       mr&mrs R. Miller                       Mr&Mrs R. Miller



The regular expression looks for a d or m at the beginning of a word and the matched
character is put in the first capture buffer. It then looks to match r (as long as it is not
followed by an a) followed an optional s and puts the characters in the second capture
buffer. The replacement expression then makes use of case-folding prefixes and spans
to control the case of the capture buffers. u makes the next character that follows it
uppercase, which would be the d or m in first capture buffer. The case-folding span L
makes characters that follow lowercase until the end of the replacement or until disabled by
E, which is applied to the second capture buffer. Case-folding functionality was introduced
in SAS V9.2.

More on Lookarounds

The final example in Figure 9 demonstrates inserting text at a location where both a positive
look-behind (?<=) and look-ahead (?=) assertion is satisfied. Consider the task where a
space character is to be inserted between two consecutive colons. One might be tempted
to write a regex to match the two colons and replace it with three characters (i.e : : ).
However, an efficient regex would be able to identify the location that is preceded and
followed by a colon and insert a single space character.

Figure 8 - Inserting a space between consecutive colons


data colons;
 length string string2 $200 ;
 string='a::::b:::::';
 string2=prxchange("s/(?<=:)(?=:)/ /",-1,string) ;
 run ;


    string                  string2

   a::::b:::::          a: : : :b: : : : :



This solution will not provide the correct result in versions prior to SAS V9.2, as there was
bug in the Perl 5.6.1 engine that the PRX functions use.

Conclusion

For simple text substitutions, using traditional SAS functions may suffice. However, as a
substitution task becomes more complicated, multiple lines of code can often be reduced to
a single regular expression within PRXCHANGE due to the tremendous flexibility they offer.



                                                                 7
References

Borowiak, Kenneth W. (2006), “Perl Regular Expressions 102”. Proceedings of the 19th
Annual Northeast SAS Users Group Conference, USA.
http://www.nesug.org/proceedings/nesug06/po/po16.pdf

Borowiak, Kenneth W. (2008), “PRX Functions and Call Routines: There is Hardly Anything
Regular About Them!”. Proceedings of the Twenty First Annual Northeast SAS Users Group
Conference, USA.
http://www.nesug.org/proceedings/nesug08/bb/bb11.pdf

Cassell, David L., ”The Basics of the PRX Functions” SAS Global Forum 2007
http://www2.sas.com/proceedings/forum2007/223-2007.pdf

Dunn, Toby, ”Grouping, Atomic Groups, and Conditions: Creating If-Then statements in Perl
RegEx” SAS Global Forum 2011
http://support.sas.com/resources/papers/proceedings11/245-2011.pdf

Friedl, Jeffrey E.F., Mastering Regular Expressions 3rd Edition


Acknowledgements

The authors would like to thank Jenni Borowiak, Kipp Spanbauer and Kunal Agnihotri for
their insightful comments on this paper.

SAS® and all other SAS Institute Inc. product or service names are registered trademarks
or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA
registration.

Disclaimer

The content of this paper are the works of the author and do not necessarily represent the
opinions, recommendations, or practices of PPD, Inc.

Contact Information

Your comments and questions are valued and encouraged.
Contact the authors at:

Ken Borowiak
3900 Paramount Parkway
Morrisville NC 27560

ken.borowiak@ppdi.com
ken.borowiak@gmail.com




                                              8

Más contenido relacionado

La actualidad más candente

Database Systems - Normalization of Relations(Chapter 4/3)
Database Systems - Normalization of Relations(Chapter 4/3)Database Systems - Normalization of Relations(Chapter 4/3)
Database Systems - Normalization of Relations(Chapter 4/3)Vidyasagar Mundroy
 
Database normalization
Database normalizationDatabase normalization
Database normalizationJignesh Jain
 
Introduction Oracle Database 11g Release 2 for developers
Introduction Oracle Database 11g Release 2 for developersIntroduction Oracle Database 11g Release 2 for developers
Introduction Oracle Database 11g Release 2 for developersLucas Jellema
 
Sentient Arithmetic and Godel's Incompleteness Theorems
Sentient Arithmetic and Godel's Incompleteness TheoremsSentient Arithmetic and Godel's Incompleteness Theorems
Sentient Arithmetic and Godel's Incompleteness TheoremsKannan Nambiar
 
Dbms ii mca-ch4-relational model-2013
Dbms ii mca-ch4-relational model-2013Dbms ii mca-ch4-relational model-2013
Dbms ii mca-ch4-relational model-2013Prosanta Ghosh
 
Yara user's manual 1.6
Yara user's manual 1.6Yara user's manual 1.6
Yara user's manual 1.6Vijay Kumar
 
Chuẩn hóa CSDL
Chuẩn hóa CSDLChuẩn hóa CSDL
Chuẩn hóa CSDLphananhvu
 
Sulpcegu5e ppt 2_1
Sulpcegu5e ppt 2_1Sulpcegu5e ppt 2_1
Sulpcegu5e ppt 2_1silvia
 
3.2 javascript regex
3.2 javascript regex3.2 javascript regex
3.2 javascript regexJalpesh Vasa
 
Database normalization
Database normalizationDatabase normalization
Database normalizationEdward Blurock
 

La actualidad más candente (20)

Database Systems - Normalization of Relations(Chapter 4/3)
Database Systems - Normalization of Relations(Chapter 4/3)Database Systems - Normalization of Relations(Chapter 4/3)
Database Systems - Normalization of Relations(Chapter 4/3)
 
Database normalization
Database normalizationDatabase normalization
Database normalization
 
Normalization in DBMS
Normalization in DBMSNormalization in DBMS
Normalization in DBMS
 
Unit04 dbms
Unit04 dbmsUnit04 dbms
Unit04 dbms
 
Introduction Oracle Database 11g Release 2 for developers
Introduction Oracle Database 11g Release 2 for developersIntroduction Oracle Database 11g Release 2 for developers
Introduction Oracle Database 11g Release 2 for developers
 
Sentient Arithmetic and Godel's Incompleteness Theorems
Sentient Arithmetic and Godel's Incompleteness TheoremsSentient Arithmetic and Godel's Incompleteness Theorems
Sentient Arithmetic and Godel's Incompleteness Theorems
 
Dbms ii mca-ch4-relational model-2013
Dbms ii mca-ch4-relational model-2013Dbms ii mca-ch4-relational model-2013
Dbms ii mca-ch4-relational model-2013
 
vim - Tips and_tricks
vim - Tips and_tricksvim - Tips and_tricks
vim - Tips and_tricks
 
Yara user's manual 1.6
Yara user's manual 1.6Yara user's manual 1.6
Yara user's manual 1.6
 
Chuẩn hóa CSDL
Chuẩn hóa CSDLChuẩn hóa CSDL
Chuẩn hóa CSDL
 
Chapter8
Chapter8Chapter8
Chapter8
 
Dbms 4NF & 5NF
Dbms 4NF & 5NFDbms 4NF & 5NF
Dbms 4NF & 5NF
 
Regular expression examples
Regular expression examplesRegular expression examples
Regular expression examples
 
Sulpcegu5e ppt 2_1
Sulpcegu5e ppt 2_1Sulpcegu5e ppt 2_1
Sulpcegu5e ppt 2_1
 
Ch3
Ch3Ch3
Ch3
 
Assignment#08
Assignment#08Assignment#08
Assignment#08
 
Sql (DBMS)
Sql (DBMS)Sql (DBMS)
Sql (DBMS)
 
3.2 javascript regex
3.2 javascript regex3.2 javascript regex
3.2 javascript regex
 
Database normalization
Database normalizationDatabase normalization
Database normalization
 
Normalization in DBMS
Normalization in DBMSNormalization in DBMS
Normalization in DBMS
 

Similar a PRXChange: Accept No Substitutions

A SAS&lt;sup>®&lt;/sup> Users Guide to Regular Expressions When the Data Resi...
A SAS&lt;sup>®&lt;/sup> Users Guide to Regular Expressions When the Data Resi...A SAS&lt;sup>®&lt;/sup> Users Guide to Regular Expressions When the Data Resi...
A SAS&lt;sup>®&lt;/sup> Users Guide to Regular Expressions When the Data Resi...Ken Borowiak
 
DEE 431 Introduction to MySql Slide 6
DEE 431 Introduction to MySql Slide 6DEE 431 Introduction to MySql Slide 6
DEE 431 Introduction to MySql Slide 6YOGESH SINGH
 
Maxbox starter20
Maxbox starter20Maxbox starter20
Maxbox starter20Max Kleiner
 
SQL for pattern matching (Oracle 12c)
SQL for pattern matching (Oracle 12c)SQL for pattern matching (Oracle 12c)
SQL for pattern matching (Oracle 12c)Logan Palanisamy
 
27.2.9 lab regular expression tutorial
27.2.9 lab   regular expression tutorial27.2.9 lab   regular expression tutorial
27.2.9 lab regular expression tutorialFreddy Buenaño
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionsRaj Gupta
 
Python - Regular Expressions
Python - Regular ExpressionsPython - Regular Expressions
Python - Regular ExpressionsMukesh Tekwani
 
Regular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular ExpressionsRegular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular ExpressionsDanny Bryant
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in PythonSujith Kumar
 
Module 3 - Regular Expressions, Dictionaries.pdf
Module 3 - Regular  Expressions,  Dictionaries.pdfModule 3 - Regular  Expressions,  Dictionaries.pdf
Module 3 - Regular Expressions, Dictionaries.pdfGaneshRaghu4
 
Loops and functions in r
Loops and functions in rLoops and functions in r
Loops and functions in rmanikanta361
 
SQL select statement and functions
SQL select statement and functionsSQL select statement and functions
SQL select statement and functionsVikas Gupta
 
Python regular expressions
Python regular expressionsPython regular expressions
Python regular expressionsKrishna Nanda
 
Regular Expressions -- SAS and Perl
Regular Expressions -- SAS and PerlRegular Expressions -- SAS and Perl
Regular Expressions -- SAS and PerlMark Tabladillo
 

Similar a PRXChange: Accept No Substitutions (20)

A SAS&lt;sup>®&lt;/sup> Users Guide to Regular Expressions When the Data Resi...
A SAS&lt;sup>®&lt;/sup> Users Guide to Regular Expressions When the Data Resi...A SAS&lt;sup>®&lt;/sup> Users Guide to Regular Expressions When the Data Resi...
A SAS&lt;sup>®&lt;/sup> Users Guide to Regular Expressions When the Data Resi...
 
DEE 431 Introduction to MySql Slide 6
DEE 431 Introduction to MySql Slide 6DEE 431 Introduction to MySql Slide 6
DEE 431 Introduction to MySql Slide 6
 
Maxbox starter20
Maxbox starter20Maxbox starter20
Maxbox starter20
 
SQL for pattern matching (Oracle 12c)
SQL for pattern matching (Oracle 12c)SQL for pattern matching (Oracle 12c)
SQL for pattern matching (Oracle 12c)
 
Compiler notes--unit-iii
Compiler notes--unit-iiiCompiler notes--unit-iii
Compiler notes--unit-iii
 
ITW 1.pptx
ITW 1.pptxITW 1.pptx
ITW 1.pptx
 
27.2.9 lab regular expression tutorial
27.2.9 lab   regular expression tutorial27.2.9 lab   regular expression tutorial
27.2.9 lab regular expression tutorial
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Python - Regular Expressions
Python - Regular ExpressionsPython - Regular Expressions
Python - Regular Expressions
 
Regular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular ExpressionsRegular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular Expressions
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in Python
 
Module 3 - Regular Expressions, Dictionaries.pdf
Module 3 - Regular  Expressions,  Dictionaries.pdfModule 3 - Regular  Expressions,  Dictionaries.pdf
Module 3 - Regular Expressions, Dictionaries.pdf
 
A regex ekon16
A regex ekon16A regex ekon16
A regex ekon16
 
Loops and functions in r
Loops and functions in rLoops and functions in r
Loops and functions in r
 
SQL select statement and functions
SQL select statement and functionsSQL select statement and functions
SQL select statement and functions
 
dbms first unit
dbms first unitdbms first unit
dbms first unit
 
Python regular expressions
Python regular expressionsPython regular expressions
Python regular expressions
 
Regular Expressions -- SAS and Perl
Regular Expressions -- SAS and PerlRegular Expressions -- SAS and Perl
Regular Expressions -- SAS and Perl
 
Intoduction to php strings
Intoduction to php  stringsIntoduction to php  strings
Intoduction to php strings
 
Les08
Les08Les08
Les08
 

Último

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 

Último (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 

PRXChange: Accept No Substitutions

  • 1. CT-03 PRXChange: Accept No Substitutions Kenneth W. Borowiak, PPD, Inc. Abstract SAS® provides a variety of functions for removing and replacing text, such as COMPRESS, TRANSLATE & TRANWRD. However, when the replacement is conditional upon the text around the string the logic can become long and difficult to follow. The PRXCHANGE function is ideal for complicated text replacements, as it leverages the power of regular expressions. The PRXCHANGE function not only encapsulates the functionality of traditional character string functions, but exceeds them because of the tremendous flexibility afforded by concepts such as predefined and user-defined character classes, capture buffers, and positive and negative look-arounds. Introduction SAS provides a variety of ways to find and replace characters or strings in character fields. Some of the traditional functions include the TRANWRD and TRANSLATE functions, but these are limited to static strings and characters. The COMPBL function can be used to reduce consecutive whitespace characters to a single whitespace character. The COMPRESS function can used to eliminate characters, and this function was enhanced in Version 9.0 to include a third argument to delete or keep classes of characters. While these traditional character functions are useful for relatively easy tasks, more difficult tasks involving text substitutions and extractions often involve nesting of functions and conditional logic, which can be cumbersome to follow and maintain. With the release of Version 9.0, SAS has introduced Perl-style regular expressions through the use of the PRX functions and call routines. This rich and powerful language for pattern matching is ideal for text substitutions, as they allow the user to leverage predefined sets of characters, customized boundaries and manipulation of captured text. This paper explores some of the key functionality of the PRXCHANGE function. Some of the basic concepts of regular expressions (a.k.a regex) are discussed in introductory papers by Borowiak [2008] and Cassell [2007] and those unfamiliar with PRX should refer to those papers before proceeding with this paper. The PRXCHANGE function takes the following form: prxchange( regular expression|id, occurrence, source ) ● regular expression or id : The substitution regex can either be entered directly in the first argument or a variable with the precompiled result from a call to the PRXPARSE function. The substitution regex takes the form: s/matching expression/replacement expression/modifiers The s before the first delimiter is required1 The matching expression is the regex of the string to match The replacement expression applied to the matched string Modifiers control the behaviour of the matching part of the regex (i.e. such i, x or o ) 1 This is unlike the PRXMATCH function, where the m is optional before the first delimiter in the regex (e.g ‘m/^d/’ ). 1
  • 2. occurrence : The number of times to perform the the match and substitution. Valid values are positive integers. A value of -1 is also valid, which performs replacement as many times as it finds the matching pattern. ● source - The character string or field where the where the pattern is to be searched. Consider the example in Figure 1, where a new variable NAME2 is created in a PROC SQL step to replace all occurrences of the letter a with the letter e. Figure 1 - Replacement of a to e proc sql outobs=5 ; select name , prxchange( ‘s/a/e/i’, -1, name ) as name2 from sashelp.class order by name2 ; quit ; Name name2 Barbara Berbere Carol Cerol Henry Henry Jeffrey Jeffrey James Jemes Since the i modifier is used, it makes the pattern matching case-insensitive, so occurrences of a and A will be replaced by e. The value of the second argument is -1, so all occurrences of a are replaced when found in the variable NAME. Compression A special case of a find-and-replace operation is compression, where the replacement is nothing. This is an operation that is often performed using the COMPRESS function. In the query below in Figure 3, both the COMPRESS and PRXCHANGE functions are used to remove the vowels from the variable NAME into the variables NAME2 and NAME3, respectively. Figure 3 - Remove all vowels proc sql outobs=4; select name , compress( name, ‘aeiou’, ‘i’ ) as name2 , prxchange( ‘s/[aeiou]//i’, -1, name ) as name3 from sashelp.class 2
  • 3. order by name3 ; quit ; Name name2 name3 Barbara Brbr Brbr Carol Crl Crl Henry Hnry Hnry Judy Jdy Jdy Now consider a slightly more restricted case where you want to remove vowels but the vowel can’t be preceded by the letters l, m or n, while ignoring case. This is relatively easy condition to implement with regular expressions by using a positive look-behind assertion (?<=), as demonstrated in Figure 4. Figure 4 - Remove vowels not preceded by l ,m or n proc sql ; select name , prxchange( ‘s/(?<=[lmn])[aeiou]//i’, -1, name ) as name3 from sashelp.class where prxmatch( ‘m/(?<=[lmn])[aeiou]/i’, name )>0 order by name3 ; quit ; Name name3 Alice Alce James Jams Jane Jan Janet Jant Louise Luise Mary Mry Philip Philp Ronald Ronld Thomas Thoms 3
  • 4. William Willam Look-arounds are zero-width assertions (i.e. they do not consume characters in the string) that only confirm whether a match is possible by checking conditions around a specific location in the pattern . Look-arounds can be positive or negative and can be forward or backward looking, and they are discussed in more detail in Borowiak [2006] and Dunn [2011]. To implement a solution to the problem in Figure 4 with the COMPRESS function one need to use a DO loop with the SUBSTR function to check the condition, which rules out using a SQL step. Dedaction Another useful type of substitution example is dedaction, or, the ‘blacking out’ of sensitive information. Consider the case where a pattern is matched and the characters are replaced by a static string, as in Figure 5 below, where you want to match the actual digits in the social security number in the free-text field and replace them with the letter x. Figure 5 - Replace digits of social security numbers with an x data SSN ; input SSN $20. ; datalines ; 123-54-2280 #987-65-4321 S.S. 666-77-8888 246801357 soc # 133-77-2000 ssnum 888_22-7779 919-555-4689 call me 1800123456 ; run ; proc sql feedback ; select ssn , prxchange( ‘s/(?<!d)d{3}[-_]?d{2}[-_]?d{4}(?!d)/xxxxxxxxx/io’, -1, ssn) as ssn2 from ssn ; quit ; SSN ssn2 123-54-2280 xxxxxxxxx #987-65-4321 #xxxxxxxxx S.S. 666-77-8888 S.S. xxxxxxxxx 4
  • 5. 246801357 xxxxxxxxx soc # 133-77-2000 soc # xxxxxxxxx ssnum 888_22-7779 ssnum xxxxxxxxx 919-555-4689 919-555-4689 call me 1800123456 call me 18001234567 Capture Buffers Like many other programming languages, regular expressions allow you to use parenthesis to add clarity by grouping logical expressions together. A result of using grouping parentheses is that it creates temporary variables, which can be used in the substitution part of a regex. Consider an extension of the example in Figure 5 where we continue to replace the digits of social security numbers with an x, but want to maintain any of the dashes or underscores in the original variables. Figure 6 - Maintain dashes and underscores in social security number dedaction proc sql feedback ; select ssn , prxchange('s/(?<!d)d{3}(-|_)?d{2}(-|_)?d{4}(?!d)/xxx$1xx$2xxxx/io', -1 , ssn ) as ssn2 from ssn ; quit ; SSN ssn2 123-54-2280 xxx-xx-xxxx #987-65-4321 #xxx-xx-xxxx S.S. 666-77-8888 S.S. xxx-xx-xxxx 246801357 xxxxxxxxx soc # 133-77-2000 soc # xxx-xx-xxxx 5
  • 6. ssnum 888_22-7779 ssnum xxx_xx-xxxx 919-555-4689 919-555-4689 call me 18001234567 call me 18001234567 Note the regular expression actually has four pairs of parentheses. The negative look-behind (i.e. (?<!d) ) and negative look-ahead assertions (i.e. (?!d) ) are non-capturing. The sub-expression (-|_), which appears the second and third pairs of parentheses, are the two capturing parentheses, which correspond to the variables $1 and $2. And though the match of the sub-expressions in the two capturing parentheses are optional, as they are followed by the ? quantifier, the variables $1 and $2 can be populated with null values. Case-Folding Prefixes and Spans When performing a replacement, users have the ability to manipulate captured variables by changing their case. Consider the example in Figure 7, where the titles Dr, Mr, and Mrs are entered in various ways but need to be ‘propcased’. Figure 7 - Propcase courtesy titles data guest_list ; input attendees $30. ; datalines; MR and MRS DRaco Malfoy mr and dr M Johnson MrS. O.M. Goodness DR. Evil mr&mrs R. Miller ; run ; proc sql ; select attendees , prxchange( 's/b(d|m)(r(?!a)s?)/u$1L$2E/io', -1, attendees ) as attendees2 from guest_list ; quit ; attendees attendees2 MR and MRS DRaco Malfoy Mr and Mrs DRaco Malfoy mr and dr M Johnson Mr and Dr M Johnson 6
  • 7. MrS. O.M. Goodness Mrs. O.M. Goodness DR. Evil Dr. Evil mr&mrs R. Miller Mr&Mrs R. Miller The regular expression looks for a d or m at the beginning of a word and the matched character is put in the first capture buffer. It then looks to match r (as long as it is not followed by an a) followed an optional s and puts the characters in the second capture buffer. The replacement expression then makes use of case-folding prefixes and spans to control the case of the capture buffers. u makes the next character that follows it uppercase, which would be the d or m in first capture buffer. The case-folding span L makes characters that follow lowercase until the end of the replacement or until disabled by E, which is applied to the second capture buffer. Case-folding functionality was introduced in SAS V9.2. More on Lookarounds The final example in Figure 9 demonstrates inserting text at a location where both a positive look-behind (?<=) and look-ahead (?=) assertion is satisfied. Consider the task where a space character is to be inserted between two consecutive colons. One might be tempted to write a regex to match the two colons and replace it with three characters (i.e : : ). However, an efficient regex would be able to identify the location that is preceded and followed by a colon and insert a single space character. Figure 8 - Inserting a space between consecutive colons data colons; length string string2 $200 ; string='a::::b:::::'; string2=prxchange("s/(?<=:)(?=:)/ /",-1,string) ; run ; string string2 a::::b::::: a: : : :b: : : : : This solution will not provide the correct result in versions prior to SAS V9.2, as there was bug in the Perl 5.6.1 engine that the PRX functions use. Conclusion For simple text substitutions, using traditional SAS functions may suffice. However, as a substitution task becomes more complicated, multiple lines of code can often be reduced to a single regular expression within PRXCHANGE due to the tremendous flexibility they offer. 7
  • 8. References Borowiak, Kenneth W. (2006), “Perl Regular Expressions 102”. Proceedings of the 19th Annual Northeast SAS Users Group Conference, USA. http://www.nesug.org/proceedings/nesug06/po/po16.pdf Borowiak, Kenneth W. (2008), “PRX Functions and Call Routines: There is Hardly Anything Regular About Them!”. Proceedings of the Twenty First Annual Northeast SAS Users Group Conference, USA. http://www.nesug.org/proceedings/nesug08/bb/bb11.pdf Cassell, David L., ”The Basics of the PRX Functions” SAS Global Forum 2007 http://www2.sas.com/proceedings/forum2007/223-2007.pdf Dunn, Toby, ”Grouping, Atomic Groups, and Conditions: Creating If-Then statements in Perl RegEx” SAS Global Forum 2011 http://support.sas.com/resources/papers/proceedings11/245-2011.pdf Friedl, Jeffrey E.F., Mastering Regular Expressions 3rd Edition Acknowledgements The authors would like to thank Jenni Borowiak, Kipp Spanbauer and Kunal Agnihotri for their insightful comments on this paper. SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Disclaimer The content of this paper are the works of the author and do not necessarily represent the opinions, recommendations, or practices of PPD, Inc. Contact Information Your comments and questions are valued and encouraged. Contact the authors at: Ken Borowiak 3900 Paramount Parkway Morrisville NC 27560 ken.borowiak@ppdi.com ken.borowiak@gmail.com 8