Introduction
The latest IronScripter challenge, Building a PowerShell Command Inventory, helps us to understand our library of PowerShell code.
It is also a good way to introduce regular expressions, most commonly called regex.
Regex and PowerShell
Before we tackle the challenge, let’s briefly discuss regex and how you can use (or probably already have used) regex in PowerShell.
What is Regex
Regex is a pattern used to match text. A regex pattern can contain letters, numbers, spaces, other characters, operators, and other constructs.
The regex engine contains categories, like characters, escape characters, character classes, anchors, grouping constructs, quantifiers, and more. This allows regex patterns to be very simple or incredibly complex.
There are numerous articles on regex and several questions on public forums. StackOverflow over 227,700 questions tagged with regex.
How PowerShell Uses Regex
If you’ve ever used Select-String
, -match
, -replace
, or -split
, you have used regex.
You may have used switch
before, but many have realized you could use regex patterns as conditions with switch -regex
.
Match Text
Consider following comparisons:
|
|
In the first statement, the -match
operator checks if the text hall
is contained in Challenge
and returns true
.
The next statement also returns true
because regular expressions are case-insensitive by default in PowerShell.
In the last statement, we force case sensitivity by using -cmatch
.
Regex Character Classes
Regex places special meaning on some characters.
For instance, the period .
is treated as a wildcard for a single character.
The backslash \
character escapes a character or is used to denote a character class.
To match on a period, you can’t use .
alone; you must escape it like this: \.
.
Word \w
, white-space \s
, and digit \d
are character classes that will match on a single character of the respective types.
To match on the opposite, use the uppercase, like \W
for any non-word character such as white-space or punctuation.
Also, brackets can surround a character group.
To match on any character a
through e
, you can use the [a-e]
character set.
You can also negate a character set using the caret ^
after the first bracket, such as [^abcd]
.
This negated character set will match on anything without the letters a, b, c, or d.
Regex Quantifiers
In the previous section, you may have noticed that many of the classes match on single character. Regex has quantifiers that can be applied immediately after the class.
Here are the some common quantifiers:
*
- matches the previous element zero or more times+
- matches the previous element one or more times?
- matches the previous element zero or one times{*n*}
- matches the previous element exactly n times
Using the -Split
operator, let’s examine how we can combine quantifiers with a character class for specific results.
|
|
-Split
returns substrings by splitting the text by \W+
, or one or more non-word characters.
Spaces, or white-spaces, and periods are non-word characters.
The first sentence’s period and following space is matched with the \W+
pattern because of the +
.
Here’s another simple -Split
example.
|
|
The second pattern matches on nn in Anna and it is treated as a character set to split on.
Regex Anchors and Alternation
The next two regex constructs were the first ones that I used many years ago when I supported Linux.
A regex pattern with an anchor matches when the text is in the position or grouping indicated by the anchor.
Here are the meta-character anchors:
^
- match must start at the beginning of the string$
- match must be at the end of the string before a newline\b
- match must occur on boundary between a word character and a non-word character\B
- match must not occur on a\b
boundary
Alternation constructs enables either/or matching.
The most common alternation construct is the vertical bar |
, sometimes called the pipeline especially in PowerShell.
You may have come across some code that looks like the following.
|
|
The regex pattern will match on the words (actually each letter is matched) fox and dog.
“The quick brown fox
jumped over the lazy dog
.”
Regex Grouping
The last regex topic we need to cover before delving into the solution for the challenge is grouping.
As in math and PowerShell expression statements, parentheses, (
and )
, provide the foundation for grouping.
Each sub-expression in between ( )
is captured.
The advanced solution uses named groups, which are in the form of (?<group-name>)
.
You can define a non-capturing group using (?: sub-expression)
.
Intermediate Challenge
With a regex primer behind us, we can now turn to the the first challenge which asks us to count how many lines of code we have in our repertoire. Regex will play a role in the part that requires us to skip empty or blank lines.
Sample Output
|
|
|
|
This one was fairly simple.
I used Get-ChildItem
with -Recurse
to get a list of all the PowerShell files,
as designated by extensions ps1 and psm1.
Then, within a Foreach-Object
loop, I read each file with Get-Content
.
Next, I pipe the file contents into a Where-Object
clause that performs a match
on any non-whitespace characters.
In regex terms, this is a \S
(uppercase S).
This gets me the non-empty or blank lines.
|
|
Lastly, I return a PSCustomObject
with the required fields and counts.
Get-Content
will read a file line-by-line and produces an array of strings.
If you want to read the complete file as a single string object, you must include the -Raw
switch.
This really useful when you are reading the contents of a JSON file, as the ConvertFrom-Json
command
will only work on a string object, not the array that you get without the -Raw
switch.
Advanced Challenge
The advanced challenge wants us to get a list of commands that we use in the same scripts that we just inventoried. This list of commands should be sorted by the number of times used.
For extra credit, we should be able to detect and expand aliases and, as an extra challenge, provide a array of files that contain the command.
I thought about the heart of this challenge.
Getting a command, in the Verb-Noun
format, would be relatively simple with the right regex.
Discovering aliases used would be a bit harder.
But, why stop there?
Why not include CmdletBinding
or Parameter
attributes?
How often do you use trap
or a try/catch
block?
It would be nice to have a tool that parses your PowerShell code and reveals what parts of the PowerShell language you frequently use. And I wanted to have this information by file and where in the file the structure was found, namely line number and index within that line. So that’s what I built to solve the advanced challenge.
I crafted some regex patterns for each of these code constructs.
- Verb-Noun
- DotNetObjects
- -f operator
- Function
- Class
- Variable declaration
- CmdletBinding
- Parameter
- Param declaration
- DynamicParam declaration
- Try/Catch/Finally
- Trap
- Enum definition
- Loop statements
- for, foreach, do/while, do/until, while
- Switch statements
PowerShell Code Structure Regex
The most critical regex is the one that detects a PowerShell command.
Lucky for us, a PowerShell command is in the form of Verb-Noun
.
How do we make a regex pattern to match on this?
First, there can be any number of spaces before and after the command.
The verb and noun component will always be a word character, probably more than one word character per component.
And we need to handle that dash -
.
Based on the previous paragraph, we can create this regex pattern: \s+(\w+\-\w+)\s+
.
This should read as “any number of white-spaces before a grouping of any number of word characters followed immediately
by a literal dash then any number of word characters ending the grouping followed by any number of white-spaces”.
While this may appear to be adequate, and in another use case might be, it would be better if we named the group so we
can use the group name in cataloging the code structure.
We now have the pattern \s+(?<PSVerbNoun>\w+\-\w+)\s+
with the group name called PSVerbNoun
.
In Get-PSCodeStructure
, I created an ordered hashtable with each of the required code type regex pattern in sequence.
Note that the key is not important other than establishing the hashtable.
The regex patterns can be pulled from the hashtable using the $PSPatterns.Values
attribute of the hashtable.
This array of values can then be concatenated using -join
and the regex alternation character, the vertical bar |
.
The complete, and now much more complex, regex pattern is made using $RegExPattern = $PSPatterns.Values -join '|'
.
PowerShell Regex Matches
In order for our named groups to function as we need, we need something other than Select-String -AllMatches
.
Also, according to the documentation, the $Matches
hashtable will only contain the first occurrence of any matching pattern.
|
|
This only gives us the first match, $Variable =
Get-Content $path.
|
|
This only gives us both matches, “$Variable = Get-Content
-Path $path”, but $Matches
does not have the second match.
|
|
|
|
Because of this limitation, we have to use the .Net class for [regex]
.
Let’s look at the class constructor overloads.
|
|
We need the string pattern and, optionally, we can supply regex options and a timeout. Unlike PowerShell, the .Net class is case sensitive, so we need to instruct it to ignore case. Note: We won’t be using the matchTimeout parameter.
The regex
class has a method called Matches()
which will provide us all matches.
|
|
This produces the following output.
|
|
We then need to filter on the groups that matched (Success is true) and are named (Name not an integer).
|
|
And this gives us what we ultimately needed.
|
|
- The Name is the group name of the specific code structure type.
- The Value is the captured value from the pattern.
- The Index is the position the match was found
We use these three to build the PSCustomObject
which is outputted into the pipeline.
Handling False Positives
So far, we have a regex pattern that will match on Verb-Noun
.
Unfortunately, at least the way I’ve written it, this pattern will lead to false positives.
|
|
Clearly, Volume-licensed is not the name of a PowerShell command. I needed something to negate the false positives.
The method I chose was to check the Verb
of the matched value with a list of approved PowerShell verbs.
|
|
Matching on approved verbs will skip any commands that you use which do not use approved verbs.
For instance, the Encode-Sqlname
and Decode-Sqlname
commands from the module SqlPS
would not match and, therefore,
would not be in our inventory.
Perhaps someone with greater regex-foo or a better idea on how to filter out false positives can comment below.
Capture Code Structure into Variable
|
|
|
|
The FileCount and Elapsed Time is written to the Information Stream.
I think the Information Stream is underutilized.
It’s a great way to provide the user information and it doesn’t “clutter” up the standard output stream, like Write-Host
or Write-Output
would do.
Sample Object
Let’s take a look at the first discovered code structure.
|
|
|
|
We have all of the critical pieces of data we would need about this structure. We know the file, the structure type, what line contains it, where it is in the line, and the command itself.
The AliasName
property will contain the alias and the Command
will contain the full command name.
Currently, the regex pattern for detecting aliases does not discern if the alias is used within a comment. In fact, none of the regex can discern if the code type is used within a comment.
Again, perhaps someone with greater regex-foo or a better idea on how to filter out false positives can comment below.
Count of Structure Types
We can use Group-Object
to get a count of the code structure types.
Throw in Sort-Object
and Select-Object
|
|
|
|
Looks like I need to go back and remove some Aliases.
Count of Verb-Noun Commands and Aliases
|
|
|
|
This shows that I have 732 PowerShell commands in the scripts within this folder. These may include commands in comments.
Advanced Extra Credit Challenge
From the list above, we can also see that I have used 10 aliases. Let’s check those out and how I pulled those out of the code.
|
|
|
|
In the begin block of Get-PSCodeStructure
, you will find $Aliases = Get-Alias
.
After the regex patterns have gathered any matches, I split the line and start iterating through each ‘word’.
Since foreach
and select
are also part of Verb-Noun commands, I first attempt to match on them specifically.
Next, I attempt to match the ‘word’ against all the names in $Aliases
and if the ‘word’ contains only letter.
Advanced Extra Challenge
As an extra challenge, we were asked to include a property that is an array of the filenames where the command exists.
Lucky for us, the FileName
is tucked away in the Group property.
|
|
|
|
Solution
Here are the two functions I wrote to solve this challenge.
Performance
One consideration in processing a numerous files is performance.
I tested using PowerShell 7’s Foreach-Object -Parallel
and a standard foreach
statement on a folder path containing
157 files.
Here are the results.
Iteration | Foreach-Object -Parallel | foreach |
---|---|---|
Run 1 | 0h 0m 24.223s | 0h 0m 25.638s |
Run 2 | 0h 0m 25.4s | 0h 0m 19.659s |
Run 3 | 0h 0m 29.780s | 0h 0m 27.159s |
There’s not that much difference between the elapsed time.
However, I did notice that Foreach-Object -Parallel
consumed more processor and memory, using up to 450MB and up to 80% CPU.
The foreach
statement only consumed up to 150MB and up to 25%.
Based on these findings, I chose to use the foreach
statement only.
Other Notes
For the Advanced challenge, we need to pass a path.
I wanted to provide the user a way to supply a single file or a path.
If provided a path, any ps1 or psm1 files would be selected.
I also provided a -Recurse
switch that allows the user to select all multiple downstream paths.
With this criteria in mind, here’s how I did that.
|
|
Additional Information
To learn more about regex, here are a few resources that go much deeper into the topic than this article.
- PowerShell: The many ways to use regex on Kevin Marquette
- A Practical Guide for Using Regex in PowerShell on Josh Duffney’s blog
- About Regular Expressions
- .Net Quick Reference on Regular Expression Language
For several years now, I have crafted my regex using the following online validator tools. There are others, these are just the ones I find familiar and easy to use.
Summary
When I began working on this Iron Scripter challenge, I was only considering writing a short article on my solution. I quickly realized, however, that the heart of the challenge involves regular expressions. And I suspect that many PowerShell scripters would only have a little experience or knowledge on this complex subject.
My hopes for this article are twofold:
- You have gained a better understanding of regex and how you can use and write regex patterns in PowerShell.
- You have gained an interest in participating in the Iron Scripter challenges, or have had your interest bolstered. You can learn a great deal while solving the challenges.
If you have suggestions for better regex patterns or a better way to handle false positives for Verb-Noun and aliases, please let me know in the comments below.
If you have any general questions on Regex or PowerShell, feel free to leave them in the comments or ask me on Dave Carroll .