19. Regular expressions

InstallBuilder supports using regular expressions for processing text. It can be used for a large number of tasks such as checking if a text matches specified pattern or extracting text from a command output.

InstallBuilder supports extended regular expressions. This is the most commonly used syntax for regular expressions and is similar to the used in most programming languages.

Regular expressions can be used by the <regExMatch> rule to verify if a text matches a pattern. It can also be used by <setInstallerVariableFromRegEx> to replace or extract a part of a match a part of a given text or in the <substitute> action to replace texts matching regular expression within a file.

19.1. Basics of regular expressions

Regular expressions allow defining a substring in a text through a pattern. This pattern can be as simple as a literal string, for example to check if some program stdout contains "success":

  <regExMatch>
    <logic>matches</logic>
    <pattern>success</pattern>
    <text>${program_stdout}</text>
  </regExMatch>

Or complex enough to allow extracting a port number from a configuration file:

  <setInstallerVariableFromRegEx>
    <name>port</name>
    <pattern>.*\n\s*Listen\s+(\d+).*</pattern>
    <substitution>\1</substitution>
    <text>${httpdConf}</text>
  </setInstallerVariableFromRegEx>

A pattern can be constructed from one or more branches (sub patters), separated by the | character, meaning that if the text matches any of the branches, it matches the full pattern. For example the pattern success|done|started matches either "success", "done" or "started".

Each character, a group of characters or a potential match is called an atom. For example done consists of 4 atoms - d, o, n and e. The pattern ok|yes consists of two branches, one with o and k atoms and another with y, e and s atoms.

Regular expressions can also use special characters:

  • ^ - Means the start of a line or the text. The pattern ^yes specifies that the text must start with the yes string to match the regular expression.
  • $ - Means end of line or text. The pattern yes$ specifies that the text must end with the yes string to match the regular expression.
  • . - Means any character. For example te.t will match both "text" and "test".

If you need to specify one of those characters as a literal, you can escape them using a backslash (\) character. For example done. will match any text that has the word "done", followed by any character but the expression done\. will only match the literal "done.".

Certain characters preceded by a backslash also have a special meaning:

  • \e - indicates the ESC character, which has an ascii value of 27
  • \r - carriage return character, which has an ascii value of 13
  • \n - newline character, which has an ascii value of 10
  • \t - horizontal tab character, which has an ascii value of 9
  • \v - vertical tab character, which has an ascii value of 11
  • \uABCD - where ABCD are exactly four hexadecimal digits, specifies unicode character U+ABCD; for example \u0041 maps to A character
  • \B - synonym for \ that can be used to reduce backslash doubling - for example \\\n and \B\n are synonyms, but the latter is more readable
  • \s - Matches any blank character (new lines, tabs or spaces).

Regular expressions also accept quantifiers, which specify how many times a preceeding atom should be matched:

  • ? - Specifies that the preceeding atom should match 0 or 1 times - for example colou?r matches both "color" and "colour"
  • * - Specifies that the preceeding atom should match 0 or more times - for example \s* matches an empty string or any number of spaces
  • + - Specifies that the preceeding atom should match 1 or more times - for example /+ matches any number of consecutive slash characters
  • {m} - Specifies that the preceeding atom should match exactly m times - for example -{20} matches a series of 20 consecutive hyphen characters
  • {m,} - Specifies that the preceeding atom should match at least m times - for example \s{1,} matches a series of at least 1 space.
  • {m,n} - Specifies that the preceeding atom should match between m and n times

Unlike branches and |, quantifiers only operate on the last atom. A pattern colou?r means that only the u character (the atom preceding the ? quantifier), not the entire colou expression will be affected by the quantifier.

Grouping and bracket expressions, which are described later, can be used along quantifiers in more complex scenarios.

The * and + quantifiers are greedy by default. This means that they will match the longest substring if the remaining part of expression also matches. In the case of the expression ^.*-A, it will match the longest substring that ends with -A. For the string test1-A-test2-A-test3-B, it will match to test1-A-test2-A.

In many cases a shortest match is more useful. In this case, a non-greedy counterparts *? and +? can be used. They work the same, except that shortest substring matching the pattern will be captured (test1-A in the previous example). This is commonly used when extracting a part of text.

19.2. Bracket expressions

Regular expressions can specify a subset of characters to match, specified within square brackets. For example the following will match both "disk drive" and "disc drive":

  <regExMatch>
    <logic>matches</logic>
    <pattern>dis[ck] drive</pattern>
    <text>${program_stdout}</text>
  </regExMatch>

Please note that in the example just one character will match as it is not including any quantifier (diskc wont match)

It is also possible to specify a range of characters in the format of a-b where a is the first character and b is the last character to match. For example [A-Z] specifies any of upper case letters. Multiple ranges can be used such as [A-Za-z0-9] specifying upper and lower case letters and all digits.

The following will match between 8 and 20 characters, consisting of letters and digits only:

  <regExMatch>
    <logic>matches</logic>
    <pattern>^[A-Za-z0-9]{8,20}$</pattern>
    <text>${program_stdout}</text>
  </regExMatch>

In the example above, the bracket expression is considered a single atom, therefore the {8,20} quantifier applies to the whole [A-Za-z0-9] expression. The ^ and $ characters cause the expression to only match if the entire text matches the expression.

If you need to include the literal - in the matching characters, it must be specified as the last character in the bracket expression: [A-Za-z0-9-].

Regular expressions also support specifying a character class, which can be used to as shorthand for commonly used sets of characters:

  • [[:alpha:]] - A letter
  • [[:upper:]] - An upper-case letter
  • [[:lower:]] - A lower-case letter
  • [[:digit:]] - A decimal digit
  • [[:xdigit:]] - A hexadecimal digit
  • [[:alnum:]] - An alphanumeric (letter or digit)
  • [[:print:]] - An alphanumeric (same as alnum)
  • [[:blank:]] - A space or tab character
  • [[:space:]] - A character producing white space in the text
  • [[:punct:]] - A punctuation character
  • [[:graph:]] - A character with a visible representation
  • [[:cntrl:]] - A control character

The following is an equivalent of previous example, using character classes:

  <regExMatch>
    <logic>matches</logic>
    <pattern>^[[:alnum:]]{8,20}$</pattern>
    <text>${program_stdout}</text>
  </regExMatch>

The following are also abbreviations for some of character classes:

  • \d is equivalent of [[:digit:]]
  • \s is equivalent of [[:space:]]
  • \w is equivalent of [[:alnum:]]

19.3. Grouping

Atoms in regular expressions can also be grouped by using round brackets. Grouping can be used along with branches. The following example will match if a version begins with 9. or 10.:

  <regExMatch>
    <logic>matches</logic>
    <pattern>^(9|10)\.</pattern>
    <text>${versionstring}</text>
  </regExMatch>

The | character inside a group will only match substrings inside the group.

It is also possible to group one or more characters and use quantifiers for the entire group. A pattern I am (very\s+)*happy will match "I am happy", "I am very happy", "I am very very happy"…

The very\s+ pattern will match the text "very" followed by at least 1 white space. Then, the * quantifier is applied to the entire (very\s+) group, which means 0 or more occurrences of "very" followed by at least 1 white space.

19.4. Substituting text in regular expression

The <setInstallerVariableFromRegEx> action can be used to do regular expression substitution in a text.

The example below will replace any number of white spaces with a single space in the ${text} variable:

  <setInstallerVariableFromRegEx>
    <name>result</name>
    <pattern>[[:space:]]+</pattern>
    <substitution> </substitution>
    <text>${text}</text>
  </setInstallerVariableFromRegEx>

Grouping can also be used to match certain values, which can be used for replacing a text as well as extracting a part of text. All items that are grouped can be used in the <substitution> tag by specifying \n, where n is a number between 1 and 9 corresponding to the number of the matched group.

For example the following can be used to extract an extension from a filename:

  <setInstallerVariableFromRegEx>
    <name>extension</name>
    <pattern>.*\.([^\.]+)$</pattern>
    <substitution>\1</substitution>
    <text>${filename}</text>
  </setInstallerVariableFromRegEx>

Since ([^\.]+) is the first grouping used in the expression, the \1 in <substitution> tag will reference characters matched by it.

In order to extract individual values from a hyphen-separated text such as 1234-5678-ABCD, we can use the following:

  <setInstallerVariableFromRegEx>
    <name>value1</name>
    <pattern>^(.*?)-(.*?)-(.*?)$</pattern>
    <substitution>\1</substitution>
    <text>${value}</text>
  </setInstallerVariableFromRegEx>
  <setInstallerVariableFromRegEx>
    <name>value2</name>
    <pattern>^(.*?)-(.*?)-(.*?)$</pattern>
    <substitution>\2</substitution>
    <text>${value}</text>
  </setInstallerVariableFromRegEx>
  <setInstallerVariableFromRegEx>
    <name>value3</name>
    <pattern>^(.*?)-(.*?)-(.*?)$</pattern>
    <substitution>\3</substitution>
    <text>${value}</text>
  </setInstallerVariableFromRegEx>

It can be used to get 1234 as value1, 5678 as value2 and ABCD as value3.

This can be used in combination with <regExMatch> to validate the input such as:

  <throwError>
    <text>Invalid value for field: ${value}</text>
    <ruleList>
      <regExMatch>
        <logic>does_not_match</logic>
        <text>${value}</text>
        <pattern>^(.*?)-(.*?)-(.*?)$</pattern>
      </regExMatch>
    </ruleList>
  </throwError>

In certain cases, grouping is used for matching more complex patterns, but should not be used for referencing. In this case, the grouping has to start with ?:.

The following example will match the string separated by either - or a text " hyphen ", whereas the separator will not be matched, even though it is grouped:

  <setInstallerVariableFromRegEx>
    <name>value1</name>
    <pattern>^(.*?)(?:-| hyphen )(.*?)(?:-| hyphen )(.*?)$</pattern>
    <substitution>\1</substitution>
    <text>${value}</text>
  </setInstallerVariableFromRegEx>