Lab Exercise regex1

This is a lab exercise on developing secure software. For more information, see the introduction to the labs.

Goal

Learn how to create simple regular expressions for input validation.

Background

Regular expressions (regexes) are a widely-used notation for expressing text patterns. Regexes can be used to validate input; when used correctly, they can counter many attacks.

Different regex languages have slightly different notations, but they have much in common. Here are some basic rules for regex notations:

  1. The most trivial rule is that a letter or digit matches itself. That is, the regex “d” matches the letter “d”. Most implementations use case-sensitive matches by default, and that is usually what you want.
  2. Another rule is that square brackets surround a rule that specifies any of a number of characters. If the square brackets surround just alphanumerics, then the pattern matches any of them. So [brt] matches a single “b”, “r”, or “t”. Inside the brackets you can include ranges of symbols separated by dash ("-"), so [A-D] will match one character, which can be one A, one B, one C, or one D. You can do this more than once. For example, the term [A-Za-z] will match one character, which can be an uppercase Latin letter or a lowercase Latin letter. (This text assumes you're not using a long-obsolete character system like EBCDIC.)
  3. If you follow a pattern with “*”, that means “0 or more times”. In almost all regex implementations (except POSIX BRE), following a pattern with "+" means "1 or more times". So [A-D]* will match 0 or more letters as long as every letter is an A, B, C, or D.
  4. You can use "|" to identify options, any of which are acceptable. When validating input, you should surround the collection of options with parenthesis, because "|" has a low precedence. So for example, "(yes|no)" is a way to match either "yes" or "no".

Task Information

We want to use regexes to validate input. That is, the input should completely match the regex pattern. In regexes you can do this by using its default mode (not a "multiline" mode), prepending some symbol, and appending a different symbol. Unfortunately, different platforms use different regex symbols for performing a complete match to an input. The following table shows a summarized version of what you should prepend and append for many different platforms (for their default regex system).

Platform Prepend Append
POSIX BRE, POSIX ERE, and ECMAScript (JavaScript) “^” “$”
Java, .NET, PHP, Perl, and PCRE “^” or “\A” “\z”
Golang, Rust crate regex, and RE2 “^” or “\A” “$” or “\z”
Python “^” or “\A” “\Z” (not “\z”)
Ruby “\A” “\z”

For example, to validate in ECMAScript (JavaScript) that an input is must be either “ab” or “de”, use the regex “^(ab|de)$”. To validate the same thing in Python, use “^(ab|de)\Z” or “\A(ab|de)\Z” (note that it's not quite the same thing).

More information is available in the OpenSSF guide Correctly Using Regular Expressions for Secure Input Validation.

Interactive Lab ()

Please create regular expression (regex) patterns that meet the criteria below.

Use the “hint” and “give up” buttons if necessary.

Part 1

Create a regular expression, for use in ECMAScript (JavaScript), that only matches the letters "Y" or "N".


Part 2

Create a regular expression, for use in ECMAScript (JavaScript), that only matches one or more uppercase Latin letters (A through Z).


Part 3

Create a regular expression, for use in ECMAScript (JavaScript), that only matches the words "true" or "false".


Part 4

Create a regular expression that only matches one or more uppercase Latin letters (A through Z). However, this time, do it for Python (not JavaScript).


Part 5

Create a regular expression that only matches one Latin letter (A through Z), followed by a dash ("-"), followed by one or more digits. This time, do it for Ruby (not JavaScript or Python).




This lab was developed by David A. Wheeler at The Linux Foundation.