by the OpenSSF Best Practices Working Group
This is detailed rationale for the document Correctly Using Regular Expressions for Secure Input Validation.
If you just want to know what to do, you can stop reading now, and instead consult Correctly Using Regular Expressions for Secure Input Validation. However, if you want to know why we make these recommendations, here is our detailed rationale, with commentary and supporting evidence. We’ve examined the specifications of various systems, and in some cases, written sample code to verify what implementations do.
A key question for each platform (such as a programming language) is determining if a regular expression like /x$/ only matches inputs like “ax” or if it will also match other inputs such as “ax\n”. Similarly, we must check if /^d/ matches only “dog”, or if it will match other beginnings like “\ndog” or “x\ndog”. We also need to determine if there are symbols for beginning of string (typically “\A”) or end of string (typically “\z” though Python uses “\Z”).
For more information, see Seth Larson’s Regex character “$” doesn’t mean “end-of-string” which identified the problem of many people thinking “$” always means “end of string”. See also Mastering Regular Expressions by Jeffrey E.F. Friedl (especially 3rd edition pages 129-130), which noted these variations but didn’t note that many people misunderstand them. We should also note the xkcd cartoon I know regular expressions. The site “regular-expressions.info” section “anchors” discusses this, though its text can be misleading; as of 2024-04-09 near its beginning it says that “$ matches right after the last character in the string. c$ matches c in abc…” and only later in the text does it note that its behavior varies by language.
In this rationale, we’ll provide a brief history of regular expression implementations, which explains how we got here. This is followed by a survey of various platforms.
Some historical context of “how we got here” will explain how we got to this complicated state:
The error of using the anchor “$” to mean “match end of string” on platforms where it doesn’t mean “end of string” appears to be especially prevalent. As of April 2024, MITRE’s “CWE-625: Permissive Regular Expression” discusses the problem of permissive regexes, and it specifically notes the need for anchors. However, it incorrectly recommends using “$” to match the end of a string in Perl, which is not the end-of-string marker in Perl. Thus, we’ve especially focused on checking where pattern “x$” matches “x\n” (if it does, then by definition it’s permissive). Permissive systems may allow additional matches (e.g., carriage return).
The MDN page on regular expressions explains the ECMAScript (JavaScript) regular expression notation (for a detailed view see the ECMAScript® 2025 Language Specification (Draft ECMA-262 / April 25, 2024)). In ECMAScript (JavaScript) in its default mode:
When doing input validation you typically don’t want group match results, so in most cases you’ll want to use the method test (which simply returns true or false if the pattern matches).
The proposal “Regular Expression Buffer Boundaries for ECMAScript” seeks to “introduce \A and \z character escapes to Unicode-mode regular expressions as synonyms for ^ and $ that are not affected by the m (multiline) flag.”
The GNU Gnulib library implements POSIX BRE and POSIX ERE. This library is widely used to implement regular expressions in C, C++, and various command-line tools. The Gnulib documentation and the regular-expressions.info GNU page notes that in the Gnulib implementation of POSIX BRE and POSIX ERE, “the anchor ` (backtick) matches at the very start of the subject string, while ' (single quote) matches at the very end.”
Since input validation normally doesn’t use a multi-line mode, using the standard “^” and “$” also works. Since those are standard, there’s no obvious reason to point out these Gnulib extensions just for input validation.
Go’s standard library includes regexp, and it uses the syntax of RE2. Thus, by default, ^ and \A are start-of-string, while $ and \z are end-of-string. This uses a non-backtracking implementation and thus is immune to reDoS attacks.
In the case of Java, experiments with every implementation we’ve tried shows that “$” matches only the end-of-string and that “$” is not permissive. Similarly, “^” by default matches only the beginning of a string. This is consistent with the Java 8 documentation.
However, the Oracle Java 21 documentation for java.util.regex.Pattern and some other documentation instead says that “$” is permissive. We see no evidence that this is true, or that this change has been implemented, so we suspect this is in error in the Oracle 21 documentation. However, since this is a claimed difference that might be implemented at some time in the future, using “\z” instead of “$” might be a safer choice.
As of 2022, 46% of Java programs use JDK 8 (released in 2014) instead of a more recent version, so Java 8’s results are important. The Java 8 documentation for java.util.regex.Pattern makes it clear that “$” only matches the end of the string (Java 8 is not permissive). Its first part says:
By itself this Java 8 text is ambiguous. That’s because the text saying “\z” is “end of the input” in contrast to “$” which is defined as “end of a line”; for a typical reader it’s not clear that this difference in terminology matters. However, this ambiguity is later resolved in the text, which says that “by default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence.” This added text seems to clearly state that “$” would not match an input with an extra newline at the end by default. As of 2024-03-24, the same text applies to Pattern for the draft specification for Java SE 20 & JDK 20 (DRAFT 20-valhalla+1-75.
Unfortunately, the Oracle Java 21 documentation for java.util.regex.Pattern, says that the meaning of “$” is different. It uses the same bulleted text shown earlier, but changes the text to make “$” permissive. It does this by adding the following text instead: “If MULTILINE mode is not activated, the regular expression ^ ignores line terminators and only matches at the beginning of the entire input sequence. The regular expression $ matches at the end of the entire input sequence, but also matches just before the last line terminator if this is not followed by any other input character. Other line terminators are ignored, including the last one if it is followed by other input characters.” In short, this documentation claims that “$” is permissive, just like Perl and Python.
In April-May 2024, Nikita Koselev ran a number of tests with Java on a variety of versions. He wrote this small Java class to test if “$” is permissive:
public class RegexMatchTest {
public static void main(String[] args) {
// Define the test string and the regex pattern
String testString = "x\n";
String pattern = "x$";
// Check if the pattern matches the test string
boolean isMatch = testString.matches(pattern);
// Get JVM version information
String javaVersion = System.getProperty("java.version");
String javaRuntimeVersion = System.getProperty("java.runtime.version");
// Output the JVM version and the result
System.out.println("Java Version: " + javaVersion);
System.out.println("Java Runtime Version: " + javaRuntimeVersion);
// Output the result
System.out.println("Testing if 'x$' matches 'x\\n': " + isMatch);
System.out.println("Explanation: 'x$' is expected to match strings that end with 'x' right before a newline.");
System.out.println("Result: The pattern " + (isMatch ? "matches" : "does not match") + " the string 'x\\n'.");
}
}
All tests indicate that “$” is not permissive in Java implementations. Tests were run on the compilers from Amazon for both Java 8 (8.0.412-amzn, Java version 1.8.0_412, Java runtime version 1.8.0_412-b08) and Java 21 (21.0.3-amzn, Java version 21.0.3, Java runtime version 21.0.3+9-LTS) on Ubuntu. We also tested on Windows 10 with Java 8 and Java 17. In all cases we found that testing if ‘x$’ matches ‘x\n’ returned “false” (that is, “$” is not permissive). This was also tested on a Java online system (which we believe uses Java 17) at <https://onecompiler.com/java/42by2e6vc>.
This assertion that “$” is not permissive is also consistent with the posting How to do in Java’s “Regex – Match Start or End of String (Line Anchors)” which says that “The dollar $ matches the position after the last character in the string.”
However, both Seth Larson’s Regex character “$” doesn’t mean “end-of-string” (which identified the problem of many people thinking “$” always means “end of string”) and Mastering Regular Expressions by Jeffrey E.F. Friedl (especially 3rd edition pages 129-130) state that in Java “$” matches the end of string or just before a line terminator. That is, these claim that “$” is permissive. These claims are understandable, but they appear to be incorrect.
In all implementations we’ve tested, “$” only matches the end of string by default.
There are multiple versions of .NET, which can make discussions on .NET complex. “.NET Standard is a formal specification of .NET APIs that are available on multiple .NET implementations. The motivation behind .NET Standard was to establish greater uniformity in the .NET ecosystem. .NET 5 and later versions adopt a different approach to establishing uniformity that eliminates the need for .NET Standard in most scenarios. However, if you want to share code between .NET Framework and any other .NET implementation, such as .NET Core, your library should target .NET Standard 2.0.” (.NET Standard) There are also various target frameworks in SDK-style projects, specifically .NET 8.
Still, regular expressions are a basic part of .NET (via System.Text.RegularExpressions).
Microsoft’s .NET “Regular Expression Language - Quick Reference” says the following, clearly documenting that “$” is permissive:
^ By default, the match must start at the beginning of the string; in multiline mode, it must start at the beginning of the line.
$ By default, the match must occur at the end of the string or before \n at the end of the string; in multiline mode, it must occur before the end of the line or before \n at the end of the line.
\A The match must occur at the start of the string.
\Z The match must occur at the end of the string or before \n at the end of the string.
\z The match must occur at the end of the string.
The Regex Hero page “.NET Regex Reference” says the same thing. Seth Larson’s Regex character “$” doesn’t mean “end-of-string” (which identified the problem of many people thinking “$” always means “end of string”) and Mastering Regular Expressions by Jeffrey E.F. Friedl (especially 3rd edition pages 129-130) also clearly state that “$” in .NET/C# is permissive.
As explained in “Learn .NET Fundamentals / Regular expression options” section “Nonbacktracking mode”, “By default, .NET’s regex engine uses backtracking to try to find pattern matches. A backtracking engine is one that tries to match one pattern, and if that fails, goes backs and tries to match an alternate pattern, and so on. A backtracking engine is very fast for typical cases, but slows down as the number of pattern alternations increases, which can lead to catastrophic backtracking. The RegexOptions.NonBacktracking option, which was introduced in .NET 7, doesn’t use backtracking and avoids that worst-case scenario. Its goal is to provide consistently good behavior, regardless of the input being searched. The RegexOptions.NonBacktracking option doesn’t support everything the other built-in engines support. In particular, the option can’t be used in conjunction with RegexOptions.RightToLeft or RegexOptions.ECMAScript. It also doesn’t allow for the following constructs…”
Microsoft recommends that, “When using System.Text.RegularExpressions to process untrusted input, pass a timeout. A malicious user can provide input to RegularExpressions, causing a Denial-of-Service attack. ASP.NET Core framework APIs that use RegularExpressions pass a timeout.”
In PHP the “PCRE extension is a core PHP extension, so it is always enabled.” This is the usual library for PHP, and unsurprisingly has PCRE semantics.
Per anchors, by default, “A dollar character ($) is an assertion which is true only if the current matching point is at the end of the subject string, or immediately before a newline character that is the last character in the string (by default).” It notes that “The meaning of dollar can be changed so that it matches only at the very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at compile or matching time. This does not affect the \Z assertion.”
As noted in PHP’s PCRE pattern modifiers, if you set the D (PCRE_DOLLAR_ENDONLY) modifier, ”a dollar metacharacter in the pattern matches only at the end of the subject string. Without this modifier, a dollar also matches immediately before the final character if it is a newline (but not before any other newlines). This modifier is ignored if m modifier is set. There is no equivalent to this modifier in Perl.”
POSIX BRE and POSIX ERE are defined in the POSIX standard. For our purposes we’ll use the
The Open Group Base Specifications Issue 7, 2018 edition because it is publicly available. It has some helpful comments in its section on regcomp (which compiles regular expressions), but its real meat is in its chapter on regular expressions.
In both BRE and ERE notation, by default “^” means beginning-of-string and “$” means end-of-string, per sections “9.3.8 BRE Expression Anchoring” and “9.4.9 ERE Expression Anchoring”.
The regcomp function (which compiles regular expressions) accepts a “REG_NEWLINE” flag, to help text editors search many lines. If REG_NEW_LINE is set, the interpretation changes: a “^” matches the zero-length string immediately after a <newline> in string, and “$” matches the zero-length string immediately before a <newline> in string. There’s no way in the POSIX specification to separately match the beginning of a string nor an end of a string when REG_NEWLINE is enabled, which is why \A, \Z, and \z were later created by Perl. When validating input from untrusted users the REG_NEWLINE option is normally not used.
Perl documentation for perlre (perl regular expressions) describes its support for regular expressions. Version 5.38.2 documents the following, where “/m” is the “multiple lines” modifier (the multiple lines modifier is not enabled by default):
The Perl Compatible Regular Expressions (PCRE) library “is a set of functions that implement regular expression pattern matching using the same syntax and semantics as Perl 5. PCRE was originally written for the Exim MTA, but is now used by many… [programs]”. PCRE includes various extensions not in the Perl implementation.
The “CIRCUMFLEX AND DOLLAR” section describes “^” and “$”. By default “$” is “an assertion that is true only if the current matching point is at the end of the subject string, or immediately before a newline at the end of the string (by default), unless PCRE2NOTEOL is set.” The meaning of “$” can be changed to match _only the very end of the string by setting PCRE2_DOLLAR_ENDONLY option at compile time. This does not affect the \Z assertion.
PCRE supports various named options which are converted in a bit pattern (and thus PCRE doesn’t standardize text option flags). The main 32 options are documented in pcre2_compile; a full description of options and extended options is in PCRE2 API.
Setting both PCRE2_ANCHORED and PCRE2_ENDANCHORED forces a full-string match, but it also disables JIT compilation, so don’t do that unnecessarily.
The Python3 language documentation on re notes that its operations are “similar to those found in Perl” - but note that they are similar not identical. In this library:
As with many languages, there are alternative libraries. The Python3 documentation specifically notes that the “third-party regex module, which has an API compatible with the standard library re module, but offers additional functionality and a more thorough Unicode support.”
Python3’s regular expression library “re” has the method “fullmatch” which exactly matches the string (like prepending “\A(?:” and appending “)\Z”). However, this can’t always be used. Flask is a common server-side web application framework for Python3, and a common way to validate data in Flask is Webargs (here’s an example of a recommendation). The validators of Webargs reuse marshmallow.validate., which has a marshmallow.validate.Regexp but no equivalent marshmallow.validate.FullRegexp. Thus, you still need to prefix and suffix regular expressions sometimes.
As of 2024-03-24, Tutorialspoints incorrectly claims that “$ matches the end of a string” in Python. StackOverflow answer 1218783 is also incorrect.
RE2 is a regular expression library using a non-backtracking impllementation approach. Such implementations are don’t have catestrophic cases and are sometimes orders of magnitude faster, but they’re less featureful (e.g., they don’t support backreferences). RE2’s speed is compelling in many cases, so RE2 ended up being used in many places.
The RE2 syntax page notes that the flag “m” enables “^ and $ match begin/end line in addition to begin/end text (default false):
As documented in the Ruby version 3.3.0 documentation on class Regexp:
As noted in the Ruby on Rails guide on security, “A common pitfall in Ruby’s regular expressions is to match the string’s beginning and end by ^ and $, instead of \A and \z.” The Brakeman tool warns in many cases when ^ and $ are used in Ruby regular expressions (instead of \A and \z).
Rust doesn’t include a regular expression library in its default set of libraries. The crate regex is widely used in Rust development, so that’s what we used here. In crate regex:
The following survey table shows specifics for a number of common platforms using their default/built-in regex system in default mode (e.g., not multiline). We abbreviate Portable Operating System Interface (POSIX) Extended Regular Expressions (ERE), and POSIX Basic Regular Expressions (BRE).
Platform | Start of text symbol(s) | End of text symbol(s) | $ Permissive | Notes | |||
ECMAScript (JavaScript) | ^ | $ | No | Use test. Adding \A and \z has been proposed. | |||
Golang | ^ | \A | $ | \z | No | Uses RE2. | |
Java | ^ | \A | $ | \z | Yes | Oracle Java 21 documentation disagrees. | |
.NET | ^ | \A | \z | Yes | |||
PHP | ^ | \A | \z | Yes | Using PCRE built-in. | ||
POSIX BRE | ^ | $ | No | ||||
POSIX ERE | ^ | $ | No | ||||
Perl/PCRE | ^ | \A | \z | Yes | |||
Python3 | ^ | \A | \Z | Yes | End is capital \Z. Prefer using “fullmatch” method | ||
Ruby | \A | \z | Yes | Always use \A…\z | |||
Rust | ^ | \A | $ | \z | No | Using crate regex. |
The “$ Permissive” column indicates whether or not the “$” is permissive in the default (not multiline) mode. A “$” is permissive if it would also match at least a newline at the end of the string being validated (it may match other sequences). That is, if the input string “cat\n” (where \n is a newline) would match the regex string “^cat$” then $ is permissive.
Here’s the summary information as text:
Beyond releasing this guide, here are some ways we can reduce the incidence of incorrect regular expressions leading to vulnerabilities.
We plan to modify the OpenSSF fundamentals course. Tools will miss things, not everyone uses tools, and developers will sometimes ignore tool reports if they believe the tool is incorrect.
We should encourage modifying various static tools (e.g., linters, style checkers, SAST) to detect and warn on these errors in using regexes. In particular, where “$” is permissive, warning on “$” but allowing “\n?\z” doesn’t limit functionality and makes the result clearer. Good examples of these are various tools in the Ruby ecosystem; Ruby has very unusual rules for ^ and $, so they’ve seen the problem more often and thus have tools specifically to look for these problems.
Modify fuzzers to add extra newlines at the end of inputs. Another approach would be to interpose regex compilation and warn about problematic regex patterns, especially in systems that have a permissive $ anchor.
Include tests that start with valid values but extend them, and add newlines to valid data to see if slips through. More generally, include tests that are almost correct inputs to ensure they are correctly rejected.
Many developers believe that regex notation is the same everywhere, even though it isn’t. It would be dangerous for existing systems to change the meaning of their existing symbols. However, we could take steps so that more regex symbols did mean the same thing everywhere. E.g.:
Such changes would take years to adopt. Even worse, these changes might not be accepted in some cases because some people may think that merely being possible to do something is adequate. We don’t agree; we think it’s important to make it easy to do the secure action, not just possible, and it’s best to make avoidable mistakes les likely. These changes require implementations in many systems and modifications of many specifications; doing this has been historically challenging. Still, such changes would reduce the likelihood of these problems worldwide.
We would like to thank the following contributors:
For detailed rationale, along with other information such as contributor credits, see Correctly Using Regular Expressions Rationale.
If you have any additions, changes, or corrections you’d like to suggest, please open an issue or open a pull request. We appreciate your contributions!
This document is released under the Creative Commons CC-BY-4.0 license.