by the OpenSSF Best Practices Working Group
A key part of developing secure software is input validation, that is, validating that untrusted input (at least) is checked so that only valid data is accepted. For example, if a value is supposed to be an integer, then the software must only accept integers for that value and reject anything else. Sometimes a particular data type (like an integer or email address) is so common that libraries and frameworks include validators for them; in such cases, consider using them. However, many applications have application-specific patterns that also need input validation.
Regular expressions (aka regexes) can be a great way to validate input against specialized patterns. They’re widely available, widely understood, flexible, and efficient. However, they must be used correctly. A lot of advice is wrong or omits key points. In particular:
Doing input validation wrong, such as incorrectly using “^” or “$”, could lead to vulnerabilities.
When using regexes for secure validation of untrusted input, do the following so they’ll be correctly interpreted:
Platform | Prepend | Append | $ Permissive? |
POSIX BRE, POSIX ERE, and ECMAScript (JavaScript) | “^” (not “\A”) | “$” (not “\z” nor “\Z”) | No |
Perl, .NET/C# | “^” or “\A” | “\z” (not “$”) | Yes |
Java | “^” or “\A” | “\z”; “$” works but some documents conflict | No |
PHP | “^” or “\A” | “\z”; “$” with “D” modifier | Yes |
PCRE | “^” or “\A” | “\z”; “$” with PCRE2_ DOLLAR_ENDONLY | Yes |
Golang, Rust crate regex, and RE2 | “^” or “\A” | “\z” or “$” | No |
Python | “^” or “\A” | “\Z” (not “$” nor “\z”) | Yes |
Ruby | “\A” (not “^”) | “\z” (not “$”) | Yes |
For example, to validate in JavaScript that the input is only “ab” or “de”, use the regex “^(ab|de)$”. To validate the same thing in Python, use “^(ab|de)\Z” or “\A(ab|de)\Z”. Note that the “$” anchor has different meanings among platforms and is often misunderstood; on many platforms it’s permissive by default and doesn’t match only the end of the input. Instead of using “$” on a platform if $ is permissive, consider using an explicit form instead (e.g., “\n?\z
”). Consider preferring “\A” and “\z” where it’s supported (this is necessary when using Ruby).
In addition, ensure your regex is not vulnerable to a Regular Expression Denial of Service (ReDoS) attack. A ReDoS “is a Denial of Service attack, that exploits the fact that most Regular Expression implementations may reach extreme situations that cause them to work very slowly (exponentially related to input size)”. Many regex implementations are “backtracking” implementations, that is, they try all possible matches. In these implementations, a poorly-written regular expression can be exploited by an attacker to take a vast amount of time.
For detailed rationale, along with other information such as contributor credits, see Correctly Using Regular Expressions for Secure Input Validation - Rationale.
Our thanks to Seth Larson, whose article Seth Larson’s Regex character “$” doesn’t mean “end-of-string” raised awareness of some of the problems dicussed here.
This document is released under the Creative Commons CC-BY-4.0 license.