Handling data between different encodings or while filtering out untrusted characters and strings can cause malicious content to slip through input sanitation.
Encoding changes, such as changing from UTF-8
to pure ASCII
, can result in turning non-functional payloads, such as <script生>
, into functional <script>
tags. Mixed encoding modes CWE-180: Incorrect Behavior Order: Validate Before Canonicalize - Development Environment can also play a role. The recommendation by Batchelder 2022 to use a single type of encoding and mode is only applicable for a single project or supplier. The recommendation to always choose the UTF-8
by W3c.org 2025 provides no guarantee and is already flawed by Windows having Windows-1252
encoding for some Python installations.
The example01.py
is a crudely simplified version of two methods simulating two completely different systems using different encodings. We are simulating the data at rest and data in transit part in a variable named floppy
. The write_message and read_message method would be delivered independently in a real world scenario, each with their own encoding.
# SPDX-FileCopyrightText: OpenSSF project contributors
# SPDX-License-Identifier: MIT
"""Code Example"""
import re
import unicodedata
def write_message(input_string: str):
"""Normalize and validate untrusted string before storing
Parameters:
input_string(string): String to validate
"""
message = unicodedata.normalize("NFC", input_string)
# validate, exclude dangerous tags:
for tag in re.findall("<[^>]*>", message):
if tag in ["<script>", "<img", "<a href"]:
raise ValueError("Invalid input tag")
return message.encode("utf-8")
def read_message(message: bytes):
"""Simulating another part of the system displaying the content.
Args:
message (bytes): bytearray with some data
"""
print(message.decode("ascii", "ignore"))
#####################
# attempting to exploit above code example
#####################
# attacker:
floppy = write_message("<script生>")
# victim:
read_message(floppy)
Output of example01.py:
<script>
The example01.py
code reduces the UTF-8
encoded data into 128 ASCII
subsequently turning a previously harmless string into a working <script>
tag.
The example01.py
turns a non-functional UTF-8
encoded message <script���>
or <script生>
string into a working <script>
tag after collapsing the data into ASCII
. Such an event taking place highly depends on the client, trust relation and chain of events.
A compliant solution will have to adhere to at least:
Reduction of data into a subset is not limited to strings and characters.
Tool | Version | Checker | Description |
---|---|---|---|
Bandit | 1.7.4 on Python 3.10.4 | Not Available | |
Flake8 | 8-4.0.1 on Python 3.10.4 | Not Available |
[Batchelder 2022] | Ned Batchelder, Pragmatic Unicode, or, How do I stop the pain? [online], Available from: https://www.youtube.com/watch?v=sgHbC6udIqc [Accessed 4 April 2025] |
[W3c.org 2015] | Character encodings for beginners [online], Available from: https://www.w3.org/International/questions/qa-what-is-encoding, [Accessed 4 April 2025] |