Handling data between different encodings or while filtering out untrusted characters and strings can cause malicious content to slip through input sanitation.
Encoding changes, such as changing from UTF-8 to pure ASCII, can result in turning non-functional payloads, such as <script生>, into functional <script> tags. Mixed encoding modes CWE-180: Incorrect Behavior Order: Validate Before Canonicalize - Development Environment can also play a role. The recommendation by Batchelder 2022 to use a single type of encoding and mode is only applicable for a single project or supplier. The recommendation to always choose the UTF-8 by W3c.org 2025 provides no guarantee and is already flawed by Windows having Windows-1252 encoding for some Python installations.
The example01.py is a crudely simplified version of two methods simulating two completely different systems using different encodings. We are simulating the data at rest and data in transit part in a variable named floppy. The write_message and read_message method would be delivered independently in a real world scenario, each with their own encoding.
# SPDX-FileCopyrightText: OpenSSF project contributors
# SPDX-License-Identifier: MIT
"""Code Example"""
import re
import unicodedata
def write_message(input_string: str):
    """Normalize and validate untrusted string before storing
    Parameters:
        input_string(string): String to validate
    """
    message = unicodedata.normalize("NFC", input_string)
    # validate, exclude dangerous tags:
    for tag in re.findall("<[^>]*>", message):
        if tag in ["<script>", "<img", "<a href"]:
            raise ValueError("Invalid input tag")
    return message.encode("utf-8")
def read_message(message: bytes):
    """Simulating another part of the system displaying the content.
    Args:
        message (bytes): bytearray with some data
    """
    print(message.decode("ascii", "ignore"))
#####################
# attempting to exploit above code example
#####################
# attacker:
floppy = write_message("<script生>")
# victim:
read_message(floppy)
Output of example01.py:
<script>
The example01.py code reduces the UTF-8 encoded data into 128 ASCII subsequently turning a previously harmless string into a working <script> tag.
The example01.py turns a non-functional UTF-8 encoded message <script���> or <script生>  string into a working <script> tag after collapsing the data into ASCII. Such an event taking place highly depends on the client, trust relation and chain of events.
A compliant solution will have to adhere to at least:
Reduction of data into a subset is not limited to strings and characters.
| Tool | Version | Checker | Description | 
|---|---|---|---|
| Bandit | 1.7.4 on Python 3.10.4 | Not Available | |
| Flake8 | 8-4.0.1 on Python 3.10.4 | Not Available | 
| [Batchelder 2022] | Ned Batchelder, Pragmatic Unicode, or, How do I stop the pain? [online], Available from: https://www.youtube.com/watch?v=sgHbC6udIqc [Accessed 4 April 2025] | 
| [W3c.org 2015] | Character encodings for beginners [online], Available from: https://www.w3.org/International/questions/qa-what-is-encoding, [Accessed 4 April 2025] |