Hiding Data with Homoglyphs: Exploiting Unicode Lookalikes

Tldr

Homoglyphs—visually similar characters from different writing systems—can be used to covertly encode information in plain sight. By carefully substituting identical-looking characters, attackers can create hidden data channels that remain invisible to both humans and many automated systems.

🔗 Quick Links

🌐 Introduction

Homoglyphs are characters that look alike but have different codepoints. For example, the Latin letter a (U+0061) and the Cyrillic letter а (U+0430) are visually indistinguishable to most people, but to a computer they are completely different symbols.

This subtle difference enables a form of Unicode steganography, where data is encoded not by adding extra characters, but by swapping existing ones with similar-looking alternatives. Unlike emoji-based steganography, homoglyph substitution leaves no obvious trace, making it especially dangerous for phishing, code obfuscation, and covert communications.

Tldr

Homoglyph steganography hides data within ordinary text by replacing certain characters with visually indistinguishable equivalents from other Unicode ranges. This makes detection challenging, as the text appears completely normal to the human eye.

💡 Technical Foundations

1️⃣ What Are Homoglyphs?

Unicode was designed to support all writing systems, which means many characters share similar or identical shapes. These visual similarities create opportunities for confusion—or deliberate exploitation.

Some common homoglyph pairs:

Character	Unicode Codepoint	Script
`a`	`U+0061`	Latin
`а`	`U+0430`	Cyrillic
`o`	`U+006F`	Latin
`ο`	`U+03BF`	Greek
`i`	`U+0069`	Latin
`і`	`U+0456`	Cyrillic
`p`	`U+0070`	Latin
`р`	`U+0440`	Cyrillic

When rendered in most fonts, these pairs are indistinguishable.

2️⃣ Encoding Data Using Homoglyphs

The basic idea is simple:

Select a set of characters with lookalike alternatives.
Treat the choice of glyph as a binary signal:
- Base character (e.g., Latin a) = 0
- Homoglyph (e.g., Cyrillic а) = 1
Replace characters in the text according to the data you want to hide.

This creates a covert channel that requires no additional visible characters or unusual formatting.

3️⃣ Example Encoding Scheme

Here’s a simple JavaScript implementation that encodes binary data using Latin and Cyrillic homoglyphs:

const homoglyphMap = {
  a: "а", // Latin a -> Cyrillic a
  o: "о", // Latin o -> Cyrillic o
  p: "р", // Latin p -> Cyrillic r
  i: "і", // Latin i -> Cyrillic i
};
 
function encodeMessage(text, dataBits) {
  let bitIndex = 0;
  return text
    .split("")
    .map((char) => {
      if (homoglyphMap[char] && bitIndex < dataBits.length) {
        const bit = dataBits[bitIndex++];
        return bit === "1" ? homoglyphMap[char] : char;
      }
      return char;
    })
    .join("");
}

Example usage:

const hidden = encodeMessage("password", "1011");
console.log(hidden); // May look like "рaѕѕwоrd" but contains hidden bits

To humans, "рaѕѕwоrd" looks identical to "password"—but it’s not the same string.

4️⃣ Decoding the Hidden Data

To extract the hidden message, we reverse the mapping:

const reverseMap = Object.fromEntries(
  Object.entries(homoglyphMap).map(([latin, cyrillic]) => [cyrillic, latin])
);
 
function decodeMessage(stegoText) {
  const bits = [];
  for (const char of stegoText) {
    if (reverseMap[char]) {
      bits.push("1");
    } else if (homoglyphMap[char]) {
      bits.push("0");
    }
  }
  return bits.join("");
}

console.log(decodeMessage(hidden)); // "1011"

🔬 Practical Applications

1️⃣ Covert Communications

Homoglyph substitution allows two parties to exchange hidden data without raising suspicion. For example, a chat message could appear completely harmless while secretly carrying instructions for malware or other sensitive data.

2️⃣ Code Obfuscation and Attacks

Attackers can use homoglyphs to create visually identical but semantically different code. This has been demonstrated in real-world exploits such as the Trojan Source attack (CVE-2021-42574).

Example:

// Looks safe:
if (user.isAdmin) {
  grantAccess();
}
 
// May actually contain homoglyphs that change the logic entirely!

To the naked eye, both blocks appear the same, but the compiler interprets different characters.

3️⃣ Domain Name Spoofing (IDN Homograph Attacks)

Homoglyphs are also used in phishing:

paypaI.com  // Using a capital "I" instead of lowercase "l"
раураl.com  // Using Cyrillic letters

These domains are visually indistinguishable from paypal.com, making them perfect for social engineering attacks.

🛡️ Security Implications

1️⃣ Why It’s Hard to Detect

The text looks perfectly normal to humans.
Many editors and tools normalize text visually, but not semantically.
Automated systems often treat visually identical characters as unique.

This makes homoglyph-based steganography stealthier than zero-width characters or even emoji-based techniques.

2️⃣ Detection Strategies

To mitigate risks, developers and security teams can:

Normalize Unicode using NFC or NFKC before processing.
Use libraries that flag confusable characters, such as:
- Unicode Security Mechanisms
- ICU Libraries
Employ visual diffing tools in code reviews.
Block mixed-script content where not needed.

💁🏼‍♀️ Summary

Homoglyph-based steganography demonstrates how even ordinary-looking text can be weaponized. By swapping characters for visually identical alternatives, it’s possible to hide binary data, create covert communication channels, or launch subtle attacks on software systems and end-users.

This technique is particularly challenging because it exploits human perception itself—we simply can’t distinguish between similar glyphs without specialized tools.

As Unicode continues to evolve and incorporate more scripts, the attack surface for homoglyph-based exploits will only grow. Security professionals must stay vigilant and integrate Unicode-aware detection mechanisms into their workflows to prevent abuse.

💻 dawid.dev

Explorer

🔗 Quick Links

🌐 Introduction

💡 Technical Foundations

1️⃣ What Are Homoglyphs?

2️⃣ Encoding Data Using Homoglyphs

3️⃣ Example Encoding Scheme

4️⃣ Decoding the Hidden Data

🔬 Practical Applications

1️⃣ Covert Communications

2️⃣ Code Obfuscation and Attacks

3️⃣ Domain Name Spoofing (IDN Homograph Attacks)

🛡️ Security Implications

1️⃣ Why It’s Hard to Detect

2️⃣ Detection Strategies

💁🏼‍♀️ Summary

Graph View

Table of Contents