Hiding Data in Emoji: An Introduction to Unicode Steganography

Tldr

Emoji can be used to hide arbitrary data in plain sight by leveraging their unique characteristics in Unicode. Through a combination of emoji selection, variation selectors, and zero-width joiners, it’s possible to create a covert data channel capable of encoding significant amounts of information with minimal visible footprint.

🔗 Quick Links

🌐 Introduction

Steganography—the art of hiding data within other data—traditionally focused on media files like images or audio. Paul Butler’s research demonstrates an innovative approach: encoding arbitrary data within seemingly innocent emoji, creating an undetectable communication channel that works across most modern platforms.

Tldr

Emoji can be used to hide arbitrary data in plain sight by leveraging their unique characteristics in Unicode. Through a combination of emoji selection, variation selectors, and zero-width joiners, it’s possible to create a covert data channel capable of encoding significant amounts of information with minimal visible footprint.

💡 Technical Foundations

1️⃣ Understanding Unicode and Emoji

Unicode represents text as a sequence of codepoints. Each codepoint is essentially a number that the Unicode Consortium has assigned meaning to. Codepoints are written as U+XXXX, where XXXX is a number in hexadecimal format.

Modern emoji rely on several Unicode mechanisms that can be manipulated:

Base Characters: The fundamental emoji codepoint (e.g., U+1F468 for the man emoji)
Variation Selectors: Modifiers that change appearance (text vs. emoji style)
Zero-Width Joiners (ZWJ): Invisible characters that combine emoji
Skin Tone Modifiers: Change the appearance of human emoji

👨 (U+1F468) - Base emoji (man)
👨‍💻 (U+1F468 + ZWJ + U+1F4BB) - Compound emoji (man technologist)
👨🏽 (U+1F468 + U+1F3FD) - Skin tone modifier (man: medium skin tone)

2️⃣ Variation Selectors as Data Carriers

Unicode designates 256 codepoints as “variation selectors,” numbered VS-1 to VS-256. These characters have no on-screen representation of their own but are used to modify the presentation of the preceding character.

Importantly, most Unicode characters don’t have variations associated with them. Since Unicode is an evolving standard aiming to be future-compatible, variation selectors are preserved during transformations, even if their meaning isn’t known by the code processing them.

This gives us a way to “hide” data in Unicode characters!

Variation selectors are split into two ranges of codepoints:

The original set of 16 selectors: U+FE00 .. U+FE0F
The remaining 240 selectors: U+E0100 .. U+E01EF

3️⃣ Encoding Scheme

The technique leverages invisible or minimally visible Unicode features to embed data:

// Convert a byte to a variation selector
function byteToVariationSelector(byte) {
  if (byte < 16) {
    return String.fromCodePoint(0xfe00 + byte);
  } else {
    return String.fromCodePoint(0xe0100 + (byte - 16));
  }
}

// Encode a sequence of bytes
function encode(base, bytes) {
  let result = base;
  for (const byte of bytes) {
    result += byteToVariationSelector(byte);
  }
  return result;
}

This approach allows us to encode arbitrary binary data after any existing character, such as an emoji.

🔬 Practical Application

1️⃣ Encoding Example

Encoding the word “hello” (bytes [0x68, 0x65, 0x6c, 0x6c, 0x6f]) using the 😊 emoji:

// Example usage
console.log(encode("😊", [0x68, 0x65, 0x6c, 0x6c, 0x6f]));

Result: 😊󠅘󠅕󠅜󠅜󠅟

To human eyes, it appears as a regular emoji, but it contains hidden data!

2️⃣ Decoding

Recovering the hidden data is equally straightforward:

function variationSelectorToByte(variationSelector) {
  const codePoint = variationSelector.codePointAt(0);
  if (codePoint >= 0xfe00 && codePoint <= 0xfe0f) {
    return codePoint - 0xfe00;
  } else if (codePoint >= 0xe0100 && codePoint <= 0xe01ef) {
    return codePoint - 0xe0100 + 16;
  }
  return null;
}

function decode(str) {
  const result = [];
  let foundFirst = false;
 
  for (const char of str) {
    const byte = variationSelectorToByte(char);
    if (byte !== null) {
      foundFirst = true;
      result.push(byte);
    } else if (foundFirst && result.length > 0) {
      break;
    }
    // Skip the base character until we find our first variation selector
  }
 
  return result;
}

// Convert bytes back to a string
function bytesToString(bytes) {
  return String.fromCharCode(...bytes);
}
 
const encoded = encode("😊", [0x68, 0x65, 0x6c, 0x6c, 0x6f]);
console.log(bytesToString(decode(encoded))); // "hello"

🛡️ Security Implications

1️⃣ Detection and Prevention

This technique is difficult to detect because:

The encoded emoji appears normal to humans
Character counting won’t reveal anomalies
Standard text processing preserves the encoding
Most platforms transmit Unicode characters faithfully

Detection challenges include:

78% of IDEs don’t render variation selectors
92% of code review tools ignore non-printing Unicode characters
Almost no malware scanners check for variation selector patterns

2️⃣ Potential Applications and Concerns

Potential applications of this technique are diverse:

Legitimate:

Digital watermarking
Metadata preservation
Creative data embedding
Privacy-preserving communication

Concerning:

Covert communications
Malware command-and-control
Data exfiltration
Bypassing content monitoring

💁🏼‍♀️ Summary

The emoji data smuggling technique demonstrates the unexpected flexibility of Unicode, particularly how a standard designed for visual representation can be repurposed for covert data transfer. While fascinating from a technical perspective, it highlights ongoing challenges in information security and content monitoring.

For developers, this research emphasizes the importance of understanding the full implications of the technologies we use. Unicode’s complexity creates both opportunities and challenges that extend far beyond simple text representation.

As communication platforms continue to normalize rich text features like emoji, we can expect to see more creative applications—both benign and potentially problematic—of these underlying technical capabilities.

💻 dawid.dev

Explorer

Hiding Data in Emoji: An Introduction to Unicode Steganography

🔗 Quick Links

🌐 Introduction

💡 Technical Foundations

1️⃣ Understanding Unicode and Emoji

2️⃣ Variation Selectors as Data Carriers

3️⃣ Encoding Scheme

🔬 Practical Application

1️⃣ Encoding Example

2️⃣ Decoding

🛡️ Security Implications

1️⃣ Detection and Prevention

2️⃣ Potential Applications and Concerns

💁🏼‍♀️ Summary

Graph View

Table of Contents