DevToolNow

Character Encoding: ASCII, UTF-8, UTF-16 — What Engineers Actually Need

DevToolNow Editorial Team··~11 min read

Character encoding is the source of most "weird character" bugs — mojibake, broken emoji, "이게 뭐지" turning into "ì´ê²Œ ë­ì§€". The fix is rarely complex; the confusion is. This guide separates the concepts (Unicode, code points, encodings) and gives the rules that prevent encoding bugs in production.

1. Unicode vs encoding — the fundamental distinction

Unicode is a character set. It assigns a number (a "code point", written U+XXXX) to every character humans use:

  • 'A' → U+0041
  • '한' → U+D55C
  • '🎉' → U+1F389
  • '☕' → U+2615

Unicode 16.0 (released 2024) defines 154,998 characters. The maximum possible code point is U+10FFFF (about 1.1 million slots).

An encoding is a way to turn code points into bytes for storage. UTF-8, UTF-16, UTF-32 are three different encodings of the same Unicode characters.

2. ASCII — the 7-bit ancestor

ASCII (American Standard Code for Information Interchange) defines 128 characters (0x00–0x7F): English letters, digits, basic punctuation, control codes. Each character is exactly 1 byte (with the high bit always 0).

ASCII alone can't represent any non-English language. But its 128 characters are a subset of Unicode (they're U+0000 through U+007F), and UTF-8 was designed to encode them identically — which is why ASCII files are valid UTF-8.

3. UTF-8 — the modern default

UTF-8 (Ken Thompson and Rob Pike, 1992) is the dominant encoding on the web (98%+ as of 2024 per W3Techs). The genius is variable-length encoding:

Code point rangeBytesByte pattern (binary)Example
U+0000 – U+007F10xxxxxxxASCII (A-Z, 0-9)
U+0080 – U+07FF2110xxxxx 10xxxxxxLatin extended, Greek, Cyrillic
U+0800 – U+FFFF31110xxxx 10xxxxxx 10xxxxxxCJK (한, 中, 日)
U+10000 – U+10FFFF411110xxx 10xxxxxx 10xxxxxx 10xxxxxxEmoji, rare CJK

Self-synchronizing: from any byte, you can find the start of the next character without scanning from the beginning. This makes UTF-8 robust to truncation and parsing errors.

Storage cost (English): Same as ASCII (1 byte per character).

Storage cost (Korean / Chinese / Japanese): 3 bytes per character. A Korean string takes ~3× the space of equivalent English. Worth considering for very large CJK datasets, but rarely worth choosing UTF-16 over.

4. UTF-16 — the JavaScript and Java legacy

UTF-16 uses 16-bit code units. Characters in the Basic Multilingual Plane (U+0000–U+FFFF) use 1 code unit (2 bytes). Characters above (emoji, rare CJK) use a surrogate pair — 2 code units (4 bytes total).

// JavaScript / Java strings are UTF-16
'A'.length         // 1 (one code unit)
'한'.length        // 1 (one code unit, BMP)
'🎉'.length        // 2 ❗ surrogate pair

// To count user-perceived characters
[...'🎉'].length             // 1
Array.from('🎉').length      // 1

// Even better: count grapheme clusters (handles 👨‍👩‍👦)
const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });
[...seg.segment('👨‍👩‍👦')].length   // 1
'👨‍👩‍👦'.length                       // 8 ❗ (5 emoji × surrogate pairs - 2 ZWJs)

Why UTF-16 exists: Java (1996) and Windows NT (1993) chose UTF-16 when Unicode fit in 16 bits. When Unicode expanded, surrogate pairs were retrofitted. JavaScript inherits Java's choice via the ECMAScript spec.

5. The mojibake catalog

Mojibake = wrong encoding interpretation. Common patterns:

OriginalWhat you seeCause
한글한글UTF-8 bytes interpreted as Latin-1/Win-1252
한글??UTF-8 → DB column with utf8 (3-byte) → no surrogate pair support
cafécaféUTF-8 read as Latin-1 (most common)
naïvenaïveSame — Latin-1 misinterpretation of UTF-8
한글\uD55C\uAE00JSON over-escaping (some libraries default to ASCII-only)

6. The MySQL utf8 trap

MySQL's utf8 charset is not real UTF-8. It's a 3-byte-max subset that cannot store characters above U+FFFF — including all emoji and many rare CJK characters.

Use utf8mb4 (added in MySQL 5.5.3, 2010). Treat utf8 as a legacy alias to avoid forever:

-- Database level
ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

-- Table level
ALTER TABLE posts CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

-- Connection level (in your client config)
SET NAMES utf8mb4;

7. Practical rules

  1. Use UTF-8 everywhere. Files, databases, HTTP, source code. Default for all new systems.
  2. Set Content-Type charset. Content-Type: text/html; charset=utf-8 on every HTML response. Same for JSON.
  3. HTML <meta>. <meta charset="utf-8"> as the very first thing in <head>.
  4. MySQL: utf8mb4, never utf8. Set at database, table, column, AND connection level.
  5. Don't use BOM in UTF-8. Breaks bash scripts, CSV parsers, JSON parsers.
  6. Don't trust string.length. In JavaScript, use Intl.Segmenter for user-visible character counts.
  7. Compare normalized. "café" can be 1 code point (U+00E9) or 2 (U+0065 + U+0301). Use String.prototype.normalize('NFC') before comparison.

FAQ

Q. Why does my emoji look like two characters in JavaScript's string.length?

A. JavaScript strings are UTF-16, and emoji like 🎉 require a 'surrogate pair' (two 16-bit code units) because they're outside the Basic Multilingual Plane. So '🎉'.length === 2. Use [...'🎉'].length or Array.from('🎉').length to get the user-perceived character count (1). For grapheme cluster counting, use Intl.Segmenter (available in Node 16+ and modern browsers).

Q. Should I use BOM (Byte Order Mark)?

A. Almost never for UTF-8. The BOM is meaningful for UTF-16 (where byte order matters) but the UTF-8 BOM (EF BB BF) breaks many parsers — bash scripts fail, JSON parsers throw, CSV readers see a corrupted first column. Some Microsoft tools insert it; most modern toolchains strip it. Default: don't use UTF-8 BOM.

Q. What's the difference between Unicode and UTF-8?

A. Unicode is the character set — a registry mapping characters to numbers (code points). 'A' is U+0041, '한' is U+D55C, '🎉' is U+1F389. UTF-8 is one of several encodings that turn code points into bytes for storage and transmission. UTF-16 and UTF-32 are alternatives. Unicode says 'what', UTF-* says 'how to store it'.

Q. Is UTF-8 always backward-compatible with ASCII?

A. Yes for the first 128 code points (U+0000 to U+007F) — they're encoded identically in 1 byte. ASCII files are valid UTF-8 with no changes. But ASCII is not forward-compatible — it can only represent those 128 characters. Any '8-bit ASCII' (Latin-1, Windows-1252) is not real ASCII and is NOT compatible with UTF-8 — these are the source of most mojibake.

Q. What encoding should databases use?

A. MySQL: utf8mb4 (NOT 'utf8' — that's a 3-byte legacy encoding that can't store emoji). PostgreSQL: UTF8 by default, no caveat. Use COLLATE utf8mb4_unicode_ci or utf8mb4_0900_ai_ci for case-insensitive matching across languages. Set this at database, table, AND column level — MySQL inherits but doesn't always.

References

About the DevToolNow Editorial Team

DevToolNow's editorial team is made up of working software developers who use these tools every day. Every guide is reviewed against primary sources — IETF RFCs, W3C/WHATWG specifications, MDN Web Docs, and project repositories on GitHub — before publication. We update articles when standards change so the guidance stays current.

Sources we cite: IETF RFCs · MDN Web Docs · WHATWG · ECMAScript spec · Official project READMEs on GitHub