Character Encoding: ASCII, UTF-8, UTF-16 — What Engineers Actually Need

Character encoding is the source of most "weird character" bugs — mojibake, broken emoji, "이게 뭐지" turning into "ì´ê²Œ ëì§€". The fix is rarely complex; the confusion is. This guide separates the concepts (Unicode, code points, encodings) and gives the rules that prevent encoding bugs in production.

1. Unicode vs encoding — the fundamental distinction

Unicode is a character set. It assigns a number (a "code point", written U+XXXX) to every character humans use:

'A' → U+0041
'한' → U+D55C
'🎉' → U+1F389
'☕' → U+2615

Unicode 16.0 (released 2024) defines 154,998 characters. The maximum possible code point is U+10FFFF (about 1.1 million slots).

An encoding is a way to turn code points into bytes for storage. UTF-8, UTF-16, UTF-32 are three different encodings of the same Unicode characters.

2. ASCII — the 7-bit ancestor

ASCII (American Standard Code for Information Interchange) defines 128 characters (0x00–0x7F): English letters, digits, basic punctuation, control codes. Each character is exactly 1 byte (with the high bit always 0).

ASCII alone can't represent any non-English language. But its 128 characters are a subset of Unicode (they're U+0000 through U+007F), and UTF-8 was designed to encode them identically — which is why ASCII files are valid UTF-8.

3. UTF-8 — the modern default

UTF-8 (Ken Thompson and Rob Pike, 1992) is the dominant encoding on the web (98%+ as of 2024 per W3Techs). The genius is variable-length encoding:

Code point range	Bytes	Byte pattern (binary)	Example
U+0000 – U+007F	1	0xxxxxxx	ASCII (A-Z, 0-9)
U+0080 – U+07FF	2	110xxxxx 10xxxxxx	Latin extended, Greek, Cyrillic
U+0800 – U+FFFF	3	1110xxxx 10xxxxxx 10xxxxxx	CJK (한, 中, 日)
U+10000 – U+10FFFF	4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	Emoji, rare CJK

Self-synchronizing: from any byte, you can find the start of the next character without scanning from the beginning. This makes UTF-8 robust to truncation and parsing errors.

Storage cost (English): Same as ASCII (1 byte per character).

Storage cost (Korean / Chinese / Japanese): 3 bytes per character. A Korean string takes ~3× the space of equivalent English. Worth considering for very large CJK datasets, but rarely worth choosing UTF-16 over.

4. UTF-16 — the JavaScript and Java legacy

UTF-16 uses 16-bit code units. Characters in the Basic Multilingual Plane (U+0000–U+FFFF) use 1 code unit (2 bytes). Characters above (emoji, rare CJK) use a surrogate pair — 2 code units (4 bytes total).

// JavaScript / Java strings are UTF-16
'A'.length         // 1 (one code unit)
'한'.length        // 1 (one code unit, BMP)
'🎉'.length        // 2 ❗ surrogate pair

// To count user-perceived characters
[...'🎉'].length             // 1
Array.from('🎉').length      // 1

// Even better: count grapheme clusters (handles 👨‍👩‍👦)
const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });
[...seg.segment('👨‍👩‍👦')].length   // 1
'👨‍👩‍👦'.length                       // 8 ❗ (5 emoji × surrogate pairs - 2 ZWJs)

Why UTF-16 exists: Java (1996) and Windows NT (1993) chose UTF-16 when Unicode fit in 16 bits. When Unicode expanded, surrogate pairs were retrofitted. JavaScript inherits Java's choice via the ECMAScript spec.

5. The mojibake catalog

Mojibake = wrong encoding interpretation. Common patterns:

Original	What you see	Cause
한글	í•œê¸€	UTF-8 bytes interpreted as Latin-1/Win-1252
한글	??	UTF-8 → DB column with utf8 (3-byte) → no surrogate pair support
café	cafÃ©	UTF-8 read as Latin-1 (most common)
naïve	naÃ¯ve	Same — Latin-1 misinterpretation of UTF-8
한글	\uD55C\uAE00	JSON over-escaping (some libraries default to ASCII-only)

6. The MySQL utf8 trap

MySQL's utf8 charset is not real UTF-8. It's a 3-byte-max subset that cannot store characters above U+FFFF — including all emoji and many rare CJK characters.

Use utf8mb4 (added in MySQL 5.5.3, 2010). Treat utf8 as a legacy alias to avoid forever:

-- Database level
ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

-- Table level
ALTER TABLE posts CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

-- Connection level (in your client config)
SET NAMES utf8mb4;

7. Practical rules

Use UTF-8 everywhere. Files, databases, HTTP, source code. Default for all new systems.
Set Content-Type charset. Content-Type: text/html; charset=utf-8 on every HTML response. Same for JSON.
HTML <meta>. <meta charset="utf-8"> as the very first thing in <head>.
MySQL: utf8mb4, never utf8. Set at database, table, column, AND connection level.
Don't use BOM in UTF-8. Breaks bash scripts, CSV parsers, JSON parsers.
Don't trust string.length. In JavaScript, use Intl.Segmenter for user-visible character counts.
Compare normalized. "café" can be 1 code point (U+00E9) or 2 (U+0065 + U+0301). Use String.prototype.normalize('NFC') before comparison.

FAQ

Q. Why does my emoji look like two characters in JavaScript's string.length?

A. JavaScript strings are UTF-16, and emoji like 🎉 require a 'surrogate pair' (two 16-bit code units) because they're outside the Basic Multilingual Plane. So '🎉'.length === 2. Use [...'🎉'].length or Array.from('🎉').length to get the user-perceived character count (1). For grapheme cluster counting, use Intl.Segmenter (available in Node 16+ and modern browsers).

Q. Should I use BOM (Byte Order Mark)?

A. Almost never for UTF-8. The BOM is meaningful for UTF-16 (where byte order matters) but the UTF-8 BOM (EF BB BF) breaks many parsers — bash scripts fail, JSON parsers throw, CSV readers see a corrupted first column. Some Microsoft tools insert it; most modern toolchains strip it. Default: don't use UTF-8 BOM.

Q. What's the difference between Unicode and UTF-8?

A. Unicode is the character set — a registry mapping characters to numbers (code points). 'A' is U+0041, '한' is U+D55C, '🎉' is U+1F389. UTF-8 is one of several encodings that turn code points into bytes for storage and transmission. UTF-16 and UTF-32 are alternatives. Unicode says 'what', UTF-* says 'how to store it'.

Q. Is UTF-8 always backward-compatible with ASCII?

A. Yes for the first 128 code points (U+0000 to U+007F) — they're encoded identically in 1 byte. ASCII files are valid UTF-8 with no changes. But ASCII is not forward-compatible — it can only represent those 128 characters. Any '8-bit ASCII' (Latin-1, Windows-1252) is not real ASCII and is NOT compatible with UTF-8 — these are the source of most mojibake.

Q. What encoding should databases use?

A. MySQL: utf8mb4 (NOT 'utf8' — that's a 3-byte legacy encoding that can't store emoji). PostgreSQL: UTF8 by default, no caveat. Use COLLATE utf8mb4_unicode_ci or utf8mb4_0900_ai_ci for case-insensitive matching across languages. Set this at database, table, AND column level — MySQL inherits but doesn't always.

Character Encoding: ASCII, UTF-8, UTF-16 — What Engineers Actually Need

1. Unicode vs encoding — the fundamental distinction

2. ASCII — the 7-bit ancestor

3. UTF-8 — the modern default

4. UTF-16 — the JavaScript and Java legacy

5. The mojibake catalog

6. The MySQL utf8 trap

7. Practical rules

FAQ

Q. Why does my emoji look like two characters in JavaScript's string.length?

Q. Should I use BOM (Byte Order Mark)?

Q. What's the difference between Unicode and UTF-8?

Q. Is UTF-8 always backward-compatible with ASCII?

Q. What encoding should databases use?

References

📖 Related Guides

JWT Anatomy — Header, Payload, Signature

URL Encoding — When to Use What

Hash Algorithms — MD5, SHA-1, SHA-256, bcrypt

🔧 Related Tools

JWT Decoder

Base64 Encoder & Decoder

URL Encoder & Decoder

About the DevToolNow Editorial Team