Character Encoding: ASCII, UTF-8, UTF-16 — What Engineers Actually Need
Character encoding is the source of most "weird character" bugs — mojibake, broken emoji, "이게 뭐지" turning into "ì´ê²Œ ëì§€". The fix is rarely complex; the confusion is. This guide separates the concepts (Unicode, code points, encodings) and gives the rules that prevent encoding bugs in production.
1. Unicode vs encoding — the fundamental distinction
Unicode is a character set. It assigns a number (a "code point", written U+XXXX) to every character humans use:
'A'→ U+0041'한'→ U+D55C'🎉'→ U+1F389'☕'→ U+2615
Unicode 16.0 (released 2024) defines 154,998 characters. The maximum possible code point is U+10FFFF (about 1.1 million slots).
An encoding is a way to turn code points into bytes for storage. UTF-8, UTF-16, UTF-32 are three different encodings of the same Unicode characters.
2. ASCII — the 7-bit ancestor
ASCII (American Standard Code for Information Interchange) defines 128 characters (0x00–0x7F): English letters, digits, basic punctuation, control codes. Each character is exactly 1 byte (with the high bit always 0).
ASCII alone can't represent any non-English language. But its 128 characters are a subset of Unicode (they're U+0000 through U+007F), and UTF-8 was designed to encode them identically — which is why ASCII files are valid UTF-8.
3. UTF-8 — the modern default
UTF-8 (Ken Thompson and Rob Pike, 1992) is the dominant encoding on the web (98%+ as of 2024 per W3Techs). The genius is variable-length encoding:
| Code point range | Bytes | Byte pattern (binary) | Example |
|---|---|---|---|
| U+0000 – U+007F | 1 | 0xxxxxxx | ASCII (A-Z, 0-9) |
| U+0080 – U+07FF | 2 | 110xxxxx 10xxxxxx | Latin extended, Greek, Cyrillic |
| U+0800 – U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx | CJK (한, 中, 日) |
| U+10000 – U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | Emoji, rare CJK |
Self-synchronizing: from any byte, you can find the start of the next character without scanning from the beginning. This makes UTF-8 robust to truncation and parsing errors.
Storage cost (English): Same as ASCII (1 byte per character).
Storage cost (Korean / Chinese / Japanese): 3 bytes per character. A Korean string takes ~3× the space of equivalent English. Worth considering for very large CJK datasets, but rarely worth choosing UTF-16 over.
4. UTF-16 — the JavaScript and Java legacy
UTF-16 uses 16-bit code units. Characters in the Basic Multilingual Plane (U+0000–U+FFFF) use 1 code unit (2 bytes). Characters above (emoji, rare CJK) use a surrogate pair — 2 code units (4 bytes total).
// JavaScript / Java strings are UTF-16
'A'.length // 1 (one code unit)
'한'.length // 1 (one code unit, BMP)
'🎉'.length // 2 ❗ surrogate pair
// To count user-perceived characters
[...'🎉'].length // 1
Array.from('🎉').length // 1
// Even better: count grapheme clusters (handles 👨👩👦)
const seg = new Intl.Segmenter('en', { granularity: 'grapheme' });
[...seg.segment('👨👩👦')].length // 1
'👨👩👦'.length // 8 ❗ (5 emoji × surrogate pairs - 2 ZWJs)Why UTF-16 exists: Java (1996) and Windows NT (1993) chose UTF-16 when Unicode fit in 16 bits. When Unicode expanded, surrogate pairs were retrofitted. JavaScript inherits Java's choice via the ECMAScript spec.
5. The mojibake catalog
Mojibake = wrong encoding interpretation. Common patterns:
| Original | What you see | Cause |
|---|---|---|
| 한글 | 한글 | UTF-8 bytes interpreted as Latin-1/Win-1252 |
| 한글 | ?? | UTF-8 → DB column with utf8 (3-byte) → no surrogate pair support |
| café | café | UTF-8 read as Latin-1 (most common) |
| naïve | naïve | Same — Latin-1 misinterpretation of UTF-8 |
| 한글 | \uD55C\uAE00 | JSON over-escaping (some libraries default to ASCII-only) |
6. The MySQL utf8 trap
MySQL's utf8 charset is not real UTF-8. It's a 3-byte-max subset that cannot store characters above U+FFFF — including all emoji and many rare CJK characters.
Use utf8mb4 (added in MySQL 5.5.3, 2010). Treat utf8 as a legacy alias to avoid forever:
-- Database level ALTER DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci; -- Table level ALTER TABLE posts CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci; -- Connection level (in your client config) SET NAMES utf8mb4;
7. Practical rules
- Use UTF-8 everywhere. Files, databases, HTTP, source code. Default for all new systems.
- Set Content-Type charset.
Content-Type: text/html; charset=utf-8on every HTML response. Same for JSON. - HTML <meta>.
<meta charset="utf-8">as the very first thing in <head>. - MySQL: utf8mb4, never utf8. Set at database, table, column, AND connection level.
- Don't use BOM in UTF-8. Breaks bash scripts, CSV parsers, JSON parsers.
- Don't trust string.length. In JavaScript, use
Intl.Segmenterfor user-visible character counts. - Compare normalized. "café" can be 1 code point (U+00E9) or 2 (U+0065 + U+0301). Use
String.prototype.normalize('NFC')before comparison.
FAQ
Q. Why does my emoji look like two characters in JavaScript's string.length?
A. JavaScript strings are UTF-16, and emoji like 🎉 require a 'surrogate pair' (two 16-bit code units) because they're outside the Basic Multilingual Plane. So '🎉'.length === 2. Use [...'🎉'].length or Array.from('🎉').length to get the user-perceived character count (1). For grapheme cluster counting, use Intl.Segmenter (available in Node 16+ and modern browsers).
Q. Should I use BOM (Byte Order Mark)?
A. Almost never for UTF-8. The BOM is meaningful for UTF-16 (where byte order matters) but the UTF-8 BOM (EF BB BF) breaks many parsers — bash scripts fail, JSON parsers throw, CSV readers see a corrupted first column. Some Microsoft tools insert it; most modern toolchains strip it. Default: don't use UTF-8 BOM.
Q. What's the difference between Unicode and UTF-8?
A. Unicode is the character set — a registry mapping characters to numbers (code points). 'A' is U+0041, '한' is U+D55C, '🎉' is U+1F389. UTF-8 is one of several encodings that turn code points into bytes for storage and transmission. UTF-16 and UTF-32 are alternatives. Unicode says 'what', UTF-* says 'how to store it'.
Q. Is UTF-8 always backward-compatible with ASCII?
A. Yes for the first 128 code points (U+0000 to U+007F) — they're encoded identically in 1 byte. ASCII files are valid UTF-8 with no changes. But ASCII is not forward-compatible — it can only represent those 128 characters. Any '8-bit ASCII' (Latin-1, Windows-1252) is not real ASCII and is NOT compatible with UTF-8 — these are the source of most mojibake.
Q. What encoding should databases use?
A. MySQL: utf8mb4 (NOT 'utf8' — that's a 3-byte legacy encoding that can't store emoji). PostgreSQL: UTF8 by default, no caveat. Use COLLATE utf8mb4_unicode_ci or utf8mb4_0900_ai_ci for case-insensitive matching across languages. Set this at database, table, AND column level — MySQL inherits but doesn't always.
References
📖 Related Guides
JWT Anatomy — Header, Payload, Signature
Understand JWT structure, claims, and signing algorithms. Security best practices.
URL Encoding — When to Use What
encodeURI vs encodeURIComponent vs escape. Query strings, paths, and reserved characters.
Hash Algorithms — MD5, SHA-1, SHA-256, bcrypt
When to use each hash function. Security comparison and password storage best practices.
About the DevToolNow Editorial Team
DevToolNow's editorial team is made up of working software developers who use these tools every day. Every guide is reviewed against primary sources — IETF RFCs, W3C/WHATWG specifications, MDN Web Docs, and project repositories on GitHub — before publication. We update articles when standards change so the guidance stays current.
Sources we cite: IETF RFCs · MDN Web Docs · WHATWG · ECMAScript spec · Official project READMEs on GitHub