🔡 How Unicode Characters Become Bytes in Go: From Code Points to UTF-8

DushyanthDushyanth
3 min read

Ever wondered how languages like Telugu, emojis, or even simple letters like 'A' are stored in Go?
It all comes down to Unicode and UTF-8 — and Go makes working with them surprisingly clean.

Let’s peel back the layers of abstraction and see what really happens under the hood when you write a character in Go.

Unicode vs UTF-8 vs Go Types

ConceptWhat It Is
UnicodeA universal set of character codes (code points) — one number per symbol
UTF-8A way to store those code points using 1–4 bytes
GoUses rune to store a Unicode code point, and []byte to store its UTF-8 bytes

Let's Take a Character:

This is a Telugu letter, pronounced “ta”.

Step 1: Unicode

Every character has a unique code point in the Unicode spec.

goCopyEditr := 'త'

fmt.Printf("Unicode: U+%04X\n", r) // Output: U+0C24

So 'త' has Unicode code point U+0C24 = 3108 in decimal.

Step 2: Convert to UTF-8

The Unicode code point is abstract — we need a way to store it in memory. That’s where UTF-8 comes in.

UTF-8 stores U+0C24 as:

  • Bytes: [224, 176, 164]

  • Hex: [0xE0, 0xB0, 0xA4]

  • Binary:

      11100000 10110000 10100100
    

UTF-8 uses variable-length encoding. Since is in the U+0800U+FFFF range, it uses 3 bytes.

Step 3: See It in Go

Here's a complete Go function to visualise this transformation:

package main

import (
    "fmt"
)

func printUTF8Encoding(s string) {
    fmt.Printf("Input: %q\n\n", s)
    for i, r := range s {
        utf8Bytes := []byte(string(r))
        binaryUnicode := fmt.Sprintf("%016b", r)

        fmt.Printf("Character #%d: %q\n", i+1, r)
        fmt.Printf("→ Unicode:  U+%04X (decimal: %d)\n", r, r)
        fmt.Printf("→ Binary (Unicode): %s\n", insertEvery4Bits(binaryUnicode))
        fmt.Printf("→ UTF-8 Bytes: %v\n", utf8Bytes)

        fmt.Print("→ Hex:       ")
        for _, b := range utf8Bytes {
            fmt.Printf("0x%X ", b)
        }

        fmt.Print("\n→ Binary:    ")
        for _, b := range utf8Bytes {
            fmt.Printf("%08b ", b)
        }

        fmt.Println("\n---")
    }
}

func insertEvery4Bits(s string) string {
    out := ""
    for i, c := range s {
        if i > 0 && (len(s)-i)%4 == 0 {
            out += " "
        }
        out += string(c)
    }
    return out
}

func main() {
    printUTF8Encoding("త")
}

🧪 Output:

Input: "త"

Character #1: 'త'
→ Unicode:  U+0C24 (decimal: 3108)
→ Binary (Unicode): 0000 1100 0010 0100
→ UTF-8 Bytes: [224 176 164]
→ Hex:       0xE0 0xB0 0xA4
→ Binary:    11100000 10110000 10100100

🔍 Summary

LayerValue
UnicodeU+0C24 (decimal: 3108)
UTF-8 Bytes[224, 176, 164]
Go runeint32 → 3108
Go []byteUTF-8 bytes

✨ Try More!

Pass any string into the function: "😊", "hi", "తెలుగు", "你好" — and you'll see how Go handles it.

0
Subscribe to my newsletter

Read articles from Dushyanth directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Dushyanth
Dushyanth

A Full Stack Developer with a knack for creating engaging web experiences. Currently tinkering with GO.