Everything that happens when you click a button

Shantanu SharmaShantanu Sharma
27 min read

Working in product and analytics over the last few years, I have interacted closely with developers, analysts, QAs - often stepping in areas to write code myself, and yet when I think about end-to-end system design, even in my products, I realize most of my knowledge is still pretty fragmanted.

Developers definitely have a more granular understanding of the frameworks they implement, but I’ve often observed a limitedness in the ‘why’ of using those frameworks, as well as a gap in understanding how things are shaped beyond the application layer.

As things are moving, AI likely will create another layer on top of this for code enthusiasts and many budding coders, where programming for them would only be using pre-built template-codes with plug-and-play variables and on-demand code snippets, with smarter systems course-correcting their suggestions over automated test frameworks.

While this will definitely bridge the gap for folks from non-tech backgrounds - they will miss a base level understanding of programming that I often find missing in myself as well. This is the case for most product managers, and many technical PMs too.

This blog  is therefore designed to learn together (whether you are a total noob or a curious developer) - the unseen, in-depth journey of information behind the simplest of actions we take for granted - Clicking a button.

📌
While I have tried to explain as much jargon as possible, this also isn’t a surface-level introduction blog, and will require some patience. One thing I can promise you though - however long the blog gets, it won’t be boring (I mean like I’ll try).

The Journey Begins! - Tap that App (or Website)

At the very start of our journey, when you click any button, if it’s a fake button, nothing will happen.
That’s it. Journey end. Blog over. Yay.

Unfortunately, since no one makes silly buttons anymore, developers link an action (usually placing a link/URL) within each button to trigger a specific result.
We are going to explore two key user actions in this blog:

  1. Clicking a button on a website (ex. Google after you type in your query)

  2. Clicking a button on a streaming app (ex. Netflix to watch a movie)

For either of the use cases, clicking on the button triggers the URL placed inside, which fetches information from the website’s servers hundreds of miles away from you, if not more.

So how does my request become signals, bro? How does the internet even work?

There are five really cool layers in the transmission of our request:

  1. Application Layer (HTTPS/FTP/DNS..)
    Your generated request (URL) is readied for transmission from your device via HTTP, a protocol for sharing data over this fat-ass internet.
    Modern browsers enforce HTTPS, where the extra S means ‘Secure’ - wherein your encrypted request is assigned a TSL certificate to ensure your data (the request payload) is understood only by the server of the website you are accessing.
    Before your request is transmitted, the client performs DNS (Domain Name System) resolution - seeking the IP address of the destination server, to know who to communicate with in the first place, and the performing a TLS-handshake with the server - confirming that both the cloent and server can share and recieve data with each other (more on this later).

     # Your encrypted GET request over HTTPS (after TLS handshake)
     encrypted_request = aes256gcm_encrypt(
       data="GET /search?q=what+is+love HTTP/1.1\r\nHost: www.google.com",
       key=session_key
     )
    

    There are other protocols also for sharing data - like the FTP (File Transfer Protocol) for communicating with devices usually in your local network, SMTP for emails and Websockets for real-time chatting with the broskis.

  2. Transport Layer (TCP/UDP)
    Your request, wrapped in HTTP, is now split into segments/packets and assigned port numbers. Splitting data ensures any corruption in one segment doesn’t affect the others, while duplicating segment-ends ensures even if some segments are corrupted, data can be rebuilt from others.
    Port numbers here are assigned to define whether segments/packets will communicate with the server via TCP (more secure but slower) or UDP (less secure, but faster) protocols.

  3. Network Layer (IP)
    Your data packets, with TCP headers, are now assigned their source and destination IP addresses — pretty much like when you send mail, so they know exactly where to go and where they are from.
    Look at this cute-ass envelope:

     [ IP Packet Header ] -------------------------------
     | Version: IPv4           | Length: 20 bytes       |
     | Source IP: 192.168.1.100| Destination IP: 142.250.190.46 |
     | TTL: 64 (Time to Live)  | Protocol: TCP (6)      |
     | Checksum: 0x3A7B        |                        |
     ---------------------------------------------------
     [ TCP Header ] ------------------------------------
     | Source Port: 54321      | Dest Port: 443 (HTTPS) |
     | Sequence Number: 12345  | Ack Number: 67890      |
     | Flags: SYN/ACK          | Window Size: 65535     |
     ---------------------------------------------------
     [ Data Payload ] ----------------------------------
     | "GET / HTTP/1.1\r\nHost: www.google.com\r\n\r\n" |
     ---------------------------------------------------
    
    💡
    After this layer, data is often converted to hex, a human-readable format for binary data.
  4. Data Link Layer
    Different packets are grouped together into frames to be converted into signals, and within your network, they are also assigned MAC addresses (a unique ID of your physical device) to authenticate your request with your local router, with checksums for error detection.

     [ IP Packet ]  
     | Source IP: 192.168.1.100 | Dest IP: 142.250.190.46 | Data: "GET /" |  
       ↓  
     [ Ethernet Frame ]  
     | Dest MAC: AA:BB:CC:DD:EE:FF | Src MAC: 00:1A:2B:3C:4D:5E | Type: IPv4 | Data (IP Packet) | CRC |
    

    Now your request is finally ready to be beemed across the physical layer!

  5. Physical Layer (Fibre Optic cables, Wireless, Ethernet)
    These frames are converted to signals (light pulses in fibre optics, electric pulses in ethernet, and electro-magnetic waves in Wifi) and vroom-vroomed across the globe!
    Protocols like QAM, QPSK and OFDM define how these signals travel and interact.

Deep-Diving Data Transmission: Prepping Your Request

At the very, very start of things, when your device calls a URL, you are essentially placing a call to talk to a specific person (URL) in a specific organisation (domain).

In order to access information from this URL, your device wraps its request inside a HTTPS protocol, which:

  1. Defines how information is formatted for the receiver

  2. Ensures that the URL carries a TSL certificate to only allow the correct receiver (the server) to access the information in your request.

While your request is getting ready for its date with the server, your device also sends a UDP request (the fast but less secure protocol) to your internet provider to hit the Domain Name Service (DNS, which is essentially an address book to lookup IP addresses), to find the IP address of your server.

[Your Device]                          [ISP's DNS Server]                          [Google's Server]
      |                                         |                                         |
      | -- "What's the IP of google.com?" (UDP) -->                                         |
      |                                         |                                         |
      |                                         | -- Checks its records or queries others -→|
      |                                         |                                         |
      | <-- "Google.com = 142.250.190.46" (UDP) --                                         |
      |                                         |                                         |
      | -- TCP SYN to 142.250.190.46 (HTTPS) --------------------------------------------->|
      |                                         |                                         |

Once it has this IP address, it now has correctly formatted the request, and knows where to send it. Now it has to ready it for sending.

💡
Your client doesn’t always need to request IP address of destination server and launch this UDP request. Multiple caches are checked in your browser to see if you already have the IP address.

A TCP handshake is essentially a combination of three messages between client (you) and the destination server where:
1. Client sends a SYN packet (“Bro, am I audible?”)
2. Server responding with a SYN-ACK packet to acknowledge (“Yes bro, you are audible. Am I audible?”)
3. Client sending ACK packet to acknowledge back (“Yes bro, you are also audible.”)

# Python pseudo-code for TCP 3-way handshake
client.send(SYN, seq=1000)       # Step 1: Client → Server (SYN)
server.send(SYN-ACK, seq=2000, ack=1001) # Step 2: Server → Client (SYN-ACK)
client.send(ACK, seq=1001, ack=2001)     # Step 3: Client → Server (ACK)

The request is then dissected into multiple segments, and each segment then is assigned their sequence number (to identify which order to read them in while decoding) and the port number for TCP connection.

Your operating system then wraps each segment with the source and destination IP address information (as mentioned above in the blog).

Now that your request is ready to be sent to the destination server, it also needs to have some information to travel to your ISP first. Which is why, in order to communicate with your router, your request is also wrapped with bits signifying your device’s MAC adddress.

Your device’s MAC address is already in the list of approved addresses of devices that your routers allows connection to, and once cross-checked, the router allows the device to share packets of information with the Internet Service Provider (ISP), which is further connected to multiple servers over the internet through either cables, signal towers or satellites.

💡
A key thing to note is that each router sends information to every device (and vice versa).

The data link layer also involves your request having checksum error checks (CEC/CRCs) that essentially store some means of validating the quality of information (say the total length of bits). This is later also calculated at the destination, and then examined with the stored CEC value to check if any corruption has happened.

Now your request is fully prepared to be sent over the internet. It has confirmed communication with the destination IP, has been assigned all the necessary headers, payload and error checks, is approved to be sent to your router, and then further from your ISP on to the destination server of the organisation you are trying to hit.

Let’s Get Physical, Physical

Now comes the physical layer, which I honestly love the most.

Each information character is first converted into 8-bit ASCII/UTF-8 binary data, represented as a stream of 1s and 0s:
(The ASCII values for uppercase letters range from 65 to 90, while the ASCII values for lowercase letters range from 97 to 122.)
So ‘Hello’ (I’m not actually saying hello to you, this is an example) is:
H → 72 = 01001000 e → 101 = 01100101 l → 108 = 01101100 l → 108 = 01101100 o → 111 = 01101111

‘Hello’ = 01001000 01100101 01101100 01101100 01101111
where each character is defined by 8 bits — eight 1s and 0s.

These bits are then encoded into ‘symbols’ - packets of energy used to transmit information, like voltage level or intensity of a light pulse.


Optional (but fun): Why binary? Why not encode information to say the power of 3, 4 or anything else?

The short answer is - it’s definitely possible to do so, but it’ll be a pain in the ass with no strong gain.

Binary digits (bits) in the physical layer are essentially electric or light pulses (like turning a switch on and off) that act as information, and this is precisely how data is transmitted through fibre optics (through light pulses travelling in fibre cables), ethernet (electric signals in copper wires) or wifi (electromagnetic waves travelling whatever because they’re badass EM waves).
Now, as we’ll discuss ahead in the blog, pulses in the physical layer indeed are complex enough to carry various phases and amplitudes, each combination signifying a symbol containing bits.

With a base power of say 3 (instead of 2) where: 0 = low energy pulse, 1 = mid-energy pulse, 2 = high energy pulse, we could have a more efficient energy-dense system indeed since for n digits: Ternary can represent 3^n values, while binary only provides 2^n.

This Ternary system (0/1/2) is also mathematically complete - where each logical function can be expressed using the system (e.g., using {TAND, TOR, TNOT}).

However, this system relies on disrupting legacy operating physical transistors (the physical layer for boolean logic gates) with no expressive power advantage - just more complexity.

At its core, Boolean logic (AND, OR, NOT) is the simplest possible system that can express all computable functions while being physically robust. Ternary systems only make numeric coding more robust, adding unnecessary complexity to an efficient system, not to mention the massive overhead in changing existing systems to work with three states.

Another key reason is the loss of accuracy in three states. Signals attenuate (lose their power over time), suffer with noise, or get interfered. The more we segment a packet into different states, the more likely it is for a pulse to be incorrectly decoded, which has dangerous implications.

For this reason, it practically works in the opposite direction with scale.
Transistors work on CMOS (Complementary Metal-Oxide-Semiconductor) technology, which switches transistors on (1) or off (0) via voltage thresholds.
For Ternary systems, CPUs would need 3-state ALUs to process data, and electronic systems would require larger transistors to maintain signal integrity, opposing Moore’s law which predicts 2x transistors per chip every 2 years as they keep shrinking.

In software too, existing algorithms around hashing, encryption and error correction would all need to be re-created, and it would be just an all-round chaos.

💡
An interesting trivia from this is the Setun, a Soviet ternary computer that worked on ternary system (-1, 0, +1) in 1958. Despite being more data-dense, CMOS binary chips out-scaled it by the 1970s. A binary CMOS chip with 2x the transistors beat ternary’s theoretical 1.58× density gain.

Anyway, returning back to our core use case, our information packets now need to be sent as pulses/voltages. We could design them dumbly as no pulse = 0, pulse = 1, but the issue with absence of pulse being taken as 0 is that you wouldn’t be able to differentiate between sending a 0 or sending no message.
At the same time, sending each bit as an individual pulse would be very inefficient. Therefore, better protocols were created to encode information into signals in really intelligent ways, which I’ll go into next!

Regardless of the medium (light/electric/EM), let’s understand some protocols in encoding information onto signals/pulses:

  1. On-Off Keying (Cutest)
    OOK sends one bit/pulse, which is extremely inefficient (and thus slow), but is the simplest way to encode info onto pulses. Here, you take a high state of energy as 1 (large amplitude in light, high amperes in electric) and low state as 0, while no pulse acts as gaps.
    OOK was used around the time your mama was young so it’s obviously really, really archaic (there, I said it).

  2. Manchester Encoding (Semi-cute)
    Like all things Manchester, this was great at some point but now is pretty okay-ish.

    In the past, a key issue in sending signals was that in order to separate between different sequences (to create a gap in between say one word and the next), systems would rely on a gap time. However, there would often be sync issues and delays in relay, even within a packet, leading to incorrect decoding by the receiver.

    Manchester encoding solved for this by encoding 1 as a transition from low-to-high and 0 as high-to-low signal. The mid-point of each packet/signal would be taken as reference by the receiver, whose clock would adjust its sync based on the length of the transition, thereby course-correcting for any time drift.
    Manchester coding works by exclusive-ORing (XOR) the original data and the clock signal (A regular pulse (like a metronome) that synchronizes data transmission to ensure sender/receiver agree on when to read each bit).

    However, encoding signals in transition takes up time, eats bandwidth, and is also prone to interference so Manchester is largely used where simplicity is more important than speed (like in RFID systems for payments and toll collection, either over radio signals or ethernet)

  3. Phase and Amplitude Modulation (Most common)
    Modulation schemes like QPSK and QAM group multiple bits into symbols -

    1. QPSK (Quadrature Phase Shift Keying): Groups 2 bits into one symbol, using four phase shifts to represent the 2-bit symbols:

      • 00 – 0° phase shift

      • 01 – 90° phase shift

      • 11 – 180° phase shift

      • 10 – 270° phase shift

QPSK is robust in noisy environments (satellite TV, 4G) but has low spectral efficiency (only 2 bits/Hz).

  1. QAM (Quadrature Amplitude Modulation): Encodes multiple bits per symbol by varying both amplitude and phase.
    Higher-order QAM (e.g., 256-QAM) packs more bits per signal, but due to smaller differences between each state, only works in environments where noise is low.

    The number before QAM denotes the combinations of symbols it can transmit (For example, for 256-QAM, you have 16 different amplitudes with 16 different phase shifts, with each combination representing a symbol of 10 bits each.)
    Here, the no. of bits per symbol (b) =

    $$b = \log_2(n)$$

    where n = no. of symbols in the QAM constellation. (16/64/256/1024-QAM houses 4/6/8/10 bits per symbol respectively).

💡
An important point to note here is physical layer does not care about how bits come together to make up words (whether it is byte-based encoding or 16-bit encoding — the application layer is responsible for this. What physical layer will do is add preamble and start detection of a frame (which will have a constant length always), followed by headers containing MAC address (again constant), Ethertype (which defines how data is to be interpreted), the actual payload of the message and then a mathematical check to ensure that everything’s tight when the message is finally received.

While early Wifi used 16-QAM, using 4 bits in one symbol — latest wifi technology uses 256 or 1024-QAM.

A logical question that follows is - Wouldn’t splitting signals across 32 amplitudes and 32 phase shifts create much more probability of errors during transmission? The gap between one symbol to another is very, very small in this 32*32 grid, not to mention that we cram 10 bits into each symbol.

The fact is, just like me, it is indeed fragile. But just like me, it compensates this with other, cooler shit.

New generations of routers are adept in solving for this by implementing -

  1. OFDM (Orthogonal Frequency Division Multiplexing)
    Instead of sending one 1024-QAM symbol at a time, the data is split across hundreds of subcarriers that move data slower individually but reduce errors. Being orthogonal to each other, signals also do not interfere with each other.

  2. LDPC Coding (Low-Density Parity Check)
    Extra bits are added for error-correction. This helps in recovering any corrupted bits by cross-checking patterns.

  3. Beam-forcing (Not adopted as much)
    Routers focus signals directly towards your device, reducing noise. Cons are that it is not really accurate for mobile devices where you device is hardly stationary.

  4. Adaptive Modulation, where when a signal is weak, it falls back to QPSK model to ensure larger gaps make identifying signals from noise easier.
    (Weak signals essentially mean a poor signal-to-noise ratio, and since noise is high, densely packed information is much more prone to errors . Imagine a packed pub where you voice 100s of orders to the bartender, vs only shouting 3 orders loudly to get your message across.)

Using these methods, your signals efficiently reach the router with minimum corruption to the data within.

From there on, after approving your MAC address, the router strips the MAC address headers from the request and forwards it to the ISP, usually through cables.

But how do these signals travel thousands of miles from your ISP to the destination servers?

Signals from your ISP travel in fibre optic cables and coaxial cables — even under oceans, hopping from one destinaiton to the other, finally reaching the destination server. Over mobile network, this data flows wirelessly from cell tower to tower nationally, but still usually requires under-sea fibre optic cables to travel from one country to another.


Backend Shenanigans on Reaching Destination

Finally, after all this foreplay, when your signals finally reach the destination server, they’re usually met by Load Balancers - the bouncers of networking.

Also, in reality, the "destination server" is rarely a single machine. Instead, it’s often:

  • A cluster of servers (scaled horizontally for redundancy/performance).

  • Distributed across groups (e.g., cloud regions, edge nodes).

Load balancers validate your request (via IP, hostname and cookies), decrypt HTTPS using the TSL certificate, and also decide which exact server within this group of servers the request should go to.

Your decrypted HTTP request on being parsed (understood) contains:

  • HTTP Method (GET, POST)
    Whether the client wants to fetch information (GET) or send information (POST). “Get” requests are like loading a webpage to view content on it, getting restaurants near your location on Swiggy, etc. “Post” requests are like adding a comment on an Instagram reel, uploading a profile picture, submitting a form, etc.

  • Path (/search)
    The intended vertical/path of the request within the organisation like to /search for things from Google, get /results of request from Youtube, /home for home page of a website, /browse for Netflix. This is where business logic in the backend of the server fetches the requested information by querying/checking caches (more on this later).

  • Query Params (q=best+pizza+near+me)
    The exact request attributes you are trying to get like the title of a movie, or the search parameters you types for Google.

  • Headers (cookies, auth tokens).
    Used to authenticate the request (auth tokens) and maintain the session for a user (cookies). Session essentially is a an ongoing experience of a user with a website/app that they don’t want ot refresh.

After decrypting the request, load balancers usually route requests through any of the following ways:

  • Round Robin (rotate servers evenly).

  • Least Connections (sends to the least busy server).

  • IP Hash (same user → same server to maintain session consistency).

💡
Routing is important because if all traffic was routed to a single server, it would like break and shit. Load balancers balance your load (no sex jokes from you please) so the thick flow of data is sprayed evenly across servers (I’m allowed to make them).

After being sent to the required server, the request is processed by the backend, the brain of the product (or like the heart or whatever).

The backend consists of code written over frameworks to define business logic into algorithms and automate it, handle servers and storage, house central data (and backups of it) and ensure all processes and transactions for the business stay tight!

After receiving a request, the backend essentially performs the following actions -

  1. Authenticate Request (typical Bouncer)
    The Web Server validates the request using JWT (JSON Web Token)/OAuth middleware to ensure that the client:

    • Is indeed a user of the platform,

    • Is interacting in a session that is still active (user isn’t logged out),

    • Is able to see content based on his user role (premium content only to paid users).

  1. For JWT, the token is stored in your browser’s cookies ( which is why deleting them can often log you out of a website) and is carried within the request. The server decodes the token and approves or rejects the signature within to authenticate your session.

  2. OAuth, on the other hand is used to traditionally delegate access to resources without sharing credentials — often by third parties.
    Here, the third party, like “Sign in with Google”:

    • Provides the backend server a (short-lived) access token to allow you access, and shares your data with the server.

    • This access token further has a refresh token (longer lived), so that as long as the refresh token is active, the backend server can keep on generating access tokens to maintain your session.

    • However, when the refresh token itself expires, your access token post expiry can no longer log you back in into the system. (For ex. if you wanna ban a user, you can expire their refresh token).

    • The two tokens are created because access tokens are easier to fall in the wrong hands, being shared over API calls, while refresh tokens are more secure, so even if someone is misusing the APIs to fetch your information, the particular access token will be expired after a point.

    • So to temporarily end someone’s session, you can expire their access token. And, to permanently ban them, you can expire their refresh token.

💡
A key difference is that ‘authentication’ - verifying user’s identity is done by the organisation in JWT, while in OAuth, it is done by the third party (Google). ‘Authorisation’ - deciding level of access to a user, on the other hand, is done by the organisation in both the cases: by either having a ‘role’ parameter within the payload of JWT after it is decrypted by the token, or by having ‘scope’ parameter with the generated JSON token in OAuth.
  1. Pull the Requested Information
    Backend calls any external/internal APIs (essentially an HTTP request) to fetch further information and uses specific codes built to cater to the specific business logic to process information and update systems.

    • Talks to other services if it requires external data, or to update any external dependency.

    • Checks its notes first - Before doing heavy lifting, it checks Redis/Memcached (its short-term memory) to see if the information is already there. For static stuff (like images), it just grabs them from the CDN—no fuss.

    • If the cache comes up empty, it sighs and queries the database properly.

  2. Run Async Processes
    Async processes start from your trigger, but never act as blockers to giving you the required information. (For ex. Swiggy doesn’t need to update all of it’s system post receiving your order, to display on your screen ‘Order Confirmed’)
    Async processes run on Kafka/Celery, wherein:

    • Kafka acts as the bulletin board for the list of heavy tasks to pick up, and

    • Celery executes them in the background.

  3. Finally, Sending Information to Frontend
    Responses are sent to the frontend (that loads on your browser/app) either through SSR (Server-Side Rendering) or CSR (Client Side Rendering):

    • CSR (DIY Furniture Approach) -
      In traditional frontend frameworks, the browser receives a barebones-HTML sent to your device from the backend, and the JavaScript to build the page piece by piece. CSR is slower on the client’s (your) end, and often creates the loading bar you usually see on websites.

    • SSR (Pre-Built Delivery Approach) -
      The backend server (using frameworks like Next.js, Django or Flask) pre-builds the full HTML, CSS, JS before sending it to your browser.
      Don’t worry, the frontend developer still has to write the code, but now the backend generates the complete page (with HTML, CSS, JS added) and then sends it across to your device.
      This has makes loading the full page seem instant.

💡
SSRs are usually used for building initial/home screens for sites, after which each button acts as a link to a specific microservice in backend corresponding to the specific vertical, that then triggers client-side rendering (CSR) by sending it JSON data over API calls.

After processing, your information is now ready for its journey back to you, but how does information show up on your screen?


Coming Home to You: Frontend Shenanigans

When the information reaches your device, your browser parses the received page HTML from top to bottom, taking the metadata from <head> tag and the visible content from <body>.

<!-- Original HTML File -->
<html>
<head>
    <title>Irritatingly long explanation for system design</title>
    <meta charset="UTF-8">
</head>
<body>
    <header>
        <h1>Welcome, welcome</h1>
    </header>
    <main>
        <p>This is <span>important</span> text. I am <span>important</span> person.</p>
        <ul>
            <li>You can follow me on X at @yesyesmovealong.</li>
            <li>Pliss follow.🥺</li>
        </ul>
    </main>
</body>
</html>

HTML data when parsed is usually stored as DOM (Document Object Model) — a tree-like structure which essentially acts as the blueprint from which browser builds a webpage.

<!-- DOM File -->
document
├── html
    ├── head
    │   ├── title → "Irritatingly long explanation for system design"
    │   └── meta → charset="UTF-8"
    └── body
        ├── header
        │   └── h1 → "Welcome, welcome"
        └── main
            ├── p → "This is important text. I am important person."
            │   ├── span → "important"
            │   └── span → "important"
            └── ul
                ├── li → "You can follow me on X at @yesyesmovealong."
                └── li → "Pliss follow.🥺"

Along with HTML, how your content is stylized is defined by CSS (Cascading Style Sheets) -  “cascading” because of how multiple rules apply to the same element, with the last one taking precedence - which also accompanies the packet.

With HTML and CSS, Javascript handles interactivity — click events, API calls, transitions, live chat etc. For larger apps, people use Typescript, which is a stricter version of Javascript with types (to catch errors early).

In order to provide you a seamless experience, and for developers to build fast, modern frontend frameworks today provide ready-made building blocks—buttons, forms, layouts and other common components, so developers don’t have to rewrite code manually. Frameworks also offer state management (tracking user input/data changes) dynamically and define how changes to frontend need to be made, to make the whole process enriching for us viewers.

Some of the popular frontend frameworks are:

  1. React
    With every little change on a DOM, the usually browser wastes time searching and redrawing parts of the page, making it extremely inefficient.
    React (a frontend framework) creates a lightweight copy of the DOM, and implements changes made by the developer only in the copy. This lightweight copy then cross-checks with the actual DOM, sending only the parts that have changed to it.
    This is critical for apps like Instagram where your feed loads and updates continuously.

  2. Vue
    The core idea behind Vue is reactive data binding — basically linking data/variables to all of its references in the UI, so any change there updates any dependencies on it, while not hampering with the rest of the DOM.
    Gmail uses this in search, where results instantly filter without refreshing the page.

  3. Svelte
    Svelte solves for the overhead from running virtual DOMs and tracking dependencies, created in React and Vue.
    Overhead usually slows down initial page load and consumes memory, so Svelte compiles away the framework — converting your high-level code to optimised, vanilla Javascript during build time (before the app runs in your browser).
    Essentially, instead of rendering and comparing, it precisely points out which part to change in the original DOM itself without disturbing or needing to recreate it.
    Razorpay uses Svelte since payment UIs require fast loading times and smooth animations.

💡
Svelte has the best performance amongst the three, and uses the least memory, but as a growing framework still has less libraries and community support than React or Vue.

Final Renderings

Anyway, going back to our use case, once the DOM is updated, the browser determines the final styles for every element, combining information from inline styles in the DOM, style data in CSS and related information in JavaScript.

The browser then calculates the exact position and size of each element, dependencies on other elements (which likely have relative padding) and creates the layout.

Then it goes into painting and composting — converting element data into pixel data held in memory, and finally combines the painted layers (with all the overlaps and animations) into a final image.

The GPU flips the frame buffer, swapping the old image with the new one, almost instantly (~16 ms pr frame) — and you see the updated information.

Finally, the rendered page is shown onto your browser/app, completing the cycle of information behind a single click (*phew*).


👋
Well, that’s it! I have a small section below if you also want to read the use case for streaming content (and what different protocols are used there), but I hope you liked the overall blog above. Post your comments below, and if you liked it - do follow me at @yesyesmovealong on X or on LinkedIn where I post other silly things.

(Alt) What If You Were Watching Content Instead

  1. Transport Layer -
    If we were tranporting media data instead of largely texts and static images, we would likely use the below protocols developed over TCP/UDP -

    1. Over TCP (High reliability):

      • HLS  - Relies on HTTP/TCP for chunked delivery. Chops video into small HTTP files.
        Used by Netflix, YouTube (fallback), Disney+. High latency (10–30 sec buffer).

      • LL-HLS - TCP-based, but with shorter chunks and HTTP/2 push to reduce latency to ~3 sec. Used by Apple TV+ and live shopping apps

      • MPEG-DASH - Like HLS but codec-agnostic (supports AV1, H.265); open standard alternative to HLS. YouTube’s primary protocol, also TCP-based.

      • Smooth Streaming - Legacy Microsoft protocol (TCP), largely replaced by DASH.

      • RTMP - Ancient TCP-based protocol (Flash era), now only used for ingest (e.g., sending streams to Twitch).

    2. Over UDP (High speed):

      • SRT - Adds retransmission (ARQ) and encryption to UDP. Faster than TCP (1–3 sec latency) but still reliable. Simpler than RIST in forward correction.
        Used for live production — Live sports (ESPN), news (Al Jazeera).

      • WebRTC - Pure UDP + FEC. Peer-to-peer (P2P) when possible, else relays via TURN servers. Avoids server hops (sub-500ms latency). For sub-second latency in calls. Has scaling issues (hard for 1,000+ viewers). Used by Zoom, Google Meet.

      • RIST - UDP with TCP-like recovery. Like SRT but with heavier FEC (Forward Error Correction). Can lose 20% packets and still recover (critical for satellite feeds).
        Used for Broadcast TV use case (2–5 sec latency) — by Fox Sports, BBC.

    3. Over Both/Hybrid:

      • QUIC - UDP-based but emulates TCP reliability.
        Used by YouTube/Cloudflare to speed up HLS/DASH.

      • NDI - Local LAN only. Sends uncompressed video over local networks. Needs zero-config, ultra-low latency (<100ms). Uses UDP for video/RTP and TCP for control.
        Used by OBS, CNN studios internally.

      • WebSocket - TCP-based, used for real-time data (chat, trading) but rarely for raw video. Persistent TCP connection for real-time data with no HTTP overhead (good for chat, stocks).
        Used by Slack (chat), Robinhood (stock prices).

  2. Improved Caching -
    Apart from this, platforms like Netflix have also employed their own caching servers inside ISP data centers in most countries they operate in — called Open Connect.

    When we stream, we actually fetch cached content (that they pre-cache based on what’s hot in a certain region) to give us the seamless streaming experience we are so used to.

  3. Encoding and Transcoding -
    For encoding and transcoding videos, backend systems for content-streaming platforms convert raw videos into multiple bitrates (e.g., 480p, 1080p, 4K) — so there are different packets of data to choose from depending on the end user’s (your) internet speed as well as device capability.

    Dynamic Adaptive Streaming (DASH/HLS) also ensures that the content adjusts quality based on network speed.

  4. Storage -
    For storage, videos are usually stored in distributed storage (AWS S3, Google Cloud Storage). Origin servers here handle the metadata (what to stream) while CDNs deliver the actual video.

  5. Rendering -
    Finally, the frontend (video player) fetches the manifest file, downloads chunks, and plays them seamlessly. For buffering, the player pre-loads next 3–5 chunks to avoid stuttering, and handles seek requests (jumping to a specific point in the video).


Alright fin. (Follow me at @yesyesmovealong on X or on LinkedIn).
13
Subscribe to my newsletter

Read articles from Shantanu Sharma directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Shantanu Sharma
Shantanu Sharma

I lead product at Orbit Farming, where our goal is to build India's largest farm mechanisation hub. Prior to this, I led product development at Gramhal, an NGO building social networks to reduce information asymmetry in rural India, following some stints in the Indian ed-tech space in product management and analytics. Strategizing for long-term impact, encouraging direct conversations, and having a get-things-done culture are aspects I deeply admire in an organization. That, and just some free coffee. :)