As an AI/ML practitioner, you've probably experienced the challenges of GPU access. While platforms like Google Colab and Kaggle offer free GPU support for training and fine-tuning models through notebooks, things get complicated when you need to build AI-powered software. Because not every time you'll code as a Notebook. Fine-tuning a model using free cloud GPU resources is fine, but when it's time to integrate that model into your application for a Proof of Concept (POC) or demo, the limitation of GPU access becomes a real pain point.

These days, there are numerous paid GPU instances and shared GPU services available. However, choosing the right GPU service can be tricky, even for production environments, when considering factors like uptime, cost, resources, and maintenance. Who wants to invest time and energy researching GPU providers just for a POC or short demo? Additionally, paying for GPUs in this case can be challenging for many learners.

Long story short! Today, we'll explore how to use Google Colab's free 15GB T4 GPU for inference through your local VS Code and Terminal.

Who should read this article?

AI & SWE practitioners without dedicated GPUs in their personal devices
Those seeking a quick hacks to use their models from anywhere
Application Developers working without a GPU instance budget on development server
Students presenting AI projects
Practitioners needing to show quick demos
Small teams building Proof of Concept (POC) of their SaaS

This list may grow as we discover more use cases. Now, let's dive in!

Method & Code

In this quick tutorial, I’m using EXAONE-3.5-2.4b-instruct pretrained model, which requires approximately 10GB disk storage, ~3GB RAM, and ~6GB GPU memory. Feel free to use your own trained/fine-tuned or other models. But remember that Google Colab's Free T4 GPU has a 15GB limit. Our implementation consists of two parts: (1) Loading Model & Creating Inference Endpoint in Google Colab, and (2) HTTP Request to the Inference endpoint from local computer.

1. Google Colab

As told earlier, we will run our model on google colab’s GPU and you can inference the model from your local VS Code or Terminal or anywhere. Start by creating a new notebook in Google Colab and ensure you switch the runtime to GPU.

1.1. Setup Model

# Install Dependencies
! pip install torch transformers huggingface_hub structlog

# Import model dependencies
import warnings
import structlog
from huggingface_hub import login
from torch import bfloat16
from transformers import AutoModelForCausalLM, AutoTokenizer
warnings.filterwarnings("ignore", category=RuntimeWarning)

logger = structlog.get_logger(__name__)

# Add Hugging Face Access Token to download the opensource model
HF_TOKEN = "Your_HuggingFace_Token" # https://huggingface.co/settings/tokens
login(HF_TOKEN)

# Function to download model and tokenizer from hugging face
def setup_model(
    model_name_or_local_path:str="LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct",
    device="auto",
):
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_local_path)
    model = AutoModelForCausalLM.from_pretrained(
        model_name_or_local_path,
        torch_dtype=bfloat16,
        trust_remote_code=True,
        device_map=device,
    )
    return model, tokenizer

# Function to inference the model
def infer_model(prompt, tokenizer, model, max_tokens=200, device="cuda"):
    if type(prompt)==str:
        messages = [
            {"role": "system",
             "content": "You are a helpful chatbot."},
            {"role": "user", "content": prompt}
        ]
    else: messages=prompt
    logger.info("Inference Started", messages=messages)
    input_ids = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    )
    output = model.generate(
        input_ids.to(device),
        eos_token_id=tokenizer.eos_token_id,
        max_new_tokens=max_tokens,
        do_sample=False,
    )
    completion = tokenizer.decode(output[0])
    # extract the actual generated content from completion. Tweak as per your model's generation
    content = completion.split("[|assistant|]")[-1].split("[|endofturn|]")[0]
    logger.info("Generation Done", content=content)
    return completion, content

# Setup model and tokenizer. 
# This will basically download the model from HuggingFace and cache the model
# If you already has the preferred model in disk, pass the model path
model, tokenizer = setup_model(
    model_name_or_local_path="LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct"
)

Perfect! Your model and tokenizer are now configured. You'll see the model running on Google Colab's GPU as expected.

[Note: The resource consumption shown below is higher because I've been running the notebook for about an hour.]

Resource usage on Google Colab

1.2. Tunneling

Now it is time to create an publicly accessible endpoint in the same Notebook. So that we can access the model from anywhere and pass prompt & arguments to the model from local.

First thing first. Sign in/up to ngrok.com and obtain your Personal Auth-token from here. Although this is enough for this tutorial, I recommend getting a static Ngrok domain. After signing in/up to Ngrok, you can create a static domain from here. Do not worry. It is also free!

Ngrok Free Quota

Why a static Ngrok domain is better?

In general, you can proceed without a static domain but each time you run Ngrok tunnel, it generates a new public URL for model access. So, during inference, when you send requests to the inference endpoint, you have to ensure that the public URL is correct. A static domain maintains a same endpoint URL, making inference much more convenient.

Now, Let's implement the code:

# Install ngrok dependencies
! pip install pyngrok --upgrade
! pip install flask-ngrok --upgrade
! pip install flask-cors

# Import ngrok dependencies
from flask import Flask, request, jsonify
from flask_cors import CORS
from pyngrok import ngrok

NGROK_TOKEN = "YOUR_NGROK_TOKEN"
ngrok.kill()
! ngrok config add-authtoken $NGROK_TOKEN

# Functions to run NGROK in a Flask Development Server and process Inference request

def process_inferecne(payload:dict, device="cuda"):
    prompt = payload.get('prompt')
    max_tokens = payload.get('max_tokens', 500)
    logger.info("Inference Starting", prompt=prompt, max_tokens=max_tokens)
    completion, content = infer_model(prompt, tokenizer, model, max_tokens=max_tokens, device=device)
    return content

def run_app():
    app_port = 5000
    ngrok_static_domain = "Your_Static_Domain" # Your static domain (if any), otherwise comment
    app = Flask(__name__)
    CORS(app)
    public_url = ngrok.connect(
        addr=app_port, 
        domain=ngrok_static_domain  # Your static domain (if any), otherwise comment
    )
    logger.info(f"NGROK Public URL: {public_url}")

    @app.route('/gpu-inference', methods=['POST'])
    def flask_inference():
        try:
            payload = request.json
            logger.info("Received inference request", payload=payload)
            result = process_inferecne(payload=payload)
            return jsonify({"status": "success","result": result}), 200
        except Exception as e:
            return jsonify({"status": "error","message": str(e)}), 500
    app.run(port=5000)

# Run your Flask & Tunnel
run_app()

Boom! Your inference endpoint is now ready. Now you can access your model running on Google Colab's GPU from anywhere for inference operations. The model is also accessible through your local VS Code or terminal without requiring a dedicated GPU in your computer. Relief right?

Notice the line Public URL: NgrokTunnel: "https://closely-vital-puma.ngrok-free.app" -> "http://localhost:5000" below. Now, you can send HTTP request to https://closely-vital-puma.ngrok-free.app/gpu-inference for inference.

This concludes the Google Colab setup. Let's move on to running inference from your local environment.

2. Inference from Local VS Code & Terminal

We'll demonstrate how to perform inference using Python3 Interpreter, Ubuntu Terminal, Windows PowerShell, and VS Code one-by-one.

2.1. Python Interpreter

taufiq_wsl@Taufiq:~$ python3
Python 3.12.3 (main, Nov  6 2024, 18:32:19) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> headers = {'Content-Type': 'application/json'}
>>> endpoint = "https://closely-vital-puma.ngrok-free.app/gpu-inference"
>>> prompt = "Hi Exa! Tell me which filed has highest salary in Bangladesh between AI and Cyber Security? I am specifically asking for Dhaka City."
>>> payload = {"prompt":prompt, "max_tokens":300}
>>> response = requests.post( endpoint, json=payload, headers=headers, timeout=30)
>>> response.json()
{'result': "In Dhaka City, Bangladesh, both Artificial Intelligence (AI) and Cyber Security are rapidly growing fields with competitive salaries, but they tend to attract different skill sets and roles, which can influence salary levels based on expertise and experience. Here’s a general comparison based on current trends:\n\n### Cyber Security:\n1. **High Demand**: Cyber Security is one of the fastest-growing sectors globally, including in Bangladesh.\n2. **Salary Range**: Salaries in Cyber Security can vary widely depending on experience and specific roles:\n   - **Entry Level**: BDT 300,000 - BDT 500,000 per annum\n   - **Mid-Level**: BDT 500,000 - BDT 1,000,000 per annum\n   - **Senior Level**: BDT 1,000,000+ per annum\n3. **Key Roles**: Network Security Engineers, Security Analysts, Penetration Testers, Security Architects, Compliance Officers.\n\n### Artificial Intelligence (AI):\n1. **Emerging Field**: AI is still developing in Bangladesh compared to Cyber Security, but it’s gaining traction.\n2. **Salary Range**: Salaries in AI can be competitive but might be slightly lower due to the field's relative youth and less established infrastructure:\n   - **", 'status': 'success'}
>>>
>>> # Seems a cutoff in the response. Let's increase token limit and tweak prompt to write concisely
>>> prompt = "Hi Exa! Tell me which filed has highest salary in Bangladesh between AI and Cyber Security? I am specifically asking for Dhaka City. Please answer very concisely"
>>> payload = {"prompt":prompt, "max_tokens":500}
>>> response = requests.post( endpoint, json=payload, headers=headers, timeout=30)
>>> response.json()
{'result': "In Dhaka City, **Cyber Security** generally offers higher salaries compared to AI for similar roles due to increasing demand and criticality in Bangladesh's tech landscape.", 'status': 'success'}
>>>
>>> prompt = "Hi Exa! Tell me which filed has highest salary in Bangladesh between AI and Cyber Security? I am specifically asking for Dhaka City. Please answer very concisely but include the salary range for junior engineer."
>>> payload = {"prompt":prompt, "max_tokens":500}
>>> response = requests.post( endpoint, json=payload, headers=headers, timeout=30)
>>> response.json()
{'result': 'In Dhaka City, **Cyber Security** generally offers higher salaries compared to AI for junior engineers. \n\n**Salary Range (Junior Engineer):**\n- **Cyber Security:** BDT 300,000 - BDT 500,000 annually.\n- **AI:** BDT 250,000 - BDT 400,000 annually.', 'status': 'success'}
>>>

2.2. Ubuntu Terminal (Bash)

taufiq_wsl@Taufiq:~$ curl -X POST "https://closely-vital-puma.ngrok-free.app/gpu-inference" -H "Content-Type: application/json" -d '{"prompt":"Hi Exa! Tell me which filed has highest salary in Bangladesh between AI and Cyber Security? I am specifically asking for Dhaka City.","max_tokens":700}' -m 30
{"result":"In Dhaka City, Bangladesh, both Artificial Intelligence (AI) and Cyber Security are rapidly growing fields with competitive salaries, but they tend to attract different skill sets and roles, which can influence salary levels based on expertise and experience. Here\u2019s a general comparison based on current trends:\n\n### Cyber Security:\n1. **High Demand**: Cyber Security is one of the fastest-growing sectors globally, including in Bangladesh.\n2. **Salary Range**: Salaries in Cyber Security can vary widely depending on experience and specific roles:\n   - **Entry Level**: BDT 300,000 - BDT 500,000 per annum\n   - **Mid-Level**: BDT 500,000 - BDT 1,000,000 per annum\n   - **Senior Level**: BDT 1,000,000+ per annum\n3. **Key Roles**: Network Security Engineers, Security Analysts, Penetration Testers, Security Architects, Compliance Officers.\n\n### Artificial Intelligence (AI):\n1. **Emerging Field**: AI is still developing in Bangladesh compared to Cyber Security, but it\u2019s gaining traction.\n2. **Salary Range**: Salaries in AI can be competitive but might be slightly lower due to the field's relative youth and less established infrastructure:\n   - **Entry Level**: BDT 250,000 - BDT 400,000 per annum\n   - **Mid-Level**: BDT 400,000 - BDT 800,000 per annum\n   - **Senior Level**: BDT 800,000+ per annum\n3. **Key Roles**: Machine Learning Engineers, Data Scientists, AI Researchers, AI Product Managers, AI Ethicists.\n\n### Conclusion:\nWhile Cyber Security tends to offer slightly higher salaries across most levels due to its established presence and critical importance in protecting digital assets, both fields are lucrative. **Cyber Security** might edge out slightly due to its broader applicability and higher demand for specialized skills in Dhaka City. However, **AI** is rapidly expanding and could offer significant opportunities for high earners as the technology matures further in Bangladesh.\n\nFor precise figures, it would be beneficial to consult recent job listings, salary surveys specific to Dhaka City, or professional networking platforms like LinkedIn to get the most current and detailed insights.","status":"success"}
taufiq_wsl@Taufiq:~$

2.3. Windows PowerShell

PowerShell 7.4.6
PS C:\Users\DELL> curl.exe -X POST "https://closely-vital-puma.ngrok-free.app/gpu-inference" -H "Content-Type: application/json" -d '{"prompt":"Hi Exa! Tell me which filed has highest salary in Bangladesh between AI and Cyber Security? I am specifically asking for Dhaka City.","max_tokens":700}' -m 30
{"result":"In Dhaka City, Bangladesh, both Artificial Intelligence (AI) and Cyber Security are rapidly growing fields with competitive salaries, but they tend to attract different skill sets and roles, which can influence salary levels based on expertise and experience. Here\u2019s a general comparison based on current trends:\n\n### Cyber Security:\n1. **High Demand**: Cyber Security is one of the fastest-growing sectors globally, including in Bangladesh.\n2. **Salary Range**: Salaries in Cyber Security can vary widely depending on experience and specific roles:\n   - **Entry Level**: BDT 300,000 - BDT 500,000 per annum\n   - **Mid-Level**: BDT 500,000 - BDT 1,000,000 per annum\n   - **Senior Level**: BDT 1,000,000+ per annum\n3. **Key Roles**: Network Security Engineers, Security Analysts, Penetration Testers, Security Architects, Compliance Officers.\n\n### Artificial Intelligence (AI):\n1. **Emerging Field**: AI is still developing in Bangladesh compared to Cyber Security, but it\u2019s gaining traction.\n2. **Salary Range**: Salaries in AI can be competitive but might be slightly lower due to the field's relative youth and less established infrastructure:\n   - **Entry Level**: BDT 250,000 - BDT 400,000 per annum\n   - **Mid-Level**: BDT 400,000 - BDT 800,000 per annum\n   - **Senior Level**: BDT 800,000+ per annum\n3. **Key Roles**: Machine Learning Engineers, Data Scientists, AI Researchers, AI Product Managers, AI Ethicists.\n\n### Conclusion:\nWhile Cyber Security tends to offer slightly higher salaries across most levels due to its established presence and critical importance in protecting digital assets, both fields are lucrative. **Cyber Security** might edge out slightly due to its broader applicability and higher demand for specialized skills in Dhaka City. However, **AI** is rapidly expanding and could offer significant opportunities for high earners as the technology matures further in Bangladesh.\n\nFor precise figures, it would be beneficial to consult recent job listings, salary surveys specific to Dhaka City, or professional networking platforms like LinkedIn to get the most current and detailed insights.","status":"success"}
PS C:\Users\DELL>

2.4. VS Code

# Write in a '.py' file
import requests

def inference_from_vscode(prompt, max_tokens=700, timeout=30):
    headers = {"Content-Type": "application/json"}
    endpoint = "https://closely-vital-puma.ngrok-free.app/gpu-inference"
    payload = {"prompt": prompt, "max_tokens": max_tokens}
    response = requests.post(endpoint, json=payload, headers=headers, timeout=timeout)
    return response.json()

if __name__ == "__main__":
    prompt = "Hi Exa! Tell me which filed has highest salary in Bangladesh between AI and Cyber Security?"
    response = inference_from_vscode(prompt)
    print(response["result"])

Run the above code as a python file and you’ll get output like below.

Output:

In Bangladesh, both Artificial Intelligence (AI) and Cyber Security are rapidly growing fields with significant demand, but the salary structures can vary based on several factors including experience level, specific roles, company size, and location within the country. Here’s a general comparison based on recent trends:

### Cyber Security:
1. **Entry-Level Positions**: Typically, entry-level positions like Junior Security Analyst or Security Engineer can start around BDT 300,000 to BDT 500,000 per annum.
2. **Mid-Level Positions**: Mid-level roles such as Security Architect or Senior Security Analyst might range from BDT 500,000 to BDT 1,000,000 annually.
3. **Senior Positions**: Senior roles like Chief Information Security Officer (CISO) or Director of Security can command salaries upwards to BDT 1,500,000 to BDT 3,000,000 or more annually.

### Artificial Intelligence (AI):
1. **Entry-Level Positions**: For AI roles like Machine Learning Engineer or Data Scientist, entry-level salaries can start around BDT 300,000 to BDT 600,000 per annum.
2. **Mid-Level Positions**: Mid-level roles such as Senior Machine Learning Engineer or AI Research Scientist might range from BDT 600,000 to BDT 1,200,000 annually.
3. **Senior Positions**: Senior positions like Chief AI Officer or Director of AI Research can earn significantly more, often ranging from BDT 1,200,000 to BDT 3,000,000 or more annually.

### Summary:
- **Cyber Security** tends to have a strong presence in foundational roles with competitive salaries, especially as the field matures and demand increases.
- **AI** is growing rapidly, particularly in research and development roles, which can offer higher salaries due to the specialized nature of the work and the increasing importance of AI technologies across various industries.

### Factors Influencing Salary:
- **Experience**: More experienced professionals typically earn higher salaries.
- **Company Size and Industry**: Larger companies and those in sectors heavily reliant on technology (like finance, healthcare, and tech startups) often offer higher salaries.
- **Skill Set**: Specialized skills in AI, particularly in areas like deep learning, natural language processing, etc., can command premium salaries.

Given these points, **AI** roles, especially at senior levels, often have the potential to offer higher salaries due to the specialized nature of the work and the broader applicability of AI technologies across multiple industries. However, the exact figures can vary widely based on individual circumstances and market conditions.

Conclusion

In this tutorial, we've explored how to leverage Google Colab's free GPU for inference from local devices. While some cloud providers offer limited free GPU access for application development (Flask, Django, etc.), in most cases they require educational or organizational email addresses. Instead, we've focused on a solution accessible to everyone

Feel free to check out the complete Google Colab Notebook for this tutorial. The code is available for you to copy and adapt for your use cases. If you’re curious to learn a bit more about the EXAONE-3.5 model, please refer to this article titled “EXAONE-3.5-2.4B: A Ultra-lightweight but High Performing LLM on Just 6GB GPU“. Good luck with your AI Journey. Thank you!

About the Author: Taufiq is an Artificial Intelligence Engineer from Bangladesh, currently working remotely at LaLoka Labs, Tokyo, Japan as an NLP Engineer & Backend Developer (Level 3). Feel free to connect with him on LinkedIn: https://www.linkedin.com/in/taufiq-khan-tusar/.

Hacks to use Google Colab Free GPU on Local VS Code or Terminal for Inference

Table of contents