Install Gemma 4 Free - Step-by-Step Guide

How to Install Gemma 4 Free Version
A step-by-step guide to getting Google's powerful open-weights AI model running on your machine — no prior experience required.
Get Started
System Requirements
Before You Begin
What Is Gemma 4?
Gemma 4 is Google DeepMind's latest open-weights language model, available free for personal and research use. It delivers state-of-the-art text generation, coding assistance, and question answering — all running locally on your own hardware, with no cloud subscription required.
The Free Version gives you access to the base model weights, letting you run inference, fine-tune, and experiment without usage limits or API costs. Whether you're a developer, student, or curious enthusiast, Gemma 4 is designed to be accessible and powerful.
Why Gemma 4?
100% free for personal use
Runs locally — your data stays private
No API key or subscription needed
Supports multiple hardware setups
Active open-source community
System Requirements
Make Sure Your System Is Ready
💻 Operating System
Windows 10/11, macOS 12+, or any modern Linux distribution (Ubuntu 20.04+ recommended)
🧠 RAM
Minimum 8 GB RAM. 16 GB or more is strongly recommended for smooth performance.
💾 Disk Space
At least 10 GB of free storage for the model weights and dependencies.
🎮 GPU (Optional)
An NVIDIA GPU with 6 GB+ VRAM dramatically speeds up inference. CPU-only mode is supported.
Step 1 of 7
Create a Hugging Face Account
Gemma 4's model weights are hosted on Hugging Face, the leading platform for open-source AI models. You'll need a free account to accept Google's usage licence and download the model files.
Head to huggingface.co and click "Sign Up" in the top-right corner. Fill in your name, email address, and choose a password. After verifying your email, your account will be ready in under two minutes.
Already have a Hugging Face account? Skip ahead to accepting the model licence.
Step 2 of 7
Accept the Gemma 4 Licence
Before downloading, you must accept Google's Gemma Terms of Use. Visit the official Gemma 4 model page on Hugging Face — search for "google/gemma-4" in the search bar. On the model page, you will see a licence agreement prompt. Read through the terms and click "Agree and access repository". Access is granted instantly and automatically.
You must be logged in to your Hugging Face account to see and accept the licence agreement. The download button will not appear until this step is completed.
Step 3 of 7
Install Python on Your Machine
Gemma 4 requires Python 3.9 or higher. Open your terminal (or Command Prompt on Windows) and check if Python is already installed by typing:
python --version
If Python is not installed, download the latest stable release from python.org. Run the installer and make sure to tick "Add Python to PATH" during setup on Windows — this is a common step that's easy to miss.
macOS and most Linux distributions come with Python pre-installed. You may only need to upgrade to a newer version.
Step 4 of 7
Set Up a Virtual Environment
It is best practice to install Gemma 4 inside a virtual environment — an isolated Python workspace that keeps its dependencies separate from the rest of your system. This prevents version conflicts and makes it easy to remove everything cleanly later.
01
Open your terminal
Navigate to the folder where you want to store your Gemma project using the cd command.
02
Create the environment
Run python -m venv gemma-env to create a new virtual environment named gemma-env.
03
Activate it
On Windows: gemma-env\Scripts\activate — On macOS/Linux: source gemma-env/bin/activate
04
Confirm activation
Your terminal prompt should now show (gemma-env) at the start — you're inside the environment.
Step 5 of 7
Install Required Libraries
With your virtual environment active, you'll install the core Python libraries needed to load and run Gemma 4. These include Transformers (by Hugging Face), PyTorch, and a few supporting packages. Run the following command in your terminal:
pip install transformers torch accelerate bitsandbytes
This may take a few minutes depending on your internet speed. Once complete, all the necessary components will be installed and ready. If you have an NVIDIA GPU, also run:
pip install nvidia-cublas-cu12
The accelerate library enables multi-GPU and CPU offloading, while bitsandbytes allows 4-bit quantisation to reduce memory usage significantly.
Step 6 of 7
Generate Your Hugging Face Access Token
To download gated models like Gemma 4, you need a personal access token from Hugging Face. This token proves your identity and confirms you've accepted the licence terms.
Log in to your Hugging Face account, click your profile picture in the top-right, then go to Settings → Access Tokens. Click "New token", give it a name (e.g. "gemma-install"), and select the Read role. Click "Generate" and copy the token — it looks like hf_xxxxxxxxxx.
Save your token somewhere safe immediately. Hugging Face will only show it to you once. If you lose it, you'll need to generate a new one.
Once you have your token, authenticate your terminal session by running:
huggingface-cli login
Paste your token when prompted and press Enter.
Step 7 of 7
Download the Gemma 4 Model Weights
You're now ready to download Gemma 4! The model weights can be downloaded directly using Python. Create a new file called download_gemma.py and add the following code:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "google/gemma-4-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto"
)
Run it with python download_gemma.py. The download is approximately 5–9 GB depending on the model variant selected.
Choosing the Right Model Variant
Gemma 4 comes in several sizes. Choosing the right one for your hardware ensures the best balance of speed, quality, and memory usage.
Gemma-4-2B
Best for: Low-resource machines, quick experiments
RAM needed: ~6 GB
Lightweight and fast — great for laptops and machines without a dedicated GPU.
Gemma-4-7B
Best for: Balanced performance on consumer hardware
RAM needed: ~14 GB
The sweet spot for most users — strong reasoning with manageable resource demands.
Gemma-4-27B
Best for: High-end workstations and servers
RAM needed: ~55 GB
Maximum capability for complex tasks — requires significant GPU memory or quantisation.
Pro Tip
Speed Up with 4-Bit Quantisation
If your machine has limited VRAM or RAM, 4-bit quantisation lets you run larger models by compressing the model weights. This can cut memory usage by up to 75% with minimal quality loss — a game-changer for consumer hardware.
Modify your loading code to use the BitsAndBytesConfig as shown below:
from transformers import BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant_config,
    device_map="auto"
)
Run Your First Inference
With the model downloaded and loaded, it's time to run your first prompt! Add the following code to generate a response from Gemma 4:
input_text = "Explain quantum computing in simple terms."

inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Run the script with python your_script.py. Within a few seconds — or moments if you have a GPU — Gemma 4 will generate its response directly in your terminal. Congratulations, you're now running a state-of-the-art AI model locally!
Alternative Method
Install via Ollama (Easier Option)
If the Python setup feels complex, Ollama is a fantastic alternative. It wraps the entire model management process into a single, easy-to-use command-line tool — no virtual environments or token setup needed.
Step 1: Download Ollama from ollama.com and install it like any regular application.
Step 2: Open a terminal and run:
ollama run gemma4
Ollama automatically downloads the model and starts an interactive chat session. It's the fastest path from zero to running Gemma 4.
Ollama supports macOS, Windows, and Linux and handles GPU acceleration automatically.
Using Gemma 4 with a Web Interface
Prefer a visual chat interface instead of the terminal? Open WebUI paired with Ollama gives you a browser-based chat experience similar to ChatGPT — all running locally on your machine.
Install Open WebUI
Run pip install open-webui in your terminal after ensuring Ollama is already installed and running.
Launch the Server
Start the interface with open-webui serve and wait for the local server to initialise.
Open in Browser
Navigate to http://localhost:3000 in your web browser to access the chat dashboard.
Select Gemma 4
Choose gemma4 from the model dropdown and start chatting with a clean, intuitive interface.
Troubleshooting Common Issues
❌ "CUDA out of memory" error
Your GPU doesn't have enough VRAM. Switch to a smaller model variant (e.g. 2B) or enable 4-bit quantisation using BitsAndBytesConfig as shown in the earlier step.
❌ "401 Unauthorised" when downloading
Your Hugging Face token is missing or expired. Re-run huggingface-cli login and paste a fresh token. Make sure you've accepted the Gemma 4 licence on the model page.
❌ Slow generation speed on CPU
CPU inference is significantly slower than GPU. Consider enabling device_map="cpu" with quantisation, or try Ollama which optimises CPU performance more aggressively.
❌ "Module not found" errors
Your virtual environment may not be activated. Run the activation command again and re-install dependencies with pip install transformers torch accelerate.
Installation Summary
The Full Process at a Glance
Following these seven steps takes most users between 20 and 45 minutes depending on internet speed and hardware setup. The Ollama method can cut this down to under 5 minutes for those who prefer simplicity over flexibility.
What Can You Do with Gemma 4?
Code Generation
Ask Gemma 4 to write, review, debug, or explain code across dozens of programming languages. It excels at Python, JavaScript, and SQL.
Content Writing
Generate essays, summaries, emails, marketing copy, and creative writing with nuanced, context-aware output.
Research Assistant
Ask complex questions, summarise documents, and get detailed explanations on virtually any topic — entirely offline.
Next Steps & Further Learning
Explore the Community
The Hugging Face community has thousands of Gemma-based fine-tunes, adapters, and example notebooks. Browse the Models section and filter by "gemma" to discover specialised variants for coding, medicine, and more.
Fine-Tune for Your Use Case
Using the PEFT library and LoRA adapters, you can fine-tune Gemma 4 on your own dataset with as little as 6 GB of VRAM. The Hugging Face documentation has detailed guides to get you started.
Useful Resources
huggingface.co/google/gemma-4
ollama.com — easy local model runner
github.com/huggingface/transformers
Hugging Face Forums & Discord
Google DeepMind Gemma blog
open-webui.com — browser UI
You're Ready to Run Gemma 4!
You've successfully installed Google's Gemma 4 Free Version and run your first AI inference locally. From here, the possibilities are virtually endless — build apps, automate tasks, or simply explore the frontiers of open-source AI from the comfort of your own machine.
🚀 Keep Experimenting
Try different prompts, adjust generation parameters, and explore what Gemma 4 does best.
🤝 Join the Community
Share your projects and get help from thousands of developers on the Hugging Face forums.
🔄 Stay Updated
Watch the Gemma model page for new releases, fine-tunes, and performance improvements.
Visit Gemma on Hugging Face
Try Ollama