Recently, Google/Deep Mind announced Gemma:
A family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models
I was curious if I could ever run these models on a local server. Gemma features both a 7B and a 2B token model, so I thought I'd have a shot on one of my cheap 4-CPU Intel servers that has 16GB of RAM.
The 2B parameter model, at 2 Bytes / token, should be around 4GB.
2 x 1024 x 1024 x 1024 tokens x 2 Bytes / token = 4,294,967,296 Bytes
I'm following the instructions on the Kaggle page.
I also had to download my Kaggle API token
into ~/.kaggle/kaggle.json
. This was required in order to run the Keras option below.
The first option is to use the Python keras
package, which I've used in the past.
The total space taken by the model is 4.7G.
Python setup:
pyenv install 3.10.10
pyenv virtualenv 3.10.10 gemma-keras
pyenv activate gemma-keras
pip install --upgrade keras-nlp-nightly
# Alternatively, from https://github.com/keras-team/keras-nlp
# pip install --upgrade keras-nlp
# pip install --upgrade keras>=3
Sample script: gemma.py
. I needed to create this script inside the downloaded model
directory
import keras_nlp
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")
print(gemma_lm.generate("Keras is a", max_length=30))
I needed to create and run this script inside the downloaded model directory, or
I would see errors when importing the keras_nlp
library in python.
When this script was run the first time, it re-downloaded the 4.7G model into a new directory--super annoying.
~/.cache/kagglehub/models/keras/gemma/keras/gemma_2b_en/2/model.weights.h5
Invoking the script:
$ time python gemma.py
Keras is a popular deep learning library for Python. It is easy to use and has a wide range of features. One of the features of
real 1m38.011s
user 1m52.609s
sys 0m17.086s
So it took 98 seconds to produce 30 words, including the model load time. Ha! I was surprised that it finally worked. Not all the other methods worked out.
I also tried the Flax
option, because this page
said that it could be run with either a CPU, GPU or TPU.
I downloaded the 2b model, which takes 3.7GB:
$ du -sh
3.7G .
Python setup:
pyenv install 3.10.10
pyenv virtualenv 3.10.10 gemma
pyenv activate gemma
pip install git+https://github.com/google-deepmind/gemma.git
Invoking the sample script
git clone https://github.com/google-deepmind/gemma.git
cd gemma
python examples/sampling.py \
--path_checkpoint=../model/2b/ \
--path_tokenizer=../model/tokenizer.model \
--string_to_sample="Where is Salt Lake City?"
which looked good at first:
Loading the parameters from /home/schaubj/work/jschaub30/gemma/model/2b/
I0223 14:37:24.434225 135111827680128 checkpointer.py:164] Restoring item from /home/schaubj/work/jschaub30/gemma/model/2b.
I0223 14:37:34.196148 135111827680128 xla_bridge.py:689] Unable to initialize backend 'cuda':
I0223 14:37:34.196315 135111827680128 xla_bridge.py:689] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
I0223 14:37:34.198884 135111827680128 xla_bridge.py:689] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
I0223 14:37:34.299071 135111827680128 checkpointer.py:167] Finished restoring checkpoint from /home/schaubj/work/jschaub30/gemma/model/2b.
Parameters loaded.
...but crashed after about 30sec with this helpful error message:
Killed
I decided to run my viz_workload profiler to try and uncover any problems.
The results are here.
The memory profile immediately stood out: .
The used
memory (in red) peaked at 15.1 right when the workload crashed.
I then checked the dmesg
log and saw many memory related errors:
[ 330.616577] Out of memory: Killed process 2646 (sampling.py) total-vm:24639840kB, anon-rss:15396924kB, file-rss:256kB, shmem-rss:0kB, UID:1000 pgtables:35264kB oom_score_adj:0
The transformers description looks promising:
A transformers implementation of Gemma-2b-instruct. It is a good choice for Python developers that want a quick and easy way to prompt a model, or even begin developing LLM applications. It is a 2B parameter, instruction-tuned LLM.
The downloaded model size was
$ du -sh
15G .
This is too large--maybe a mistake?
The setup is easy enough:
pyenv virtualenv 3.10.10 gemma-transform
pyenv activate gemma-transform
pip install --upgrade pip
pip install transformers
pip install torch # not documented! Took forever
Sample script--I modified it to add the access API token.
from transformers import AutoTokenizer, AutoModelForCausalLM
with open(".token", "r") as fid:
access_token = fid.read().strip()
# tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it", token=access_token)
tokenizer = AutoTokenizer.from_pretrained("google/", token=access_token)
model = AutoModelForCausalLM.from_pretrained("google/", token=access_token)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids, max_new_tokens=30)
print(tokenizer.decode(outputs[0]))
Invoking the script:
$ time python gemma.py
Loading checkpoint shards: 100%|█████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.47s/it]
<bos>Write me a poem about Machine Learning.
Machines, they weave and they learn,
From the data, they discern.
Algorithms, a symphony,
Unleash the power of
real 0m28.090s
user 1m12.143s
sys 0m22.906s
So this is much quicker than keras (28 seconds for 18 words).