Notebooks
G
Google Gemini
[Gemma 2]Deploy With VLLM

[Gemma 2]Deploy With VLLM

recurrentgemmagemmapaligemmagemma-cookbookcodegemma
Copyright 2024 Google LLC.
[ ]

Gemma - deploy with vLLM

This notebook demonstrates how you can deploy a Gemma model with vLLM and query it. vLLM is a fast and easy-to-use library for LLM inference and serving, and has built-in support for Gemma deployment.

Setup

Select the Colab runtime

To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:

  1. In the upper-right of the Colab window, select ▾ (Additional connection options).
  2. Select Change runtime type.
  3. Under Hardware accelerator, select T4 GPU.

Gemma setup on Hugging Face

vLLM uses Hugging Face under the hood. So you will need to:

Installation

[ ]
Requirement already satisfied: vllm in /usr/local/lib/python3.10/dist-packages (0.4.2)
Requirement already satisfied: cmake>=3.21 in /usr/local/lib/python3.10/dist-packages (from vllm) (3.27.9)
Requirement already satisfied: ninja in /usr/local/lib/python3.10/dist-packages (from vllm) (1.11.1.1)
Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from vllm) (5.9.5)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from vllm) (0.1.99)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from vllm) (1.25.2)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from vllm) (2.31.0)
Requirement already satisfied: py-cpuinfo in /usr/local/lib/python3.10/dist-packages (from vllm) (9.0.0)
Requirement already satisfied: transformers>=4.40.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (4.41.1)
Requirement already satisfied: tokenizers>=0.19.1 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.19.1)
Requirement already satisfied: fastapi in /usr/local/lib/python3.10/dist-packages (from vllm) (0.111.0)
Requirement already satisfied: openai in /usr/local/lib/python3.10/dist-packages (from vllm) (1.30.2)
Requirement already satisfied: uvicorn[standard] in /usr/local/lib/python3.10/dist-packages (from vllm) (0.29.0)
Requirement already satisfied: pydantic>=2.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (2.7.1)
Requirement already satisfied: prometheus-client>=0.18.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.20.0)
Requirement already satisfied: prometheus-fastapi-instrumentator>=7.0.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (7.0.0)
Requirement already satisfied: tiktoken==0.6.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.6.0)
Requirement already satisfied: lm-format-enforcer==0.9.8 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.9.8)
Requirement already satisfied: outlines==0.0.34 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.0.34)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from vllm) (4.11.0)
Requirement already satisfied: filelock>=3.10.4 in /usr/local/lib/python3.10/dist-packages (from vllm) (3.14.0)
Requirement already satisfied: ray>=2.9 in /usr/local/lib/python3.10/dist-packages (from vllm) (2.23.0)
Requirement already satisfied: nvidia-ml-py in /usr/local/lib/python3.10/dist-packages (from vllm) (12.550.52)
Requirement already satisfied: vllm-nccl-cu12<2.19,>=2.18 in /usr/local/lib/python3.10/dist-packages (from vllm) (2.18.1.0.4.0)
Requirement already satisfied: torch==2.3.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (2.3.0+cu121)
Requirement already satisfied: xformers==0.0.26.post1 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.0.26.post1)
Requirement already satisfied: interegular>=0.3.2 in /usr/local/lib/python3.10/dist-packages (from lm-format-enforcer==0.9.8->vllm) (0.3.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from lm-format-enforcer==0.9.8->vllm) (24.0)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from lm-format-enforcer==0.9.8->vllm) (6.0.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (3.1.4)
Requirement already satisfied: lark in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (1.1.9)
Requirement already satisfied: nest-asyncio in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (1.6.0)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (2.2.1)
Requirement already satisfied: diskcache in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (5.6.3)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (1.11.4)
Requirement already satisfied: numba in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (0.58.1)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (1.4.2)
Requirement already satisfied: referencing in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (0.35.1)
Requirement already satisfied: jsonschema in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (4.19.2)
Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.10/dist-packages (from tiktoken==0.6.0->vllm) (2023.12.25)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (1.12)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (3.3)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (2023.6.0)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.105)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.105)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.105)
Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (8.9.2.26)
Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.3.1)
Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (11.0.2.54)
Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (10.3.2.106)
Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (11.4.5.107)
Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.0.106)
Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (2.20.5)
Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.105)
Requirement already satisfied: triton==2.3.0 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (2.3.0)
Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch==2.3.0->vllm) (12.5.40)
Requirement already satisfied: starlette<1.0.0,>=0.30.0 in /usr/local/lib/python3.10/dist-packages (from prometheus-fastapi-instrumentator>=7.0.0->vllm) (0.37.2)
Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.0->vllm) (0.7.0)
Requirement already satisfied: pydantic-core==2.18.2 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.0->vllm) (2.18.2)
Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (8.1.7)
Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (1.0.8)
Requirement already satisfied: protobuf!=3.19.5,>=3.15.3 in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (3.20.3)
Requirement already satisfied: aiosignal in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (1.3.1)
Requirement already satisfied: frozenlist in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (1.4.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->vllm) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->vllm) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->vllm) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->vllm) (2024.2.2)
Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from tokenizers>=0.19.1->vllm) (0.23.1)
Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.40.0->vllm) (0.4.3)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.40.0->vllm) (4.66.4)
Requirement already satisfied: fastapi-cli>=0.0.2 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (0.0.4)
Requirement already satisfied: httpx>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (0.27.0)
Requirement already satisfied: python-multipart>=0.0.7 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (0.0.9)
Requirement already satisfied: ujson!=4.0.2,!=4.1.0,!=4.2.0,!=4.3.0,!=5.0.0,!=5.1.0,>=4.0.1 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (5.10.0)
Requirement already satisfied: orjson>=3.2.1 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (3.10.3)
Requirement already satisfied: email_validator>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (2.1.1)
Requirement already satisfied: h11>=0.8 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (0.14.0)
Requirement already satisfied: httptools>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (0.6.1)
Requirement already satisfied: python-dotenv>=0.13 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (1.0.1)
Requirement already satisfied: uvloop!=0.15.0,!=0.15.1,>=0.14.0 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (0.19.0)
Requirement already satisfied: watchfiles>=0.13 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (0.21.0)
Requirement already satisfied: websockets>=10.4 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (12.0)
Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from openai->vllm) (3.7.1)
Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai->vllm) (1.7.0)
Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from openai->vllm) (1.3.1)
Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai->vllm) (1.2.1)
Requirement already satisfied: dnspython>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from email_validator>=2.0.0->fastapi->vllm) (2.6.1)
Requirement already satisfied: typer>=0.12.3 in /usr/local/lib/python3.10/dist-packages (from fastapi-cli>=0.0.2->fastapi->vllm) (0.12.3)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx>=0.23.0->fastapi->vllm) (1.0.5)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->outlines==0.0.34->vllm) (2.1.5)
Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema->outlines==0.0.34->vllm) (23.2.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema->outlines==0.0.34->vllm) (2023.12.1)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema->outlines==0.0.34->vllm) (0.18.1)
Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba->outlines==0.0.34->vllm) (0.41.1)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch==2.3.0->vllm) (1.3.0)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (1.5.4)
Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.10/dist-packages (from typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (13.7.1)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (2.16.1)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (0.1.2)

Inference

Import vLLM

[ ]

Instantiate vLLM's engine (note that Colab's free T4 GPU only supports float32).

[ ]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]
INFO 05-24 06:23:04 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='google/gemma-2b', speculative_config=None, tokenizer='google/gemma-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float32, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=google/gemma-2b)
tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]
tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]
generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]
INFO 05-24 06:23:06 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 05-24 06:23:08 selector.py:69] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 05-24 06:23:08 selector.py:32] Using XFormers backend.
WARNING 05-24 06:23:10 gemma.py:54] Gemma's activation function was incorrectly set to exact GeLU in the config JSON file when it was initially released. Changing the activation function to approximate GeLU (`gelu_pytorch_tanh`). If you want to use the legacy `gelu`, edit the config JSON to set `hidden_activation=gelu` instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
INFO 05-24 06:23:10 weight_utils.py:199] Using model weights format ['*.safetensors']
model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]
model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]
INFO 05-24 06:24:08 model_runner.py:175] Loading model weights took 9.3440 GB
INFO 05-24 06:24:16 gpu_executor.py:114] # GPU blocks: 1724, # CPU blocks: 7281
INFO 05-24 06:24:19 model_runner.py:937] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 05-24 06:24:19 model_runner.py:941] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 05-24 06:24:29 model_runner.py:1017] Graph capturing finished in 11 secs.

Run inference with a single prompt

[ ]
Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.42it/s]

Prompt: The capital of USA is, Gemma output:  planning to invest in sustainable transportation and is ready to accept transformational advances in technology

Run batch inference

[ ]
[ ]
Processed prompts: 100%|██████████| 3/3 [00:00<00:00,  3.59it/s]

Prompt: The best thing about California is, Gemma output:  that there are always new ways to spend time and enjoy the magnificent state as a

Prompt: Physics studies, Gemma output:  matter, its properties, how it interacts with other matter, and other fields of

Prompt: The best place for sightseeing in Japan is, Gemma output:  Kyoto. It is the very first city you will visit in Japan when you get