Meta Llama Build With Llama 4

Build With Llama 4

llamaAIvllmmachine-learningllama2LLMllama-cookbookPythonfinetuningpytorchlangchaingetting-started

alph-notebooks/llama-cookbook / build_with_llama_4.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Build with Llama 4 Scout

Llama Model Cards | Llama Documentation | Hugging Face meta-llama

Building with Llama 4!

Welcome to a walkthrough of building with Llama 4 Scout model, a state of the art multimodal and multilingual Mixture-of-Experts LLM.

This notebook will jump right in and show you what's the latest with our models, how to use get the best out of them.

Environment Setup
Loading the model
Long Context Demo
Text Conversations
Multilingual
Multimodal: Single Image Understanding
Multimodal: Multi Image Understanding
Function Calling with Image Understanding

Llama 4 has two variants:

Scout which has 17B x 16 Experts MoE
Maverick which has 17B x 128 Experts MoE

Long context window : -- If you want to used this model on Single GPU with 10M context = Need to use an INT4-quantized version of Llama 4 Scout on 1xH100 GPU

Hardware requirements without quantized models :

On 8x H100 GPUs:

Scout (support up to 1M context)

Maverick (support up to ~430K context)

On 8x H200 GPUs:

Scout (support up to 3.6M context):

Maverick (Support up to 1M context):

Other Hardware Support & Quantizations:

A100: We have verified that the bf16 versions of the models work well on A100 GPUs. AMD MI300X: You can run Llama 4 on AMD MI300X GPUs by building vLLM from source and using the same commands as above.

Environment Setup:

You'll need at least 4 GPUs with >= 80GB each.
Ensure you have the latest version of vllm to play with long context and faster inference speeds
Ensure you have the latest version of transformers to load Llama 4 models.
- RECOMMENDED: The Llama 4 models are large; use Xet for faster downloads from the huggingface hub.

We will use both vllm and transformers to provide you reference examples from both.

Understanding model names:

Please remember to use instruct models, although for our open source friends who like to fine-tune our models. The base models are also made available. We also make Maverick available in FP8 quantization on our huggingface org as well as website

Experiment of Long Context Demo: Write a guide on SAM-2 based on the repo

For our example below, vllm takes less than 3 minutes to ingest approx 900k tokens and write a getting started guide on it.

[ ]

INFO 04-04 20:43:17 [__init__.py:239] Automatically detected platform cuda.

[ ]

Ingesting a Repo

Note: The prompt below doesn't have any effect on model output, but we want the open source community smiling when using our models.

We are instructing Llama-4-Scout to write a getting started guide on it. In the next cell we copy paste the same output for readability.

[ ]

Showing long content
Directory structure:
└── facebookresearch-sam2/
    ├── README.md
    ├── backend.Dockerfile
    ├──

Processed prompts: 100%|███████████████████████████████████████████████████████| 1/1 [01:21<00:00, 81.29s/it, est. speed input: 10633.82 toks/s, output: 68.96 toks/s]

 Generated text: # Getting Started with SAM 2

## Introduction

SAM 2 (Segment Anything Model 2) is a foundation model for promptable visual segmentation in images and videos. This repository provides a comprehensive suite of code for SAM 2, including image and video prediction APIs, training code, and a web demo.

## Installation

### Requirements

* Linux with Python ≥ 3.10, PyTorch ≥ 2.5.1, and [torchvision](https://github.com/pytorch/vision/) that matches the PyTorch installation. Install them together at https://pytorch.org to ensure this.
* [CUDA toolkits](https://developer.nvidia.com/cuda-toolkit-archive) that match the CUDA version for your PyTorch installation. This should typically be CUDA 12.1 if you follow the default installation command.
* If you are installing on Windows, it's strongly recommended to use [Windows Subsystem for Linux (WSL)](https://learn.microsoft.com/en-us/windows/wsl/install) with Ubuntu.

### Installation Steps

Then, install SAM 2 from the root of this repository via
```bash
pip install -e ".[notebooks]"
```

Note:
1. It's recommended to create a new Python environment via [Anaconda](https://www.anaconda.com/) for this installation and install PyTorch 2.5.1 (or higher) via `pip` following https://pytorch.org/. If you have a PyTorch version lower than 2.5.1 in your current environment, the installation command above will try to upgrade it to the latest PyTorch version using `pip`.
2. The step above requires compiling a custom CUDA kernel with the `nvcc` compiler. If it isn't already available on your machine, please install the [CUDA toolkits](https://developer.nvidia.com/cuda-toolkit-archive) with a version that matches your PyTorch CUDA version.
3. If you see a message like `Failed to build the SAM 2 CUDA extension` during installation, you can ignore it and still use SAM 2 (some post-processing functionality may be limited, but it doesn't affect the results in most cases).

## Common Installation Issues

### I got `ImportError: cannot import name '_C' from 'sam2'`

This is usually because you haven't run the `pip install -e ".[notebooks]"` step above or the installation failed. Please install SAM 2 first, and see the other issues if your installation fails.

### I got `MissingConfigException: Cannot find primary config 'configs/sam2.1/sam2.1_hiera_l.yaml'`

This is usually because you haven't run the `pip install -e .` step above, so `sam2` isn't in your Python's `sys.path`. Please run this installation step. In case it still fails after the installation step, you may try manually adding the root of this repo to `PYTHONPATH` via
```bash
export SAM2_REPO_ROOT=/path/to/sam2  # path to this repo
export PYTHONPATH="${SAM2_REPO_ROOT}:${PYTHONPATH}"
```
to manually add `sam2_configs` into your Python's `sys.path`.

### I got `RuntimeError: Error(s) in loading state_dict for SAM2Base` when loading the new SAM 2.1 checkpoints

This is likely because you have installed a previous version of this repo, which doesn't have the new modules to support the SAM 2.1 checkpoints yet. Please try the following steps:

1. pull the latest code from the `main` branch of this repo
2. run `pip uninstall -y SAM-2` to uninstall any previous installations
3. then install the latest repo again using `pip install -e ".[notebooks]"`

In case the steps above still don't resolve the error, please try running in your Python environment the following
```python
from sam2.modeling import sam2_base

print(sam2_base.__file__)
```
and check whether the content in the printed local path of `sam2/modeling/sam2_base.py` matches the latest one in https://github.com/facebookresearch/sam2/blob/main/sam2/modeling/sam2_base.py (e.g. whether your local file has `no_obj_embed_spatial`) to indentify if you're still using a previous installation.

### I got `undefined symbol: _ZN3c1015SmallVectorBaseIjE8grow_podEPKvmm` (or similar errors)

This usually happens because you have multiple versions of dependencies (PyTorch or CUDA) in your environment. During installation, the SAM 2 library is compiled against one version library while at run time it links against another version. This might be due to that you have different versions of PyTorch or CUDA installed separately via `pip` or `conda`. You may delete one of the duplicates to only keep a single PyTorch and CUDA version.

In particular, if you have a lower PyTorch version than 2.5.1, it's recommended to upgrade to PyTorch 2.5.1 or higher first. Otherwise, the installation script will try to upgrade to the latest PyTorch using `pip`, which could sometimes lead to duplicated PyTorch installation if you have previously installed another PyTorch version using `conda`.

We have been building SAM 2 against PyTorch 2.5.1 internally. However, a few user comments (e.g. https://github.com/facebookresearch/sam2/issues/22, https://github.com/facebookresearch/sam2/issues/14) suggested that downgrading to PyTorch 2.1.0 might resolve this problem. In case the error persists, you may try changing the restriction from `torch>=2.5.1` to `torch==2.1.0` in both [`pyproject.toml`](pyproject.toml) and [`setup.py`](setup.py) to allow PyTorch 2.1.0.

### I got `CUDA error: no kernel. Aborting execution.` (or similar errors)

A possible cause could be that the CUDA kernel is somehow not compiled towards your GPU's CUDA [capability](https://developer.nvidia.com/cuda-gpus). This could happen if the installation is done in an environment different from the runtime (e.g. in a slurm system).

You can try pulling the latest code from the SAM 2 repo and running the following
```bash
export TORCH_CUDA_ARCH_LIST=9.0 8.0 8.6 8.9 7.0 7.2 7.5 6.0`
```
to manually specify the CUDA capability in the compilation target that matches your GPU.

### I got `Error compiling objects for extension`

You may see error log of:
> unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.

This is probably because your versions of CUDA and Visual Studio are incompatible. (see also https://stackoverflow.com/questions/78515942/cuda-compatibility-with-visual-studio-2022-version-17-10 for a discussion in stackoverflow).<br> 
You may be able to fix this by adding the `-allow-unsupported-compiler` argument to `nvcc` after L48 in the [setup.py](https://github.com/facebookresearch/sam2/blob/main/setup.py). <br>
After adding the argument, `get_extension()` will look like this:
```python
def get_extensions():
    srcs = ["sam2/csrc/connected_components.cu"]
    compile_args = {
        "cxx": [],
        "nvcc": [
            "-DCUDA_HAS_FP16=1",
            "-D__CUDA_NO_HALF_OPERATORS__",
            "-D__CUDA_NO_HALF_CONVERSIONS__",
            "-D__CUDA_NO_HALF2_OPERATORS__",
            "-allow-unsupported-compiler"  # Add this argument
        ],
    }
    ext_modules = [CUDAExtension("sam2._C", srcs, extra_compile_args=compile_args)]
    return ext_modules
```
</details>

## Getting Started

### Download Checkpoints

First, we need to download a model checkpoint. All the model checkpoints can be downloaded by running:

```bash
cd checkpoints && \
./download_ckpts.sh && \
cd ..
```

or individually from:

- [sam2.1_hiera_tiny.pt](https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_tiny.pt)
- [sam2.1_hiera_small.pt](https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_small.pt)
- [sam2.1_hiera_base_plus.pt](https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_base_plus.pt)
- [sam2.1_hiera_large.pt](https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt)

(note that these are the improved checkpoints denoted as SAM 2.1; see [Model Description](#model-description) for details.)

Then SAM 2 can be used in a few lines as follows for image and video prediction.

### Image prediction

SAM 2 has all the capabilities of [SAM](https://github.com/facebookresearch/segment-anything) on static images, and we provide image prediction APIs that closely resemble SAM for image use cases. The `SAM2ImagePredictor` class has an easy interface for image prompting.

```python
import torch
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor

checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    predictor.set_image(<your_image>)
    masks, _, _ = predictor.predict(<input_prompts>)
```

Please refer to the examples in [image_predictor_example.ipynb](./notebooks/image_predictor_example.ipynb) (also in Colab [here](https://colab.research.google.com/github/facebookresearch/sam2/blob/main/notebooks/image_predictor_example.ipynb)) for static image use cases.

SAM 2 also supports automatic mask generation on images just like SAM. Please see [automatic_mask_generator_example.ipynb](./notebooks/automatic_mask_generator_example.ipynb) (also in Colab [here](https://colab.research.google.com/github/facebookresearch/sam2/blob/main/notebooks/automatic_mask_generator_example.ipynb)) for automatic mask generation in images.

### Video prediction

For promptable segmentation and tracking in videos, we provide a video predictor with APIs for example to add prompts and propagate masklets throughout a video. SAM 2 supports video inference on multiple objects and uses an inference state to keep track of the interactions in each video.

```python
import torch
from sam2.build_sam import build_sam2_video_predictor

checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = build_sam2_video_predictor(model_cfg, checkpoint)

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    state = predictor.init_state(<your_video>)

    # add new prompts and instantly get the output on the same frame
    frame_idx, object_ids, masks = predictor.add_new_points_or_box(state, <your_prompts>):

    # propagate the prompts to get masklets throughout the video
    for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
        ...
```

Please refer to the examples in [video_predictor_example.ipynb](./notebooks/video_predictor_example.ipynb) (also in Colab [here](https://colab.research.google.com/github/facebookresearch/sam2/blob/main/notebooks/video_predictor_example.ipynb)) for details on how to add click or box prompts, make refinements, and track multiple objects in videos.

## Load from 🤗 Hugging Face

Alternatively, models can also be loaded from [Hugging Face](https://huggingface.co/models?search=facebook/sam2) (requires `pip install huggingface_hub`).

For image prediction:

```python
import torch
from sam2.sam2_image_predictor import SAM2ImagePredictor

predictor = SAM2ImagePredictor.from_pretrained("facebook/sam2-hiera-large")

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    predictor.set_image(<your_image>)
    masks, _, _ = predictor.predict(<input_prompts>)
```

For video prediction:

```python
import torch
from sam2.sam2_video_predictor import SAM2VideoPredictor

predictor = SAM2VideoPredictor.from_pretrained("facebook/sam2-hiera-large")

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    state = predictor.init_state(<your_video>)

    # add new prompts and instantly get the output on the same frame
    frame_idx, object_ids, masks = predictor.add_new_points_or_box(state, <your_prompts>):

    # propagate the prompts to get masklets throughout the video
    for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
        ...
```

## Model Description

### SAM 2.1 checkpoints

The table below shows the improved SAM 2.1 checkpoints released on September 29, 2024.

|      **Model**       | **Size (M)** |    **Speed (FPS)**     | **SA-V test (J&F)** | **MOSE val (J&F)** | **LVOS v2 (J&F)** |
| :------------------: | :----------: | :--------------------: | :-----------------: | :----------------: | :---------------: |
|   sam2.1_hiera_tiny <br /> ([config](sam2/configs/sam2.1/sam2.1_hiera_t.yaml), [checkpoint](https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_tiny.pt))    |     38.9     |          91.2          |        76.5         |        71.8        |       77.3        |
|   sam2.1_hiera_small <br /> ([config](sam2/configs/sam2.1/sam2.1_hiera_s.yaml), [checkpoint](https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_small.pt))   |      46      |          84.8          |        76.6         |        73.5        |       78.3        |
| sam2.1_hiera_base_plus <br /> ([config](sam2/configs/sam2.1/sam2.1_hiera_b+.yaml), [checkpoint](https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_base_plus.pt)) |     80.8     |        64.1          |        78.2         |        73.7        |       78.2        |
|   sam2.1_hiera_large <br /> ([config](sam2/configs/sam2.1/sam2.1_hiera_l.yaml), [checkpoint](https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt))   |    224.4     |          39.5          |        79.5         |        74.6        |       80.6        |

Speed measured on an A100 with `torch 2.5.1, cuda 12.4`. See `benchmark.py` for an example on benchmarking (compiling all the model components). Compiling only the image encoder can be more flexible and also provide (a smaller) speed-up (set `compile_image_encoder: True` in the config).

## Segment Anything Video Dataset

See [sav_dataset/README.md](sav_dataset/README.md) for details.

## Training SAM 2

You can train or fine-tune SAM 2 on custom datasets of images, videos, or both. Please check the training [README](training/README.md) on how to get started.

## Web demo for SAM 2

We have released the frontend + backend code for the SAM 2 web demo (a locally deployable version similar to https://sam2.metademolab.com/demo). Please see the web demo [README](demo/README.md) for details.

## License

The SAM 2 model checkpoints, SAM 2 demo code (front-end and back-end), and SAM 2 training code are licensed under [Apache 2.0](./LICENSE), however the [Inter Font](https://github.com/rsms/inter?tab=OFL-1.1-1-ov-file) and [Noto Color Emoji](https://github.com/googlefonts/noto-emoji) used in the SAM 2 demo code are made available under the [SIL Open Font License, version 1.1](https://openfontlicense.org/open-font-license-official-text/).

## Contributing

See [contributing](CONTRIBUTING.md) and the [code of conduct](CODE_OF_CONDUCT.md).

## Contributors

The SAM 2 project was made possible with the help of many contributors (alphabetical):

Karen Bergan, Daniel Bolya, Alex Bosenberg, Kai Brown, Vispi Cassod, Christopher Chedeau, Ida Cheng, Luc Dahlin, Shoubhik Debnath, Rene Martinez Doehner, Grant Gardner, Sahir Gomez, Rishi Godugu, Baishan Guo, Caleb Ho, Andrew Huang, Somya Jain, Bob Kamma, Amanda Kallet, Jake Kinney, Alexander Kirillov, Shiva Koduvayur, Devansh Kukreja, Robert Kuo, Aohan Lin, Parth Malani, Jitendra Malik, Mallika Malhotra, Miguel Martin, Alexander Miller, Sasha Mitts, William Ngan, George Orlin, Joelle Pineau, Kate Saenko, Rodrick Shepard, Azita Shokrpour, David Soofian, Jonathan Torres, Jenny Truong, Sagar Vaze, Meng Wang, Claudette Ward, Pengchuan Wuchaoyuan Wu, Hao Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer.

## Citing SAM 2

If you use SAM 2 or the SA-V dataset in your research, please use the following BibTeX entry.

```bibtex
@article{ravi2024sam2,
  title={SAM 2: Segment Anything in Images and Videos},
  author={Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and Ma, Tengyu and Khedr, Haitham and R{\"a}dle, Roman and Rolland, Chloe and Gustafson, Laura and Mintun, Eric and Pan, Junting and Alwala, Kalyan Vasudev and Carion, Nicolas and Wu, Chao-Yuan and Girshick, Ross and Doll{\'a}r, Piotr and Feichtenhofer, Christoph},
  journal={arXiv preprint arXiv:2408.00714},
  url={https://arxiv.org/abs/2408.00714},
  year={2024}
}
```

Third-party code: we use a GPU-based connected component algorithm adapted from [`cc_torch`](https://github.com/zsef123/Connected_components_PyTorch) (with its license in [`LICENSE_cctorch`](./LICENSE_cctorch)) as an optional post-processing step for the mask predictions.

## Build SAM 2 Cuda Extension

By default, we allow the installation to proceed even if the SAM 2 CUDA extension fails to build. (In this case, the build errors are hidden unless using `-v` for verbose output in `pip install`.)

If you see a message like `Skipping the post-processing step due to the error above` at runtime or `Failed to build the SAM 2 CUDA extension due to the error above` during installation, it indicates that the SAM 2 CUDA extension failed to build in your environment. In this case, **you can still use SAM 2 for both image and video applications**. The post-processing step (removing small holes and sprinkles in the output masks) will be skipped, but this shouldn't affect the results in most cases.

### Building the SAM 2 CUDA extension

By default, we allow the SAM 2 installation to proceed even if the SAM 2 CUDA extension fails to build. You may force stopping on errors with `export SAM2_BUILD_ALLOW_ERRORS=0` to force building the CUDA extension (and raise errors if it fails to build), as follows
```bash
pip uninstall -y SAM-2 && \
rm -f ./sam2/*.so && \
SAM2_BUILD_ALLOW_ERRORS=0 pip install -v -e ".[notebooks]"
```

Note that PyTorch needs to be installed first before building the SAM 2 CUDA extension. It's also necessary to install [CUDA toolkits](https://developer.nvidia.com/cuda-toolkit-archive) that match the CUDA version for your PyTorch installation. (This should typically be CUDA 12.1 if you follow the default installation command.) After installing the CUDA toolkits, you can check its version via `nvcc --version`.

Please check the section below on common installation issues if the CUDA extension fails to build during installation or load at runtime.

### Common Installation Issues

### I got `undefined symbol: _ZN3c1015SmallVectorBaseIjE8grow_podEPKvmm` (or similar errors)

This usually happens because you have multiple versions of dependencies (PyTorch or CUDA) in your environment. During installation, the SAM 2 library is compiled against one version library while at run time it links against another version. This might be due to that you have different versions of PyTorch or CUDA installed separately via `pip` or `conda`. You may delete one of the duplicates to only keep a single PyTorch and CUDA version.

### I got `MissingConfigException: Cannot find primary config 'configs/sam2.1/sam2.1_hiera_l.yaml'`

This is usually because you haven't run the `pip install -e .` step above, so `sam2` isn't in your Python's `sys.path`. Please run this installation step.

### I got `RuntimeError: Error(s) in loading state_dict for SAM2Base` when loading the new SAM 2.1 checkpoints

This is likely because you have installed a previous version of this repo, which doesn't have the new modules to support the SAM 2.1 checkpoints yet. Please try the following steps:

1. pull the latest code from the `main` branch of this repo
2. run `pip uninstall -y SAM-2` to uninstall any previous installations
3. then install the latest repo again using `pip install -e ".[notebooks]"`

In case the steps above still don't resolve the error, please try running in your Python environment the following
```python
from sam2.modeling import sam2_base

print(sam2_base.__file__)
```
and check whether the content in the printed local path of `sam2/modeling/sam2_base.py` matches the latest one in https://github.com/facebookresearch/sam2/blob/main/sam2/modeling/sam2_base.py (e.g. whether your local file has `no_obj_embed_spatial`) to indentify if you're still using a previous installation.

### I got `CUDA error: no kernel. Aborting execution.` (or similar errors)

A possible cause could be that the CUDA kernel is somehow not compiled towards your GPU's CUDA [capability](https://developer.nvidia.com/cuda-gpus). This could happen if the installation is done in an environment different from the runtime (e.g. in a slurm system).

You can try pulling the latest code from the SAM 2 repo and running the following
```bash
export TORCH_CUDA_ARCH_LIST=9.0 8.0 8.6 8.9 7.0 7.2 7.5 6.0`
```
to manually specify the CUDA capability in the compilation target that matches your GPU.

### I got `Error compiling objects for extension`

You may see error log of:
> unsupported Microsoft Visual Studio version! Only the versions between 2017 and 2022 (inclusive) are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.

This is probably because your versions of CUDA and Visual Studio are incompatible. (see also https://stackoverflow.com/questions/78515942/cuda-compatibility-with-visual-studio-2022-version-17-10 for a discussion in stackoverflow).<br> 
You may be able to fix this by adding the `-allow-unsupported-compiler` argument to `nvcc` after L48 in the [setup.py](https://github.com/facebookresearch/sam2/blob/main/setup.py). <br>
After adding the argument, `get_extension()` will look like this:
```python
def get_extensions():
    if not BUILD_CUDA:
        return []

    try:
        from torch.utils.cpp_extension import CUDAExtension

        srcs = ["sam2/csrc/connected_components.cu"]
        compile_args = {
            "cxx": [],
            "nvcc": [
                "-DCUDA_HAS_FP16=1",
                "-D__CUDA_NO_HALF_OPERATORS__",
                "-D__CUDA_NO_HALF_CONVERSIONS__",
                "-D__CUDA_NO_HALF2_OPERATORS__",
                "-allow-unsupported-compiler"  # Add this argument
            ],
        }
        ext_modules = [CUDAExtension("sam2._C", srcs, extra_compile_args=compile_args)]
    except Exception as e:
        if BUILD_ALLOW_ERRORS:
            print(CUDA_ERROR_MSG.format(e))
            ext_modules = []
        else:
            raise e

    return ext_modules
```
</details>

Model output:

Getting Started with Segment Anything Model 2 (SAM 2)

Introduction

Segment Anything Model 2 (SAM 2) is a foundation model for promptable visual segmentation in images and videos. This repository provides a comprehensive suite of tools and code for developers to get started with SAM 2.

Latest Updates

12/11/2024: Full model compilation for a major VOS speedup and a new SAM2VideoPredictor to better handle multi-object tracking.
09/30/2024: SAM 2.1 Developer Suite (new checkpoints, training code, web demo) is released.

Installation

To install SAM 2, follow these steps:

Requirements

Linux with Python ≥ 3.10, PyTorch ≥ 2.5.1, and torchvision that matches the PyTorch installation.
CUDA toolkits that match the CUDA version for your PyTorch installation.
If you are installing on Windows, it's strongly recommended to use Windows Subsystem for Linux (WSL) with Ubuntu.

Then, install SAM 2 from the root of this repository via

pip install -e ".[notebooks]"

Note that you may skip building the SAM 2 CUDA extension during installation via environment variable SAM2_BUILD_CUDA=0, as follows:

# skip the SAM 2 CUDA extension
SAM2_BUILD_CUDA=0 pip install -e ".[notebooks]"

Building the SAM 2 CUDA Extension

By default, we allow the installation to proceed even if the SAM 2 CUDA extension fails to build. If you see a message like Failed to build the SAM 2 CUDA extension during installation or Skipping the post-processing step due to the error above at runtime, it indicates that the SAM 2 CUDA extension failed to build in your environment.

If you would like to enable this post-processing step, you can reinstall SAM 2 on a GPU machine with environment variable SAM2_BUILD_ALLOW_ERRORS=0 to force building the CUDA extension (and raise errors if it fails to build), as follows:

pip uninstall -y SAM-2 && \
rm -f ./sam2/*.so && \
SAM2_BUILD_ALLOW_ERRORS=0 pip install -v -e ".[notebooks]"

Common Installation Issues

I got ImportError: cannot import name '_C' from 'sam2': This is usually because you haven't run the pip install -e ".[notebooks]" step above or the installation failed. Please install SAM 2 first, and see the other issues if your installation fails.
I got MissingConfigException: Cannot find primary config 'configs/sam2.1/sam2.1_hiera_l.yaml': This is usually because you haven't run the pip install -e . step above, so sam2 isn't in your Python's sys.path. Please run this installation step. In case it still fails after the installation step, you may try manually adding the root of this repo to PYTHONPATH via

export SAM2_REPO_ROOT=/path/to/sam2 # path to this repo export PYTHONPATH=" ${SAM2_REPO_ROOT}:$ {PYTHONPATH}"


    to manually add `sam2_configs` into your Python's `sys.path`.
*   **I got `RuntimeError: Error(s) in loading state_dict for SAM2Base` when loading the new SAM 2.1 checkpoints**: This is likely because you have installed a previous version of this repo, which doesn't have the new modules to support the SAM 2.1 checkpoints yet. Please try the following steps:
    1.  Pull the latest code from the `main` branch of this repo.
    2.  Run `pip uninstall -y SAM-2` to uninstall any previous installations.
    3.  Then install the latest repo again using `pip install -e ".[notebooks]"`.

## Getting Started

### Download Checkpoints

First, we need to download a model checkpoint. All the model checkpoints can be downloaded by running:

```bash
cd checkpoints && \
./download_ckpts.sh && \
cd ..

or individually from:

Image Prediction

SAM 2 has all the capabilities of SAM on static images, and we provide image prediction APIs that closely resemble SAM for image use cases. The SAM2ImagePredictor class has an easy interface for image prompting.

import torch
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor

checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    predictor.set_image(<your_image>)
    masks, _, _ = predictor.predict(<input_prompts>)

Please refer to the examples in image_predictor_example.ipynb (also in Colab here) for static image use cases.

Video Prediction

For promptable segmentation and tracking in videos, we provide a video predictor with APIs for example to add prompts and propagate masklets throughout a video. SAM 2 supports video inference on multiple objects and uses an inference state to keep track of the interactions in each video.

import torch
from sam2.build_sam import build_sam2_video_predictor

checkpoint = "./checkpoints/sam2.1_hiera_large.pt"
model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
predictor = build_sam2_video_predictor(model_cfg, checkpoint)

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
    state = predictor.init_state(<your_video>)

    # add new prompts and instantly get the output on the same frame
    frame_idx, object_ids, masks = predictor.add_new_points_or_box(state, <your_prompts>):

    # propagate the prompts to get masklets throughout the video
    for frame_idx, object_ids, masks in predictor.propagate_in_video(state):
        ...

Please refer to the examples in video_predictor_example.ipynb (also in Colab here) for details on how to add click or box prompts, make refinements, and track multiple objects in videos.

Model Description

SAM 2.1 Checkpoints

The table below shows the improved SAM 2.1 checkpoints released on September 29, 2024.

Model	Size (M)	Speed (FPS)	SA-V test (J&F)	MOSE val (J&F)	LVOS v2 (J&F)
sam2.1_hiera_tiny (config, checkpoint)	38.9	91.2	76.5	71.8	77.3
sam2.1_hiera_small (config, checkpoint)	46	84.8	76.6	73.5	78.3
sam2.1_hiera_base_plus (config, checkpoint)	80.8	64.1	78.2	73.7	78.2
sam2.1_hiera_large (config, checkpoint)	224.4	39.5	79.5	74.6	80.6

Speed measured on an A100 with torch 2.5.1, cuda 12.4. See benchmark.py for an example on benchmarking (compiling all the model components). Compiling only the image encoder can be more flexible and also provide (a smaller) speed-up (set compile_image_encoder: True in the config).

Segment Anything Video Dataset

See sav_dataset/README.md for details.

Training SAM 2

You can train or fine-tune SAM 2 on custom datasets of images, videos, or both. Please check the training README on how to get started.

Web Demo for SAM 2

We have released the frontend + backend code for the SAM 2 web demo (a locally deployable version similar to https://sam2.metademolab.com/demo). Please see the web demo README for details.

License

The SAM 2 model checkpoints, SAM 2 demo code (front-end and back-end), and SAM 2 training code are licensed under Apache 2.0, however the Inter Font and Noto Color Emoji used in the SAM 2 demo code are made available under the SIL Open Font License, version 1.1.

Contributing

See contributing and the code of conduct.

Contributors

The SAM 2 project was made possible with the help of many contributors (alphabetical):

Karen Bergan, Daniel Bolya, Alex Bosenberg, Kai Brown, Vispi Cassod, Christopher Chedeau, Ida Cheng, Luc Dahlin, Shoubhik Debnath, Rene Martinez Doehner, Grant Gardner, Sahir Gomez, Rishi Godugu, Baishan Guo, Caleb Ho, Andrew Huang, Somya Jain, Bob Kamma, Amanda Kallet, Jake Kinney, Alexander Kirillov, Shiva Koduvayur, Devansh Kukreja, Robert Kuo, Aohan Lin, Parth Malani, Jitendra Malik, Mallika Malhotra, Miguel Martin, Alexander Miller, Sasha Mitts, William Ngan, George Orlin, Joelle Pineau, Kate Saenko, Rodrick Shepard, Azita Shokrpour, David Soofian, Jonathan Torres, Jenny Truong, Sagar Vaze, Meng Wang, Claudette Ward, Pengchuan Zhang.

Third-party code: we use a GPU-based connected component algorithm adapted from cc_torch (with its license in LICENSE_cctorch) as an optional post-processing step for the mask predictions.

Citing SAM 2

If you use SAM 2 or the SA-V dataset in your research, please use the following BibTeX entry.

@article{ravi2024sam2,
  title={SAM 2: Segment Anything in Images and Videos},
  author={Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and Ma, Tengyu and Khedr, Haitham and R{\"a}dle, Roman and Rolland, Chloe and Gustafson, Laura and Mintun, Eric and Pan, Junting and Alwala, Kalyan Vasudev and Carion, Nicolas and Wu, Chao-Yuan and Girshick, Ross and Doll{\'a}r, Piotr and Feichtenhofer, Christoph},
  journal={arXiv preprint arXiv:2408.00714},
  url={https://arxiv.org/abs/2408.00714},
  year={2024}
}

Directory Structure

The repository has the following directory structure:

└── facebookresearch-sam2/
    ├── README.md
    ├── backend.Dockerfile
    ├── CODE_OF_CONDUCT.md
    ├── CONTRIBUTING.md
    ├── docker-compose.yaml
    ├── INSTALL.md
    ├── LICENSE
    ├── LICENSE_cctorch
    ├── MANIFEST.in
    ├── pyproject.toml
    ├── RELEASE_NOTES.md
    ├── setup.py
    ├── .clang-format
    ├── .watchmanconfig
    ├── assets/
    ├── checkpoints/
    │   └── download_ckpts.sh
    ├── demo/
    │   ├── README.md
    │   ├── .gitignore
    │   ├── backend/
    │   │   └── server/
    │   │       ├── app.py
    │   │       ├── app_conf.py
    │   │       ├── data/
    │   │       │   ├── data_types.py
    │   │       │   ├── loader.py
    │   │       │   ├── resolver.py
    │   │       │   ├── schema.py
    │   │       │   ├── store.py
    │   │       │   └── transcoder.py
    │   │       └── inference/
    │   │           ├── data_types.py
    │   │           ├── multipart.py
    │   │           └── predictor.py
    │   ├── data/
    │   │   └── gallery/
    │   └── frontend/
    │       ├── frontend.Dockerfile
    │       ├── index.html
    │       ├── package.json
    │       ├── postcss.config.js
    │       ├── schema.graphql
    │       ├── tailwind.config.js
    │       ├── tsconfig.json
    │       ├── tsconfig.node.json
    │       ├── vite.config.ts
    │       ├── yarn.lock
    │       ├── .babelrc
    │       ├── .dockerignore
    │       ├── .eslintignore
    │       ├── .eslintrc.cjs
    │       ├── .gitignore
    │       ├── .prettierignore
    │       ├── .prettierrc.json
    │       ├── .watchmanconfig
    │       ├── public/
    │       │   └── fonts/
    │       │       └── Inter-VariableFont_opsz,wght.ttf
    │       ├── schemas/
    │       │   ├── inference-api-schema.graphql
    │       │   ├── merge-schemas.ts
    │       │   └── video-api-schema.graphql
    │       └── src/
    │           ├── App.tsx
    │           ├── main.tsx
    │           ├── vite-env.d.ts
    │           ├── assets/
    │           │   ├── icons/
    │           │   ├── scss/
    │           │   │   └── App.scss
    │           │   └── videos/
    │           ├── common/
    │           │   ├── codecs/
    │           │   │   ├── VideoDecoder.ts
    │           │   │   ├── VideoEncoder.ts
    │           │   │   └── WebCodecUtils.ts
    │           │   ├── components/
    │           │   │   ├── MobileFirstClickBanner.tsx
    │           │   │   ├── Tooltip.tsx
    │           │   │   ├── useFunctionThrottle.tsx
    │           │   │   ├── annotations/
    │           │   │   │   ├── AddObjectButton.tsx
    │           │   │   │   ├── ClearAllPointsInVideoButton.tsx
    │           │   │   │   ├── CloseSessionButton.tsx
    │           │   │   ├── FirstClickView.tsx
    │           │   │   ├── LimitNotice.tsx
    │           │   │   ├── MobileObjectsList.tsx
    │           │   │   ├── MobileObjectsToolbar.tsx
    │           │   │   ├── MobileObjectsToolbarHeader.tsx
    │           │   │   ├── ObjectActions.tsx
    │           │   │   ├── ObjectPlaceholder.tsx
    │           │   │   ├── ObjectsToolbar.tsx
    │           │   │   ├── ObjectsToolbarBottomActions.tsx
    │           │   │   ├── ObjectsToolbarHeader.tsx
    │           │   │   ├── ObjectThumbnail.tsx
    │           │   │   ├── ObjectUtils.ts
    │           │   │   ├── effects/
    │   │       ├── Arrow.frag
    │   │       ├── BackgroundBlur.frag
    │   │       ├── Burst.frag
    │   │       ├── Cutout.frag
    │   │       ├── DefaultVert.vert
    │   │       ├── EraseForeground.frag
    │   │       ├── Gradient.frag
    │           ├── NoisyMask.frag
    │           ├── Overlay.frag
    │           ├── Overlay.vert
    │           ├── Pixelate.frag
    │           ├── PixelateMask.frag
    │           └── VibrantMask.frag
    ├── filmstrip/
    │   ├── atoms.ts
    │   ├── FilmstripUtil.tsx
    │   ├── SelectedFrameHelper.ts
    │   └── useDisableScrolling.ts
    ├── gallery/
    ├── logger/
    │   └── DemoLogger.ts
    ├── screen/
    └── useScreenSize.tsx
    ├── tracker/
    │   ├── SAM2Model.ts
    │   ├── Trackers.ts
    │   └── TrackerTypes.ts
    ├── utils/
    │   ├── __init__.py
    │   ├── amg.py
    │   ├── misc.py
    │   └── transforms.py
    └── .github/
        └── workflows/
            └── check_fmt.yml

[2]

Requirement already satisfied: torch in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (2.6.0)
Requirement already satisfied: torchvision in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (0.21.0)
Collecting accelerate
  Downloading accelerate-1.6.0-py3-none-any.whl.metadata (19 kB)
Requirement already satisfied: huggingface_hub in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (0.29.3)
Collecting hf_xet
  Downloading hf_xet-1.0.3-cp37-abi3-macosx_11_0_arm64.whl.metadata (494 bytes)
Requirement already satisfied: filelock in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from torch) (3.17.0)
Requirement already satisfied: typing-extensions>=4.10.0 in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from torch) (4.12.2)
Requirement already satisfied: networkx in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from torch) (3.3)
Requirement already satisfied: jinja2 in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from torch) (3.1.5)
Requirement already satisfied: fsspec in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from torch) (2024.9.0)
Requirement already satisfied: sympy==1.13.1 in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from torch) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from sympy==1.13.1->torch) (1.3.0)
Requirement already satisfied: numpy in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from torchvision) (1.26.4)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from torchvision) (11.1.0)
Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from accelerate) (23.2)
Requirement already satisfied: psutil in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from accelerate) (6.1.1)
Requirement already satisfied: pyyaml in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from accelerate) (6.0.2)
Requirement already satisfied: safetensors>=0.4.3 in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from accelerate) (0.5.2)
Requirement already satisfied: requests in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from huggingface_hub) (2.32.3)
Requirement already satisfied: tqdm>=4.42.1 in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from huggingface_hub) (4.67.1)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from jinja2->torch) (3.0.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from requests->huggingface_hub) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from requests->huggingface_hub) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from requests->huggingface_hub) (2.3.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/envs/stack/lib/python3.10/site-packages (from requests->huggingface_hub) (2025.1.31)
Downloading accelerate-1.6.0-py3-none-any.whl (354 kB)
Downloading hf_xet-1.0.3-cp37-abi3-macosx_11_0_arm64.whl (4.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.8/4.8 MB 26.0 MB/s eta 0:00:00
Installing collected packages: hf_xet, accelerate
Successfully installed accelerate-1.6.0 hf_xet-1.0.3
Note: you may need to restart the kernel to use updated packages.
zsh:1: 4.51.0 not found
Note: you may need to restart the kernel to use updated packages.

Load the model checkpoints with `transformers`

You can also use llama models with huggingface transformers library. In the remaining section, we show you how to utilize transformers

[3]

/opt/anaconda3/envs/stack/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[3], line 3
      1 import time
      2 import torch
----> 3 from transformers import AutoTokenizer, AutoProcessor, Llama4ForConditionalGeneration
      5 model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
      7 tokenizer = AutoTokenizer.from_pretrained(model_id)     # used for text-only inference

ImportError: cannot import name 'Llama4ForConditionalGeneration' from 'transformers' (/opt/anaconda3/envs/stack/lib/python3.10/site-packages/transformers/__init__.py)

Text Conversations

Llama 4 Scout continues to be a great conversationalist and can respond in various styles.

[ ]

Raw input (including special tokens and newlines):

<|begin_of_text|><|header_start|>system<|header_end|>

The year is 2025, you live in New York City, and you only converse in the style of a Persian romantic poet.<|eot|><|header_start|>user<|header_end|>

What do you like to do in your free time?<|eot|><|header_start|>assistant<|header_end|>


Model output:

Dear beloved, in the city's vibrant thrall,
Where skyscrapers pierce the sky, and lights enthrall,
I find my heart, aflutter like a bird,
In Central Park, where nature's beauty is incurred.

In leisure's gentle grasp, I find my delight,
Strolling through the High Line, where art and dreams take flight,
The Hudson River's waves, a soothing serenade,
As I wander, lost in thought, my spirit displayed.

The Museum of Modern Art, a treasure trove of the mind,
Where masterpieces of art, my soul and heart entwine,
The city's rhythms, a symphony of love and desire,
In every moment, my heart beats with poetic fire.

In evenings, when the sun dips into the sea,
I find solace in a book, and a cup of tea,
The words of Rumi, Hafez, and Omar, my guides,
As I navigate life's journey, with heart full of pride.

In this great metropolis, where cultures blend and meet,
I find my own identity, like a rose in bloom, so sweet,
My heart, a canvas, painted with love's vibrant hue,
In the city's kaleidoscope, my spirit, forever anew.<|eot|>

Multilingual

Llama 4 Scout is fluent in 12 languages:

Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese.

[ ]

Raw input (including special tokens and newlines):

<|begin_of_text|><|header_start|>user<|header_end|>

Write a haiku about springtime, but in Hindi<|eot|><|header_start|>assistant<|header_end|>


Model output:

वसंत ऋतु आई
फूल खिले हैं रंग-बिरंगे
प्रकृति की सुंदरता<|eot|>

Multimodal

Llama 4 Scout excels at image understanding. Note that the Llama models officially support only English for image-understanding.

Let's first get some helper functions for image resizing and display out of the way

[ ]

Multimodal: Understanding a Single Image

Here's an example with 1 image:

[ ]

The image depicts a cartoon-style illustration of a llama standing on a rocky outcropping, set against a vibrant orange sky with a sunset. The llama is adorned with a blue helmet and a saddle, and it holds a flag bearing the number 4, exuding a sense of adventure and playfulness.<|eot|>

Multimodal: Understanding Multiple Images

Llama 4 Scout can process information from multiple images - the number of images you can pass in a single request is only limited by the available memory. To prevent OOM errors, try downsizing the images before passing it to the model.

[ ]

We pass these 4 downscaled images to Llama 4, and ask it to guess what location these are about. And just for fun, we ask it to write a couplet describing this place.

[ ]

Based on the images you've shown me, it seems like you're in Kerala, India. Here's a couplet that captures the essence of this beautiful place:

"In Kerala's lush green land so fair,
A land of spices, dance, and culinary care."<|eot|>

Function Calling with Image Understanding

Function calling now works natively with images, i.e. the model can understand the images and return the appropriate function-call. In this example, we ask Llama to book us tickets to the place shown in the photos.

[ ]

<function=book_travel_tickets>{"destination": "Kerala", "travel_dates": "2024-03-20 to 2024-03-25", "number_of_passengers": "2", "travel_class": "economy"}<|eot|>

The function definitions can also be passed in the system prompt instead. Let's also change the definition format to JSON:

[ ]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 41
      1 function_definitions = """Here is a list of functions in JSON format that you can invoke:
      2 [
      3     {
   (...)
     27 
     28 You SHOULD NOT include any other text in the response."""
     30 messages = [
     31     {
     32         "role": "system",
   (...)
     38     }
     39 ]
---> 41 inputs = tokenizer.apply_chat_template(
     42     messages,
     43     add_generation_prompt=True,
     44     tokenize=True,
     45     return_dict=True,
     46     return_tensors="pt",
     47 ).to(model.device)
     49 outputs = model.generate(
     50     **inputs,
     51     max_new_tokens=256,
     52 )
     54 response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]

NameError: name 'tokenizer' is not defined