Unsloth Gemma3N (2B) Inference

Gemma3N (2B) Inference

unsloth-notebooksunslothoriginal_template

alph-notebooks/unsloth-notebooks / Gemma3N_(2B)-Inference.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

News

Placeholder

Installation

[1]

Unsloth

Launch sglang inference for unsloth/gemma-3n-E2B-it (https://huggingface.co/unsloth/gemma-3n-E2B-it)

[2]

nohup: redirecting stderr to stdout
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2025-07-04 17:48:54] server_args=ServerArgs(model_path='unsloth/gemma-3n-E2B-it', tokenizer_path='unsloth/gemma-3n-E2B-it', tokenizer_mode='auto', skip_tokenizer_init=False, skip_server_warmup=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='unsloth/gemma-3n-E2B-it', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, impl='auto', host='127.0.0.1', port=8000, mem_fraction_static=0.75, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=877324878, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='fa3', sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_moe=False, enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, disable_overlap_cg_plan=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False)
E0000 00:00:1751651339.903475    4493 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1751651339.903475    4493 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2025-07-04 17:49:06] Inferred chat template from model path: gemma-it
[2025-07-04 17:49:14] Using model weights format ['*.safetensors']
[2025-07-04 17:49:14] Using model weights format ['*.safetensors']
[2025-07-04 17:49:14] Using model weights format ['*.safetensors']
[2025-07-04 17:49:14] Using model weights format ['*.safetensors']
[2025-07-04 17:49:14] Using model weights format ['*.safetensors']
[2025-07-04 17:49:14] Using model weights format ['*.safetensors']
[2025-07-04 17:49:14] Using model weights format ['*.safetensors']
[2025-07-04 17:49:14] Using model weights format ['*.safetensors']
[2025-07-04 17:49:14] Using model weights format ['*.safetensors']
[2025-07-04 17:49:14] Using model weights format ['*.safetensors']
[2025-07-04 17:49:14] Using model weights format ['*.safetensors']
[2025-07-04 17:49:14] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:02<00:01,  1.08s/it]
Capturing batches (bs=16 avail_mem=9.72 GB):  78%|███████▊  | 18/23 [00:13<00:03,  1.51it/s][2025-07-04 17:50:33] Capture cuda graph end. Time elapsed: 17.01 s. mem usage=2.45 GB. avail mem=9.62 GB.
[2025-07-04 17:50:40] INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

Image helper functions

[3]

Gemma3n Inference using sglang (source model: https://huggingface.co/unsloth/gemma-3n-E2B-it)

Inference 1

Image source file "https://raw.githubusercontent.com/sgl-project/sglang/refs/heads/main/test/lang/example_image.png"

load image from url source

[4]

Let's run the model!

[6]

{"id":"4b0ed837363d4f19819b4eeda28f0ebd","object":"chat.completion","created":1751651506,"model":"unsloth/gemma-3n-E2B-it","choices":[{"index":0,"message":{"role":"assistant","content":"The image shows a man ironing a blue shirt on the trunk of a yellow taxi cab in what appears to be a city street. \n\nHere's a breakdown of what's visible:\n\n*   **Man:** An adult male wearing a yellow long-sleeved shirt and dark pants. He is standing on the trunk of the taxi, ironing a blue shirt with an ironing board.\n*   **Taxi:** A yellow taxi cab with a rack on the trunk. The taxi is in motion, as indicated by the motion blur of the background.\n*   **Ironing Board:** A foldable ironing board is set up on the trunk of the taxi, supporting the shirt.\n*   **Shirt:** A blue shirt is being ironed.\n*   **Street:** The taxi is driving on a city street with white lane markings.\n*   **Buildings:** Tall buildings line the street, typical of a city environment.\n*   **Other Vehicles:** Another yellow taxi is visible in the background, moving in the opposite direction.\n*   **Cityscape:** There are elements of a city environment, including signs, posters, and possibly traffic lights.\n\n\n\nIt's a somewhat unusual sight!","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":106}],"usage":{"prompt_tokens":283,"total_tokens":529,"completion_tokens":246,"prompt_tokens_details":null}}

Inference 2

Image source file "https://i.ibb.co/1tw5whfz/ocr.png"

load image from url source

[7]

Let's run the model!

[8]

{"id":"0d961928e0e542b49ace5d5c1a6c9d7a","object":"chat.completion","created":1751651525,"model":"unsloth/gemma-3n-E2B-it","choices":[{"index":0,"message":{"role":"assistant","content":"Hello, this is a sample text! Short and to the point.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":106}],"usage":{"prompt_tokens":282,"total_tokens":297,"completion_tokens":15,"prompt_tokens_details":null}}