Making vLLM-Omni Run on the Ryzen AI Max 395 - Kurt

In a haze of caffeine and blind ambition, I monkey-patched vLLM-Omni to run on the AMD Ryzen AI Max+ 395 (gfx1151) system. I'm very happy to say I can run Qwen2.5-Omni-7B somewhat using vLLM-Omni. Image and text inputs work great, and text output is fast. Audio output is slow and glitchy, and I don't have a great way to test video or image outputs currently. I'm hoping I can inspire others to jump in with more expertise, fix my mistakes, and make this workable for the masses. [GitHub](https://github.com/kurt-apple/amd-strix-halo-vllm-omni-toolboxes) # Why? I am preparing for a bunch of AI-related software projects. Several of these will require usage beyond what I am comfortable paying, or deal with private data that I do not wish to share with others. Based on the use cases I have in mind, audio output is preferable. # The Ryzen AI Max+ 395 I purchased a Beelink GTR9Pro off eBay to do self hosted AI research. The Strix Halo architecture enables the RAM and VRAM to be a shared resource. If you consider that the top model has 128GB RAM for roughly $2500 (price fluctuates), this computer can run the same LLMs that a much more costly server/workstation would be able to do. I think it's a big deal because at least it is relatively much more accessible. # vLLM-Omni I will probably make mistakes in this explanation but vLLM-Omni sits atop vLLM to orchestrate several input and output modalities of various models or parts of a model. For instance, if there was a reasoning model that you wanted to make talk, both would be served up in parallel sub-processes, and the pipe between the two is configurable. There are also multimodal "omni" models whose internal structure would lend themselves particularly well to vLLM-Omni's orchestration scheme. vLLM-Omni supports the ROCm runtime - that's basically AMD's version of CUDA, i.e. it's a way to efficiently run computation tasks on GPU. # The Problem There's a catch. gfx1151, aka the AI Max+ 395, is not fully & officially supported for ROCm. To boil it down, ROCm is mostly intended for use on AMD's enterprise-class GPUs, so consumer hardware support is not a focus. Developers must use the [Nightly ROCm builds](https://github.com/ROCm/TheRock). The fragmented support of gfx1151 is evident in benchmarks wherein Vulkan runtimes usually beat ROCm in AI workload tests. # The Point I searched for the methods used to orchestrate and host models with audio outputs. There seem to (theoretically) be two approaches: DIY using pytorch + transformers, or vllm-omni (which is quite new). There are efficiency and performance claims with vLLM Omni to do with caching and its ability to host Omni models. These were appealing to me as a way to enable as many desired capabilities in one model instance as possible rather than orchestrating several separate instances of different models with custom wrappers, scheduling, queues, messaging systems, and an API proxy. # Hindsight Before I go into details on the process, I want to summarize my results. While I can run vLLM-Omni on gfx1151 hardware, it took 8+ days of debugging and ungracefully cracking the software stack to let it run. The reality of running one model instance to enable a lot of my needed features in my VRAM budget is more nuanced. Either way, in hindsight I should have stuck with an ensemble of separate specialized models (text -> image, text -> speech, image -> text, reasoning, coding, etc.) and wrote my own orchestration around all that. Sure, under the right configuration, it appears that separate components of an omni model will run next to each other so that you don't have to have 3 parallel instances of the full model. However, the KV cache is isolated per-module. So, the VRAM allocation is still very high or unfeasible, and no matter what I try, there are OOM issues. This is why I lowered the goalpost from Qwen3-Omni to Qwen2.5-Omni-7B. Also, vLLM-Omni is just not performance-optimized for gfx1151... it is hard to debug, unstable, and brittle. The approach I will detail below is neither battle-tested nor good. # Approach I found kyuz0's collection of strix halo toolboxes and pulled the latest for this project to kick off. Reading the Dockerfile, it becomes clear kyuz0 et al have put a lot of thought into patching vLLM so it supports strix halo architecture. At [DeepWiki](deepwiki.com/kyuz0/amd-strix-halo-vllm-toolboxes/5.3-gfx1151-hardware-support) there are more details. I built it locally to confirm that it works on my specific machine, the Beelink GTR9 Pro. The next logical step is to add the vLLM-Omni installation to this existing work. # In Depth I intended to add vLLM-Omni to this dockerfile since the underlying vLLM installation is still required for Omni since under the hood Omni instantiates vLLM processes. Simply adding the clone and install steps clobbered the patched dependencies that preceded it in the Dockerfile, though. I like that word, clobbered. You'll see more of that word soon. Other issues arose. vLLM-Omni must align with the same version of the underlying vLLM, and kyuz0's toolbox container installs the latest vLLM version (0.16.0 during this research). Omni's releases lag vLLM, so I would either have to a) downgrade vLLM in the toolbox or b) patch to make Omni work with more versions of vLLM. Then, by sheer luck, vLLM-Omni put out a pre-release 0.16.0. In rapid fire, here are the patches I required to successfully launch vLLM-Omni: - run off a local copy of the torch, torchvision, torchaudio, and triton wheels to save time and so I don't get throttled by whoever hosts the files - use the TheRock torch version/wheel so flash-attention doesn't CLOBBER it - pull vLLM 0.16.0 specifically - copy the locally-sourced triton CMAKE dep into `/opt/vllm/.deps` and set an envar - set `MAX_JOBS=16` for faster compilation - torch version hardcoding shenanigans: pin the torch version when other things install so it doesn't get CLOBBERED - manually upgrade to transformers 5.1.0 which adds support for various Qwen Omni models. ## Patch Flash Attention There was an issue where Flash Attention would crash due to unknown/unsupported architecture, so I hardcoded `gfx1151` ```python # old get_arch() --> GpuArch: return GpuArch(name=name) # bad value # new get_arch() --> GpuArch: return GpuArch(name="gfx1151", family="rdna") # hardcoded patch ``` ## Patch vLLM Certain checks are done for hardware, and the APIs must have changed. Also, vLLM-Omni seemed to initialize ModelConfigs with the model property unset. This would cause it to default to `Qwen/Qwen3-0.6B` which isn't even a multimodal model. Not sure why this is the default! ```js // OLD -- did not exist in my build /* ... */ C10_HIP_CHECK /* ... */ /* ... */ getCurrentHIPStreamMasqueradingAsCUDA /* ... */ // NEW -- probably/hopefully just renamed and not proliferated to other repos /* ... */ C10_CUDA_CHECK /* ... */ /* ... */ getCurrentHIPStream /* ... */ ``` ```python # OLD: ModelConfig class class ModelConfig: model: str = "Qwen/Qwen3-0.6B" # etc. # NEW: envvar for default, and a fallback class ModelConfig: model: str = os.environ.get("VLLM_DEFAULT_MODEL", "Qwen/Qwen2.5-Omni-7B") # envar is set in Dockerfile.third ``` ## Patch `offload-arch` Some sophisticated logic on this one: ```bash #!/bin/bash echo "gfx1151" ``` ## Patch ROCm ```python # old: this assertion would just fail because I had to forcibly unset HIP_VISIBLE_DEVICES # Prevent use of clashing `{CUDA/HIP}_VISIBLE_DEVICES` if "HIP_VISIBLE_DEVICES" in os.environ: val = os.environ["HIP_VISIBLE_DEVICES"] if cuda_val := os.environ.get("CUDA_VISIBLE_DEVICES", None): assert val == cuda_val else: os.environ["CUDA_VISIBLE_DEVICES"] = val # new if "HIP_VISIBLE_DEVICES" in os.environ: val = os.environ["HIP_VISIBLE_DEVICES"] if cuda_val := os.environ.get("CUDA_VISIBLE_DEVICES", None): pass else: pass # old: check if hardware supports fp8_fnuz quants @classmethod def is_fp8_fnuz(cls) --> bool: # only device 0 is checked, this assumes MI300 platforms are homogeneous return "gfx94" in torch.cuda.get_device_properties(0).gcnArchName # new: give the correct answer for gfx1151 ;) def is_fp8_fnuz(cls) --> bool: return False # not supported in gfx1151 ``` ## Patch Qwen3 Model Launcher A parameter was undefined but required a value. ```python def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""): # ADDED THIS talker_config.text_config.rope_parameters["rope_theta"] = getattr(talker_config.text_config, "rope_theta", 10000.0) ``` This is not a super long list, but this is my first expedition into this stuff. I used Claude to bridge gaps in my knowledge as well as just keep the momentum going during this 8-day march through integration hell. # Operating vLLM-Omni After all this work of aligning dependencies and massaging some code with `sed`, I had a version of vLLM-Omni that could run and detect the gfx1151 GPU. vLLM-Omni wants a stage presets file, and there are examples in `vllm_omni/model_executor/stage_configs`. The purpose of these files is to compose stages, or components, of the omni infrastructure as it relates to GPU VRAM budgets, runtime tweaks, context window configuration, batching, and pipeline connections between stages, etc. # Results After adapting the YAML for Qwen3-Omni and Qwen2.5-Omni and attempting to run, it became clear that my assumptions about running an omni model was malformed. Because I have run models roughly 60GB plus a kv cache within the 96+ GB VRAM budget of the AI Max+ 395, it seemed feasible to run Qwen3-Omni in a similar way, operating in a single 60ish GB window since each stage or modality seemed to be a "sparse" subset of the total model. However, I was hitting OOM which meant best-case that the cache/context had to be tuned down substantially. Worst-case, each stage was trying to instantiate the full model, for a total requirement of nearly 200GB. At this stage of research I concluded it was unfeasible to determine whether this was even intentional behavior of vLLM-Omni, though I will point out that the [stage-config](https://docs.vllm.ai/projects/vllm-omni/en/latest/configuration/stage_configs/) `process: true/false` did not seem to affect this behavior. All this to say that I resorted to Qwen2.5-Omni-7B which stands at 24GB per instance. This plus a sufficient KV cache can slide into a 96GB VRAM budget. I then launched a gradio demo web UI ([Source](https://github.com/vllm-project/vllm-omni/tree/main/examples/online_serving/qwen2_5_omni)). After testing for a few minutes, the verdict was clear. I was tired. Text and image inputs seem to work acceptably. Audio output, however, was unreliable and incredibly slow. Most likely there is a bug that either delegates this to CPU, disables parallelism, or something else. Or, this is just a legitimately heavy computation and gfx1151 just doesn't have the juice. Either way, this was my pencils-down moment. # Conclusion Look, it works. I think audio output is probably an issue of vLLM-Omni processing in a way that is sub-optimal. I don't have time to debug or patch further, but hopefully this can give eager devs a launching-off point to make this viable. Meanwhile, I will devote my time to write a custom orchestrator to run smaller, more specialized models for each modality that needs to be supported by the system I am designing. If there is contention over resources, then models will need to be run sequentially, and a scheduler will need to offload and reload them. vLLM-Omni shows promise but until there is more community support and documentation improvements at the very least, I'll watch from the sidelines.