b7678

Jan 8, 2026

Meta/llama.cppCLIvb7678

ggml webgpu: initial flashattention implementation (#18610)

FlashAttention (#13)
Add inplace softmax
Move rms_norm to split row approach
Update debug for supports_op
clean up debug statements
neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though
neg passes backend test
unary operators pass ggml tests
rms_norm double declaration bug atoned
abides by editor-config
removed vestigial files
fixed autoconfig
All operators (inlcluding xielu) working
removed unnecesarry checking if node->src[1] exists for unary operators
responded and dealt with PR comments
implemented REPL_Template support and removed bug in unary operators kernel
formatted embed wgsl and ggml-webgpu.cpp
Faster tensors (#8)

Add fast matrix and matrix/vector multiplication.

Use map for shader replacements instead of pair of strings
Wasm (#9)
webgpu : fix build on emscripten
more debugging stuff
test-backend-ops: force single thread on wasm
fix single-thread case for init_tensor_uniform
use jspi
add pthread
test: remember to set n_thread for cpu backend
Add buffer label and enable dawn-specific toggles to turn off some checks
Intermediate state
Fast working f16/f32 vec4
Working float fast mul mat
Clean up naming of mul_mat to match logical model, start work on q mul_mat
Setup for subgroup matrix mat mul
Basic working subgroup matrix
Working subgroup matrix tiling
Handle weirder sg matrix sizes (but still % sg matrix size)
Working start to gemv
working f16 accumulation with shared memory staging
Print out available subgroup matrix configurations
Vectorize dst stores for sg matrix shader
Gemv working scalar
Minor set_rows optimization (#4)
updated optimization, fixed errors
non vectorized version now dispatches one thread per element
Simplify
Change logic for set_rows pipelines

Co-authored-by: Neha Abbas nehaabbas@macbookpro.lan Co-authored-by: Neha Abbas nehaabbas@ReeseLevines-MacBook-Pro.local Co-authored-by: Reese Levine reeselevine1@gmail.com

Comment on dawn toggles
Working subgroup matrix code for (semi)generic sizes
Remove some comments
Cleanup code
Update dawn version and move to portable subgroup size
Try to fix new dawn release
Update subgroup size comment
Only check for subgroup matrix configs if they are supported
Add toggles for subgroup matrix/f16 support on nvidia+vulkan
Make row/col naming consistent
Refactor shared memory loading
Move sg matrix stores to correct file
Working q4_0
Formatting
Work with emscripten builds
Fix test-backend-ops emscripten for f16/quantized types
Use emscripten memory64 to support get_memory
Add build flags and try ci

Co-authored-by: Xuan Son Nguyen son@huggingface.co

Remove extra whitespace
Move wasm single-thread logic out of test-backend-ops for cpu backend
Disable multiple threads for emscripten single-thread builds in ggml_graph_plan
Refactored pipelines and workgroup calculations (#10)
refactored pipelines
refactored workgroup calculation
removed commented out block of prior maps
Clean up ceiling division pattern

Co-authored-by: Neha Abbas nehaabbas@eduroam-169-233-141-223.ucsc.edu Co-authored-by: Reese Levine reeselevine1@gmail.com

Start work on flash attention
Shader structure set up (many bugs still)
debugging
Working first test
Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32
Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling
Start work on integrating pre-wgsl
Separate structs/initial shader compilation library into separate files
Work on compilation choices for flashattention
Work on subgroup matrix/tile size portability
subgroup size agnostic online softmax
Cleanups, quantization types
more cleanup
fix wasm build
Refactor flashattention to increase parallelism, use direct loads for KV in somce cases
Checkpoint
formatting
Update to account for default kv cache padding
formatting shader
Add workflow for ggml-ci webgpu
Try passing absolute path to dawn in ggml-ci
Avoid error on device destruction, add todos for proper cleanup
Fix unused warning
Forgot one parameter unused
Move some flashattn computation to f32 for correctness

macOS/iOS:

Linux:

Windows:

openEuler:

← Back to feed