My Notebook
#project#programming
2

in browser favicon diffusion! scratch-dit pt.2

2025-02-19 · 11 min reada fun side project to learn more about kernel development + optimization
Notion image
🧨
TLDR; Implemented a ~11M parameter diffusion model in WebGPU that does 32 steps (350M params) in ~0.7 seconds! Worked on kernel optimization, training/distilling a diffusion model, and fun code stuff.

All of this for… 🥁🥁🥁
IN BROWSER FAVICON DIFFUSION

Notion image

So… this project mostly started with me knowing ~0 about diffusion models to building a diffusion model from scratch and now port it to the browser (from scratch again)!

The pitch here is basically—you can generate interesting icons, do fun things, all on local GPUs! WebGPU is still relatively new (and not supported by a lot of browsers), but the full capacity definitely isn’t used on the web. Think native performance, mixed into whatever application you need—no need for LMStudio or local diffusion models.

Performance

These are really approximate and not exhaustive, ran on an m1 pro and done with 100 passes + averaged

Notion image

Using some basic benchmarks (averaged over 100 runs) we get our implementation being 45% faster than a naive tensorflow.js implementation or 88% faster than a baseline JS implementation… how do we get these performance benefits?

Awesome kernel optimizations + buffer optimizations!

kernels

Each kernel is tiled to optimize for caching + tries to minimize the amount of loops possible. Tiling is set to default to 16x16 due to memory constraints (this is the max tile size WebGPU allows) though this is reasonable and can fit in most L1 caches! Other important optimizations I saw large benefits from was pre-transposing the secondary matrix for matmuls—allowing for even better caching here!

Notion image

I also implemented flash attention and get a lot of the logic behind it now! Sadly flashattention is only useful with much larger sequence lengths/attention sizes, and since we’re limited to a 16x16 tile for WebGPU it doesn’t seem to provide a performance benefit (it seems to slow it down?)

Multithreading was also a large problem! For example if you check out scale.wgsl , there are some interesting tricks to handle dividing the data across different threads—ie. letting some threads handle the slack + adding different fallbacks in case the sizes didn’t match up perfectly.

Handling everything as 1d arrays was especially interesting + led to a lot of sketching things out when designing the exact code. I took a deep operating systems class and already knew about the L1 cache + memory coalescing (making sure threads hit memory caches together) and love that this is a super practical application (made benchmark number go brrr)

buffers

Another large problem was actually handling buffers correctly + disposing of them + not using too many synchronous calls. I ended up using patchify.wgsl and unpatchify.wgsl to do more steps on the GPU!

Buffer storage was also pretty interesting. I originally stored all the weights in a matrices.json file which would be extremely easy to load and use. The problem was um… it was over 400mb. This meant that compared to a pytorch .pt file, I was using ~10x the space with just whitespace.

The solution here was byte encoding! I ended up storing the parameters as binary, taking up as little space as I could—only including space for parameter names and actually using all the bytes to store the numbers.

Afterwards I got the file size down to ~40mb, which is still large but at least a little less (~1mb less) than the regular .pt torch save.

activations

The funniest problem that I probably faced was with relation to activations. Because I used a diffusion model in pixel space, this leads to quite a lot of instability with relation to the weights. Even a small quantization or rounding error could cause problems.

WebGPU is different from CUDA in many ways, but one of them is the lack of an included error function. Initially, I trained the model on a GELU activation function and didn’t care to retrain it—therefore I used an approximation of the ERF (error function) using tanh.

When integration testing, everything was off by a decently large amount! I went through tons of integration testing + logging, and found the activations to be ever so slightly off. After looking deeply, the Pytorch kernel was using an ERF GELU approximation, versus my tanh approximation… which was causing the errors.

Quickly I switched my kernel over to basic constants, what I assumed Pytorch did, and things got a lot better! It was still not at the exact quality as Pytorch so I went on debugging. Eventually I found an error, went back to the GELU kernel, looked at the Pytorch implementation, AND looked at the official CUDA erf precision error + associated constants and 🥁🥁🥁… my constant was off by 0.00001 due to my WGSL formatter truncating the numbers.

Eventually this led to me disabling all formatters, retraining my diffusion model with a much simpler SiLU activation, implementing the kernel, and calling it a day!

Training

pre-training

An interesting problem I had was HOW to train a model like this. I could either find a bunch of hippos, use the stable diffusion architecture and distill it, or generate a bunch of synthetic data. I chose the latter since it was much dev time + allowed me to test this in a more reproducible way.

Since generating favicons doesn’t actually have any conditioning, I didn’t train the model on any class/text labels.

Initially I used general hippo images by stable diffusion, where the actual hippo structure was pretty hard to converge on. I used ~8k timesteps here and ablated the MLP size/patch size quite a bit to try and get it as neat as possible. You can pretty clearly see body structure + some semblance of face structure.

Notion image
Notion image

Seeing this, I simplified the prompt to only be a hippo face. This seemed to increase the accuracy + shape a lot! As usual, training is honestly pretty stable and I started seeing faces relatively soon and good structure.

Notion image
Notion image

distillation

Going from 8k timesteps all the way down to 32 was a pretty hard task and was a really interesting problem. I read up on a lot of different papers and ended up implementing this diffusion distillation algorithm. It was a good mix of checkpointing every decrease in step size and compressing 2 at a time. https://github.com/google-research/google-research/tree/master/diffusion_distillation

For distillation I also modified the training objective to try and predict the original image from the ending latents. I read that this helps in earlier noise steps (with almost no real “data”) to help the gradients at least update towards the image slightly—and it helped the model converge a lot faster to less noisy distributions!

Notion image

Some really nice tips I realized were that transfer learning works especially well on diffusion models! Modulating the step size + adding new parameters was actually very easy and seemed to converge fast when including even half the weights from a previous training run.

WebGPU Tips & Tricks

WebGPU is still extremely experimental, so this only works in select browsers with flags enabled! This also means the development environment is extremely terrible and honestly in need of some type of web bundler.

One quick hack I quite like is a compile.sh script that takes all .wgsl files and compiles it into variables which you can use in code. This seems a lot nicer than dealing with JS string literals, and is the only way I can deal with the syntax highlighting.

It seems like Safari supports a 2d matrices, while Chrome does not—but using 1d matrices is probably better to get a more metal view of the system. I also realized that Safari doesn’t have as many debugging logs as Chrome, so I highly recommend development there.


Overall, this was a super fun project to work on, I loved writing the kernels and thinking through exactly how I can move the loops to fit the cache + different threading interactions. This was meant mainly as an educational project, and I hope you enjoyed looking through it!

Check out the source code @ https://github.com/neelr/favicon-diffusion and check out the demo https://web-dit.vercel.app! The favicon & embed can also be found @ https://neelr.dev (on my homepage)

Thanks for reading! Liked the story? Click the heart
Created with ☕ by @neelr