Johannes Gäßler
|
9b596417af
|
CUDA: quantized KV support for FA vec (#7527)
* CUDA: quantized KV support for FA vec
* try CI fix
* fix commented-out kernel variants
* add q8_0 q4_0 tests
* fix nwarps > batch size
* split fattn compile via extern templates
* fix flake8
* fix metal tests
* fix cmake
* make generate_cu_files.py executable
* add autogenerated .cu files
* fix AMD
* error if type_v != FP16 and not flash_attn
* remove obsolete code
|
2024-06-01 08:44:14 +02:00 |
|
Johannes Gäßler
|
cd93a28cb1
|
CUDA: fix FA out-of-bounds reads (#7479)
|
2024-05-23 00:31:20 +02:00 |
|
Johannes Gäßler
|
38c03478a3
|
CUDA: fix FA out-of-bounds writes (#7465)
|
2024-05-22 17:58:25 +02:00 |
|
Georgi Gerganov
|
9b3d833189
|
cuda : fix compile warning (#7454)
|
2024-05-22 12:36:37 +03:00 |
|
Johannes Gäßler
|
95fb0aefab
|
CUDA: remove incorrect precision check (#7454)
|
2024-05-22 10:24:29 +02:00 |
|
Johannes Gäßler
|
133d99c599
|
CUDA: deduplicate FlashAttention code (#7352)
|
2024-05-18 12:36:25 +02:00 |
|
Johannes Gäßler
|
0fc1e820a9
|
CUDA: faster large batch FA without tensor cores (#7314)
|
2024-05-17 18:54:52 +02:00 |
|