* winogrande: simple implementation
It doesn't look like it is working - why?
For Mistral-7B it is barely better than
random chance (score ~60% for 1267 tasks), while I see
Mistral-7B scoring 78.4% on the HF leader board.
1-sigma statistical uncertainty for 1267 tasks is ~1.4,
so no way the difference is due to statistics.
* winogrande: somewhat better
Score for Mistrali7-B is now 68.9 on the validation set of
winogrande_debiased. Still far from the reported 78.4, but
better than what I had before.
* winogrande: improving
Mistral-7B score is now 73.56.
Still not quite 78.4 but getting there.
We are also getting a lower score on HellaSwag
compared to HF leader board, so I'm not expecting
we will get up to 78.4 anyway.
It looks like it is better to skip the choice word(s)
when evaluating the average log-likelihood. This kind of
makes sense because a more common word (in Winogrande this is
often a name) will have a higher probability without knowing
about the follow up context, and this will skew the log-likelihood
towards the more common word. We can only do this if the
choice words are not last in the sentence.
It also looks like it is better to skip the punctuation at the
end of the sentence, provided the choice words are not last.
* winogrande: add dataset instructions
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Metal memory: Small memory leak on init, dangling pointer, and unused autorelease pool in graph compute
* SPM header potential fix
* Reverting symlinks
* ggml : add IQ2 to test-backend-ops + refactoring
ggml-ci
* cuda : update supports_op for IQ2
ggml-ci
* ci : enable LLAMA_CUBLAS=1 for CUDA nodes
ggml-ci
* cuda : fix out-of-bounds-access in `mul_mat_vec_q`
ggml-ci
* tests : avoid creating RNGs for each Q tensor
ggml-ci
* tests : avoid creating RNGs for each tensor
ggml-ci
* backend : add eval callback
ggml-ci
* backend : group nodes in a single compute when user don't need them
* backend : clean-up the implementation
ggml-ci
* simple : do not perform tensor data copy if not needed
* simple : fix
* imatrix : offload to GPU support
* imatrix : fix ggml_mul_mat_id hanlding
ggml-ci
* ci : add imatrix test
ggml-ci
* ci : rearrange output
ggml-ci
* backend : add eval callback
ggml-ci
* backend : group nodes in a single compute when user don't need them
* backend : clean-up the implementation
ggml-ci
* simple : do not perform tensor data copy if not needed
* simple : fix
* simple : no need for ggml_is_contiguous + fix bool parse
* llama : fix callback placement in llama_context_params
* backend : avoid double-ask callback calls
* simple : restore examples, imatrix will serve as a demo
This commit adds the name of the training data file to the log message
printed when the training data is tokenized.
The motivation for this change is that it can be useful to show which
file is being tokenized when running the finetune example.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* Introduce starter project for Android
Based on examples/llama.swiftui.
* Add github workflow
* Set NDK version
* Only build arm64-v8a in CI
* Sync bench code
* Rename CI prop to skip-armeabi-v7a
* Remove unused tests
* metal: Log `recommendedMaxWorkingSetSize` on iOS 16+
* Only log on iOS and macOS, ignoring tvOS and other platforms
* Check for Xcode version before using recommendedMaxWorkingSetSize
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This change makes it possible to build ggml-cuda.cu and ggml-metal.m as
independent dynamic shared objects, that may be conditionally linked at
runtime in a multiplatform binary. It introduces a GGML_CALL annotation
that documents which functions have a cyclic call relationship, between
the application code and GPU modules.
This change does nothing, unless the build defines -DGGML_MULTIPLATFORM
which causes back-references and function pointers to conform to MS ABI
which is supported by NVCC, ROCm, XCode, GCC and Clang across platforms
This commit replaces the magic number LLAMA_FILE_MAGIC_LORA used in
finetune.cpp with LLAMA_FILE_MAGIC_GGLA defined in llama.h.
Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
* Fix ffn_down quantization mix for MoE models
In #4872 I did not consider the part where every third
tensor is quantized with more bits. Fir MoE this leads to tensors
of the same layer being quantized with different number of bits,
which is not considered as a possibility in the inference implementation
(it is assumed all experts use the same quantization).
* Fix the fix
* Review suggestion
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>