Calling `mmap.mmap` on Windows apparently resets the file offset of the
raw file object (and makes the BufferedReader return a *negative* file
offset). For safetensors, avoid using the file offset after calling
mmap. For GGML format, explicitly save and restore the offset.
Fixes#966.
* GGML map ops proof of concept.
* Various cleanups.
Add handling for task setting.
Add handling for ggml_compute_backward.
Rename functions to ggml_map_unary_f32 and ggml_map_binary_f32
Fix compiler warnings related to casting function pointers and `void *`
Reorder functions and definitions based on the GGML op number.
Use typedefs for map op function pointer types.
* Fix position of map ops cases in ggml_compute_forward
Current status: Working, except for the latest GPTQ-for-LLaMa format
that includes `g_idx`. This turns out to require changes to GGML, so
for now it only works if you use the `--outtype` option to dequantize it
back to f16 (which is pointless except for debugging).
I also included some cleanup for the C++ code.
This script is meant to replace all the existing conversion scripts
(including the ones that convert from older GGML formats), while also
adding support for some new formats. Specifically, I've tested with:
- [x] `LLaMA` (original)
- [x] `llama-65b-4bit`
- [x] `alpaca-native`
- [x] `alpaca-native-4bit`
- [x] LLaMA converted to 'transformers' format using
`convert_llama_weights_to_hf.py`
- [x] `alpaca-native` quantized with `--true-sequential --act-order
--groupsize 128` (dequantized only)
- [x] same as above plus `--save_safetensors`
- [x] GPT4All
- [x] stock unversioned ggml
- [x] ggmh
There's enough overlap in the logic needed to handle these different
cases that it seemed best to move to a single script.
I haven't tried this with Alpaca-LoRA because I don't know where to find
it.
Useful features:
- Uses multiple threads for a speedup in some cases (though the Python
GIL limits the gain, and sometimes it's disk-bound anyway).
- Combines split models into a single file (both the intra-tensor split
of the original and the inter-tensor split of 'transformers' format
files). Single files are more convenient to work with and more
friendly to future changes to use memory mapping on the C++ side. To
accomplish this without increasing memory requirements, it has some
custom loading code which avoids loading whole input files into memory
at once.
- Because of the custom loading code, it no longer depends in PyTorch,
which might make installing dependencies slightly easier or faster...
although it still depends on NumPy and sentencepiece, so I don't know
if there's any meaningful difference. In any case, I also added a
requirements.txt file to lock the dependency versions in case of any
future breaking changes.
- Type annotations checked with mypy.
- Some attempts to be extra user-friendly:
- The script tries to be forgiving with arguments, e.g. you can
specify either the model file itself or the directory containing
it.
- The script doesn't depend on config.json / params.json, just in
case the user downloaded files individually and doesn't have those
handy. But you still need tokenizer.model and, for Alpaca,
added_tokens.json.
- The script tries to give a helpful error message if
added_tokens.json is missing.
* Add support to batch size for perplexity
* Revert "Fix memory allocation issues and seg faults"
This reverts commit 4870e455b3.
* update from merge
* Remove perplexity from main
* updates
* Update batch size for efficiency
* Initial version of q4_0 matrix multiplication benchmark
* Bugfix: Added dependency to ggml.o to benchmark
* Reviewer requests: added parameter for threads, switched to ggml_time_us()
* Reviewer input: removed rtsc, use epsilon for check
* Review comment: Removed set_locale
* Feature: Param for numer of iterations, Bugfix for use of parameter threads
* Reviewer suggestion: Moved to examples
* Reviewer feedback: Updated clean: and benchmark: sections
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Mostly for msys2 and mingw64 builds, which are different from each other
and different from standard Visual Studio builds. Isn't Windows fun?
- Define _GNU_SOURCE in more files (it's already used in ggml.c for
Linux's sake).
- Don't use PrefetchVirtualMemory if not building for Windows 8 or later
(mingw64 doesn't by default). But warn the user about this situation
since it's probably not intended.
- Check for NOMINMAX already being defined, which it is on mingw64.
- Actually use the `increment` variable (bug in my `pizza` PR).
- Suppress unused variable warnings in the fake pthread_create and
pthread_join implementations for Windows.
- (not Windows-related) Remove mention of `asprintf` from comment;
`asprintf` is no longer used.
Fixes#871.