llama.cpp/examples/quantize/quantize.cpp

#include "ggml.h"
#include "llama.h"

#include <cstdio>
#include <string>

// usage:
//  ./quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type
//
int main(int argc, char ** argv) {
    ggml_time_init();

    if (argc < 4) {
        fprintf(stderr, "usage: %s model-f32.bin model-quant.bin type [nthread]\n", argv[0]);
        fprintf(stderr, "  type = %d - q4_0\n", LLAMA_FTYPE_MOSTLY_Q4_0);
        fprintf(stderr, "  type = %d - q4_1\n", LLAMA_FTYPE_MOSTLY_Q4_1);
        fprintf(stderr, "  type = %d - q4_2\n", LLAMA_FTYPE_MOSTLY_Q4_2);
        fprintf(stderr, "  type = %d - q4_3\n", LLAMA_FTYPE_MOSTLY_Q4_3);
        fprintf(stderr, "  type = %d - q8_0\n", LLAMA_FTYPE_MOSTLY_Q8_0);
        return 1;
    }

    // needed to initialize f16 tables
    {
        struct ggml_init_params params = { 0, NULL, false };
        struct ggml_context * ctx = ggml_init(params);
        ggml_free(ctx);
    }

    const std::string fname_inp = argv[1];
    const std::string fname_out = argv[2];

    const enum llama_ftype ftype = (enum llama_ftype)atoi(argv[3]);
    int nthread = argc > 4 ? atoi(argv[4]) : 0;

    const int64_t t_main_start_us = ggml_time_us();

    int64_t t_quantize_us = 0;

    // load the model
    {
        const int64_t t_start_us = ggml_time_us();

        if (llama_model_quantize(fname_inp.c_str(), fname_out.c_str(), ftype, nthread)) {
            fprintf(stderr, "%s: failed to quantize model from '%s'\n", __func__, fname_inp.c_str());
            return 1;
        }

        t_quantize_us = ggml_time_us() - t_start_us;
    }

    // report timing
    {
        const int64_t t_main_end_us = ggml_time_us();

        printf("\n");
        printf("%s: quantize time = %8.2f ms\n", __func__, t_quantize_us/1000.0);
        printf("%s:    total time = %8.2f ms\n", __func__, (t_main_end_us - t_main_start_us)/1000.0);
    }

    return 0;
}
Initial release 2023-03-10 19:40:58 +01:00			`#include "ggml.h"`
Introduce C-style API (#370) * Major refactoring - introduce C-style API * Clean up * Add <cassert> * Add <iterator> * Add <algorithm> .... * Fix timing reporting and accumulation * Measure eval time only for single-token calls * Change llama_tokenize return meaning 2023-03-22 06:32:36 +01:00			`#include "llama.h"`
Initial release 2023-03-10 19:40:58 +01:00
			`#include <cstdio>`
			`#include <string>`
We could use std::unordered_map over std::map (#305) * Improve performance by changing std::map to std::unordered_map and std::map<id, token> id_to_token; to std::vector<token> id_to_token; * fix last commit on gpt_vocab_init add vocab.id_to_token.resize(vocab.token_to_id.size()); * Removed include <map> * Nest struct token score inside gpt_vocab * renamed token to tok 2023-03-21 18:21:50 +01:00
Initial release 2023-03-10 19:40:58 +01:00			`// usage:`
Add enum llama_ftype, sync ggml_type to model files (#709) 2023-04-11 17:03:51 +02:00			`// ./quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type`
Initial release 2023-03-10 19:40:58 +01:00			`//`
			`int main(int argc, char ** argv) {`
Windows fixes (#31) * Apply fixes suggested to build on windows Issue: https://github.com/ggerganov/llama.cpp/issues/22 * Remove unsupported VLAs * MSVC: Remove features that are only available on MSVC C++20. * Fix zero initialization of the other fields. * Change the use of vector for stack allocations. 2023-03-12 21:15:00 +01:00			`ggml_time_init();`
Introduce C-style API (#370) * Major refactoring - introduce C-style API * Clean up * Add <cassert> * Add <iterator> * Add <algorithm> .... * Fix timing reporting and accumulation * Measure eval time only for single-token calls * Change llama_tokenize return meaning 2023-03-22 06:32:36 +01:00
llama : multi-threaded quantization (#1075) * Multi-threading quantization. Not much gain for simple quantizations, bit it will be important for quantizations that require more CPU cycles. * Multi-threading for quantize-stats It now does the job in ~14 seconds on my Mac for Q4_0, Q4_1 and Q4_2. Single-threaded it was taking more than 2 minutes after adding the more elaborate version of Q4_2. * Reviewer comments * Avoiding compiler confusion After changing chunk_size to const int as suggested by @ggerganov, clang and GCC starting to warn me that I don't need to capture it in the lambda. So, I removed it from the capture list. But that makes the MSVC build fail. So, making it a constexpr to make every compiler happy. * Still fighting with lambda captures in MSVC --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-04-20 19:42:27 +02:00			`if (argc < 4) {`
			`fprintf(stderr, "usage: %s model-f32.bin model-quant.bin type [nthread]\n", argv[0]);`
Add enum llama_ftype, sync ggml_type to model files (#709) 2023-04-11 17:03:51 +02:00			`fprintf(stderr, " type = %d - q4_0\n", LLAMA_FTYPE_MOSTLY_Q4_0);`
			`fprintf(stderr, " type = %d - q4_1\n", LLAMA_FTYPE_MOSTLY_Q4_1);`
ggml : add new Q4_2 quantization (ARM only) (#1046) * ggml : Q4_2 ARM * ggml : add ggml_is_quantized() * llama : update llama_type_name() with Q4_2 entry * ggml : speed-up q4_2 - 4 threads: ~100ms -> ~90ms - 8 threads: ~55ms -> ~50ms * ggml : optimize q4_2 using vmlaq_n_f32 + vmulq_n_f32 2023-04-18 22:54:57 +02:00			`fprintf(stderr, " type = %d - q4_2\n", LLAMA_FTYPE_MOSTLY_Q4_2);`
ggml : add Q4_3 quantization (#1082) 2023-04-20 19:35:53 +02:00			`fprintf(stderr, " type = %d - q4_3\n", LLAMA_FTYPE_MOSTLY_Q4_3);`
ggml : add Q8_0 quantization format (rename the old one to Q8_1) (ARM NEON) (#1179) * ggml : add Q8_0 quantization format (rename the old one to Q8_1) * tests : fix test-quantize-fns * ggml : finalize Q8_0 implementation * ggml : use q4_0_q8_0 and q4_2_q8_0 * ggml : fix Q8_0 dot product bug (ARM) * ggml : Q8_0 unroll x2 * ggml : fix bug - using wrong block type * ggml : extend quantize_fns_t with "vec_dot_type" * ggml : fix Q8_0 to use 255 values out of 256 * ggml : fix assert using wrong QK4_2 instead of QK4_3 2023-04-25 22:40:51 +02:00			`fprintf(stderr, " type = %d - q8_0\n", LLAMA_FTYPE_MOSTLY_Q8_0);`
Initial release 2023-03-10 19:40:58 +01:00			`return 1;`
			`}`

Fix un-initialized FP16 tables on x86 (#15, #2) 2023-03-11 16:40:14 +01:00			`// needed to initialize f16 tables`
			`{`
Fix ggml_init_params in quantize 2023-03-29 05:38:57 +02:00			`struct ggml_init_params params = { 0, NULL, false };`
Fix un-initialized FP16 tables on x86 (#15, #2) 2023-03-11 16:40:14 +01:00			`struct ggml_context * ctx = ggml_init(params);`
			`ggml_free(ctx);`
			`}`

Initial release 2023-03-10 19:40:58 +01:00			`const std::string fname_inp = argv[1];`
			`const std::string fname_out = argv[2];`

Add enum llama_ftype, sync ggml_type to model files (#709) 2023-04-11 17:03:51 +02:00			`const enum llama_ftype ftype = (enum llama_ftype)atoi(argv[3]);`
llama : multi-threaded quantization (#1075) * Multi-threading quantization. Not much gain for simple quantizations, bit it will be important for quantizations that require more CPU cycles. * Multi-threading for quantize-stats It now does the job in ~14 seconds on my Mac for Q4_0, Q4_1 and Q4_2. Single-threaded it was taking more than 2 minutes after adding the more elaborate version of Q4_2. * Reviewer comments * Avoiding compiler confusion After changing chunk_size to const int as suggested by @ggerganov, clang and GCC starting to warn me that I don't need to capture it in the lambda. So, I removed it from the capture list. But that makes the MSVC build fail. So, making it a constexpr to make every compiler happy. * Still fighting with lambda captures in MSVC --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-04-20 19:42:27 +02:00			`int nthread = argc > 4 ? atoi(argv[4]) : 0;`
Initial release 2023-03-10 19:40:58 +01:00
			`const int64_t t_main_start_us = ggml_time_us();`

			`int64_t t_quantize_us = 0;`

			`// load the model`
			`{`
			`const int64_t t_start_us = ggml_time_us();`

llama : multi-threaded quantization (#1075) * Multi-threading quantization. Not much gain for simple quantizations, bit it will be important for quantizations that require more CPU cycles. * Multi-threading for quantize-stats It now does the job in ~14 seconds on my Mac for Q4_0, Q4_1 and Q4_2. Single-threaded it was taking more than 2 minutes after adding the more elaborate version of Q4_2. * Reviewer comments * Avoiding compiler confusion After changing chunk_size to const int as suggested by @ggerganov, clang and GCC starting to warn me that I don't need to capture it in the lambda. So, I removed it from the capture list. But that makes the MSVC build fail. So, making it a constexpr to make every compiler happy. * Still fighting with lambda captures in MSVC --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-04-20 19:42:27 +02:00			`if (llama_model_quantize(fname_inp.c_str(), fname_out.c_str(), ftype, nthread)) {`
Initial release 2023-03-10 19:40:58 +01:00			`fprintf(stderr, "%s: failed to quantize model from '%s'\n", __func__, fname_inp.c_str());`
			`return 1;`
			`}`

			`t_quantize_us = ggml_time_us() - t_start_us;`
			`}`

			`// report timing`
			`{`
			`const int64_t t_main_end_us = ggml_time_us();`

			`printf("\n");`
all : be more strict about converting float to double (#458) * Be more strict about converting float to double * Test equivalence of round, SILU implementations Test module is commented out in CMakeLists.txt because the tests may take a long time, depending on how much the compiler optimizes. * Fix softmax in perplexity.cpp * all : prefer float over double where appropriate * perplexity : add <cmath> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-03-28 18:48:20 +02:00			`printf("%s: quantize time = %8.2f ms\n", __func__, t_quantize_us/1000.0);`
			`printf("%s: total time = %8.2f ms\n", __func__, (t_main_end_us - t_main_start_us)/1000.0);`
Initial release 2023-03-10 19:40:58 +01:00			`}`

			`return 0;`
			`}`