ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
#include "ggml.h"
|
2024-11-16 21:17:59 +01:00
|
|
|
#include "ggml-alloc.h"
|
|
|
|
#include "ggml-backend.h"
|
|
|
|
#include "ggml-cpu.h"
|
|
|
|
#include "ggml-opt.h"
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2023-08-02 10:06:19 +02:00
|
|
|
#include <cmath>
|
2024-11-16 22:40:39 +01:00
|
|
|
#include <cinttypes>
|
2024-11-16 21:17:59 +01:00
|
|
|
#include <random>
|
|
|
|
#include <string>
|
|
|
|
#include <thread>
|
|
|
|
#include <vector>
|
|
|
|
|
|
|
|
static bool almost_equal(const double a, const double b, const double atol) {
|
|
|
|
return fabs(a - b) < atol;
|
|
|
|
}
|
|
|
|
|
|
|
|
constexpr int64_t ne_datapoint = 2;
|
|
|
|
constexpr int64_t ne_label = 1;
|
|
|
|
constexpr int64_t ndata = 6;
|
|
|
|
|
|
|
|
struct helper_ctx_data {
|
|
|
|
std::vector<ggml_opt_dataset_t> datasets_supervised;
|
|
|
|
std::vector<struct ggml_tensor *> data_batch;
|
|
|
|
std::vector<struct ggml_tensor *> labels_batch;
|
|
|
|
|
|
|
|
ggml_opt_dataset_t dataset_unsupervised;
|
|
|
|
struct ggml_context * ctx_static;
|
|
|
|
struct ggml_context * ctx_compute;
|
|
|
|
struct ggml_opt_params opt_params;
|
|
|
|
ggml_opt_context_t opt_ctx;
|
|
|
|
struct ggml_tensor * inputs;
|
|
|
|
struct ggml_tensor * weights;
|
|
|
|
struct ggml_tensor * outputs;
|
|
|
|
ggml_backend_buffer_t buf;
|
|
|
|
ggml_opt_result_t result;
|
|
|
|
ggml_opt_result_t result2;
|
|
|
|
};
|
|
|
|
|
|
|
|
// These default values make it easier to check optimization results vs. expected values.
|
|
|
|
static ggml_opt_optimizer_params helper_get_test_opt_pars(void * userdata) {
|
|
|
|
ggml_opt_optimizer_params result = ggml_opt_get_default_optimizer_params(userdata);
|
|
|
|
result.adamw.alpha = 1.0f;
|
|
|
|
result.adamw.beta1 = 0.0f;
|
|
|
|
result.adamw.beta2 = 0.0f;
|
|
|
|
result.adamw.eps = 0.0f;
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
|
|
|
static helper_ctx_data helper_get_ctx_data(
|
|
|
|
ggml_backend_sched_t backend_sched,
|
|
|
|
ggml_backend_t backend,
|
|
|
|
const bool init_opt_ctx = true,
|
|
|
|
const bool optimizer_defaults = true,
|
|
|
|
int64_t nbatch_logical = 1,
|
|
|
|
int64_t nbatch_physical = 1,
|
|
|
|
enum ggml_opt_loss_type loss_type = GGML_OPT_LOSS_TYPE_SUM) {
|
|
|
|
std::vector<ggml_opt_dataset_t> datasets(ndata);
|
|
|
|
for (int64_t ndata_shard = 1; ndata_shard <= ndata; ++ndata_shard) {
|
|
|
|
ggml_opt_dataset_t dataset = ggml_opt_dataset_init(ne_datapoint, ne_label, ndata, ndata_shard);
|
|
|
|
|
|
|
|
float * data = ggml_get_data_f32(ggml_opt_dataset_data( dataset));
|
|
|
|
float * labels = ggml_get_data_f32(ggml_opt_dataset_labels(dataset));
|
|
|
|
|
|
|
|
for (int64_t idata = 0; idata < ndata; ++idata) {
|
|
|
|
for (int64_t id = 0; id < ne_datapoint; ++id) {
|
|
|
|
data[ idata*ne_datapoint + id] = 16*idata + id;
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
}
|
2024-11-16 21:17:59 +01:00
|
|
|
for (int64_t il = 0; il < ne_label; ++il) {
|
|
|
|
labels[idata*ne_label + il] = 16*(16*idata + il);
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
}
|
2024-11-16 21:17:59 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
datasets[ndata_shard-1] = dataset;
|
|
|
|
}
|
|
|
|
|
|
|
|
ggml_opt_dataset_t dataset_unsupervised = ggml_opt_dataset_init(1, 0, ndata, /*ndata_shard =*/ 1);
|
|
|
|
|
|
|
|
float * data = ggml_get_data_f32(ggml_opt_dataset_data(dataset_unsupervised));
|
|
|
|
|
|
|
|
for (int64_t idata = 0; idata < ndata; ++idata) {
|
|
|
|
data[idata] = idata;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct ggml_context * ctx_static;
|
|
|
|
struct ggml_context * ctx_compute;
|
|
|
|
{
|
|
|
|
struct ggml_init_params params = {
|
|
|
|
/*.mem_size =*/ (2*ndata + 2)*ggml_tensor_overhead(),
|
|
|
|
/*.mem_buffer =*/ nullptr,
|
|
|
|
/*.no_alloc =*/ true,
|
|
|
|
};
|
|
|
|
ctx_static = ggml_init(params);
|
|
|
|
}
|
|
|
|
{
|
|
|
|
struct ggml_init_params params = {
|
|
|
|
/*.mem_size =*/ GGML_DEFAULT_GRAPH_SIZE*ggml_tensor_overhead() + 3*ggml_graph_overhead(),
|
|
|
|
/*.mem_buffer =*/ nullptr,
|
|
|
|
/*.no_alloc =*/ true,
|
|
|
|
};
|
|
|
|
ctx_compute = ggml_init(params);
|
|
|
|
}
|
|
|
|
|
|
|
|
std::vector<struct ggml_tensor *> data_batch(ndata);
|
|
|
|
std::vector<struct ggml_tensor *> labels_batch(ndata);
|
|
|
|
for (int64_t ndata_batch = 1; ndata_batch <= ndata; ++ndata_batch) {
|
|
|
|
data_batch[ndata_batch-1] = ggml_new_tensor_1d(ctx_static, GGML_TYPE_F32, ndata_batch*ne_datapoint);
|
|
|
|
labels_batch[ndata_batch-1] = ggml_new_tensor_1d(ctx_static, GGML_TYPE_F32, ndata_batch*ne_label);
|
|
|
|
}
|
|
|
|
|
|
|
|
struct ggml_tensor * inputs = ggml_new_tensor_1d(ctx_static, GGML_TYPE_F32, nbatch_physical);
|
|
|
|
ggml_set_name(inputs, "inputs");
|
|
|
|
|
|
|
|
struct ggml_tensor * weights = ggml_new_tensor_1d(ctx_static, GGML_TYPE_F32, 1);
|
|
|
|
ggml_set_name(weights, "weights");
|
|
|
|
ggml_set_param(ctx_static, weights);
|
|
|
|
|
|
|
|
struct ggml_tensor * intermediary = ggml_add(ctx_compute, inputs, weights);
|
|
|
|
|
|
|
|
struct ggml_tensor * outputs = ggml_scale(ctx_compute, intermediary, 1.0f);
|
|
|
|
ggml_set_name(outputs, "outputs");
|
|
|
|
|
|
|
|
ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors(ctx_static, backend);
|
|
|
|
const float w0 = float(ndata)/2;
|
|
|
|
ggml_backend_tensor_set(weights, &w0, 0, sizeof(float));
|
|
|
|
|
|
|
|
GGML_ASSERT(nbatch_logical % nbatch_physical == 0);
|
|
|
|
const int32_t opt_period = nbatch_logical / nbatch_physical;
|
|
|
|
|
|
|
|
struct ggml_opt_params opt_params = ggml_opt_default_params(backend_sched, ctx_compute, inputs, outputs, loss_type);
|
|
|
|
opt_params.opt_period = opt_period;
|
|
|
|
if (!optimizer_defaults) {
|
|
|
|
opt_params.get_opt_pars = helper_get_test_opt_pars;
|
|
|
|
}
|
|
|
|
ggml_opt_context_t opt_ctx = init_opt_ctx ? ggml_opt_init(opt_params) : nullptr;
|
|
|
|
|
|
|
|
ggml_opt_result_t result = ggml_opt_result_init();
|
|
|
|
ggml_opt_result_t result2 = ggml_opt_result_init();
|
|
|
|
|
|
|
|
return {datasets, data_batch, labels_batch, dataset_unsupervised, ctx_static, ctx_compute, opt_params, opt_ctx, inputs, weights, outputs, buf, result, result2};
|
|
|
|
}
|
|
|
|
|
|
|
|
static void helper_free_ctx_data(struct helper_ctx_data ctx_data) {
|
|
|
|
ggml_opt_result_free(ctx_data.result);
|
|
|
|
ggml_opt_result_free(ctx_data.result2);
|
|
|
|
ggml_opt_free(ctx_data.opt_ctx);
|
|
|
|
ggml_backend_buffer_free(ctx_data.buf);
|
|
|
|
ggml_free(ctx_data.ctx_static);
|
|
|
|
ggml_free(ctx_data.ctx_compute);
|
|
|
|
for (ggml_opt_dataset_t dataset : ctx_data.datasets_supervised) {
|
|
|
|
ggml_opt_dataset_free(dataset);
|
|
|
|
}
|
|
|
|
ggml_opt_dataset_free(ctx_data.dataset_unsupervised);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void helper_after_test(
|
|
|
|
const char * func, const bool high_level, const std::string options,
|
|
|
|
const std::string subtest, const bool subtest_ok, int & ntest, int & npass) {
|
|
|
|
printf(" %s(high_level=%s%s, subtest=%s): ",
|
|
|
|
func, high_level ? "yes" : "no", options.c_str(), subtest.c_str());
|
|
|
|
if (subtest_ok) {
|
|
|
|
printf("\033[1;32mOK\033[0m\n");
|
|
|
|
npass++;
|
|
|
|
} else {
|
|
|
|
printf("\033[1;31mFAIL\033[0m\n");
|
|
|
|
}
|
|
|
|
ntest++;
|
|
|
|
}
|
|
|
|
|
|
|
|
static std::pair<int, int> test_dataset(ggml_backend_sched_t backend_sched, ggml_backend_t backend, const bool shuffle) {
|
|
|
|
int ntest = 0;
|
|
|
|
int npass = 0;
|
|
|
|
|
|
|
|
struct helper_ctx_data cd = helper_get_ctx_data(backend_sched, backend);
|
|
|
|
|
|
|
|
for (int64_t ndata_shard = 1; ndata_shard <= ndata; ++ndata_shard) {
|
|
|
|
ggml_opt_dataset_t dataset = cd.datasets_supervised[ndata_shard-1];
|
|
|
|
|
|
|
|
if (shuffle) {
|
|
|
|
ggml_opt_dataset_shuffle(cd.opt_ctx, dataset, -1);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (int64_t ndata_batch = 1; ndata_batch <= ndata; ++ndata_batch) {
|
|
|
|
if (ndata_batch % ndata_shard != 0) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
bool subtest_ok = true;
|
|
|
|
|
|
|
|
struct ggml_tensor * data_batch = cd.data_batch[ndata_batch-1];
|
|
|
|
struct ggml_tensor * labels_batch = cd.labels_batch[ndata_batch-1];
|
|
|
|
|
|
|
|
std::vector<float> data(ggml_nelements( data_batch));
|
|
|
|
std::vector<float> labels(ggml_nelements(labels_batch));
|
|
|
|
|
|
|
|
std::vector<int64_t> idata_shuffled;
|
|
|
|
const int64_t nbatches = ndata / ndata_batch;
|
|
|
|
for (int64_t ibatch = 0; ibatch < nbatches; ++ibatch) {
|
|
|
|
ggml_opt_dataset_get_batch(dataset, data_batch, labels_batch, ibatch);
|
|
|
|
|
|
|
|
ggml_backend_tensor_get( data_batch, data.data(), 0, ggml_nbytes( data_batch));
|
|
|
|
ggml_backend_tensor_get(labels_batch, labels.data(), 0, ggml_nbytes(labels_batch));
|
|
|
|
|
|
|
|
for (int64_t idata_batch = 0; idata_batch < ndata_batch; ++idata_batch) {
|
|
|
|
const int64_t idata = ibatch*ndata_batch + idata_batch;
|
|
|
|
const int64_t idata_found = data[idata_batch*ne_datapoint] / 16;
|
|
|
|
subtest_ok = subtest_ok && (shuffle || idata_found == idata);
|
|
|
|
idata_shuffled.push_back(idata_found);
|
|
|
|
|
|
|
|
for (int64_t id = 0; id < ne_datapoint; ++id) {
|
|
|
|
if (data[ idata_batch*ne_datapoint + id] != 16*idata_found + id) {
|
|
|
|
subtest_ok = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
for (int64_t il = 0; il < ne_label; ++il) {
|
|
|
|
if (labels[idata_batch*ne_label + il] != 16*(16*idata_found + il)) {
|
|
|
|
subtest_ok = false;
|
|
|
|
}
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
2024-11-16 21:17:59 +01:00
|
|
|
|
|
|
|
if (!shuffle || ndata % ndata_batch == 0) {
|
|
|
|
const int ndata_max = (ndata / ndata_batch) * ndata_batch;
|
|
|
|
|
|
|
|
for (int64_t idata = 0; subtest_ok && idata < ndata_max; ++idata) {
|
|
|
|
int ninstances = 0;
|
|
|
|
for (int64_t id : idata_shuffled) {
|
|
|
|
ninstances += id == idata;
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
}
|
2024-11-16 21:17:59 +01:00
|
|
|
if (ninstances != 1) {
|
|
|
|
subtest_ok = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
printf(" %s(shuffle=%s, ndata_shard=%" PRId64 ", ndata_batch=%" PRId64 "): ",
|
|
|
|
__func__, shuffle ? "yes" : "no", ndata_shard, ndata_batch);
|
|
|
|
if (subtest_ok) {
|
|
|
|
printf("\033[1;32mOK\033[0m\n");
|
|
|
|
npass++;
|
|
|
|
} else {
|
|
|
|
printf("\033[1;31mFAIL\033[0m\n");
|
|
|
|
}
|
|
|
|
ntest++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
helper_free_ctx_data(cd);
|
|
|
|
|
|
|
|
return std::make_pair(npass, ntest);
|
|
|
|
}
|
|
|
|
|
|
|
|
static std::pair<int, int> test_grad(ggml_backend_sched_t backend_sched, ggml_backend_t backend) {
|
|
|
|
int ntest = 0;
|
|
|
|
int npass = 0;
|
|
|
|
|
|
|
|
struct helper_ctx_data cd = helper_get_ctx_data(backend_sched, backend, /*init_opt_ctx =*/ true, /*optimizer_defaults =*/ false,
|
|
|
|
/*nbatch_logical =*/ 999999, /*nbatch_physical =*/ 1);
|
|
|
|
|
|
|
|
std::vector<float> grad_history(ndata);
|
|
|
|
for (int64_t idata = 0; idata < ndata; ++idata) {
|
|
|
|
grad_history[idata] = NAN;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (int idata = 0; idata < ndata; ++idata) {
|
|
|
|
const float idataf = idata;
|
|
|
|
ggml_backend_tensor_set(cd.inputs, &idataf, 0, ggml_nbytes(cd.inputs));
|
|
|
|
ggml_opt_forward_backward(cd.opt_ctx, cd.result);
|
|
|
|
ggml_backend_tensor_get(ggml_opt_grad_acc(cd.opt_ctx, cd.weights), grad_history.data() + idata, 0, sizeof(float));
|
|
|
|
}
|
|
|
|
|
|
|
|
{
|
|
|
|
bool subtest_ok = true;
|
|
|
|
for (int idata = 0; idata < ndata; ++idata) {
|
|
|
|
if (grad_history[idata] != idata + 1) {
|
|
|
|
subtest_ok = false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
printf(" %s(): ", __func__);
|
|
|
|
if (subtest_ok) {
|
|
|
|
printf("\033[1;32mOK\033[0m\n");
|
|
|
|
npass++;
|
|
|
|
} else {
|
|
|
|
printf("\033[1;31mFAIL\033[0m\n");
|
|
|
|
}
|
|
|
|
ntest++;
|
|
|
|
}
|
|
|
|
|
|
|
|
helper_free_ctx_data(cd);
|
|
|
|
|
|
|
|
return std::make_pair(npass, ntest);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void helper_after_test_forward_backward(
|
|
|
|
const char * func, const bool high_level, const bool shuffle,
|
|
|
|
const std::string subtest, const bool subtest_ok, int & ntest, int & npass) {
|
|
|
|
std::string options = ", shuffle=";
|
|
|
|
options += shuffle ? "yes" : "no";
|
|
|
|
helper_after_test(func, high_level, options, subtest, subtest_ok, ntest, npass);
|
|
|
|
}
|
|
|
|
|
|
|
|
static std::pair<int, int> test_forward_backward(
|
|
|
|
ggml_backend_sched_t backend_sched, ggml_backend_t backend, const bool high_level, const bool shuffle) {
|
|
|
|
int ntest = 0;
|
|
|
|
int npass = 0;
|
|
|
|
|
|
|
|
struct helper_ctx_data cd = helper_get_ctx_data(backend_sched, backend, /*init_opt_ctx =*/ true, /*optimizer_defaults =*/ false);
|
|
|
|
struct ggml_tensor * loss = ggml_opt_loss(cd.opt_ctx);
|
|
|
|
|
|
|
|
std::vector<float> loss_history(ndata);
|
|
|
|
for (int64_t idata = 0; idata < ndata; ++idata) {
|
|
|
|
loss_history[idata] = NAN;
|
|
|
|
}
|
|
|
|
|
|
|
|
{
|
|
|
|
int64_t ndata;
|
|
|
|
ggml_opt_result_ndata(cd.result, &ndata);
|
|
|
|
double loss;
|
|
|
|
double loss_unc;
|
|
|
|
ggml_opt_result_loss(cd.result, &loss, &loss_unc);
|
|
|
|
double accuracy;
|
|
|
|
double accuracy_unc;
|
|
|
|
ggml_opt_result_accuracy(cd.result, &accuracy, &accuracy_unc);
|
|
|
|
const bool subtest_ok = ndata == 0 && loss == 0.0 && std::isnan(loss_unc) && std::isnan(accuracy) && std::isnan(accuracy_unc);
|
|
|
|
helper_after_test_forward_backward(__func__, high_level, shuffle, "results_initial", subtest_ok, ntest, npass);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (high_level) {
|
|
|
|
ggml_opt_dataset_t dataset = cd.dataset_unsupervised;
|
|
|
|
if (shuffle) {
|
|
|
|
ggml_opt_dataset_shuffle(cd.opt_ctx, dataset, -1);
|
|
|
|
}
|
|
|
|
ggml_opt_epoch(cd.opt_ctx, dataset, nullptr, cd.result, 0, nullptr, nullptr);
|
|
|
|
} else {
|
|
|
|
for (int idata = 0; idata < ndata; ++idata) {
|
|
|
|
const float idataf = idata;
|
|
|
|
ggml_backend_tensor_set(cd.inputs, &idataf, 0, ggml_nbytes(cd.inputs));
|
|
|
|
ggml_opt_forward(cd.opt_ctx, cd.result);
|
|
|
|
ggml_backend_tensor_get(loss, loss_history.data() + idata, 0, sizeof(float));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
{
|
|
|
|
float weights;
|
|
|
|
ggml_backend_tensor_get(cd.weights, &weights, 0, sizeof(float));
|
|
|
|
const bool subtest_ok = weights == ndata/2;
|
|
|
|
helper_after_test_forward_backward(__func__, high_level, shuffle, "weights_after_forward", subtest_ok, ntest, npass);
|
|
|
|
}
|
|
|
|
{
|
|
|
|
int64_t ndata;
|
|
|
|
ggml_opt_result_ndata(cd.result, &ndata);
|
|
|
|
bool subtest_ok = ndata == 6;
|
|
|
|
|
|
|
|
double loss;
|
|
|
|
double loss_unc;
|
|
|
|
ggml_opt_result_loss(cd.result, &loss, &loss_unc);
|
|
|
|
subtest_ok = subtest_ok && loss == 33.0 && almost_equal(loss_unc, sqrt(3.5), 1e-10);
|
|
|
|
|
|
|
|
double accuracy;
|
|
|
|
double accuracy_unc;
|
|
|
|
ggml_opt_result_accuracy(cd.result, &accuracy, &accuracy_unc);
|
|
|
|
subtest_ok = subtest_ok && std::isnan(accuracy) && std::isnan(accuracy_unc);
|
|
|
|
|
|
|
|
helper_after_test_forward_backward(__func__, high_level, shuffle, "results_after_forward", subtest_ok, ntest, npass);
|
|
|
|
}
|
|
|
|
|
|
|
|
float w0;
|
|
|
|
ggml_backend_tensor_get(cd.weights, &w0, 0, sizeof(float));
|
|
|
|
for (int i = 0; i < 10; ++i) {
|
|
|
|
ggml_opt_forward_backward(cd.opt_ctx, nullptr);
|
|
|
|
}
|
|
|
|
ggml_backend_tensor_set(cd.weights, &w0, 0, sizeof(float));
|
|
|
|
|
|
|
|
ggml_opt_reset(cd.opt_ctx, /*optimizer =*/ false);
|
|
|
|
ggml_opt_result_reset(cd.result);
|
|
|
|
|
|
|
|
for (int64_t idata = 0; idata < ndata; ++idata) {
|
|
|
|
loss_history[idata] = NAN;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (high_level) {
|
|
|
|
ggml_opt_dataset_t dataset = cd.dataset_unsupervised;
|
|
|
|
if (shuffle) {
|
|
|
|
ggml_opt_dataset_shuffle(cd.opt_ctx, dataset, -1);
|
|
|
|
}
|
|
|
|
ggml_opt_epoch(cd.opt_ctx, dataset, cd.result, nullptr, ndata, nullptr, nullptr);
|
|
|
|
} else {
|
|
|
|
for (int idata = 0; idata < ndata; ++idata) {
|
|
|
|
const float idataf = idata;
|
|
|
|
ggml_backend_tensor_set(cd.inputs, &idataf, 0, ggml_nbytes(cd.inputs));
|
|
|
|
ggml_opt_forward_backward(cd.opt_ctx, cd.result);
|
|
|
|
ggml_backend_tensor_get(loss, loss_history.data() + idata, 0, sizeof(float));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
{
|
|
|
|
float weights;
|
|
|
|
ggml_backend_tensor_get(cd.weights, &weights, 0, sizeof(float));
|
|
|
|
const bool subtest_ok = weights == -ndata/2;
|
|
|
|
helper_after_test_forward_backward(__func__, high_level, shuffle, "weights_after_forward_backward", subtest_ok, ntest, npass);
|
|
|
|
}
|
|
|
|
{
|
|
|
|
int64_t ndata;
|
|
|
|
ggml_opt_result_ndata(cd.result, &ndata);
|
|
|
|
bool subtest_ok = ndata == 6;
|
|
|
|
|
|
|
|
double loss;
|
|
|
|
double loss_unc;
|
|
|
|
ggml_opt_result_loss(cd.result, &loss, &loss_unc);
|
|
|
|
subtest_ok = subtest_ok && loss == 18.0 && (shuffle || loss_unc == 0.0);
|
|
|
|
|
|
|
|
double accuracy;
|
|
|
|
double accuracy_unc;
|
|
|
|
ggml_opt_result_accuracy(cd.result, &accuracy, &accuracy_unc);
|
|
|
|
subtest_ok = subtest_ok && std::isnan(accuracy) && std::isnan(accuracy_unc);
|
|
|
|
|
|
|
|
helper_after_test_forward_backward(__func__, high_level, shuffle, "result_after_forward_backward", subtest_ok, ntest, npass);
|
|
|
|
}
|
|
|
|
|
|
|
|
helper_free_ctx_data(cd);
|
|
|
|
|
|
|
|
return std::make_pair(npass, ntest);
|
|
|
|
}
|
|
|
|
|
|
|
|
static std::pair<int, int> test_epoch_vs_fit(ggml_backend_sched_t backend_sched, ggml_backend_t backend) {
|
|
|
|
int ntest = 0;
|
|
|
|
int npass = 0;
|
|
|
|
|
|
|
|
float weights_epoch;
|
|
|
|
float weights_fit;
|
|
|
|
|
|
|
|
{
|
|
|
|
struct helper_ctx_data cd = helper_get_ctx_data(backend_sched, backend, /*init_opt_ctx =*/ true);
|
|
|
|
ggml_opt_dataset_t dataset = cd.dataset_unsupervised;
|
|
|
|
|
|
|
|
ggml_opt_dataset_shuffle(cd.opt_ctx, dataset, -1);
|
|
|
|
ggml_opt_epoch(cd.opt_ctx, dataset, cd.result, nullptr, ndata, nullptr, nullptr);
|
|
|
|
|
|
|
|
ggml_backend_tensor_get(cd.weights, &weights_epoch, 0, ggml_nbytes(cd.weights));
|
|
|
|
helper_free_ctx_data(cd);
|
|
|
|
}
|
|
|
|
{
|
|
|
|
struct helper_ctx_data cd = helper_get_ctx_data(backend_sched, backend, /*init_opt_ctx =*/ false);
|
|
|
|
ggml_opt_dataset_t dataset = cd.dataset_unsupervised;
|
|
|
|
|
|
|
|
ggml_opt_fit(backend_sched, cd.ctx_compute, cd.inputs, cd.outputs, dataset,
|
|
|
|
GGML_OPT_LOSS_TYPE_SUM, ggml_opt_get_default_optimizer_params, 1, 1, 0.0f, true);
|
|
|
|
|
|
|
|
ggml_backend_tensor_get(cd.weights, &weights_fit, 0, ggml_nbytes(cd.weights));
|
|
|
|
helper_free_ctx_data(cd);
|
|
|
|
}
|
|
|
|
|
|
|
|
const bool subtest_ok = weights_epoch == weights_fit;
|
|
|
|
|
|
|
|
printf(" %s(): ", __func__);
|
|
|
|
if (subtest_ok) {
|
|
|
|
printf("\033[1;32mOK\033[0m\n");
|
|
|
|
npass++;
|
|
|
|
} else {
|
|
|
|
printf("\033[1;31mFAIL\033[0m\n");
|
|
|
|
}
|
|
|
|
ntest++;
|
|
|
|
|
|
|
|
return std::make_pair(npass, ntest);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void helper_after_test_idata_split(
|
|
|
|
const char * func, const bool high_level, const int epoch,
|
|
|
|
const std::string subtest, const bool subtest_ok, int & ntest, int & npass) {
|
|
|
|
std::string options = ", epoch=";
|
|
|
|
options += std::to_string(epoch);
|
|
|
|
helper_after_test(func, high_level, options, subtest, subtest_ok, ntest, npass);
|
|
|
|
}
|
|
|
|
|
|
|
|
static std::pair<int, int> test_idata_split(ggml_backend_sched_t backend_sched, ggml_backend_t backend, const bool high_level) {
|
|
|
|
int ntest = 0;
|
|
|
|
int npass = 0;
|
|
|
|
|
|
|
|
struct helper_ctx_data cd = helper_get_ctx_data(backend_sched, backend, /*init_opt_ctx =*/ true, /*optimizer_defaults =*/ false);
|
|
|
|
struct ggml_tensor * loss = ggml_opt_loss(cd.opt_ctx);
|
|
|
|
const int idata_split = ndata * 2/3;
|
|
|
|
|
|
|
|
std::vector<float> loss_history(ndata);
|
|
|
|
for (int64_t idata = 0; idata < ndata; ++idata) {
|
|
|
|
loss_history[idata] = NAN;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (int epoch = 1; epoch <= 4; ++epoch) {
|
|
|
|
if (high_level) {
|
|
|
|
ggml_opt_epoch(cd.opt_ctx, cd.dataset_unsupervised, cd.result, cd.result2, idata_split, nullptr, nullptr);
|
|
|
|
} else {
|
|
|
|
int idata = 0;
|
|
|
|
for (; idata < idata_split; ++idata) {
|
|
|
|
const float idataf = idata;
|
|
|
|
ggml_backend_tensor_set(cd.inputs, &idataf, 0, ggml_nbytes(cd.inputs));
|
|
|
|
ggml_opt_forward_backward(cd.opt_ctx, cd.result);
|
|
|
|
ggml_backend_tensor_get(loss, loss_history.data() + idata, 0, sizeof(float));
|
|
|
|
}
|
|
|
|
for (; idata < ndata; ++idata) {
|
|
|
|
const float idataf = idata;
|
|
|
|
ggml_backend_tensor_set(cd.inputs, &idataf, 0, ggml_nbytes(cd.inputs));
|
|
|
|
ggml_opt_forward(cd.opt_ctx, cd.result2);
|
|
|
|
ggml_backend_tensor_get(loss, loss_history.data() + idata, 0, sizeof(float));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
{
|
|
|
|
float weights;
|
|
|
|
ggml_backend_tensor_get(cd.weights, &weights, 0, sizeof(float));
|
|
|
|
const bool subtest_ok = weights == ndata/2 - epoch*idata_split;
|
|
|
|
helper_after_test_idata_split(__func__, high_level, epoch, "weights", subtest_ok, ntest, npass);
|
|
|
|
}
|
|
|
|
{
|
|
|
|
int64_t ndata_result;
|
|
|
|
ggml_opt_result_ndata(cd.result, &ndata_result);
|
|
|
|
bool subtest_ok = ndata_result == idata_split;
|
|
|
|
|
|
|
|
double loss;
|
|
|
|
double loss_unc;
|
|
|
|
ggml_opt_result_loss(cd.result, &loss, &loss_unc);
|
|
|
|
subtest_ok = subtest_ok && loss == 28.0 - epoch*16.0 && loss_unc == 0.0;
|
|
|
|
|
|
|
|
double accuracy;
|
|
|
|
double accuracy_unc;
|
|
|
|
ggml_opt_result_accuracy(cd.result, &accuracy, &accuracy_unc);
|
|
|
|
subtest_ok = subtest_ok && std::isnan(accuracy) && std::isnan(accuracy_unc);
|
|
|
|
|
|
|
|
helper_after_test_idata_split(__func__, high_level, epoch, "results_backward", subtest_ok, ntest, npass);
|
|
|
|
}
|
|
|
|
{
|
|
|
|
int64_t ndata_result;
|
|
|
|
ggml_opt_result_ndata(cd.result2, &ndata_result);
|
|
|
|
bool subtest_ok = ndata_result == ndata - idata_split;
|
|
|
|
|
|
|
|
double loss;
|
|
|
|
double loss_unc;
|
|
|
|
ggml_opt_result_loss(cd.result2, &loss, &loss_unc);
|
|
|
|
subtest_ok = subtest_ok && loss == 15.0 - epoch*8 && almost_equal(loss_unc, sqrt(0.5), 1e-10);
|
|
|
|
|
|
|
|
double accuracy;
|
|
|
|
double accuracy_unc;
|
|
|
|
ggml_opt_result_accuracy(cd.result2, &accuracy, &accuracy_unc);
|
|
|
|
subtest_ok = subtest_ok && std::isnan(accuracy) && std::isnan(accuracy_unc);
|
|
|
|
|
|
|
|
helper_after_test_idata_split(__func__, high_level, epoch, "results_forward", subtest_ok, ntest, npass);
|
|
|
|
}
|
|
|
|
|
|
|
|
ggml_opt_result_reset(cd.result);
|
|
|
|
ggml_opt_result_reset(cd.result2);
|
|
|
|
}
|
|
|
|
|
|
|
|
helper_free_ctx_data(cd);
|
|
|
|
|
|
|
|
return std::make_pair(npass, ntest);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void helper_after_test_gradient_accumulation(
|
|
|
|
const char * func, const int nbatch_physical, const enum ggml_opt_loss_type loss_type, const int epoch,
|
|
|
|
const std::string subtest, const bool subtest_ok, int & ntest, int & npass) {
|
|
|
|
std::string options = ", nbatch_physical=";
|
|
|
|
options += std::to_string(nbatch_physical);
|
|
|
|
options += ", loss_type=";
|
|
|
|
options += loss_type == GGML_OPT_LOSS_TYPE_MEAN ? "mean" : "sum";
|
|
|
|
options += ", epoch=";
|
|
|
|
options += std::to_string(epoch);
|
|
|
|
helper_after_test(func, false, options, subtest, subtest_ok, ntest, npass);
|
|
|
|
}
|
|
|
|
|
|
|
|
static std::pair<int, int> test_gradient_accumulation(
|
|
|
|
ggml_backend_sched_t backend_sched, ggml_backend_t backend, const int32_t nbatch_physical, const enum ggml_opt_loss_type loss_type) {
|
|
|
|
int ntest = 0;
|
|
|
|
int npass = 0;
|
|
|
|
|
|
|
|
struct helper_ctx_data cd = helper_get_ctx_data(
|
|
|
|
backend_sched, backend, /*init_opt_ctx =*/ true, /*optimizer_defaults =*/ false, /*nbatch_logical =*/ 6, nbatch_physical, loss_type);
|
|
|
|
struct ggml_tensor * loss = ggml_opt_loss(cd.opt_ctx);
|
|
|
|
|
|
|
|
std::vector<float> grad_history(ndata);
|
|
|
|
for (int64_t idata = 0; idata < ndata; ++idata) {
|
|
|
|
grad_history[idata] = NAN;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (int epoch = 1; epoch <= 4; ++epoch) {
|
|
|
|
if (nbatch_physical == 1) {
|
|
|
|
for (int idata = 0; idata < ndata; ++idata) {
|
|
|
|
const float idataf = idata;
|
|
|
|
ggml_backend_tensor_set(cd.inputs, &idataf, 0, 1*sizeof(float));
|
|
|
|
ggml_opt_forward_backward(cd.opt_ctx, cd.result);
|
|
|
|
ggml_backend_tensor_get(ggml_opt_grad_acc(cd.opt_ctx, cd.weights), grad_history.data() + idata, 0, 1*sizeof(float));
|
|
|
|
}
|
|
|
|
} else if (nbatch_physical == 2) {
|
|
|
|
for (int idata = 0; idata < ndata; idata += 2) {
|
|
|
|
const float idataf[2] = {float(idata + 0), float(idata + 1)};
|
|
|
|
ggml_backend_tensor_set(cd.inputs, idataf, 0, 2*sizeof(float));
|
|
|
|
ggml_opt_forward_backward(cd.opt_ctx, cd.result);
|
|
|
|
|
|
|
|
grad_history[idata + 0] = 0.0f;
|
|
|
|
ggml_backend_tensor_get(ggml_opt_grad_acc(cd.opt_ctx, cd.weights), grad_history.data() + idata + 1, 0, 1*sizeof(float));
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
GGML_ASSERT(false);
|
|
|
|
}
|
|
|
|
|
|
|
|
{
|
|
|
|
GGML_ASSERT(ndata == 6);
|
|
|
|
constexpr double atol = 1e-6;
|
|
|
|
bool subtest_ok = true;
|
|
|
|
if (loss_type == GGML_OPT_LOSS_TYPE_SUM) {
|
|
|
|
if (nbatch_physical == 1) {
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[0], 1.0, atol);
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[2], 3.0, atol);
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[4], 5.0, atol);
|
|
|
|
} else {
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[0], 0.0, atol);
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[2], 0.0, atol);
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[4], 0.0, atol);
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
}
|
2024-11-16 21:17:59 +01:00
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[1], 2.0, atol);
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[3], 4.0, atol);
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[5], 0.0, atol);
|
|
|
|
} else if (loss_type == GGML_OPT_LOSS_TYPE_MEAN) {
|
|
|
|
if (nbatch_physical == 1) {
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[0], 1.0/ndata, atol);
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[2], 3.0/ndata, atol);
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[4], 5.0/ndata, atol);
|
|
|
|
} else {
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[0], 0.0/ndata, atol);
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[2], 0.0/ndata, atol);
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[4], 0.0/ndata, atol);
|
|
|
|
}
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[1], 2.0/ndata, atol);
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[3], 4.0/ndata, atol);
|
|
|
|
subtest_ok = subtest_ok && almost_equal(grad_history[5], 0.0/ndata, atol);
|
|
|
|
} else {
|
|
|
|
GGML_ASSERT(false);
|
|
|
|
}
|
|
|
|
helper_after_test_gradient_accumulation(__func__, nbatch_physical, loss_type, epoch, "grads", subtest_ok, ntest, npass);
|
|
|
|
}
|
|
|
|
{
|
|
|
|
float weights;
|
|
|
|
ggml_backend_tensor_get(cd.weights, &weights, 0, sizeof(float));
|
|
|
|
const bool subtest_ok = weights == (ndata/2) - epoch;
|
|
|
|
helper_after_test_gradient_accumulation(__func__, nbatch_physical, loss_type, epoch, "weights", subtest_ok, ntest, npass);
|
|
|
|
}
|
|
|
|
{
|
|
|
|
int64_t ndata_result;
|
|
|
|
ggml_opt_result_ndata(cd.result, &ndata_result);
|
|
|
|
bool subtest_ok = ndata_result == ndata/nbatch_physical;
|
|
|
|
|
|
|
|
double loss;
|
|
|
|
ggml_opt_result_loss(cd.result, &loss, /*loss_unc =*/ nullptr);
|
|
|
|
if (loss_type == GGML_OPT_LOSS_TYPE_SUM) {
|
|
|
|
subtest_ok = subtest_ok && loss == (39.0 - epoch*6.0);
|
|
|
|
} else if (loss_type == GGML_OPT_LOSS_TYPE_MEAN) {
|
|
|
|
subtest_ok = subtest_ok && almost_equal(loss, (39.0 - epoch*6.0) / ndata, 1e-6);
|
|
|
|
} else {
|
|
|
|
GGML_ASSERT(false);
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
}
|
2024-11-16 21:17:59 +01:00
|
|
|
|
|
|
|
double accuracy;
|
|
|
|
double accuracy_unc;
|
|
|
|
ggml_opt_result_accuracy(cd.result, &accuracy, &accuracy_unc);
|
|
|
|
subtest_ok = subtest_ok && std::isnan(accuracy) && std::isnan(accuracy_unc);
|
|
|
|
|
|
|
|
helper_after_test_gradient_accumulation(__func__, nbatch_physical, loss_type, epoch, "results", subtest_ok, ntest, npass);
|
|
|
|
}
|
|
|
|
|
|
|
|
ggml_opt_result_reset(cd.result);
|
2023-09-28 23:41:44 +02:00
|
|
|
}
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
helper_free_ctx_data(cd);
|
|
|
|
|
|
|
|
return std::make_pair(npass, ntest);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ggml_opt_optimizer_params helper_get_regression_opt_pars(void * userdata) {
|
|
|
|
ggml_opt_optimizer_params result = ggml_opt_get_default_optimizer_params(userdata);
|
|
|
|
result.adamw.alpha = 0.1f;
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
static std::pair<int, int> test_regression(ggml_backend_sched_t backend_sched, ggml_backend_t backend) {
|
|
|
|
int ntest = 0;
|
|
|
|
int npass = 0;
|
2023-08-02 10:06:19 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
// Test for simple regression with f(x) = a*x + b
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
constexpr int64_t ndata_regression = 201;
|
|
|
|
constexpr float a_true = 1.2f;
|
|
|
|
constexpr float b_true = 3.4f;
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
std::mt19937 gen(12345);
|
|
|
|
std::normal_distribution<float> nd{0.0f, 0.1f};
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
ggml_opt_dataset_t dataset = ggml_opt_dataset_init(1, 1, ndata_regression, ndata_regression);
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
float * data = ggml_get_data_f32(ggml_opt_dataset_data( dataset));
|
|
|
|
float * labels = ggml_get_data_f32(ggml_opt_dataset_labels(dataset));
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
constexpr float x_min = -100.0f;
|
|
|
|
constexpr float x_max = 100.0f;
|
2023-07-07 18:24:01 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
for (int64_t idata = 0; idata < ndata_regression; ++idata) {
|
|
|
|
const float x = x_min + (x_max - x_min) * idata/(ndata_regression-1);
|
|
|
|
const float y = a_true*x + b_true + nd(gen);
|
2023-07-07 18:24:01 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
data[idata] = x;
|
|
|
|
labels[idata] = y;
|
|
|
|
}
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
struct ggml_context * ctx_static;
|
|
|
|
struct ggml_context * ctx_compute;
|
|
|
|
{
|
|
|
|
struct ggml_init_params params = {
|
|
|
|
/*.mem_size =*/ 3*ggml_tensor_overhead(),
|
|
|
|
/*.mem_buffer =*/ nullptr,
|
|
|
|
/*.no_alloc =*/ true,
|
|
|
|
};
|
|
|
|
ctx_static = ggml_init(params);
|
|
|
|
}
|
|
|
|
{
|
|
|
|
struct ggml_init_params params = {
|
|
|
|
/*.mem_size =*/ GGML_DEFAULT_GRAPH_SIZE*ggml_tensor_overhead() + 3*ggml_graph_overhead(),
|
|
|
|
/*.mem_buffer =*/ nullptr,
|
|
|
|
/*.no_alloc =*/ true,
|
|
|
|
};
|
|
|
|
ctx_compute = ggml_init(params);
|
|
|
|
}
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
// The first dimension is the dimension of the datapoints, the second dimension is the number of datapoints.
|
|
|
|
struct ggml_tensor * x = ggml_new_tensor_2d(ctx_static, GGML_TYPE_F32, 1, ndata_regression);
|
|
|
|
ggml_set_name(x, "x");
|
|
|
|
|
|
|
|
struct ggml_tensor * a = ggml_new_tensor_1d(ctx_static, GGML_TYPE_F32, 1);
|
|
|
|
ggml_set_name(a, "a");
|
|
|
|
ggml_set_param(ctx_static, a);
|
|
|
|
|
|
|
|
struct ggml_tensor * b = ggml_new_tensor_1d(ctx_static, GGML_TYPE_F32, 1);
|
|
|
|
ggml_set_name(b, "b");
|
|
|
|
ggml_set_param(ctx_static, b);
|
|
|
|
|
|
|
|
struct ggml_tensor * f = ggml_add(ctx_compute, ggml_mul(ctx_compute, x, a), b);
|
|
|
|
ggml_set_name(f, "f");
|
|
|
|
ggml_set_param(ctx_static, f);
|
|
|
|
|
|
|
|
ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors(ctx_static, backend);
|
|
|
|
const float a0 = 1.0f;
|
|
|
|
const float b0 = 3.0f;
|
|
|
|
ggml_backend_tensor_set(a, &a0, 0, sizeof(float));
|
|
|
|
ggml_backend_tensor_set(b, &b0, 0, sizeof(float));
|
|
|
|
|
|
|
|
ggml_opt_fit(backend_sched, ctx_compute, x, f, dataset, GGML_OPT_LOSS_TYPE_MEAN_SQUARED_ERROR,
|
|
|
|
helper_get_regression_opt_pars, 100, ndata_regression, 0.0f, true);
|
|
|
|
|
|
|
|
{
|
|
|
|
float a_fit;
|
|
|
|
ggml_backend_tensor_get(a, &a_fit, 0, sizeof(float));
|
|
|
|
float b_fit;
|
|
|
|
ggml_backend_tensor_get(b, &b_fit, 0, sizeof(float));
|
|
|
|
const bool subtest_ok = almost_equal(a_fit, a_true, 1e-2) && almost_equal(b_fit, b_true, 1e-2);
|
|
|
|
printf(" %s(subtest=weights): ", __func__);
|
|
|
|
if (subtest_ok) {
|
|
|
|
printf("\033[1;32mOK\033[0m\n");
|
|
|
|
npass++;
|
|
|
|
} else {
|
|
|
|
printf("\033[1;31mFAIL\033[0m\n");
|
|
|
|
}
|
|
|
|
ntest++;
|
|
|
|
}
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
ggml_backend_buffer_free(buf);
|
|
|
|
ggml_free(ctx_static);
|
|
|
|
ggml_opt_dataset_free(dataset);
|
2023-07-07 18:24:01 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
return std::make_pair(npass, ntest);
|
|
|
|
}
|
2023-07-07 18:24:01 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
static std::pair<int, int> test_backend(ggml_backend_sched_t backend_sched, ggml_backend_t backend) {
|
|
|
|
int npass = 0;
|
|
|
|
int ntest = 0;
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
for (bool shuffle : {false, true}) {
|
|
|
|
std::pair<int, int> partial = test_dataset(backend_sched, backend, shuffle);
|
|
|
|
npass += partial.first;
|
|
|
|
ntest += partial.second;
|
|
|
|
}
|
|
|
|
{
|
|
|
|
std::pair<int, int> partial = test_grad(backend_sched, backend);
|
|
|
|
npass += partial.first;
|
|
|
|
ntest += partial.second;
|
|
|
|
}
|
|
|
|
for (bool high_level : {false, true}){
|
|
|
|
for (bool shuffle : {false, true}) {
|
|
|
|
if (!high_level && shuffle) {
|
|
|
|
continue;
|
|
|
|
}
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
std::pair<int, int> partial = test_forward_backward(backend_sched, backend, high_level, shuffle);
|
|
|
|
npass += partial.first;
|
|
|
|
ntest += partial.second;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
{
|
|
|
|
std::pair<int, int> partial = test_epoch_vs_fit(backend_sched, backend);
|
|
|
|
npass += partial.first;
|
|
|
|
ntest += partial.second;
|
|
|
|
}
|
|
|
|
for (bool high_level : {false, true}){
|
|
|
|
std::pair<int, int> partial = test_idata_split(backend_sched, backend, high_level);
|
|
|
|
npass += partial.first;
|
|
|
|
ntest += partial.second;
|
|
|
|
}
|
|
|
|
for (int32_t nbatch_physical : {2, 1}) {
|
|
|
|
for (enum ggml_opt_loss_type loss_type : {GGML_OPT_LOSS_TYPE_SUM, GGML_OPT_LOSS_TYPE_MEAN}) {
|
|
|
|
std::pair<int, int> partial = test_gradient_accumulation(backend_sched, backend, nbatch_physical, loss_type);
|
|
|
|
npass += partial.first;
|
|
|
|
ntest += partial.second;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
{
|
|
|
|
std::pair<int, int> partial = test_regression(backend_sched, backend);
|
|
|
|
npass += partial.first;
|
|
|
|
ntest += partial.second;
|
|
|
|
}
|
|
|
|
|
|
|
|
return std::make_pair(npass, ntest);
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
}
|
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
int main(void) {
|
|
|
|
const size_t dev_count = ggml_backend_dev_count();
|
|
|
|
printf("Testing %zu devices\n\n", dev_count);
|
|
|
|
size_t n_ok = 0;
|
|
|
|
|
|
|
|
std::vector<ggml_backend_dev_t> devs;
|
|
|
|
std::vector<ggml_backend_t> backends;
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
for (size_t i = 0; i < dev_count; ++i) {
|
|
|
|
devs.push_back(ggml_backend_dev_get(i));
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
ggml_backend_t backend = ggml_backend_dev_init(devs[i], NULL);
|
|
|
|
GGML_ASSERT(backend != NULL);
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
if (ggml_backend_is_cpu(backend)) {
|
|
|
|
ggml_backend_cpu_set_n_threads(backend, std::thread::hardware_concurrency() / 2);
|
|
|
|
}
|
|
|
|
|
|
|
|
backends.push_back(backend);
|
|
|
|
}
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
for (size_t i = 0; i < dev_count; ++i) {
|
|
|
|
// Put the backend to be tested in front so that it's prioritized:
|
|
|
|
std::vector<ggml_backend_t> backends_modded = {backends[i]};
|
|
|
|
backends_modded.insert(backends_modded.end(), backends.begin(), backends.end());
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
ggml_backend_sched_t backend_sched = ggml_backend_sched_new(
|
|
|
|
backends_modded.data(), nullptr, backends_modded.size(), GGML_DEFAULT_GRAPH_SIZE, false);
|
ggml : implement backward pass for llama + small training-llama-from-scratch example (#1360)
* implement 8 of 14 missing backward pass operations used by llama
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW
implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.
this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.
still missing backward passes for llama:
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
* implement 5 of 6 missing backward pass operations used by llama
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK
GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.
GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...
GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.
Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.
still not completely implemented backward passes for llama:
- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
* norm & rms_norm can not be threaded:
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
* remove already resolved TODO
* implement backward pass of ggml_rope and ggml_rope_back
* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back
* add test-grad0.c
* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console
* test both gradients of mul_mat
* disable graph dot export as it floods console
* bug fixes for silu_back
* successfully test silu backward
* bug fix for scale backward pass
use sum instead of mean for gradient of scalar scale parameter
* successfully test scale backward
* improve performance of sum backward pass
use add1(x,y) instead of add(x,repeat(y,x))
* improve performance of sqr backward pass
use scale(x,y) instead of mul(x,repeat(y,x))
* successfully test rope backward
* bug fix for cpy backward pass
* successfully test cpy backward
* bug fix for reshape backward pass
* successfully test reshape backward
* add test-opt.c
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
* correctly implement softmax backward pass using new operation ggml_diag
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
* successfully test soft_max backward
* align shape annotations
* add shape annotations for llama
* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.
with this we can duplicate tensor of any typ as long as they are contiguous.
* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
* bug fix for add_at forward
required for view backward pass
src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
* successfully test view backward
* minor code format improvement
* fix ggml_forward_add functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.
* fix ggml_forward_add1 functions to work correctly with transposed tensors
uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.
* test-grad0.c : add print_elements to help with debugging
* successfully test permute backward
* some minor test-grad0 fixes
* fix sub, mul and div functions to work correctly with transposed tensors
uses the same logic as in add
* implement ggml_cont backward pass
* successfully test transpose backward and permute for all permutations
also test sub, mul and div up to max n_dims
* test-grad0.c add TODO for view_2d and view_3d
add_at (required for view backward pass) is a bit tricky for n_dims > 1.
* fix comments
* successfully test diag_mask_inf and diag_mask_zero backward
* test-grad0 : fix test for div
nargs and ndims was swapped, corrupting the stack
* fix diag_mask to work with non-inplace input
* move dup call into the actual add_at functions
* fix get rows backward pass
* successfully test get_rows backward
* fix view backward pass
add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.
* successfully test backward pass of view_1d, view_2d and view_3d
* fix backward pass for rms_norm
I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.
* successfully test backward pass of rms_norm
some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:
rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324
it is due to the test logic in check_gradients that they fail.
* add todos for llama backward pass
- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.
* add operation ggml_sum_rows
ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]
* add missing GGML_OP_SUM_ROWS
* fix backward pass for repeat
requires ggml_sum_rows
* successfully test backward pass of repeat
* update quantization types in switch-case of add_at and add1
* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.
had to increase maximum number of optimization parameters to train from scratch.
* fix softmax in baby-llama example
* switching from training with adam to lbfgs produces much better results in the baby-llama example
* train with two examples, creating new tensors each time..
* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt
when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt
* train on multiple examples, generate & print tokens with trained model afterwards
ctx0 for evaluation and optimization is renewed for each sample
* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d
* fix soft_max backward pass for input->ne[1] != 1
* add ggml_log operation necessary for cross entropy loss
* add test for ggml_log gradients
* implement backward pass for ggml_sum_rows, necessary for cross entropy loss
* implement ggml_repeat support for rank > 2 tensors
* add test for ggml_sum_rows gradients
* fix training get_example_targets
predict the next token, not the current token!
* add square_error_loss and cross_entropy_loss functions
* optimize loss over multiple samples
this increases computation graph, need parallel batched forward for more efficiency.
* fix backward pass for add_at and change arguments to have same order as in view
* add ggml_set(ctx, a, b) to set b in view of a and return modified a
necessary to set values into kv_self cache and properly propagate the gradients
* fix kv_self gradients for training
use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients
* replace inplace operations for training with copying operations to allow gradient propagation
* add GGML_ASSERT to catch ggml_rope and back value errors
* add trainable lora-only model with all big matrices C split into A,B with A*B=C
this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.
training this instead of the normal model resulted in much worse results though...
* vastly improve training results
instead of logit targets 0 and 1 use -1 and +1.
* shorten code using a variable
* change name of GGML_OP_ADD_AT to GGML_OP_ACC
* smaller default values for baby llama model parameters
* update static assert of GGML_OP_COUNT
* remove shape annotations in llama_eval_internal
* revert disabling of threading for rms_norm and norm
* rename print functions in baby-llama example
* fix call to ggml_set_name
* add missing include for strcmp, etc
* remove trailing whitespace
* reduce number of test-grad0 iterations
avoid exceeding timeout of automated tests
* remove busy loop that was used as sleep for slower sinus wave generation
* disable slow tests grad0 and opt to avoid exceeding timeouts
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* c++ in baby-llama example
use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros
* ggml : fix compiler warnings + cosmetic changes
* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* swap arguments to vDSP_vdiv call
documentation for vDSP_vdiv states: "Note that B comes before A!"
* ggml : swap vDSP_vsub args as per documentation
* add parallel batched forward function for baby-llama training
* cleanup code for batched training
* remove trailing whitespace
* minor : fix compiler warnings + indentation style
* ggml : fix null ptr deref in backward pass
* ggml : remove Q4_2 remnants
* ggml : fix clang-tidy warnings
* baby-llama : couple of clang-tidy warnings
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 14:56:40 +02:00
|
|
|
|
2024-11-16 21:17:59 +01:00
|
|
|
printf("Backend %zu/%zu: %s\n", i + 1, dev_count, ggml_backend_dev_name(devs[i]));
|
|
|
|
printf(" Device description: %s\n", ggml_backend_dev_description(devs[i]));
|
|
|
|
size_t free, total; // NOLINT
|
|
|
|
ggml_backend_dev_memory(devs[i], &free, &total);
|
|
|
|
printf(" Device memory: %zu MB (%zu MB free)\n", total / 1024 / 1024, free / 1024 / 1024);
|
|
|
|
printf("\n");
|
|
|
|
|
|
|
|
std::pair<int, int> result = test_backend(backend_sched, backends[i]);
|
|
|
|
|
|
|
|
printf(" %d/%d tests passed\n", result.first, result.second);
|
|
|
|
printf(" Backend %s: ", ggml_backend_name(backends[i]));
|
|
|
|
if (result.first == result.second) {
|
|
|
|
printf("\033[1;32mOK\033[0m\n");
|
|
|
|
n_ok++;
|
|
|
|
} else {
|
|
|
|
printf("\033[1;31mFAIL\033[0m\n");
|
|
|
|
}
|
|
|
|
|
|
|
|
printf("\n");
|
|
|
|
|
|
|
|
ggml_backend_sched_free(backend_sched);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (ggml_backend_t backend : backends) {
|
|
|
|
ggml_backend_free(backend);
|
|
|
|
}
|
|
|
|
|
|
|
|
printf("%zu/%zu backends passed\n", n_ok, dev_count);
|
|
|
|
if (n_ok != dev_count) {
|
|
|
|
printf("\033[1;31mFAIL\033[0m\n");
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
printf("\033[1;32mOK\033[0m\n");
|
|
|
|
return 0;
|
|
|
|
}
|