MPT support in llama.cpp#3417
Conversation
…odified with deltas from ggml/examples/mpt
|
quantize warns because it is looking for attn_k and not attn_qkv: |
Now fixed as well. |
…rom metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?)
…T_KEY macro instead of duplicate code
|
…nd rope_shift from build_mpt
|
Note that this PR does not include the modifications of convert script proposed in #3252 and referred to in #3417 (comment) yet. Since this PR is based on a pre-merge commit of #3252, it may be easier to add this change after the merge. |
…nvert-gptneox-hf-to-gguf.py in pr:3252
|
@cebtenzzre Thanks for the merge. If anyone can give this a quick try and confirms working, we should merge. |
Works for me. The PR is now almost the same as my own previous private merge attempt. The disable-n_past-assertion changes to ggml_compute_forward_alibi_f16 and ggml_compute_forward_alibi_f32 could be made syntactically more consistent - but AFAICS they are functionally equivalent. So not a show stopper for merge into master. |
…g hparams["vocab_size"]
…example * 'master' of github.com:ggerganov/llama.cpp: (34 commits) examples: support LLaVA v1.5 (multimodal model) (ggml-org#3436) docs : fix typo GOMP_CPU_AFFINITY (ggml-org#3597) cmake : fix add_compile_options on macOS typo : it is `--n-gpu-layers` not `--gpu-layers` (ggml-org#3592) ci : check if there is enough VRAM (ggml-org#3596) server : add completion mode (no chat) (ggml-org#3582) prompts : add mnemonics.txt server : fix kv cache management (ggml-org#3588) main : fix session loading bug (ggml-org#3400) server : add parameter -tb N, --threads-batch N (ggml-org#3584) common : fix mirostat state when using multiple sequences (ggml-org#3543) batched : add bench tool (ggml-org#3545) examples : add batched.swift + improve CI for swift (ggml-org#3562) Add MPT model to supported models in README.md (ggml-org#3574) Minor improvements in GPT2 tokenizer (ggml-org#3567) readme : add bloom (ggml-org#3570) llm : add bloom models (ggml-org#3553) swift : improvements and fixes (ggml-org#3564) llm : add MPT support (ggml-org#3417) infill. : fix tokenization (ggml-org#3508) ...
Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> (cherry picked from commit f5f9121)
|
I converted the mpt-7b-chat and the mpt-7b-storywriter. The conversion and quantization completes sucessfully and produces the .gguf files. however, the files don't work for me. When running main with them, i get an For reference, here is the full output: I already have successfull converted a bunch of falcon models that work fine, butthe mpt conversion script does not work for me. |
|
Here is a hexdump of the beginning of the files: in comparison to the openbuddy falconversion that works fine: What I notice is that after In contrast, the actual falcon model has a |
* CUDA: added support for ggml_clamp (see also: ggml-org/ggml#545) * mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt * mpt : protect against "clip_qkv": null in mpt-7b * mpt : quick fix to avoid "Strange model" warning when quantizing MPT models * mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?) * mpt : standardized all tensor names to follow GGUF spec * mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code * mpt : fixed comment s/gptneox/mpt/ * mpt : remove tabs, trailing whitespace * mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt * mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252 * comment out n_past instead of marking it unused * mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"] * mpt : remove unused tokenizer_json in convert script * ggml : remove obsolete n_past assert in ggml_alibi * llama : print clam_kqv and max_alibi_bias hparams --------- Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
As per #1333 (comment)
Some comments regarding this initial implementation: