2 HuggingFace Model Card Metadata Interoperability Consideration
Brian edited this page 2024-11-21 01:41:30 +11:00

Below is the agreed upon mapping between GGUF KV Keys and Hugging face as per Discussion with HF to coordinate on extending the handling of base model sources and dataset sources.

GGUF KV Key HF Model Card Field Notes
general.name model_name Name of the model.
general.license license License identifier.
general.license.name license_name Full name of the license.
general.license.link license_link URL to the license text.
general.base_model.{id}.name base_model Simpler field: array of model IDs on HF Hub.
general.base_model.{id}.name base_model_sources[].name Extension: detailed description of base models.
general.base_model.{id}.author base_model_sources[].author Author of the parent/base model (extension field).
general.base_model.{id}.version base_model_sources[].version Version of the parent/base model (extension field).
general.base_model.{id}.organization base_model_sources[].organization Organization responsible for the parent/base model (extension field).
general.base_model.{id}.description base_model_sources[].description Description of the parent/base model (extension field).
general.base_model.{id}.url base_model_sources[].url URL for more information about the parent/base model (extension field).
general.base_model.{id}.doi base_model_sources[].doi DOI of the parent/base model (extension field).
general.base_model.{id}.uuid base_model_sources[].uuid UUID of the parent/base model (extension field).
general.base_model.{id}.repo_url base_model_sources[].repo_url Repository URL of the parent/base model (extension field).
general.dataset.{id}.name datasets Simpler field: array of dataset IDs on HF Hub.
general.dataset.{id}.name dataset_sources[].name Extension: detailed description of datasets.
general.dataset.{id}.author dataset_sources[].author Author of the dataset (extension field).
general.dataset.{id}.version dataset_sources[].version Version of the dataset (extension field).
general.dataset.{id}.organization dataset_sources[].organization Organization responsible for the dataset (extension field).
general.dataset.{id}.description dataset_sources[].description Description of the dataset (extension field).
general.dataset.{id}.url dataset_sources[].url URL for more information about the dataset (extension field).
general.dataset.{id}.doi dataset_sources[].doi DOI of the dataset (extension field).
general.dataset.{id}.uuid dataset_sources[].uuid UUID of the dataset (extension field).
general.dataset.{id}.repo_url dataset_sources[].repo_url Repository URL of the dataset (extension field).
general.tags tags Tags describing the model.
general.languages language Languages supported by the model.
general.description Not explicitly mapped for now Can be included in a custom "description" field in the model card.
general.url Not explicitly mapped for now General URL for further information about the model.
general.repo_url Not explicitly mapped for now Repository URL for the model.
general.doi Not explicitly mapped for now DOI of the model.
general.uuid Not explicitly mapped for now UUID of the model.
general.size_label Not explicitly mapped for now May represent quantization or sizing information.
general.quantized_by Not explicitly mapped for now Indicates who performed quantization.
general.alignment Not explicitly mapped for now Potentially indicates alignment objective (e.g., RLHF, etc.).
general.file_type Not explicitly mapped for now File format of the model (e.g., GGUF, Safetensors).

An example below of how the mapping as shown above may appear:

# Model Card Fields
model_name: Example Model Six
# Licensing details
license: apache-2.0
license_name: Apache License Version 2.0, January 2004
license_link: https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md
# Simple Model (singular or list of hugging face model ids)
base_model: stabilityai/stable-diffusion-xl-base-1.0
# Detailed Model Parents (Merges, Pre-tuning, etc...) (list of dicts)
base_model_sources:
  - name: GPT-3
    author: OpenAI
    version: '3.0'
    organization: OpenAI
    description:  A large language model capable of performing a wide variety of language tasks.
    url: 'https://openai.com/research/gpt-3'
    doi: 10.5555/gpt3doi123456
    uuid: 123e4567-e89b-12d3-a456-426614174000
    repo_url: 'https://github.com/openai/gpt-3'
  - name: BERT
    author: Google AI Language
    version: '1.0'
    organization: Google
    description: A transformer-based model pretrained on English to achieve state-of-the-art performance on a range of NLP tasks.
    url: 'https://github.com/google-research/bert'
    doi: 10.5555/bertdoi789012
    uuid: 987e6543-e21a-43f3-a356-527614173999
    repo_url: 'https://github.com/google-research/bert'
# Simple Dataset (singular or list of hugging face dataset ids)
datasets: common_voice
# Detailed Model Datasets Used (Training data...) (list of dicts)
dataset_sources:
  - name: Wikipedia Corpus
    author: Wikimedia Foundation
    version: '2021-06'
    organization: Wikimedia
    description: A dataset comprising the full English Wikipedia, used to train models in a range of natural language tasks.
    url: 'https://dumps.wikimedia.org/enwiki/'
    doi: 10.5555/wikidoi234567
    uuid: 234e5678-f90a-12d3-c567-426614172345
    repo_url: 'https://github.com/wikimedia/wikipedia-corpus'
  - name: Common Crawl
    author: Common Crawl Foundation
    version: '2021-04'
    organization: Common Crawl
    description: A dataset containing web-crawled data from various domains, providing a broad range of text.
    url: 'https://commoncrawl.org'
    doi: 10.5555/ccdoi345678
    uuid: 345e6789-f90b-34d5-d678-426614173456
    repo_url: 'https://github.com/commoncrawl/cc-crawl-data'
# Model Content Metadata
tags:
  - text generation
  - transformer
  - llama
  - tiny
  - tiny model
language:
  - en