Skip to content

Stage 5: Tagging and Captioning

CyberMeow edited this page Dec 24, 2023 · 13 revisions

Tag, prune tags, and caption

  • If you start from this stage, please set --src_dir to the training folder with images to tag and caption (can be independently used as tagger).
  • In-place operation.
  • After this stage, you can edit tags and characters yourself using suitable tools, especially if you put --save_aux processed_tags characters. More details follow.

Tagging

In this phase, we use a publicly available taggers to tag images.

  • tagging_method: Choose a tagger available in waifuc. Choices are 'deepdanbooro', 'wd14_vit', 'wd14_convnext', 'wd14_convnextv2', 'wd14_swinv2', and 'mldanbooru'. Default is 'wd14_convnextv2'.
    Example usage: --tagging_method mldanbooru
  • overwrite_tags: Overwrite existing tags even if tags exist in metadata.
    Example usage: --overwrite_tags
  • tag_threshold: Threshold for tagger. Default is 0.35.
    Example usage: --tag_threshold 0.3

Tag pruning

During fine-tuning, each component presented in the images are more easily associated to the text that better represent it. Therefore, to make sure that the trigger words / embeddings are correctly associated to the key characteristics of the concept, we should remove tags that are inherent to the concept. This can be regarded as a trade-off between "convenience" and "flexibility". Without pruning, we can still pretty much get the concept by including all the relevant tags. However, there is also a risk that these tags get "polluted" and get definitely bound to the target concept.

Overview

The pruning process goes over a few steps. You can deactivate tag pruning by setting --prune_mode none.

  • Prune blacklisted tags. Remove tags in the file --blacklist_tags_file (one tag per line). Use blacklist_tags.txt by default.
  • Prune overlap tags. This includes tags that are sub-string of other tags, and overlapped tags specified in --overlap_tags_file. Use overlap_tags.json by default.
  • Prune character-related tags. This stage uses --character_tags_file, character_tags.json by default, to prune character-related tags. All the character-related tags below difficulty --drop_difficulty are dropped if --prune_mode is set to character; otherwise for --prune_mode set to character_core (the default value), only "core tags" of the characters that appear in the images are dropped (see Core tags for details).

The remaining tags are saved to the field processed_tags of metadata.

Character tag difficulty. We may want to associate a difficulty level to each character tag depending on how hard it is to learn the characteristic. This is done in character_tags.json, where we consider two difficulty levels: 0 for human-related tags and 1 for furry, mecha, demon, etc.

Common arguments

  • prune_mode: This argument defines the strategy for tag pruning. The available options are character, character_core, minimal, and none, with the default being character_core.
    Example usage: --prune_mode character
    Description: Each mode represents a different level of tag pruning:
    • none: No pruning is performed, retaining all tags.
    • minimal: Only the first two steps mentioned in overview are performed.
    • character_core: This mode prunes only the core tags of the relevant characters.
    • character: This mode prunes all character-related tags up to --drop_difficulty.
  • blacklist_tags_file, overlap_tags_file, and character_tags_file: These arguments specify the paths to the files containing different tag filtering information.
    Example usage: --blacklist_tags_file path/to/blacklist_tags.txt
  • process_from_original_tags: When enabled, the tag processing starts from the original tags instead of previously processed tags.
    Example usage: --process_from_original_tags
  • drop_difficulty: Determines the difficulty level below which character tags are dropped. Tags with a difficulty level less than this value are added to the drop list, while tags at or above this level are not dropped. The default setting is 2.
    Example usage: --drop_difficulty 1

Core tags

A tag is considered a "core tag" of a character if it frequently appears in images containing that character. Identifying these core tags is useful for both deciding which tags should be dropped due to their inherent association with the concept, and determining which tags are suitable for initializing character embeddings in pivotal tuning.

  • compute_core_tag_up_levels: Specifies the number of directory levels to ascend from the tagged directory for computing core tags. The default is 1, meaning the computation covers all image types.
    Example usage: --compute_core_tag_up_levels 0
  • core_frequency_thresh: Sets the minimum frequency threshold for a tag to be considered a core tag. The default value is 0.4.
    Example usage: --core_frequency_thresh 0.5
  • use_existing_core_tag_file: When enabled, uses the existing core tag file instead of recomputing the core tags.
    Example usage: --use_existing_core_tag_file
  • drop_all_core: If enabled, all core tags are dropped, overriding the --drop_difficulty setting.
    Example usage: --drop_all_core
  • emb_min_difficulty: Sets the minimum difficulty level for tags to be used in embedding initialization. The default is 1.
    Example usage: --emb_min_difficulty 0
  • emb_max_difficulty: Determines the maximum difficulty level for tags used in embedding initialization. The default is 2.
    Example usage: --emb_max_difficulty 1
  • emb_init_all_core: If enabled, all core tags are used for embedding initialization, overriding the --emb_min_difficulty and --emb_max_difficulty settings.
    Example usage: --emb_init_all_core
  • append_dropped_character_tags_wildcard: Append dropped character tags to the wildcard
    Example usage: --append_dropped_character_tags_wildcard

Note that core tags are always computed, and core_tags.json and wildcard.txt are always saved. However, they are computed at the end of tag processing when --pruned_mode is not character_core. You can also use get_core_tags.py to recompute them.

Further tag processing

This step consists of tag sorting, optional appending of dropped character tags, and refining the tag list to adhere to the maximum number of tags specified by --max_tag_number. Initially, we place certain tags like 'solo', '1girl', '1boy', 'Xgirls', and 'Xboys' at the beginning. Subsequently, the tags are organized based on the sort_mode, which determines their order.

  • sort_mode: Determines the method for sorting tags. Defaults to score.
    Example usage: --sort_mode shuffle
    Description: The available modes offer different approaches to tag ordering:
    • original: Maintains the original sequence of the tags as they appear in the data. This mode preserves the initial tag order without any alterations.
    • shuffle: Randomizes the order of tags. This mode introduces variety by shuffling the tags into a random sequence, differing for each image.
    • score: Sorts tags based on their scores, applicable when using a tagger. In this mode, tags are arranged with those having higher scores placed first, prioritizing the most significant tags.
  • append_dropped_character_tags: Adds previously dropped character tags back into the tag set, placing them at the end of the tag list.
    Example usage: --append_dropped_character_tags
  • max_tag_number: Limits the total number of tags included in each image's caption. The default limit is 30.
    Example usage: --max_tag_number 50

Captioning

This step uses tags and other fields to produce captions, saved both in caption field of metadata and as separate .txt files.

  • caption_ordering: Specifies the order in which different information types appear in the caption. The default order is ['npeople', 'character', 'copyright', 'image_type', 'artist', 'rating', 'crop_info', 'tags'].
    Example usage: --caption_ordering character copyright tags
  • caption_inner_sep, caption_outer_sep, character_sep, character_inner_sep, character_outer_sep: These parameters define the separators for different elements and levels within the captions, such as separating items within a single field or separating different fields
    Example usage: --caption_outer_sep "; "
  • use_[XXX]_prob arguments: These settings control the probability of including specific types of information in the captions, like character info, copyright info, and others.
    Example usage: --use_character_prob 0.8

For a complete list of all available arguments for captioning, please refer to the configuration file at configs/pipelines/base.toml.

Additional arguments specific to kohya training

  • keep_tokens_sep: Determines the separator for keep tokens specifically used in Kohya trainer. By default, it uses the value set in --character_outer_sep.
    Example usage: --keep_tokens_sep "|| "
  • keep_tokens_before: Specifies the position in the caption where the keep_tokens_sep should be placed before. The default setting is 'tags'.
    Example usage: --keep_tokens_before crop_info

Manual inspection: Tag and character editing

You can use --save_aux to save some metadata fields to separate files and use --load aux to load them.

Typically, you can run with --save_aux processed_tags characters. You then get files with names like XXX.processed_tags and XXX.characters. These can be batch edited with tools such as dataset-tag-editor or BatchPrompter. These changes can then be loaded with --load_aux processed_tags characters. Remember to run again from this stage to update captions.

  • Since tags are not overwritten by default, you don't need to worry about tagger overwriting the edited tags.
  • It is better to correct detected characters after stage 3. Here you can edit XXX.characters to add non-detected characters.
  • ⚠️ If you edit caption directly, running this stage again will overwrite the changes.