Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 975: invalid start byte #113

Open
diodiogod opened this issue Aug 20, 2024 · 7 comments

Comments

@diodiogod
Copy link

This is for bugs only

Did you already ask in the discord?

Yes

You verified that this is a bug and not a feature request or question by asking in the discord?

Yes

Describe the bug

I'm getting this error in the middle of training. Once at 399 step. The second time at 1265.
Chatgpt says it's related to a single quotation mark '.

Maybe it's a character in the config folder or the name of a file or caption? Edit: further googling it's related smart quote(’) of Windows-1252. I just don't know how to find and replace it...

My previous LoRa from a person named "Loïc" that I used the name as a trigger word had errors related to the ï character. I had to change it everywhere in the config file. But on the captions I left as it was and it worked. I also think this is a bug. The file name and prompt on the config file should allow this character to be used.

Anyway. This is the problem I'm having now (not related to Loïc) it's a different LoRA.

Traceback (most recent call last):
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\run.py", line 90, in <module>
    main()
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\run.py", line 86, in main
    raise e
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\run.py", line 78, in main
    job.run()
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\jobs\ExtensionJob.py", line 22, in run
    process.run()
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\jobs\process\BaseSDTrainProcess.py", line 1667, in run
    batch = next(dataloader_iterator)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\venv\Lib\site-packages\torch\utils\data\dataloader.py", line 630, in __next__    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\venv\Lib\site-packages\torch\utils\data\dataloader.py", line 673, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\venv\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 54, in fetch
    data = self.dataset[possibly_batched_index]
           ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\venv\Lib\site-packages\torch\utils\data\dataset.py", line 350, in __getitem__    return self.datasets[dataset_idx][sample_idx]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\toolkit\data_loader.py", line 539, in __getitem__
    return [self._get_single_item(idx) for idx in idx_list]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\toolkit\data_loader.py", line 527, in _get_single_item
    file_item.load_caption(self.caption_dict)
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\toolkit\dataloader_mixins.py", line 305, in load_caption
    prompt = f.read()
             ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 975: invalid start byte
@diodiogod
Copy link
Author

Using notepad++ searching for [\x84\x93\x94] or [\x82\x91\x92] didn't give me any results in all my 1196 txt files. =(

@setothegreat
Copy link

Just got this error last night when training a LoRA, hadn't happened with the LoRAs I'd trained prior.
Search on Google revealed that it's the result of one of the prompt files being encoded in something other than UTF-8. The only file which had been encoded differently had the phrase "The Wizard's", and changing the file back to UTF-8 changed the ' into an invalid character, and since that change training seems to be fine again.

Quick way to fix it is to use the find and replace function in Notepad++ on the folder containing your training data, and replacing ' with nothing. As for changing the encoding back to UTF-8 I'm not entirely sure how to do that automatically but there's probably a way.

@futureflix87
Copy link

futureflix87 commented Aug 21, 2024

I got this error when using llama 3.1 7b as a captioner using with a joy caption script. It was printing a ' (apostrophe) in a non utf-8 format. After I removed them, training went fine...

@WarAnakin
Copy link

yes, sometimes the file format gets reset depending if you opened the file with another program and closed it.

@diodiogod
Copy link
Author

You guys are right. Someone (user:Think) on Discord suggested, and I run these two scripts on the caption folder and it worked.

I still think that this falls into a bug category and the trainer could handle this better in the future, as a suggestion.

("note these scripts would run on the current directory, so run them on a backup copy to risk messing up your dataset.")

import os
import string

# Function to replace special characters with basic equivalents
def replace_special_characters(text):
    replacements = {
        '’': "'",
        '‘': "'",
        '“': '"',
        '”': '"',
        '–': '-',
        '—': '-',
        '…': '...',
        'é': 'e',
        'è': 'e',
        'ê': 'e',
        'á': 'a',
        'à': 'a',
        'â': 'a',
        'ó': 'o',
        'ò': 'o',
        'ô': 'o',
        'ú': 'u',
        'ù': 'u',
        'û': 'u',
        'í': 'i',
        'ì': 'i',
        'î': 'i',
        'ç': 'c',
        'ñ': 'n',
        'ß': 'ss',
        'ü': 'u',
        'ö': 'o',
        'ä': 'a',
        'ø': 'o',
        'æ': 'ae',
        # Add more replacements as needed
    }
    
    # Remove all characters not in printable set or replace if in the replacements dictionary
    printable = set(string.printable)
    result = ''.join(replacements.get(c, c) if c not in printable else c for c in text)
    
    return result

# Function to process all .txt files in the current directory
def process_text_files():
    for filename in os.listdir('.'):
        if filename.endswith('.txt'):
            with open(filename, 'r', encoding='utf-8', errors='ignore') as file:
                content = file.read()
            
            # Replace special characters
            cleaned_content = replace_special_characters(content)
            
            # Write cleaned content back to the file
            with open(filename, 'w', encoding='utf-8') as file:
                file.write(cleaned_content)

if __name__ == "__main__":
    process_text_files()
import os

def remove_special_characters(text):
    # Keep only ASCII characters (characters with ordinal values from 0 to 127)
    return ''.join(c if ord(c) < 128 else '' for c in text)

def process_text_files():
    for filename in os.listdir('.'):
        if filename.endswith('.txt'):
            with open(filename, 'r', encoding='utf-8', errors='ignore') as file:
                content = file.read()

            # Remove non-ASCII characters
            cleaned_content = remove_special_characters(content)

            # Write cleaned content back to the file
            with open(filename, 'w', encoding='utf-8') as file:
                file.write(cleaned_content)

if __name__ == "__main__":
    process_text_files()

@airobinnet
Copy link

#128 should fix this

@xFoolery
Copy link

xFoolery commented Aug 28, 2024

Open Window Setting, Time and Language, Language and Region, Administrative Language Settings, Change System Locale, Check Beta: Use Unicode UTF-8 for Worldwide Language Support. This works for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants