Skip to content

πŸ”₯ Korean GPT-2, KoGPT2 FineTuning cased. ν•œκ΅­μ–΄ 가사 데이터 ν•™μŠ΅ πŸ”₯

License

Notifications You must be signed in to change notification settings

gyunggyung/KoGPT2-FineTuning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

91 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

KoGPT2-FineTuning

Open In Colab license Apache-2.0 contributions welcome GitHub issues GitHub stars

SKT-AIμ—μ„œ μ•½ 20GB의 ν•œκ΅­μ–΄ 데이터λ₯Ό Pre-Training μ‹œν‚¨ KoGPT2λ₯Ό μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€. 첫 번째둜 가사 μž‘μ‚¬λ₯Ό μœ„ν•΄μ„œ, μ €μž‘κΆŒμ΄ 만료된 μ •μ œλœ 가사 데이터, μ†Œμ„€, 기사 등을 Dataλ³„λ‘œ weightλ₯Ό λ‹€λ₯΄κ²Œ μ£Όλ©° Fine-tuning ν•˜μ˜€μŠ΅λ‹ˆλ‹€. λ˜ν•œ μž₯λ₯΄λ„ λ°›μ•„μ„œ μŒμ•… μž₯λ₯΄λ³„ 가사 ν•™μŠ΅ κ²°κ³Όλ₯Ό λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€.

λ˜ν•œ Colabμ—μ„œλŠ” μ›ν™œν•œ ν•™μŠ΅μ„ μœ„ν•΄μ„œ Google Drive와 Dropbbox을 μ—°λ™ν–ˆμŠ΅λ‹ˆλ‹€. ν•™μŠ΅ν•œ 쀑간 κ²°κ³Όλ₯Ό Google Driveμ—μ„œ Dropbbox둜 μ΄λ™μ‹œν‚¨ ν›„, Google Driveμ—μ„œ ν•΄λ‹Ή κ²°κ³Όλ₯Ό μ‚­μ œν•˜κ²Œ ν•©λ‹ˆλ‹€. 이와 κ΄€λ ¨λœ Code

μŒμ•… μž₯λ₯΄λ³„λ‘œ, CSV ν˜•μ‹μ˜ Dataset을 λ°›λŠ” 바뀐 Version 2의 Code둜 KoGPT2-FineTuning μž‘μ—…μ„ ν•˜κΈ° μ–΄λ ΅λ‹€λ©΄, Version 1.1을 μ΄μš©ν•˜κΈΈ λ°”λžλ‹ˆλ‹€.

μ•„λž˜μ—μ„œ, λ‹€μ–‘ν•œ ν•œκ΅­μ–΄ 가사λ₯Ό ν•™μŠ΅ν•œ κ²°κ³Όλ₯Ό 확인 ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μš°λ¦¬λŠ” 이외에도 λ‹€μ–‘ν•œ ν”„λ‘œμ νŠΈλ₯Ό 진행할 κ²ƒμž…λ‹ˆλ‹€.

Sample

Data structure

weight Genre lyrics
1100.0 λ°œλΌλ“œ 'λ‚΄ λ§˜μ„ μ•Œμž–μ•„μš”\n\n\nλ°”λ‘œμ²˜λŸΌ λ©ν•˜λ‹ˆ μ„œ μžˆλŠ” λͺ¨μŠ΅λ§Œ\n\n\n바라보닀\n\n\n포기할 수 밖에 μ—†μ–΄μ„œ...'
...
3x200000

Fine Tuning

python main.py --epoch=200 --data_file_path=./dataset/lyrics_dataset.csv --save_path=./checkpoint/ --load_path=./checkpoint/genre/KoGPT2_checkpoint_296000.tar --batch_size=1

parser

parser.add_argument('--epoch', type=int, default=200,
					help="epoch λ₯Ό ν†΅ν•΄μ„œ ν•™μŠ΅ λ²”μœ„λ₯Ό μ‘°μ ˆν•©λ‹ˆλ‹€.")
parser.add_argument('--save_path', type=str, default='./checkpoint/',
					help="ν•™μŠ΅ κ²°κ³Όλ₯Ό μ €μž₯ν•˜λŠ” κ²½λ‘œμž…λ‹ˆλ‹€.")
parser.add_argument('--load_path', type=str, default='./checkpoint/Alls/KoGPT2_checkpoint_296000.tar', 
					help="ν•™μŠ΅λœ κ²°κ³Όλ₯Ό λΆˆλŸ¬μ˜€λŠ” κ²½λ‘œμž…λ‹ˆλ‹€.")
parser.add_argument('--samples', type=str, default="samples/",
					help="생성 κ²°κ³Όλ₯Ό μ €μž₯ν•  κ²½λ‘œμž…λ‹ˆλ‹€.")
parser.add_argument('--data_file_path', type=str, default='dataset/lyrics_dataset.txt',
					help="ν•™μŠ΅ν•  데이터λ₯Ό λΆˆλŸ¬μ˜€λŠ” κ²½λ‘œμž…λ‹ˆλ‹€.")
parser.add_argument('--batch_size', type=int, default=8,
					help="batch_size λ₯Ό μ§€μ •ν•©λ‹ˆλ‹€.")

Use Colab

Open In Colab

Colab을 μ΄μš©ν•΄μ„œ Fine-tuning Codeλ₯Ό μ‹€ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

Runtime Disconnection Prevention

function ClickConnect() {
    // λ°±μ—”λ“œλ₯Ό ν• λ‹Ήν•˜μ§€ λͺ»ν–ˆμŠ΅λ‹ˆλ‹€.
    // GPU이(κ°€) μžˆλŠ” λ°±μ—”λ“œλ₯Ό μ‚¬μš©ν•  수 μ—†μŠ΅λ‹ˆλ‹€. 가속기가 μ—†λŠ” λŸ°νƒ€μž„μ„ μ‚¬μš©ν•˜μ‹œκ² μŠ΅λ‹ˆκΉŒ?
    // μ·¨μ†Œ λ²„νŠΌμ„ μ°Ύμ•„μ„œ 클릭
    var buttons = document.querySelectorAll("colab-dialog.yes-no-dialog paper-button#cancel"); 
    buttons.forEach(function(btn) {
		btn.click();
    });
    console.log("1λΆ„ λ§ˆλ‹€ λ‹€μ‹œ μ—°κ²°");
    document.querySelector("#top-toolbar > colab-connect-button").click();
}
setInterval(ClickConnect,1000*60);

Clear the screen every 10 minutes

function CleanCurrentOutput(){ 
	var btn = document.querySelector(".output-icon.clear_outputs_enabled.output-icon-selected[title$='ν˜„μž¬ μ‹€ν–‰ 쀑...'] iron-icon[command=clear-focused-or-selected-outputs]");
	if(btn) {
		console.log("10λΆ„ λ§ˆλ‹€ 좜λ ₯ μ§€μš°κΈ°");
		btn.click();
	}
} 
setInterval(CleanCurrentOutput,1000*60*10);

GPU Memory Check

nvidia-smi.exe

generator

python generator.py --temperature=1.0 --text_size=1000 --tmp_sent=""

ν‘œμ ˆ μ—†μŒ

python generator.py --temperature=5.0 --text_size=500 --tmp_sent=""

parser

parser.add_argument('--temperature', type=float, default=0.7,
					help="temperature λ₯Ό ν†΅ν•΄μ„œ κΈ€μ˜ μ°½μ˜μ„±μ„ μ‘°μ ˆν•©λ‹ˆλ‹€.")
parser.add_argument('--top_p', type=float, default=0.9,
					help="top_p λ₯Ό ν†΅ν•΄μ„œ κΈ€μ˜ ν‘œν˜„ λ²”μœ„λ₯Ό μ‘°μ ˆν•©λ‹ˆλ‹€.")
parser.add_argument('--top_k', type=int, default=40,
					help="top_k λ₯Ό ν†΅ν•΄μ„œ κΈ€μ˜ ν‘œν˜„ λ²”μœ„λ₯Ό μ‘°μ ˆν•©λ‹ˆλ‹€.")
parser.add_argument('--text_size', type=int, default=250,
					help="결과물의 길이λ₯Ό μ‘°μ •ν•©λ‹ˆλ‹€.")
parser.add_argument('--loops', type=int, default=-1,
					help="글을 λͺ‡ 번 λ°˜λ³΅ν• μ§€ μ§€μ •ν•©λ‹ˆλ‹€. -1은 λ¬΄ν•œλ°˜λ³΅μž…λ‹ˆλ‹€.")
parser.add_argument('--tmp_sent', type=str, default="μ‚¬λž‘",
					help="κΈ€μ˜ μ‹œμž‘ λ¬Έμž₯μž…λ‹ˆλ‹€.")
parser.add_argument('--load_path', type=str, default="./checkpoint/Alls/KoGPT2_checkpoint_296000.tar",
					help="ν•™μŠ΅λœ 결과물을 μ €μž₯ν•˜λŠ” κ²½λ‘œμž…λ‹ˆλ‹€.")

Use Colab

Open In Colab

Colab을 μ΄μš©ν•΄μ„œ generatorλ₯Ό μ‹€ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

tensorboard

ν•™μŠ΅μ— λ”°λ₯Έ λ³€ν™”λ₯Ό ν™•μΈν•˜κΈ° μœ„ν•΄μ„œ, tensorboard둜 μ ‘κ·Όν•˜μ—¬ loss와 textλ₯Ό ν™•μΈν•©λ‹ˆλ‹€.

tensorboard --logdir=runs

loss

text

Citation

@misc{KoGPT2-FineTuning,
  author = {gyung},
  title = {KoGPT2-FineTuning},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/gyunggyung/KoGPT2-FineTuning}},
}

Output

μžμ„Έν•œ 결과물은 samplesμ—μ„œ 확인 ν•  수 μžˆμŠ΅λ‹ˆλ‹€. ν•™μŠ΅μ— λŒ€ν•΄μ„œλŠ” κ΄€λ ¨ ν¬μŠ€νŒ…μ—μ„œ 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.

Reference

https://github.com/openai/gpt-2
https://github.com/nshepperd/gpt-2
https://github.com/SKT-AI/KoGPT2
https://github.com/asyml/texar-pytorch/tree/master/examples/gpt-2
https://github.com/graykode/gpt-2-Pytorch
https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317
https://github.com/shbictai/narrativeKoGPT2
https://github.com/ssut/py-hanspell
https://github.com/likejazz/korean-sentence-splitter