Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Korean Finetuning #78

Open
hdeval1 opened this issue Jul 22, 2022 · 0 comments
Open

Korean Finetuning #78

hdeval1 opened this issue Jul 22, 2022 · 0 comments

Comments

@hdeval1
Copy link

hdeval1 commented Jul 22, 2022

I was able to finetune the base korean model using TMX data by editing the finetune recipe, but now I am having issues with the model. When I changed the finetune recipe, I found the filter-korean.sh script and substituted the steps:

python3 ../scripts/filter/bitext-match-lang.py -s $$s -t $$t | \
	grep --invert-match '[<>{}]' | \
	$(TOKENIZER)/replace-unicode-punctuation.perl |\
	perl -CS -pe 'tr[\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}][]cd;' |\
	sed 's/  */ /g;s/^ *//g;s/ *$$//g' |\
	shuf > ${TMX_DEV_BASE}.$$s-$$t.shuffled; \
	mkdir -p $$s-$$t/${TMXBASE}/dev; \

with the following:

/bin/bash ../scripts/filter/filter-korean.sh ${SRC} ${TRG} $$d > ${TMXBASE}.clean; \
	cat ${TMXBASE}.clean | \
	$(TOKENIZER)/replace-unicode-punctuation.perl |\
	perl -CS -pe 'tr[\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}][]cd;' |\
	sed 's/  */ /g;s/^ *//g;s/ *$$//g' |\
	shuf > ${TMXBASE}.$$s-$$t.shuffled; \
	mkdir -p $$s-$$t/${TMXBASE}/train; \
	mkdir -p $$s-$$t/${BASEMODELNAME}; \

That seemed to do the trick to kick of the tuning, however with the new tuned model I am having a punctuation issue. If you send something like this (it would be in korean but for the sake of explaining i did it in english):

hello my name is heather. 
-heather is here to say hello,
*how are you today?

where the trailing character before the new line is punctuation & the first character of the next line is a punctuation followed directly by a character, the translation comes out incorrect and the punctuation / new lines is off. I did notice, if you send in each line individually, then the translations come out correctly and no punctuation issues are present. It seems as though the spaces/punctuation is causing the text to be interpreted as a sentence and therefore affecting the translation. I looked through the backlog and noticed there were some initial issues with Korean, so I figured I would ask and see if you had any insight on what the issue may be.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant