How to create a .traineddata file for a specific Japanese font

create training data with split_training_text.py (NOTE: DON’T use custom training_text, as tesseract’s default is specially formatted for training)
create .lstmf files from training data (try -c preserve_interword_spaces=1 to remove unnecessary spaces)

for %i in (*.tif) do tesseract %i %~ni --oem 1 --psm 6 -c preserve_interword_spaces=1 -l jpn lstm.train

create listfile.txt (NOTE: make sure it has unix line ends i.e. LF with no CR… same applies for box files etc)

dir /b *.lstmf > listfile.txt

train

extract this from original model

combine_tessdata -e jpn.traineddata jpn.lstm

maybe dont need to do that and just use the same jpn.traineddata instead of the .lstm?

lstmtraining --model_output output --continue_from jpn.lstm --traineddata jpn.traineddata --max_iterations 500 --train_listfile listfile.txt

resume from generated checkpoint

lstmtraining --model_output output --continue_from output_checkpoint --traineddata jpn.traineddata --max_iterations 400 --train_listfile listfile.txt

evaluate

lstmeval --model output_model.checkpoint --eval_listfile listfile.txt

create .traineddata (use --stop_training on step 4

lstmtraining --model_output output --continue_from output_checkpoint --traineddata jpn.traineddata --stop_training --max_iterations 400 --train_listfile listfile.txt

move generated traineddata file to tessdata folder (have to rename output file to <name>.traineddata, put that in tessdata then in the command specify -l <name>)
test ocr

tesseract <image> output -l <traineddata-name>