1. create training data with split_training_text.py (NOTE: DON’T use custom training_text, as tesseract’s default is specially formatted for training)
  2. create .lstmf files from training data (try -c preserve_interword_spaces=1 to remove unnecessary spaces)
for %i in (*.tif) do tesseract %i %~ni --oem 1 --psm 6 -c preserve_interword_spaces=1 -l jpn lstm.train
  1. create listfile.txt (NOTE: make sure it has unix line ends i.e. LF with no CR… same applies for box files etc)
dir /b *.lstmf > listfile.txt
  1. train

extract this from original model

combine_tessdata -e jpn.traineddata jpn.lstm

maybe dont need to do that and just use the same jpn.traineddata instead of the .lstm?

lstmtraining --model_output output --continue_from jpn.lstm --traineddata jpn.traineddata --max_iterations 500 --train_listfile listfile.txt

resume from generated checkpoint

lstmtraining --model_output output --continue_from output_checkpoint --traineddata jpn.traineddata --max_iterations 400 --train_listfile listfile.txt
  1. evaluate
lstmeval --model output_model.checkpoint --eval_listfile listfile.txt
  1. create .traineddata (use --stop_training on step 4
lstmtraining --model_output output --continue_from output_checkpoint --traineddata jpn.traineddata --stop_training --max_iterations 400 --train_listfile listfile.txt
  1. move generated traineddata file to tessdata folder (have to rename output file to <name>.traineddata, put that in tessdata then in the command specify -l <name>)
  2. test ocr
tesseract <image> output -l <traineddata-name>