History¶

CorrectOCR is based on code created by:

Caitlin Richter (ricca@seas.upenn.edu)
Matthew Wickes (wickesm@seas.upenn.edu)
Deniz Beser (dbeser@seas.upenn.edu)
Mitchell Marcus (mitch@cis.upenn.edu)

See their article “Low-resource Post Processing of Noisy OCR Output for Historical Corpus Digitisation” (LREC-2018) for further details, it is available online: http://www.lrec-conf.org/proceedings/lrec2018/pdf/971.pdf

The original python 2.7 code (see original-tag in the repository) has been licensed under Creative Commons Attribution 4.0 (CC-BY-4.0, see also license.txt in the repository).

The code has subsequently been updated to Python 3 and further expanded by Mikkel Eide Eriksen (mikkel.eriksen@gmail.com) for the Copenhagen City Archives (mainly structural changes, the algorithms are generally preserved as-is). Pull requests welcome!

Requirements¶

Python >= 3.6

For package dependencies see requirements.txt. They can be installed using pip install -r requirements.txt

Indices and tables¶

Read the Docs v: develop

Versions: latest; develop

Downloads: pdf; html; epub

On Read the Docs: Project Home; Builds

Free document hosting provided by Read the Docs.