CorrectOCR.workspace module

class CorrectOCR.workspace.Workspace(workspaceconfig, resourceconfig, storageconfig)[source]

Bases: object

The Workspace holds references to Documents and resources used by the various commands.

Parameters
  • workspaceconfig

    An object with the following properties:

    • nheaderlines (int): The number of header lines in corpus texts.

    • language: A language instance from pycountry <https://pypi.org/project/pycountry/>.

    • originalPath (Path): Directory containing the original docs.

    • goldPath (Path): Directory containing the gold (if any) docs.

    • trainingPath (Path): Directory for storing intermediate docs.

    • correctedPath (Path): Directory for saving corrected docs.

  • resourceconfig – Passed directly to ResourceManager, see this for further info.

  • storageconfig – TODO

add_doc(doc)[source]

Initializes a new Document and adds it to the workspace.

The doc_id of the document will be determined by its filename.

If the file is not in the originalPath, it will be copied or downloaded there.

Parameters

doc (Any) – A path or URL.

Return type

str

docids_for_ext(ext)[source]

Returns a list of IDs for documents with the given extension.

Return type

List[str]

original_tokens()[source]

Yields an iterator of (docid, list of tokens).

Return type

Iterator[Tuple[str, TokenList]]

gold_tokens()[source]

Yields an iterator of (docid, list of gold-aligned tokens).

Return type

Iterator[Tuple[str, TokenList]]

cleanup(dryrun=True, full=False)[source]

Cleans out the backup files in the trainingPath.

Parameters
  • dryrun – Just lists the files without actually deleting them

  • full – Also deletes the current files (ie. those without .nnn. in their suffix).

class CorrectOCR.workspace.Document(workspace, doc, original, gold, training, corrected, nheaderlines=0)[source]

Bases: object

Parameters
  • doc (Path) – A path to a file.

  • original (Path) – Directory for original uncorrected files.

  • gold (Path) – Directory for known correct “gold” files (if any).

  • training (Path) – Directory for storing intermediate files.

  • corrected (Path) – Directory for saving corrected files.

  • nheaderlines (int) – Number of lines in file header (only relevant for .txt files)

tokenFile

Path to token file (CSV format).

fullAlignmentsFile

Path to full letter-by-letter alignments (JSON format).

wordAlignmentsFile

Path to word-by-word alignments (JSON format).

readCountsFile

Path to letter read counts (JSON format).

alignments(force=False)[source]

Uses the Aligner to generate alignments for a given original, gold pair of docs.

Caches its results in the trainingPath.

Parameters

force – Back up existing alignment docs and create new ones.

Return type

Tuple[list, dict, list]

prepare(step, k, dehyphenate=False, force=False)[source]

Prepares the Tokens for the given doc.

Parameters
  • k (int) – How many k-best suggestions to calculate, if necessary.

  • dehyphenate – Whether to attempt dehyphenization of tokens.

  • force – Back up existing tokens and create new ones.

crop_tokens(edge_left=None, edge_right=None)[source]
class CorrectOCR.workspace.CorpusFile(path, nheaderlines=0)[source]

Bases: object

Simple wrapper for text files to manage a number of lines as a separate header.

Parameters
  • path (Path) – Path to text file.

  • nheaderlines (int) – Number of lines from beginning to separate out as header.

save()[source]

Concatenate header and body and save.

is_file()[source]
Return type

bool

Returns

Does the file exist? See pathlib.Path.is_file().

property id
class CorrectOCR.workspace.JSONResource(path, **kwargs)[source]

Bases: dict

Simple wrapper for JSON files.

Parameters
  • path – Path to load from.

  • kwargs – TODO

save()[source]

Save to JSON file.

class CorrectOCR.workspace.ResourceManager(root, config)[source]

Bases: object

Helper for the Workspace to manage various resources.

Parameters
  • root (Path) – Path to resources directory.

  • config

    An object with the following properties:

    • correctionTrackingFile (Path): Path to file containing correction tracking.

    • TODO