CorrectOCR.workspace module¶
-
class
CorrectOCR.workspace.
Workspace
(workspaceconfig, resourceconfig, storageconfig)[source]¶ Bases:
object
The Workspace holds references to
Documents
and resources used by the variouscommands
.- Parameters
workspaceconfig –
An object with the following properties:
nheaderlines (
int
): The number of header lines in corpus texts.language: A language instance from pycountry <https://pypi.org/project/pycountry/>.
originalPath (
Path
): Directory containing the original docs.goldPath (
Path
): Directory containing the gold (if any) docs.trainingPath (
Path
): Directory for storing intermediate docs.correctedPath (
Path
): Directory for saving corrected docs.
resourceconfig – Passed directly to
ResourceManager
, see this for further info.storageconfig – TODO
-
add_doc
(doc)[source]¶ Initializes a new
Document
and adds it to the workspace.The doc_id of the document will be determined by its filename.
If the file is not in the originalPath, it will be copied or downloaded there.
-
class
CorrectOCR.workspace.
Document
(workspace, doc, original, gold, training, corrected, nheaderlines=0)[source]¶ Bases:
object
- Parameters
doc (
Path
) – A path to a file.original (
Path
) – Directory for original uncorrected files.gold (
Path
) – Directory for known correct “gold” files (if any).training (
Path
) – Directory for storing intermediate files.corrected (
Path
) – Directory for saving corrected files.nheaderlines (
int
) – Number of lines in file header (only relevant for.txt
files)
-
tokenFile
¶ Path to token file (CSV format).
-
fullAlignmentsFile
¶ Path to full letter-by-letter alignments (JSON format).
-
wordAlignmentsFile
¶ Path to word-by-word alignments (JSON format).
-
readCountsFile
¶ Path to letter read counts (JSON format).
-
alignments
(force=False)[source]¶ Uses the
Aligner
to generate alignments for a given original, gold pair of docs.Caches its results in the
trainingPath
.
-
class
CorrectOCR.workspace.
CorpusFile
(path, nheaderlines=0)[source]¶ Bases:
object
Simple wrapper for text files to manage a number of lines as a separate header.
- Parameters
-
is_file
()[source]¶ - Return type
- Returns
Does the file exist? See
pathlib.Path.is_file()
.
-
property
id
¶
-
class
CorrectOCR.workspace.
JSONResource
(path, **kwargs)[source]¶ Bases:
dict
Simple wrapper for JSON files.
- Parameters
path – Path to load from.
kwargs – TODO