molminer package

Submodules

molminer.AbstractLinker module

class molminer.AbstractLinker.AbstractLinker

Bases: abc.ABC

Act as a linker between Python and command-line interface of various SW.

_OPTIONS_REAL

dict – Internal dict which maps the passed options to real SW command-line arguments.

options

dict – Dict of SW’s command-line parameters.

options_internal

dict – Dict of internal options.

process()

Process the input file with SW.

build_commands(options, _OPTIONS_REAL, path_to_sw)

Convert the internal parameters to real SW’s command-line parameters.

help()

Return SW’s help.

static build_commands(options: dict, options_real: dict, path_to_sw: str) → list

Convert the internal parameters to real SW’s command-line parameters.

Parameters:
  • options (dict) – Options to build commands from.
  • options_real (dict) – Dict which maps internal parameters to real SW’s ones.
  • path_to_sw (str) – Path to SW’s binary.
Returns:

List of commands for calling the subprocess, dict of parameters, dict of internal parameters.

Return type:

(list, dict, dict)

help() → str
Returns:SW’s help message.
Return type:str
process(**kwargs) → <function namedtuple at 0x7fb6788579d8>

Process the input with given SW.

Parameters:kwargs
Returns:
Return type:OrderedDict
set_options(options: dict)

Sets the options passed in dict. Keys are the same as optional parameters in child’s constructor.

Parameters:options – Dict of new options.

molminer.ChemSpot module

class molminer.ChemSpot.ChemSpot(path_to_binary: str = 'chemspot', path_to_crf: str = '', path_to_nlp: str = '', path_to_dict: str = '', path_to_ids: str = '', path_to_multiclass: str = 'multiclass.bin', tessdata_path: str = '', max_memory: int = 8, verbosity: int = 1)

Bases: molminer.AbstractLinker.AbstractLinker

Represents the ChemSpot software and acts as a linker between Python and command-line interface of ChemSpot. ChemSpot version: 2.0

ChemSpot is a software for chemical Named Entity Recognition. It assigns to each chemical entity one of this classes:

“SYSTEMATIC”, “IDENTIFIER”, “FORMULA”, “TRIVIAL”, “ABBREVIATION”, “FAMILY”, “MULTIPLE”

More information here: https://www.informatik.hu-berlin.de/de/forschung/gebiete/wbi/resources/chemspot/chemspot

ChemSpot is very memory-consuming so dictionary and ID lookup is disabled by default. Only CRF, OpenNLP sentence and multiclass models will be used by default. Maximum memory used by Java process is set to 8 GB by default. It is strongly recommended to use swap file on SSD disk when available memory is under 8 GB (see https://www.digitalocean.com/community/tutorials/how-to-add-swap-space-on-ubuntu-16-04 for more details).

To show the meaning of options:

chemspot = ChemSpot()
print(chemspot.help())  # this will show the output of "$ chemspot -h"
print(chemspot._OPTIONS_REAL)  # this will show the mapping between ChemSpot class and real ChemSpot parameters
_OPTIONS_REAL

dict – Internal dict which maps the passed options to real ChemSpot command-line arguments. Static attribute.

options

dict – Get or set options.

options_internal

dict – Return dict with options having internal names.

path_to_binary

str – Path to ChemSpot binary.

process()

Process the input file with ChemSpot.

help()

Return ChemSpot help message.

RE_CHARGE = re.compile('(?P<roman>i+|I+)|(?P<digit>\\d+)|(?P<signs>^\\++|-+$)')
RE_ION = re.compile('^\\s*(?P<ion>[A-Z][a-z]?)\\s*\\((?P<charge>-?\\+?i+\\+?-?|-?\\+?I+\\+?-?|\\d+\\+|\\d+-|\\+\\d+|-\\d+|\\++|-+)\\)\\s*$')
help() → str
Returns:ChemSpot help message.
Return type:str
logger = <logging.Logger object>
static normalize_text(input_file_path: str = '', text: str = '', output_file_path: str = '', encoding: str = 'utf-8') → str

Normalize the text. Operations:

  • remove numbers of entities which points somewhere in the text, e.g. “N-octyl- (2b)” -> “N-octyl-“
  • replace “-n ” with “”
Parameters:
  • input_file_path (str) –
  • text (str) –
  • output_file_path (str) –
  • encoding (str) –
Returns:

Normalized text.

Return type:

str

Notes

One of input_file_path or text parameters must be set.

static parse_chemspot(file_path: str = '', text: str = '', encoding: str = 'utf-8') → list

Parse the output from ChemSpot.

Parameters:
  • file_path (str) – Path to file.
  • text (str) – Text to normalize.
  • encoding (str) – File encoding.
Returns:

List of lists. Each sublist is one row from input file and contains:
start position, end position, name of entity, type
Type means a type of detected entity, e.g. SYSTEMATIC, FAMILY etc.

Return type:

list

static parse_chemspot_iob(file_path: str = '', text: str = '', encoding: str = 'utf-8') → list
process(input_text: str = '', input_file: str = '', output_file: str = '', output_file_sdf: str = '', sdf_append: bool = False, input_type: str = '', lang: str = 'eng', paged_text: bool = False, format_output: bool = True, opsin_types: list = None, standardize_mols: bool = True, convert_ions: bool = True, write_header: bool = True, iob_format: bool = False, dry_run: bool = False, csv_delimiter: str = ';', normalize_text: bool = True, remove_duplicates: bool = False, annotate: bool = True, annotation_sleep: int = 2, chemspider_token: str = '', continue_on_failure: bool = False) → collections.OrderedDict

Process the input file with ChemSpot.

Parameters:
  • input_text (str) – String to be processed by ChemSpot.
  • input_file (str) – Path to file to be processed by ChemSpot.
  • output_file (str) – File to write output in.
  • output_file_sdf (str) – File to write SDF output in. SDF is from OPSIN converted entities.
  • sdf_append (bool) – If True, append new molecules to existing SDF file or create new one if doesn’t exist. SDF is from OPSIN converted entities.
  • input_type (str) –
    When empty, input (MIME) type will be determined from magic bytes.
    Or you can specify “pdf”, “pdf_scan”, “image” or “text” and magic bytes check will be skipped.
  • lang (str) –
    Language which will Tesseract use for OCR. Available languages: https://github.com/tesseract-ocr/tessdata
    Multiple languages can be specified with “+” character, i.e. “eng+bul+fra”.
  • paged_text (bool) – If True and input_type is “text” or input_text is provided, try to assign pages to chemical entities. ASCII control character 12 (Form Feed, ‘f’) is expected between pages.
  • format_output (bool) –
    If True, the value of “content” key of returned dict will be list of OrderedDicts.
    If True and output_file is set, the CSV file will be written.
    If False, the value of “content” key of returned dict will be None.
  • opsin_types (list) –
    List of ChemSpot entity types. Entities of types in this list will be converted with OPSIN. If you don’t want to convert entities, pass empty list.
    OPSIN is designed to convert IUPAC names to linear notation (SMILES etc.) so default value of opsin_types is [“SYSTEMATIC”] (these should be only IUPAC names).
    ChemSpot entity types: “SYSTEMATIC”, “IDENTIFIER”, “FORMULA”, “TRIVIAL”, “ABBREVIATION”, “FAMILY”, “MULTIPLE”
  • standardize_mols (bool) – If True, use molvs (https://github.com/mcs07/MolVS) to standardize molecules converted by OPSIN.
  • convert_ions (bool) – If True, try to convert ion entities (e.g. “Ni(II)”) to SMILES. Entities matching ion regex won’t be converted with OPSIN.
  • write_header (bool) – If True and if output_file is set and output_format is True, write a CSV write_header: “smiles”, “bond_length”, “resolution”, “confidence”, “learn”, “page”, “coordinates”
  • iob_format (bool) – If True, output will be in IOB format.
  • dry_run (bool) – If True, only return list of commands to be called by subprocess.
  • csv_delimiter (str) – Delimiter for output CSV file.
  • normalize_text (bool) – If True, normalize text before performing NER. It is strongly recommended to do so, because without normalization can ChemSpot produce unpredictable results which cannot be parsed.
  • remove_duplicates (bool) – If True, remove duplicated chemical entities. Note that some entities-compounds can have different names, but same notation (SMILES, InChI etc.). This will only remove entities with same names. Not applicable for IOB format.
  • annotate (bool) –
    If True, try to annotate entities in PubChem and ChemSpider. Compound IDs will be assigned by searching with each identifier, separately for entity name, SMILES etc.
    If entity has InChI key yet, prefer it in searching.
    If “*” is present in SMILES, skip annotation.
    If textual entity has single result in DB when searched by name, fill in missing identifiers (SMILES etc.).
  • annotation_sleep (int) – How many seconds to sleep between annotation of each entity. It’s for preventing overloading of databases.
  • chemspider_token (str) – Your personal token for accessing the ChemSpider API (needed for annotation). Make account there to obtain it.
  • continue_on_failure (bool) –
    If True, continue running even if ChemSpot returns non-zero exit code.
    If False and error occurs, print it and return.
Returns:

Keys:

  • stdout: str ... standard output from ChemSpot
  • stderr: str ... standard error output from ChemSpot
  • exit_code: int ... exit code from ChemSpot
  • content
    • list of OrderedDicts ... when format_output is True
    • None ... when format_output is False
  • normalized_text : str

Return type:

dict

set_options(options: dict)

Sets the options passed in dict. Keys are the same as optional parameters in ChemSpot constructor (__init__()).

Parameters:options – Dict of new options.
static version() → str
Returns:ChemSpot version.
Return type:str

molminer.Extractor module

class molminer.Extractor.Extractor(opsin_options: dict = {'allow_radicals': True, 'allow_acids_without_acid': True, 'detailed_failure_analysis': True, 'allow_uninterpretable_stereo': True, 'wildcard_radicals': False}, osra_options: dict = {'negate': False, 'size': '', 'resolution': 0, 'spelling_config_path': '', 'gray_threshold': 0.0, 'adaptive': False, 'jaggy': False, 'superatom_config_path': '', 'rotate': 0, 'unpaper': 0}, chemspot_options: dict = {'path_to_multiclass': 'multiclass.bin', 'path_to_dict': "''", 'path_to_nlp': '', 'max_memory': 8, 'path_to_ids': "''", 'path_to_crf': ''}, tessdata_path: str = '', verbosity: int = 1, verbosity_classes: int = 1)

Bases: object

Combines the OSRA, ChemSpot and OPSIN to extract chemical entities from article. These include 2D structures converted to linear notation and compounds found in text. Entities are converted to linear notation (SMILES etc.) with OPSIN (defaultly only IUPAC ones).

process()
chemspot_default_options = {'path_to_multiclass': 'multiclass.bin', 'path_to_dict': "''", 'path_to_nlp': '', 'max_memory': 8, 'path_to_ids': "''", 'path_to_crf': ''}
logger = <logging.Logger object>
opsin_default_options = {'allow_radicals': True, 'allow_acids_without_acid': True, 'detailed_failure_analysis': True, 'allow_uninterpretable_stereo': True, 'wildcard_radicals': False}
osra_default_options = {'negate': False, 'size': '', 'resolution': 0, 'spelling_config_path': '', 'gray_threshold': 0.0, 'adaptive': False, 'jaggy': False, 'superatom_config_path': '', 'rotate': 0, 'unpaper': 0}
process(input_file: str, output_file: str = '', output_file_sdf: str = '', sdf_append: bool = False, write_header: bool = True, separated_output: bool = False, input_type: str = '', lang: str = 'eng', use_gm: bool = True, n_jobs: int = -1, opsin_types: list = None, convert_ions: bool = True, standardize_mols: bool = True, remove_entity_duplicates: bool = False, csv_delimiter: str = ';', annotate: bool = True, annotation_sleep: int = 2, chemspider_token: str = '') → list

Process the input file with OSRA and ChemSpot. IUPAC entities found by ChemSpot are converted by OPSIN to linear notation.

Parameters:
  • input_file (str) –
  • output_file (str) – File to write output in.
  • output_file_sdf (str) – File to write SDF output in. This will write SDF file separately from OSRA and OPSIN, with “-osra.sdf” and “-opsin.sdf” suffixes.
  • sdf_append (bool) – If True, append new molecules to existing SDF file or create new one if doesn’t exist.
  • write_header (bool) – If True and if output_file is set and output_format is True, write a CSV write_header.
  • separated_output (bool) –
    If True, return OrderedDicts from each of OSRA, ChemSpot and OPSIN process methods.
    If True and output_file is set, two separated CSV files will be written with suffixes ”.ocsr”, ”.ner” and ”.opsin”.
  • input_type (str) –
    Type of input file. Values: “pdf”, “pdf_scan”, “image”
    If “pdf”, embedded text will be extracted by Poppler utils (pdftotext).
    If “pdf_scan”, PDF will be converted to images and text extracted by OCR (Tesseract).
    If “image”, text will be extracted by OCR (Tesseract).
    If empty, input (MIME) type will be determined from magic bytes. Note that “pdf_scan” cannot be determined from magic bytes, because it looks like a normal PDF.
  • lang (str) –
    Language which will Tesseract use for OCR. Available languages: https://github.com/tesseract-ocr/tessdata
    Multiple languages can be specified with “+” character, i.e. “eng+bul+fra”.
  • use_gm (bool) –
    If True, use GraphicsMagick to convert PDF to images and then process each image with OSRA. OSRA itself can handle PDF files, but some additional information is then invalid and also some structures are wrongly recognised.
  • n_jobs (int) –
    Number of jobs for parallel processing with OSRA.
    If -1 all CPUs are used.
    If 1 is given, no parallel computing code is used at all, which is useful for debugging.
    For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.
  • opsin_types (list) –
    List of ChemSpot entity types. Entities of types in this list will be converted with OPSIN.
    OPSIN is designed to convert IUPAC names to linear notation (SMILES etc.) so default value of opsin_types is [“SYSTEMATIC”] (these should be only IUPAC names).
    ChemSpot entity types: “SYSTEMATIC”, “IDENTIFIER”, “FORMULA”, “TRIVIAL”, “ABBREVIATION”, “FAMILY”, “MULTIPLE”
  • convert_ions (bool) – If True, try to convert ion entities (e.g. “Ni(II)”) to SMILES. Entities matching ion regex won’t be converted with OPSIN.
  • standardize_mols (bool) – If True, use molvs (https://github.com/mcs07/MolVS) to standardize molecules.
  • remove_entity_duplicates (bool) – If True, remove duplicated chemical entities. Note that some entities-compounds can have different names, but same notation (SMILES, InChI etc.). This will only remove entities with same names.
  • csv_delimiter (str) – Delimiter for output CSV file.
  • annotate (bool) –
    If True, try to annotate entities in PubChem and ChemSpider. Compound IDs will be assigned by searching with each identifier, separately for entity name, SMILES etc.
    If entity has InChI key yet, prefer it in searching.
    If “*” is present in SMILES, skip annotation.
    If textual entity has single result in DB when searched by name, fill in missing identifiers (SMILES etc.).
  • annotation_sleep (int) – How many seconds to sleep between annotation of each entity. It’s for preventing overloading of databases.
  • chemspider_token (str) – Your personal token for accessing the ChemSpider API (needed for annotation). Make account there to obtain it.
Returns:

  • list of OrderedDicts – Keys: “source”, “type”, “page”, “abbreviation”, “entity”, “smiles”, “inchi”, “inchikey”
  • OrderedDict, OrderedDict, OrderedDict – From OSRA, ChemSpot and OPSIN if separated_output is True.

molminer.OPSIN module

class molminer.OPSIN.OPSIN(path_to_binary: str = 'opsin', allow_acids_without_acid: bool = True, detailed_failure_analysis: bool = True, output_format: str = 'smi', allow_radicals: bool = True, allow_uninterpretable_stereo: bool = True, opsin_verbose: bool = False, wildcard_radicals: bool = False, plural_pattern: str = None, verbosity: int = 1)

Bases: molminer.AbstractLinker.AbstractLinker

Represents the OPSIN software and acts as a linker between Python and command-line interface of OPSIN. OPSIN version: 2.2.0

OPSIN is a software for converting IUPAC names to linear notation (SMILES, InCHI etc.). It reads names from stdin or from input file where on each line is one IUPAC name.

More information here: http://opsin.ch.cam.ac.uk/

To show the meaning of options:

opsin = OPSIN()
print(opsin.help())  # this will show the output of "$ opsin -h"
print(opsin._OPTIONS_REAL)  # this will show the mapping between OPSIN class and real OPSIN parameters
_OPTIONS_REAL

dict – Internal dict which maps the passed options to real OSRA command-line arguments.

options

dict – Get or set options.

options_internal

dict – Return dict with options having internal names.

path_to_binary

str – Path to OPSIN binary (JAR file).

process()

Process the input file with OPSIN.

help()

Return OPSINS help message.

PLURAL_PATTERN = re.compile('(nitrate|bromide|chloride|iodide|amine|ketoxime|ketone|oxime)s', re.IGNORECASE)
help() → str
Returns:OPSIN help message.
Return type:str
logger = <logging.Logger object>
normalize_iupac(iupac_names: typing.Union[str, list]) → typing.Union[str, list]

Normalize IUPAC names:

  • remove plurals (“nitrates” -> “nitrate”)
  • first letter lowercase (“Ammonium Nitrate” -> “ammonium nitrate”)
Parameters:iupac_names (str or list) – If str, one IUPAC name per line.
Returns:
Return type:str or list
process(input: typing.Union[str, list] = '', input_file: str = '', output_file: str = '', output_file_sdf: str = '', output_file_cml: str = '', sdf_append: bool = False, format_output: bool = True, opsin_output_format: str = '', output_formats: list = None, write_header: bool = True, dry_run: bool = False, csv_delimiter: str = ';', standardize_mols: bool = True, normalize_plurals: bool = True, continue_on_failure: bool = False) → collections.OrderedDict

Process the input file with OPSIN.

Parameters:
  • input (str or list) –
    str: String with IUPAC names, one per line.
    list: List of IUPAC names.
  • input_file (str) – Path to file to be processed by OPSIN. One IUPAC name per line.
  • output_file (str) – File to write output in.
  • output_file_sdf (str) – File to write SDF output in.
  • output_file_cml (str) –
    File to write CML (Chemical Markup Language) output in. opsin_output_format must be “cml”.
    Not supported by RDKit so standardization and conversion to other formats cannot be done.
  • sdf_append (bool) – If True, append new molecules to existing SDF file or create new one if doesn’t exist.
  • format_output (bool) –
    If True, the value of “content” key of returned dict will be list of OrderedDicts with keys:
    “iupac”, <output formats>, ..., “error”
    If True and output_file is set it will be created as CSV file with columns: “iupac”, <output formats>, ..., “error”
    If False, the value of “content” key of returned dict will be None.
  • opsin_output_format (str) –
    Output format from OPSIN. Temporarily overrides the option output_format set during instantiation (in __init__).
    Choices: “cml”, “smi”, “extendedsmi”, “inchi”, “stdinchi”, “stdinchikey”
  • output_formats (list) –
    If True and format_output is also True, this specifies which molecule formats will be output.
    You can specify more than one format, but only one format from OPSIN. This format must be also set with output_format in __init__ or with osra_output_format here.
    Default value: [“smiles”]
    Value Source Note
    smiles RDKit canonical
    smiles_opsin OPSIN (“smi”) SMILES
    smiles_extended_opsin OPSIN (“extendedsmi”) Extended SMILES. Not supported by RDKit.
    inchi RDKit Not every molecule can be converted to InChI (it doesn`t support wildcard characters etc.)
    inchi_opsin OPSIN (“inchi”) InChI
    stdinchi_opsin OPSIN (“stdinchi”) standard InChI
    inchikey RDKit The same applies as for “inchi”. Also molecule cannot be created from InChI-key.
    stdinchikey_opsin OPSIN (“stdinchikey”) Standard InChI-key. Cannot be used by RDKit to create molecule.
    sdf RDKit If present, an additional SDF file will be created.
  • write_header (bool) – If True and if output_file is set and output_format is True, write a CSV write_header.
  • dry_run (bool) – If True, only return list of commands to be called by subprocess.
  • csv_delimiter (str) – Delimiter for output CSV file.
  • standardize_mols (bool) – If True and format_output is also True, use molvs (https://github.com/mcs07/MolVS) to standardize molecules.
  • normalize_plurals (bool) –
    If True, normalize plurals (“nitrates” -> “nitrate”). See OPSIN.PLURAL_PATTERNS for relating plurals. You can set your own regex pattern with plural_patterns in __init__.
  • continue_on_failure (bool) –
    If True, continue running even if OPSIN returns non-zero exit code.
    If False and error occurs, print it and return.
Returns:

Keys:

  • stdout: str ... standard output from OPSIN
  • stderr: str ... standard error output from OPSIN
  • exit_code: int ... exit code from OPSIN
  • content:
    • list of OrderedDicts ... when format_output is True. Fields: “iupac”, <output formats>, ..., “error”
    • None ... when format_output is False

Return type:

dict

set_options(options: dict)

Sets the options passed in dict. Keys are the same as optional parameters in OPSIN constructor (__init__()).

Parameters:options – Dict of new options.

molminer.OSRA module

class molminer.OSRA.OSRA(path_to_binary: str = 'osra', size: str = '', osra_verbose: bool = False, debug: bool = False, embedded_format: str = '', output_format: str = 'can', adaptive: bool = False, jaggy: bool = False, unpaper: int = 0, gray_threshold: float = 0.0, resolution: int = 300, negate: bool = False, rotate: int = 0, superatom_config_path: str = 'superatom.txt', spelling_config_path: str = 'spelling.txt', verbosity: int = 1)

Bases: molminer.AbstractLinker.AbstractLinker

Represents the OSRA software and acts as a linker between Python and command-line interface of OSRA. OSRA version: 2.1.0

OSRA is a software for extraction of 2D structures from various formats like PDF and images. It recognizes the position of structure in the source and then constructs its linear notation (SMILES, InCHI etc.).

More information here: https://sourceforge.net/projects/osra/ (old website: https://cactus.nci.nih.gov/osra/)

To show the meaning of options:

osra = OSRA()
print(osra.help())  # this will show the output of "$ osra -h"
print(osra._OPTIONS_REAL)  # this will show the mapping between OSRA class and real OSRA parameters
_OPTIONS_REAL

dict – Internal dict which maps the passed options to real OSRA command-line arguments.

options

dict – Get or set options.

options_internal

dict – Return dict with options having internal names.

path_to_binary

str – Path to OSRA binary.

process()

Process the input file with OSRA.

help()

Return OSRA help message.

version()

Return OSRA version.

Notes

–learn parameter is currently not supported, because its output is problematic to parse.

GM_COMMAND = 'gm convert -density {dpi} {input_file_path} +adjoin {trim} -quality 100 {temp_dir}/{input_file}-%d.png'
help() → str
Returns:OSRA help message.
Return type:str
logger = <logging.Logger object>
process(input_file: str, output_file: str = '', output_file_sdf: str = '', sdf_append: bool = False, format_output: bool = True, write_header: bool = True, osra_output_format: str = '', output_formats: list = None, dry_run: bool = False, csv_delimiter: str = ';', use_gm: bool = True, gm_dpi: int = 300, gm_trim: bool = True, n_jobs: int = -1, input_type: str = '', standardize_mols: bool = True, annotate: bool = True, chemspider_token: str = '', custom_page: int = 0, continue_on_failure: bool = False) → collections.OrderedDict

Process the input file with OSRA.

Parameters:
  • input_file (str) – Path to file to be processed by OSRA.
  • output_file (str) – File to write output in.
  • output_file_sdf (str) –
    File to write SDF output in. “sdf” output format hasn’t to be in output_formats to write SDF output.
    If “sdf_osra” output format is requested, suffix “-osra.sdf” will be added.
  • sdf_append (bool) – If True, append new molecules to existing SDF file or create new one if doesn’t exist.
  • IMPLEMENTED | images_prefix (NOT) – Prefix for images of extracted compounds which will be written.
  • format_output (bool) –
    If True, the value of “content” key of returned dict will be list of OrderedDicts.
    If True and output_file is set, the CSV file will be written.
    If False, the value of “content” key of returned dict will be None.
  • write_header (bool) – If True and if output_file is set and output_format is True, write a CSV write_header.
  • osra_output_format (str) –
    Output format from OSRA. Temporarily overrides the option output_format set during instantiation (in __init__).
    Choices: “smi”, “can”, “sdf”
    If “sdf”, additional information like coordinates cannot be retrieved (not implemented yet).
  • output_formats (list) –
    If True and format_output is also True, this specifies which molecule formats will be output.
    You can specify more than one format, but only one format from OSRA. This format must be also set with output_format in __init__ or with osra_output_format here.
    When output produces by OSRA is unreadable by RDKit, you can at least have that output from OSRA.
    Default value: [“smiles”]
    Value Source Note
    smiles RDKit canonical
    smiles_osra OSRA (“smi”) SMILES
    smiles_can_osra OSRA (“can”) canonical SMILES
    inchi RDKit Not every molecule can be converted to InChI (it doesn`t support wildcard characters etc.)
    inchikey RDKit The same applies as for “inchi”.
    sdf RDKit If present, an additional SDF file will be created.
    sdf_osra OSRA (“sdf”) If present, an additional SDF file will be created.
  • dry_run (bool) – If True, only return list of commands to be called by subprocess.
  • csv_delimiter (str) – Delimiter for output CSV file.
  • use_gm (bool) –
    If True, use GraphicsMagick to convert PDF to temporary PNG images before processing.
    If False, OSRA will use it’s own conversion of PDF to image.
    Using gm is more reliable since OSRA (v2.1.0) is showing wrong information when converting directly from PDF (namely: coordinates, bond length and possibly more ones) and also there are sometimes incorrectly recognised structures.
  • gm_dpi (int) – How many DPI will temporary PNG images have.
  • gm_trim (bool) – If True, gm will trim the temporary PNG images.
  • n_jobs (int) –
    If use_gm and input file is PDF, how many jobs to use for OSRA processing of temporary PNG images.
    If -1 all CPUs are used.
    If 1 is given, no parallel computing code is used at all, which is useful for debugging.
    For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.
  • input_type (str) –
    When empty, input (MIME) type will be determined from magic bytes.
    Or you can specify “pdf” or “image” and magic bytes check will be skipped.
  • standardize_mols (bool) – If True and format_output is also True, use molvs (https://github.com/mcs07/MolVS) to standardize molecules.
  • annotate (bool) –
    If True, try to annotate entities in PubChem and ChemSpider. Compound IDs will be assigned by searching with each identifier, separately for SMILES, InChI etc.
    If entity has InChI key yet, prefer it in searching.
    If “*” is present in SMILES, skip annotation.
  • chemspider_token (str) – Your personal token for accessing the ChemSpider API. Make account there to obtain it.
  • custom_page (bool) – When use_gm is False, this will set the page for all extracted compounds.
  • continue_on_failure (bool) –
    If True, continue running even if OSRA returns non-zero exit code.
    If False and error occurs, print it and return.
Returns:

Keys:

  • stdout: str ... standard output from OSRA

  • stderr: str ... standard error output from OSRA

  • exit_code: int ... exit code from OSRA

  • content:

    • list of OrderedDicts ... when format_output is True.
    • None ... when format_output is False
If osra_output_format is “sdf”, additional information like ‘bond_length’ cannot be retrieved.
If use_gm is True then stdout, stderr and exit_code will be lists containing items from each temporary image extracted by OSRA.

Return type:

dict

Notes

Only with format_output set to True you can use molecule standardization and more molecule formats. Otherwise you will only get raw stdout from OSRA (which can also be written to file if output_file is set).

set_options(options: dict)

Sets the options passed in dict. Keys are the same as optional parameters in OSRA constructor (__init__).

Parameters:options – Dict of new options.
version() → str
Returns:OSRA version.
Return type:str

molminer.cli module

molminer.cli.add_options(options)
molminer.cli.get_kwargs(options, kwargs)
molminer.cli.get_opsin_types(types)

molminer.normalize module

chemdataextractor.text.normalize

Tools for normalizing text. https://github.com/mcs07/ChemDataExtractor

copyright:Copyright 2016 by Matt Swain.
license:MIT, see LICENSE file for more details.
molminer.normalize.ACCENTS = {'´', '`'}

Accent characters.

molminer.normalize.APOSTROPHES = {'’', 'ꞌ', ''', "'", '՚', 'Ꞌ'}

Apostrophe characters.

class molminer.normalize.BaseNormalizer

Bases: abc.ABC

Abstract normalizer class from which all normalizers inherit.

Subclasses must implement a normalize() method.

normalize(text)

Normalize the text.

Parameters:text (string) – The text to normalize.
Returns:Normalized text.
Return type:string
molminer.normalize.CONTROLS = {'\x06', '\x08', '\x03', '\x07', '\x01', '\x05', '\x04', '\x02'}

Control characters.

molminer.normalize.DOUBLE_QUOTES = {'„', '"', '‟', '“', '”'}

Double quote characters.

molminer.normalize.GREEK = {'ζ', 'η', 'κ', 'Δ', 'Ξ', 'υ', 'Η', 'Υ', 'Ι', 'Λ', 'Ε', 'δ', 'Χ', 'ψ', 'Κ', 'τ', 'Θ', 'Ω', 'θ', 'Α', 'Σ', 'ω', 'Ρ', 'π', 'Ο', 'Τ', 'ε', 'Ζ', 'Γ', 'λ', 'φ', 'Π', 'Ν', 'ρ', 'Μ', 'ι', 'ν', 'μ', 'β', 'Φ', 'γ', 'ξ', 'α', 'ο', 'Β', 'χ', 'Ψ', 'σ'}

Uppercase and lowercase greek letters.

molminer.normalize.GREEK_WORDS = {'omega', 'Phi', 'Sigma', 'mu', 'sigma', 'kappa', 'rho', 'Chi', 'Lambda', 'phi', 'theta', 'psi', 'Omega', 'Omicron', 'iota', 'delta', 'beta', 'xi', 'upsilon', 'Xi', 'alpha', 'Nu', 'omicron', 'nu', 'Theta', 'Gamma', 'Eta', 'Mu', 'Delta', 'lamda', 'pi', 'epsilon', 'Psi', 'Kappa', 'Beta', 'zeta', 'Upsilon', 'Iota', 'tau', 'Pi', 'Rho', 'Zeta', 'chi', 'eta', 'gamma', 'Epsilon', 'Tau', 'Alpha'}

Names of greek letters spelled out as words.

molminer.normalize.HYPHENS = {'–', '―', '‐', '‑', '‒', '⁃', '—', '-'}

Hyphen and dash characters.

molminer.normalize.MINUSES = {'-', '⁻', '−', '-'}

Minus characters.

molminer.normalize.NAME_SMALL = {'le', 'del', 'di', 'abu', 'y', 'la', 'von', 'de', 'ste', 'bin', 'st', 'vel', 'san', 'bon', 'dal', 'ibn', 'da', 'dí', 'van', 'der'}

Words that should not be capitalized in names.

molminer.normalize.NUMBERS = {'ninety', 'fifteen', 'thousand', 'thirty', 'five', 'billion', 'eighty', 'eighteen', 'twenty', 'four', 'fourteen', 'ten', 'six', 'million', 'nineteen', 'one', 'eight', 'nine', 'seven', 'fifty', 'forty', 'seventy', 'sixteen', 'twelve', 'seventeen', 'trillion', 'hundred', 'two', 'sixty', 'eleven', 'thirteen', 'three', 'zero'}

A variety of numbers, spelled out as words.

class molminer.normalize.Normalizer(form='NFKC', strip=True, collapse=True, hyphens=False, quotes=False, ellipsis=False, slashes=False, tildes=False)

Bases: molminer.normalize.BaseNormalizer

Main Normalizer class for generic English text.

Normalize unicode, hyphens, quotes, whitespace.

By default, the normal form NFKC is used for unicode normalization. This applies a compatibility decomposition, under which equivalent characters are unified, followed by a canonical composition. See Python docs for information on normal forms: http://docs.python.org/2/library/unicodedata.html#unicodedata.normalize

normalize(text)

Run the Normalizer on a string.

Parameters:text – The string to normalize.
molminer.normalize.PLUSES = {'⁺', '+', '+'}

Plus characters.

molminer.normalize.PRIMES = {'‷', '″', '‴', '′', '‵', '‶', '⁗'}

Prime characters.

molminer.normalize.QUOTES = {'’', "'", '‟', '´', '"', '”', '‘', '„', '‵', '⁗', '“', '‚', 'Ꞌ', '‶', '‷', '‛', '″', '‴', ''', '՚', '′', 'ꞌ', '`'}

Quote characters, including apostrophes, single quotes, double quotes, accents and primes.

molminer.normalize.SINGLE_QUOTES = {'‚', '’', '‘', '‛', "'"}

Single quote characters.

molminer.normalize.SLASHES = {'⁄', '∕', '/'}

Slash characters.

molminer.normalize.SMALL = {'to', 'and', 'of', 'at', 'or', 'if', 'but', 'the', 'on', 'en', 'for', 'v', 'in', 'as', 'vs', 'via', 'an', 'by', 'a'}

Words that should not be capitalized in titles.

molminer.normalize.TILDES = {'〜', '~', '˜', '⁓', '~', '∿', '∽', '∼'}

Tilde characters.

molminer.normalize.normalize = <molminer.normalize.Normalizer object>

Default normalize that canonicalizes unicode and fixes whitespace.

molminer.normalize.strict_normalize = <molminer.normalize.Normalizer object>

More aggressive normalize that also standardizes hyphens, and quotes.

molminer.utils module

class molminer.utils.Output(stdout, stderr, exit_code)

Bases: tuple

exit_code

Alias for field number 2

stderr

Alias for field number 1

stdout

Alias for field number 0

molminer.utils.common_subprocess(commands: typing.Union[list, str], stdin: str = '', stdin_encoding: str = 'utf-8') → <function namedtuple at 0x7fb6788579d8>

Return the namedtuple with stdout, stderr and exit code from shell command.

Parameters:
  • commands (list or str) – List of commands to execute in shell, e.g. [“ls”, “-a”]. If string is given, split it to list.
  • stdin (str) – Stdin to send to shell.
  • stdin_encoding (str) –
Returns:

Fields: “stdout”, “stderr”, “exit_code”

Return type:

namedtuple

molminer.utils.dict_to_csv(dicts: list, output_file: str = '', csv_delimiter: str = ';', write_header: bool = True)
molminer.utils.eprint(*args, **kwargs)
molminer.utils.get_input_file_type(input_file: str) → str
molminer.utils.get_temp_images(temp_dir)
molminer.utils.get_text(input_file: str, input_type: str, lang: str = 'en', tessdata_prefix: str = '') → str
molminer.utils.get_text_from_image(input_file: str, lang: str = 'eng', tessdata_prefix: str = '') → str

Get text from image using Tesseract OCR.

Parameters:
  • input_file (str) –
  • lang (str) –
    Language which will Tesseract use for OCR. Available languages: https://github.com/tesseract-ocr/tessdata
    Multiple languages can be specified with “+” character, i.e. “eng+bul+fra”.
  • tessdata_prefix (str) – Path to directory with Tesseract language data. If empty, the TESSDATA_PREFIX environment variable will be used.
Returns:

Return type:

str

molminer.utils.get_text_from_pdf(input_file: str) → str

Get embedded text from PDF using pdftotext binary (part of poppler-utils).

Parameters:input_file (str) –
Returns:
Return type:str
molminer.utils.get_text_from_pdf_scan(input_file: str, lang: str = 'eng', tessdata_prefix: str = '', tesseract_engine: int = 2, as_page_list: bool = False) → typing.Union[str, tempfile.TemporaryDirectory]

Get text from PDF which consists of scanned pages (images). First convert PDF to PNG images (one image per page) and then apply Tesseract OCR to get text.

Parameters:
  • input_file (str) –
  • lang (str) –
    Language which will Tesseract use for OCR. Available languages: https://github.com/tesseract-ocr/tessdata
    Multiple languages can be specified with “+” character, e.g. “eng+bul+fra”.
    Language data files must be stored in directory defined in TESSDATA_PREFIX environmental variable.
  • tessdata_prefix (str) – Path to directory with Tesseract language data. If empty, the TESSDATA_PREFIX environment variable will be used.
  • tesseract_engine (int) –

    OCR Engine modes:

    0 Original Tesseract only.
    1 Neural nets LSTM only.
    2 Tesseract + LSTM.
    3 Default, based on what is available.
  • as_page_list (bool) – If True, return list of text of individual pages.
Returns:

Text and TemporaryDirectory object, which contains name of temporary directory with converted images.
This directory will be deleted when script exits, when TemporaryDirectory object is deleted or its method cleanup() is called.

Return type:

(str, TemporaryDirectory)

molminer.utils.pdf_to_images(input_file_path, output_dir, gm_command='gm convert -density {dpi} {input_file_path} +adjoin {trim} -quality 100 {temp_dir}/{input_file}-%d.png', dpi=300, trim=True)
molminer.utils.write_empty_file(file: str, csv_delimiter: str = ';', header: list = None, write_header: bool = False)

Module contents