[I2S] Molecular Structure Extraction From Documents Using Deep Learning

카테고리 없음

[I2S] Molecular Structure Extraction From Documents Using Deep Learning

ML.chang 2020. 8. 31. 19:25

ABSTRACT
Chemical structure extraction from documents remains a hard problem due to both false positive identification of structures during segmentation and errors in the predicted structures.
Current approaches rely on handcrafted rules and subroutines that perform reasonably well generally, but still routinely encounter situations where recognition rates are not yet satisfactory and systematic improvement is challenging.

Complications impacting performance of current approaches include the diversity in visual styles used by various software to render structures, the frequent use of ad hoc annotations, and other challenges related to image quality, including resolution and noise.

We here present end-to-end deep learning solutions for both segmenting molecular structures from documents and for predicting chemical structures from these segmented images. This deep learning-based approach does not require any handcrafted features, is learned directly from data, and is robust against variations in image quality and style. Using the deep-learning approach described herein we show that it is possible to perform well on both segmentation and prediction of low resolution images containing moderately sized molecules found in journal articles and patents.

요약
문서에서 화학 구조를 추출하는 것은 세분화 중 구조의 오 탐지 식별과 예측 된 구조의 오류로 인해 어려운 문제로 남아 있습니다.
현재 접근 방식은 일반적으로 합리적으로 잘 수행되는 수작업 규칙 및 서브 루틴에 의존하지만 인식률이 아직 만족스럽지 않고 체계적인 개선이 어려운 상황에 여전히 일상적으로 직면합니다.

현재 접근 방식의 성능에 영향을 미치는 문제에는 다양한 소프트웨어에서 구조를 렌더링하는 데 사용하는 시각적 스타일의 다양성, 임시 주석의 빈번한 사용, 해상도 및 노이즈를 비롯한 이미지 품질과 관련된 기타 문제가 포함됩니다.

여기에서는 문서에서 분자 구조를 분할하고 이러한 분할 된 이미지에서 화학 구조를 예측하기위한 종단 간 딥 러닝 솔루션을 제공합니다. 이 딥 러닝 기반 접근 방식은 수작업 기능이 필요하지 않고 데이터에서 직접 학습되며 이미지 품질 및 스타일의 변화에 대해 강력합니다. 여기에 설명 된 딥 러닝 접근 방식을 사용하여 저널 기사 및 특허에서 발견되는 중간 크기의 분자를 포함하는 저해상도 이미지의 분할 및 예측 모두에서 잘 수행 할 수 있음을 보여줍니다.

Introduction
For drug discovery projects to be successful, it is often crucial that newly available data are quickly processed and assimilated through high quality curation. Furthermore, an important initial step in developing a new therapeutic includes the collection, analysis, and utilization of previously published experimental data. This is particularly true for small - molecule drug discovery where collections of experimentally tested molecules are used in virtual screening programs, quantitative structure activity/property relationship (QSAR/QSPR) analyses, or validation of physics-based modeling approaches.

Due to the difficulty and expense of generating large quantities of experimental data, many drug discovery projects are forced to rely on a relatively small pool of in-house experimental data,
which in turn may result in data volume as a limiting factor in improving in-house QSAR/QSPR models.

소개
신약 개발 프로젝트가 성공하기 위해서는 새로 이용 가능한 데이터를 고품질 큐 레이션을 통해 신속하게 처리하고 동화하는 것이 중요합니다. 또한 새로운 치료제 개발의 중요한 초기 단계에는 이전에 발표 된 실험 데이터의 수집, 분석 및 활용이 포함됩니다. 이것은 가상 스크리닝 프로그램, 정량적 구조 활동 / 특성 관계 (QSAR / QSPR) 분석 또는 물리학 기반 모델링 접근 방식의 검증에 실험적으로 테스트 된 분자 모음이 사용되는 저분자 약물 발견의 경우 특히 그렇습니다.

많은 양의 실험 데이터를 생성하는 데 따른 어려움과 비용으로 인해 많은 신약 개발 프로젝트는 상대적으로 작은 사내 실험 데이터 풀에 의존해야합니다.
이는 사내 QSAR / QSPR 모델을 개선하는 데 제한 요소로 데이터 볼륨을 초래할 수 있습니다.

One promising solution to the widespread lack of appropriate training set data in drug discovery is the amount of data currently being published. 1 Medline reports more than 2000+ new life science papers published per day, 2 and this estimate does not include other literature indexes or patents that further add to the volume of newly published data. Given this high rate at which new experimental data is entering the public literature, it is increasingly important to address issues related to data extraction and curation, and to automate these processes to the greatest extent possible. One such area of data curation in life sciences that continues to be difficult and time consuming is the extraction of chemical structures from publicly available sources such as journal articles and patent filings.

신약 발견에서 적절한 훈련 세트 데이터가 널리 퍼져 있지 않은 것에 대한 한 가지 유망한 솔루션은 현재 게시되는 데이터의 양입니다.
Medline은 매일 발행되는 2000 개 이상의 새로운 생명 과학 논문을보고하고 있으며,이 추정치는 새로 발행 된 데이터의 양을 더 늘리는 다른 문헌 색인이나 특허를 포함하지 않습니다.
새로운 실험 데이터가 공개 문헌에 유입되는 속도가 빠르기 때문에 데이터 추출 및 큐 레이션과 관련된 문제를 해결하고 이러한 프로세스를 최대한 자동화하는 것이 점점 더 중요 해지고 있습니다. 계속해서 어렵고 시간이 많이 걸리는 생명 과학 데이터 큐 레이션 영역 중 하나는 저널 기사 및 특허 출원과 같은 공개적으로 사용 가능한 소스에서 화학 구조를 추출하는 것입니다.

Most publications containing data related to small molecules do not provide the molecular structures in a computer readable format (e.g., SMILES, connection table, etc.).
Instead, computer programs are used by authors to draw the corresponding structures, and are included in the document via an image of the resulting drawing.

Publishing documents with only images of structures necessitates the manual redrawing of the structures in chemical sketching software as a means of converting the structures into computer readable formats for use in downstream computation and analysis.

Redrawing chemical structures can be time consuming and often requires domain knowledge to adequately resolve ambiguities, interpret variations in style, and decide how annotations should be included or ignored.

작은 분자와 관련된 데이터를 포함하는 대부분의 출판물은 분자 구조를 컴퓨터에서 읽을 수있는 형식 (예 : SMILES, 연결 테이블 등)으로 제공하지 않습니다.
대신 컴퓨터 프로그램은 작성자가 해당 구조를 그리는 데 사용되며 결과 도면의 이미지를 통해 문서에 포함됩니다.

구조 이미지 만 포함 된 문서를 게시하려면 구조를 다운 스트림 계산 및 분석에 사용하기 위해 컴퓨터 판독 가능한 형식으로 변환하는 해야합니다. 따라서 화학 스케치 소프트웨어에서 구조를 수동으로 다시 그려야합니다.

화학 구조를 다시 그리는 것은 시간이 많이 소요될 수 있으며 모호성을 적절히 해결하고 스타일의 변형을 해석하며 주석을 포함하거나 무시하는 방법을 결정하기 위해 종종 도메인 지식이 필요합니다.

Solutions for automatic structure recognition have been described previously. 3-9 These methods utilize sophisticated rules that perform well in many situations, but can experience degradation in output quality under commonly encountered conditions, especially when input resolution is low or image quality is poor.

자동 구조 인식을위한 솔루션은 이전에 설명되었습니다. 3-9 이러한 방법은 여러 상황에서 잘 수행되는 정교한 규칙을 사용하지만 일반적으로 발생하는 조건, 특히 입력 해상도가 낮거나 이미지 품질이 좋지 않은 경우 출력 품질이 저하 될 수 있습니다.

One of the challenges to improving current extraction rates is that rule-based systems are necessarily highly interdependent and complex, making further improvements difficult.

현재 추출 속도를 개선하는 데 대한 과제 중 하나는 규칙 기반 시스템이 반드시 상호 의존성이 높고 복잡하기 때문에 추가 개선이 어렵다는 것입니다.

Furthermore, rule-based approaches can be demanding to build and maintain because they require significant domain expertise and require contributors to anticipate and codify rules for all potential scenarios the system might encounter.

또한 규칙 기반 접근 방식은 상당한 도메인 전문 지식이 필요하고 기여자가 시스템에서 발생할 수있는 모든 잠재적 시나리오에 대한 규칙을 예측하고 성문화해야하므로 구축 및 유지 관리가 필요할 수 있습니다.

Developing hand-coded rules is particularly difficult in chemical structure extraction where a
wide variety of styles and annotations are used, and input quality is not always consistent. The goal of this work is twofold: 1) demonstrate it is possible to develop an extraction method to go from input document to SMILES without requiring the implementation of hand-coded rules or features; and 2) further demonstrate it is possible to improve prediction accuracy on low quality images using such a system.

손으로 코딩 한 규칙을 개발하는 것은 화학 구조 추출에서 특히 어렵습니다.
다양한 스타일과 주석이 사용되며 입력 품질이 항상 일관된 것은 아닙니다. 이 작업의 목표는 두 가지입니다. 1) 직접 코딩 된 규칙이나 기능을 구현하지 않고도 입력 문서에서 SMILES로 이동하는 추출 방법을 개발할 수 있음을 보여줍니다. 2) 이러한 시스템을 사용하여 저품질 이미지에 대한 예측 정확도를 향상시킬 수 있음을 추가로 입증합니다.

Deep learning and other data-driven technologies are becoming increasingly widespread in life sciences, particularly in drug discovery and development.

10-3 In this work we leverage recent advances in image processing, sequence generation, and computing over latent representations of chemical structures to predict SMILES for molecular structure images.

The method reported here takes an image or PDF and performs segmentation using a convolutional neural network. SMILES are then generated using a convolutional neural network in combination with a recurrent neural network (encoder-decoder) in an end-to-end fashion (meaning, the architecture computes SMILES directly from raw images).

We report results based on preliminary findings using our deep learning-based method and provide suggestions for potential improvements in future iterations.

Using a downsampled version of a published dataset, we show that our deep learning method performs well under low quality conditions, and may operate on raw image data without hand-coded rules or features.

딥 러닝 및 기타 데이터 기반 기술은 생명 과학, 특히 신약 개발 및 개발 분야에서 점점 더 널리 보급되고 있습니다.

10-13이 작업에서 우리는 분자 구조 이미지에 대한 SMILES를 예측하기 위해 화학 구조의 잠재 표현에 대한 이미지 처리, 시퀀스 생성 및 컴퓨팅의 최근 발전을 활용합니다.

여기에보고 된 방법은 이미지 또는 PDF를 가져와 컨벌루션 신경망을 사용하여 분할을 수행합니다. 그런 다음 SMILES는 순환 신경망 (인코더-디코더)과 결합 된 컨벌루션 신경망을 사용하여 종단 간 방식으로 생성됩니다 (즉, 아키텍처는 원시 이미지에서 직접 SMILES를 계산 함).

우리는 딥 러닝 기반 방법을 사용하여 예비 결과를 기반으로 결과를보고하고 향후 반복에서 잠재적 인 개선 사항을 제안합니다.

게시 된 데이터 세트의 다운 샘플링 된 버전을 사용하여 딥 러닝 방법이 저품질 조건에서 잘 작동하고 수동 코딩 된 규칙이나 기능없이 원시 이미지 데이터에서 작동 할 수 있음을 보여줍니다.

Related Work Automatic chemical structure extraction is not a new idea. Park et al., 3 McDaniel et al., 5 Sadawi et al., 6 Valko & Johnson, 7 and Filippov & Nicklaus 8 each utilize various combinations of image processing techniques, optical character recognition, and hand-coded rules to identify lines and characters in a page, then assemble these components into molecular connection tables. Similarly, Frasconi et al. 9 utilize low level image processing techniques to identify molecular components but rely on Markov logic to assemble the components into complete structures. Park et al. 4 demonstrated the benefits of ensembling several recognition systems together in a single framework for improved recognition rates. Currently available methods rely on low-level image processing techniques (edge detectors, vectorization, etc.) in combination with subcomponent recognition (character and bond detection) and high-level rules that arrange recognized components into their corresponding structures.

관련된 일
자동 화학 구조 추출은 새로운 아이디어가 아닙니. Park et al., 3 McDaniel et al., 5 Sadawi et al., 6 Valko & Johnson, 7 and Filippov & Nicklaus 8 각각은 다양한 조합의 이미지 처리 기술, 광학 문자 인식 및 손으로 코딩 된 규칙을 사용하여 선을 식별하고 페이지에 문자를 입력 한 다음 이러한 구성 요소를 분자 연결 테이블로 어셈블합니다. 마찬가지로 Frasconi et al.9은 분자 구성 요소를 식별하기 위해 저수준 이미지 처리 기술을 사용하지만 구성 요소를 완전한 구조로 조립하기 위해 Markov 논리에 의존합니다. Park et al. 4는 인식률 향상을 위해 여러 인식 시스템을 단일 프레임 워크에 통합하는 이점을 보여주었습니다. 지금 사용 가능한
방법은 하위 구성 요소 인식 (문자 및 결합 감지) 및 인식 된 구성 요소를 해당 구조에 배열하는 고급 규칙과 결합 된 저수준 이미지 처리 기술 (가장자리 감지기, 벡터화 등)에 의존합니다.

There are continuing challenges, however, that limit the usefulness of currently available methods. As discussed in Valko & Johnson 7there are many situations in the literature where designing specific rules to handle inputs becomes quite challenging.

Some of these include wavy bonds, lines that overlap (e.g.,bridges), and ambiguous atom labels. Apart from complex, ambiguous, or uncommon representations, there are other challenges that currently impact performance, including low resolution or noisy images.
Currently available solutions require relatively high resolution input, e.g., 300+ dpi, 5,19 and tolerate only small amounts of noise.

Furthermore, rule-based systems can be difficult to improve due to the complexity and interconnectedness of the various recognition components. Changing a heuristic in one area of the algorithm can impact and require adjustment in another area, making it difficult to improve components to fit new data while simultaneously maintaining or improving generalizability of the overall system.

그러나 현재 사용 가능한 방법의 유용성을 제한하는 지속적인 도전이 있습니다. Valko & Johnson 7에서 논의 된 바와 같이 문헌에는 입력을 처리하기위한 특정 규칙을 설계하는 것이 매우 어려운 상황이 많이 있습니다.

이들 중 일부에는 물결 모양의 결합, 겹치는 선 (예 : 브리지) 및 모호한 원자 레이블이 포함됩니다. 복잡하거나 모호하거나 흔하지 않은 표현 외에도 저해상도 또는 노이즈가 많은 이미지를 포함하여 현재 성능에 영향을 미치는 다른 문제가 있습니다.
현재 사용 가능한 솔루션은 상대적으로 높은 해상도 입력 (예 : 300+ dpi, 5,19)이 필요하며 소량의 노이즈 만 허용합니다.

또한 규칙 기반 시스템은 다양한 인식 구성 요소의 복잡성과 상호 연결성으로 인해 개선하기 어려울 수 있습니다. 알고리즘의 한 영역에서 휴리스틱을 변경하면 다른 영역에 영향을 미치고 조정이 필요할 수 있으므로 전체 시스템의 일반화 가능성을 유지하거나 개선하는 동시에 새 데이터에 맞게 구성 요소를 개선하기가 어렵습니다.

Apart from accuracy of structure prediction, filtering of false positives during the structure extraction process also remains problematic.
Current solutions rely on users to manually filter out results, choosing which predicted structures should be ignored if tables, figures, etc. are predicted to be molecules.
In order to both improve extraction and prediction of structures in a wide variety of source materials, particularly with noisy or low quality input images, it is important to explore alternatives to rule-based systems. The deep learning-based method outlined herein provides (i) improved accuracy for poor quality images and (ii) a built-in mechanism for further improvement through the addition of new training data.

구조 예측의 정확성 외에도 구조 추출 과정에서 오 탐지 필터링도 여전히 문제가 있습니다.
현재 솔루션은 사용자가 결과를 수동으로 필터링하여 표, 그림 등이 분자로 예측되는 경우 무시해야하는 예측 구조를 선택합니다.
특히 노이즈가 많거나 낮은 품질의 입력 이미지를 사용하여 다양한 소스 자료에서 구조의 추출 및 예측을 향상 시키려면 규칙 기반 시스템에 대한 대안을 탐색하는 것이 중요합니다. 여기에 설명 된 딥 러닝 기반 방법은 (i) 저품질 이미지에 대한 향상된 정확도 및 (ii) 새로운 훈련 데이터 추가를 통한 추가 개선을위한 내장 메커니즘을 제공합니다.

Deep Learning Method
The deep learning model architectures for segmentation and structure prediction described in this work are depicted in Figure 1. The segmentation model identifies and extracts chemical structure images from input documents, and the structure prediction model generates a computer-readable SMILES string for each extracted chemical structure image.

딥 러닝 방법
이 작업에 설명 된 세분화 및 구조 예측을위한 딥 러닝 모델 아키텍처는 그림 1에 나와 있습니다.
세분화 모델은 입력 문서에서 화학 구조 이미지를 식별하고 추출하며, 구조 예측 모델은 추출 된 각 화학 구조 이미지에 대해 컴퓨터 판독 가능한 SMILES 문자열을 생성합니다.

Figure 1. In subpanel (A) we depict the segmentation model architecture and in subpanel (B) we depict the structure prediction model architecture. For brevity, similar layers are condensed with a multiplier indicating how many layers are chained together, e.g., “2X”. All convolution (conv) layers are followed by a parametric ReLU activation function. “Pred Char” represents the character predicted at a particular step and “Prev Char” stands for the predicted character at the previous step. Computation flow through the diagrams is left to right (and top to bottom in (B)).그림 1. 하위 패널 (A)에서는 세분화 모델 아키텍처를, 하위 패널 (B)에서는 구조 예측 모델 아키텍처를 설명합니다. 간결함을 위해 유사한 레이어는 함께 연결된 레이어 수를 나타내는 승수로 압축됩니다 (예 : "2X"). 모든 컨볼 루션 (conv) 레이어 다음에는 파라 메트릭 ReLU 활성화 함수가옵니다. "Pred Char"는 특정 단계에서 예측 된 문자를 나타내고 "Prev Char"는 이전 단계에서 예측 된 문자를 나타냅니다. 다이어그램을 통한 계산 흐름은 왼쪽에서 오른쪽 ((B)에서는 위에서 아래로)입니다.

Segmentation

When presented with a document containing chemical structures, the initial step in the extraction pipeline is first identifying what are structures and segment these from the rest of the input.

Successful segmentation is important to (i) provide cleanest possible input for accurate sequence prediction, and (ii) exclude objects that are not structures but contain similar features, e.g., charts, graphs, logos, and annotations.

Segmentation here in utilizes a deep convolutional neural network to predict which pixels in input images are likely to be part of a chemical structure.

In designing the segmentation model we followed the “U-Net” design strategy outlined in Ronneberger et al.
which is especially well suited for full-resolution detection at the top of the network and enables fine-grained segmentation of structures in the experiments reported here.

The U-Net supports full-resolution segmentation by convolving (with pooling) the input to obtain a latent representation, then upsampling the latent representation using deconvolution with skip-connections until the output is at a resolution that matches that of the input.

In our experiments, the inputs to our model were preprocessed to be grayscale and downsampled to approximately 60 dpi resolution.
We found 60 dpi input resolution to be sufficient for segmentation while providing significant speed improvements versus higher resolutions. We fed the preprocessed inputs into our implementation of the U-Net
and the logits generated at the top of the network were scaled using a softmax activation,
and provided a predicted probability between 0-1.0 for each pixel, identifying the likelihood pixels belonged to a structure.

분할

화학 구조가 포함 된 문서가 제공 될 때 추출 파이프 라인의 초기 단계는 먼저 구조가 무엇인지 식별하고 나머지 입력에서 이를 분할하는 것입니다.

성공적인 세분화는 (i) 정확한 시퀀스 예측을 위해 가능한 가장 깨끗한 입력을 제공하고 (ii) 구조는 아니지만 유사한 기능 (예 : 차트, 그래프, 로고 및 주석)을 포함하는 개체를 제외하는 데 중요합니다.

여기에서 세분화는 심층 컨볼 루션 신경망을 사용하여 입력 이미지의 어떤 픽셀이 화학 구조의 일부일 가능성이 있는지 예측합니다.

세분화 모델을 설계 할 때 Ronneberger et al.에 설명 된 "U-Net"설계 전략을 따랐습니다.
이는 네트워크 상단에서 전체 해상도 감지에 특히 적합하며 여기에보고 된 실험에서 구조의 세분화를 가능하게합니다.

U-Net은 잠재 표현을 얻기 위해 입력을 컨 볼빙 (풀링 포함)하여 전체 해상도 분할을 지원합니다.
그런 다음 출력이 입력의 해상도와 일치 할 때까지 스킵 연결이있는 디컨 볼 루션을 사용하여 잠재 표현을 업 샘플링합니다.

실험에서 모델에 대한 입력은 회색조로 전처리되고 약 60dpi 해상도로 다운 샘플링되었습니다.
60dpi 입력 해상도가 세분화에 충분하고 높은 해상도에 비해 상당한 속도 향상을 제공한다는 것을 발견했습니다. 우리는 U-Net 구현에 전처리 된 입력을 공급했습니다.
네트워크 상단에서 생성 된 로짓은 소프트 맥스 활성화를 사용하여 확장되었습니다.
그리고 각 픽셀에 대해 0-1.0 사이의 예측 확률을 제공하여 구조에 속하는 가능성 픽셀을 식별합니다.

The predicted pixels formed masks generated at the same resolution as the original input images and allowed for sufficiently fine-grained extraction.
The masks obtained from the segmentation model were binarized to remove low confidence pixels, and contiguous areas of pixels that were not large enough to contain a structure were removed.
To remove areas too small to be structures, we counted the number of pixels in a contiguous area and deemed the area a non-structure if the number of pixels was below a threshold.
We also tested the removal of long, straight horizontal and vertical lines in the input image using the Hough transform.

Line removal improved mask quality in many cases, especially in tables where structures were very close to grid lines, and was included in the final model.

Individual entities (a single, contiguous group of positively predicted pixels) in the refined masks were assumed to contain single structures and were used to crop structures from the original inputs, resulting in a collection of individual structure images.

예측 된 픽셀은 원본 입력 이미지와 동일한 해상도로 생성 된 마스크를 형성하고 충분히 세밀한 추출을 허용했습니다.
세분화 모델에서 얻은 마스크는 신뢰도가 낮은 픽셀을 제거하기 위해 이진화되었으며 구조를 포함 할만큼 충분히 크지 않은 픽셀의 연속 영역은 제거되었습니다.
구조가 되기에는 너무 작은 영역을 제거하기 위해 인접한 영역의 픽셀 수를 세고 픽셀 수가 임계 값 미만이면 해당 영역을 구조가 아닌 것으로 간주했습니다.
또한 Hough 변환을 사용하여 입력 이미지에서 길고 곧은 수평선 및 수직선 제거를 테스트했습니다.

라인 제거는 특히 구조가 그리드 라인에 매우 가깝고 최종 모델에 포함 된 테이블에서 많은 경우 마스크 품질을 향상 시켰습니다.
정제 된 마스크의 개별 엔티티 (긍정적으로 예측 된 단일 픽셀의 연속 그룹)는 단일 구조를 포함하는 것으로 가정하고 원래 입력에서 구조를 자르는 데 사용되어 개별 구조 이미지 모음이 생성되었습니다.

During inference we observed qualitatively better masks when generating several masks at different resolutions and averaging the masks together into a final mask used to crop out structures. Averaged masks were obtained by scaling inputs to each resolution within the range 30 to 60 dpi in increments of 3 dpi, then generating masks for each image using the segmentation model. The resulting masks were scaled to the same resolution (60 dpi) and averaged together. The averaged masks were then scaled to the original input resolution (usually 300 dpi) and then used to crop out individual structures. Figure 2 shows an example journal article page along with its predicted mask.

추론하는 동안 우리는 서로 다른 해상도에서 여러 마스크를 생성 할 때 질적으로 더 나은 마스크를 관찰했습니다.
그리고 마스크를 함께 평균화하여 구조를 자르는 데 사용되는 최종 마스크로 만듭니다. 평균 마스크는 입력을 30 ~ 60dpi 범위 내에서 3dpi 단위로 각 해상도로 조정 한 다음 분할 모델을 사용하여 각 이미지에 대한 마스크를 생성하여 얻었습니다. 결과 마스크는
동일한 해상도 (60dpi)로 조정되고 함께 평균화됩니다. 그런 다음 평균화 된 마스크를 원래 입력 해상도 (일반적으로 300dpi)로 조정 한 다음 개별 구조를 자르는 데 사용했습니다. 그림 2는 예측 마스크와 함께 저널 기사 페이지의 예를 보여줍니다.

그림 2. Salunke 등의 저널 기사 페이지를 처리 할 때 세분화 모델의 출력을 보여주는 예. 자동화 된 사후 처리에 적용 할 수있는 희미한 선 몇 개를 제외하고 모든 텍스트 및 기타 관련없는 항목이 완전히 제거됩니다.

Structure Prediction
The images of individual structures obtained using the segmentation model were automatically transcribed into the corresponding SMILES sequences representing the contained structures using another deep neural network.

The purpose of this network was to take an image of a single structure and, in an end-to-end fashion, predict the corresponding SMILES string of the structure contained in the image.

The network comprised an encoder-decoder strategy where structure images were first encoded into a fixed-length latent space (state vector) using a convolutional neural network and then decoded into a sequence of characters using a recurrent neural network.

The convolutional network consisted of alternating layers of 5x5 convolutions, 2x2 max-pooling, and a parameterized ReLU activation function, with the overall network conceptually similar to the design outlined in Krizhevsky et al. but without the final classification layer. To help mitigate issues in which small but important features are lost during encoding, our network architecture did not utilize any pooling method in the first few layers of the network.

구조 예측
분할 모델을 사용하여 얻은 개별 구조의 이미지는 다른 심층 신경망을 사용하여 포함 된 구조를 나타내는 해당 SMILES 시퀀스로 자동으로 전사되었습니다.

이 네트워크의 목적은 단일 구조의 이미지를 촬영하고 종단 간 방식으로 이미지에 포함 된 구조의 해당 SMILES 문자열을 예측하는 것이 었습니다.

네트워크는 구조 이미지가 먼저 컨볼 루션 신경망을 사용하여 고정 길이 잠복 공간 (상태 벡터)으로 인코딩 된 다음 반복 신경망을 사용하여 문자 시퀀스로 디코딩되는 인코더-디코더 전략으로 구성되었습니다.

컨볼 루션 네트워크는 5x5 컨볼 루션, 2x2 최대 풀링 및 매개 변수화 된 ReLU 활성화 함수의 교대 계층으로 구성되었으며 전체 네트워크는 개념적으로 Krizhevsky et al. 그러나 최종 분류 레이어가 없습니다. 인코딩 중에 작지만 중요한 기능이 손실되는 문제를 완화하기 위해 네트워크 아키텍처는 네트워크의 처음 몇 계층에서 풀링 방법을 사용하지 않았습니다.

The state vector obtained from the convolutional encoder is passed into a decoder to generate a SMILES sequence. The decoder consisted of an input projection layer, three layers of GridLSTM cells, 23 an attention mechanism (implemented similarly to the soft attention method described in Xu et al. 24 and Bahdanau et al., 25 and the global attention method in Luong et al. 15), and an output projection layer. The state vector from the encoder was used to initialize the GridLSTM cell states and the SMILES sequence was then generated a character at a time, similar to the decoding method described in Sutskever et al. 26 (wherein sentences were generated a word at a time while translating English to French).

컨볼 루션 인코더에서 얻은 상태 벡터는 디코더로 전달되어 SMILES 시퀀스를 생성합니다. 디코더는 입력 프로젝션 레이어, GridLSTM 셀의 3 개 레이어,주의 메커니즘 (Xu et al.에 설명 된 소프트주의 방법과 유사하게 구현 됨)으로 구성됩니다.
및 Bahdanau 등, Luong 등의 글로벌주의 방법 및 출력 투영 레이어. 인코더의 상태 벡터를 사용하여 GridLSTM 셀 상태를 초기화하고 SMILES 시퀀스는 Sutskever 등에서 설명한 디코딩 방법과 유사하게 한 번에 문자를 생성했습니다 (여기서 문장은 번역하는 동안 한 번에 한 단어로 생성됨). 영어에서 프랑스어로).

Decoding is started by projecting a special start token into the GridLSTM (initialized by the encoder and conditioned on an initial context vector as computed by the attention mechanism), processing this input in the cell, and predicting the first character of the output sequence. Subsequent characters are produced similarly, with each prediction conditioned on the previous cell state, the current attention, and the previous output projected back into the network. The logits vector for each character produced by the network is of length N, where N is the number of available characters (65 characters in this case). A softmax activation is applied to the logits to compute a probability distribution over characters, and the highest scoring character is selected for a particular step in the sequence. Sequences were generated until a special end-of-sequence token was predicted, at which point the completed SMILES string was returned. During inference we found accuracy improved when predicting images at several different (low) resolutions and returning sequences of the highest confidence, which was determined by multiplying together the softmax output of each predicted character in the sequence.

디코딩은 특별한 시작 토큰을 GridLSTM (인코더에 의해 초기화되고주의 메커니즘에 의해 계산 된 초기 컨텍스트 벡터에 따라 조정 됨)에 투영하고,이 입력을 셀에서 처리하고, 출력 시퀀스의 첫 번째 문자를 예측하여 시작됩니다. 후속 문자는 유사하게 생성되며 각 예측은 이전 셀 상태, 현재주의 및 이전 출력이 네트워크로 다시 투영됩니다. 네트워크에 의해 생성 된 각 문자에 대한 로짓 벡터의 길이는 N이며, 여기서 N은 사용 가능한 문자 수입니다 (이 경우 65 자). 소프트 맥스 활성화는 문자에 대한 확률 분포를 계산하기 위해 로짓에 적용되며, 시퀀스의 특정 단계에 대해 가장 높은 점수를받은 문자가 선택됩니다. 시퀀스는 특수 시퀀스 끝 토큰이 예측 될 때까지 생성되었으며,이 시점에서 완료된 SMILES 문자열이 반환되었습니다. 추론하는 동안 여러 다른 (낮은) 해상도에서 이미지를 예측할 때 정확도가 향상되고 시퀀스에서 각 예측 문자의 소프트 맥스 출력을 함께 곱하여 결정된 가장 높은 신뢰도의 시퀀스를 반환합니다.

The addition of an attention mechanism in the decoder helped solve several challenges. Most importantly, attention enabled the decoder to access information produced earlier in the encoder and minimized the loss of important details that may otherwise be overly compressed when encoding the state vector.

Additionally, attention enabled the decoder to reference information closer to the raw input during the prediction of each character and was important considering the significance of pixelwise features in low resolution structure images.

See Figure 3 for an example of the computed attention and how the output corresponds to various characters recognized during the decoding process. Apart from using attention for improved performance, the attention output is useful for repositioning a predicted structure into an orientation that better matches the original input image.
This is done by converting the SMILES into a connection table using the open source Indigo toolkit, and repositioning each atom in 2D space according to the coordinates of each character's computed attention. The repositioned structure then more closely matches the original positioning and orientation in the input image, enabling users to more easily identify and correct mistakes when comparing the output with the original source.

디코더에주의 메커니즘을 추가하면 몇 가지 문제를 해결하는 데 도움이되었습니다. 가장 중요한 것은, 디코더가 인코더에서 더 일찍 생성 된 정보에 액세스 할 수있게 해주고 상태 벡터를 인코딩 할 때 과도하게 압축 될 수있는 중요한 세부 정보의 손실을 최소화했습니다.

또한 주의를 기울여 디코더가 각 문자를 예측하는 동안 원시 입력에 더 가까운 정보를 참조 할 수 있었으며 저해상도 구조 이미지에서 픽셀 단위 특징의 중요성을 고려할 때 중요했습니다.

계산 된 주의의 예와 출력이 디코딩 프로세스 중에 인식 된 다양한 문자에 해당하는 방법은 그림 3을 참조하십시오. 성능 향상을 위해주의를 사용하는 것 외에도주의 출력은 예측 된 구조를 원래 입력 이미지와 더 잘 일치하는 방향으로 재배치하는 데 유용합니다.
이는 오픈 소스 Indigo 툴킷을 사용하여 SMILES를 연결 테이블로 변환하고 각 캐릭터의 계산 된주의 좌표에 따라 2D 공간에서 각 원자의 위치를 변경하여 수행됩니다. 그러면 재배치 된 구조가 입력 이미지의 원래 위치 및 방향과 더 가깝게 일치하므로 사용자가 출력을 원본 소스와 비교할 때 실수를 더 쉽게 식별하고 수정할 수 있습니다.

그림 3. 히트 맵은 왼쪽에서 오른쪽으로, 위에서 아래로 : [, @, O, /, =, N의 순서로 캐릭터 예측 동안 계산 된주의를 나타내는 것으로 묘사됩니다.

The complete encoder-decoder framework is fully differentiable and was trained end-to-end using a suitable form of backpropagation, enabling SMILES to be fully generated using only raw images as input.
During decoding SMILES were generated a character at a time, from left to right. Additionally, no external dictionary was used for chemical abbreviations (superatoms) rather these were learned as part of the model, thus images may contain superatoms and the SMILES are still generated a character at a time.
This model operates on raw images and directly generates chemically valid SMILES with no explicit subcomponent recognition required.

완전한 인코더-디코더 프레임 워크는 완전히 차별화 할 수 있으며 적절한 형태의 역 전파를 사용하여 엔드 투 엔드로 훈련되어 원시 이미지 만 입력으로 사용하여 SMILES를 완전히 생성 할 수 있습니다.
디코딩하는 동안 SMILES는 왼쪽에서 오른쪽으로 한 번에 하나의 문자를 생성했습니다. 또한 화학 약어 (superatoms)를 위해 외부 사전을 사용하지 않았으며 모델의 일부로 학습되었으므로 이미지에는 superatoms가 포함될 수 있으며 SMILES는 여전히 한 번에 한 문자 씩 생성됩니다.
이 모델은 원시 이미지에서 작동하며 명시적인 하위 구성 요소 인식없이 화학적으로 유효한 SMILES를 직접 생성합니다.

Datasets
Segmentation Dataset To our knowledge, no dataset addressing molecular structure segmentation has been published. To provide sufficient data to train a neural network while minimizing manual effort required to curate such a dataset, we developed a pipeline for automatically generating segmentation data. To programmatically generate data, in summary, the following steps were performed: i) remove structures from journal and patent pages, ii) overlay structures onto the pages, iii) produce a ground truth mask identifying the overlaid structures, and iv) randomly crop images from the pages containing structures and the corresponding mask.

In detail, OSRA was utilized to identify bounding boxes of candidate molecules within the pages of a large number of publications, both published journal articles and patents. The regions expected to contain molecules were whited- out, thus leaving pages without molecules. OSRA was not always correct in finding structures and occasionally non-structures (e.g., charts) were removed
suggesting that cleaner input may further improve model performance. Next, images of molecules made publically available by the United States Patent and Trademark Office (USPTO) 28 were randomly overlaid
onto the pages while ensuring that no structures overlapped with any non-white pixels. Structure images were occasionally perturbed using affine transformations, changes in background shade, and/or lines added around the structure (to simulate table grid lines). We also generated the true mask for each overlaid page; these masks were zero-valued except where pixels were part of a molecule (pixels assigned a value of 1). During training, samples of 128x128 pixels were randomly cropped from the 6 overlaid pages and masks, and arranged into mini-batches for feeding into the network for training; example image-mask pairs are shown in Figure 4.

데이터 세트
세분화 데이터 세트 우리가 아는 한 분자 구조 세분화를 다루는 데이터 세트는 게시되지 않았습니다. 이러한 데이터 세트를 큐레이팅하는 데 필요한 수작업을 최소화하면서 신경망 훈련에 충분한 데이터를 제공하기 위해 세분화 데이터를 자동으로 생성하는 파이프 라인을 개발했습니다. 요약하면 데이터를 프로그래밍 방식으로 생성하기 위해 다음 단계를 수행했습니다 .i) 저널 및 특허 페이지에서 구조 제거, ii) 페이지에 구조 오버레이, iii) 오버레이 된 구조를 식별하는 Ground Truth 마스크 생성, iv) 무작위로 이미지 자르기 구조 및 해당 마스크를 포함하는 페이지에서.

구체적으로 OSRA는 많은 출판물 (출판 된 저널 기사와 특허 모두)의 페이지 내에서 후보 분자의 경계 상자를 식별하는 데 사용되었습니다. 분자를 포함 할 것으로 예상되는 영역은 희게 표시되어 분자가없는 페이지를 남겼습니다. OSRA는 구조를 찾는 데 항상 정확하지 않았으며 때때로 비 구조 (예 : 차트)가 제거되었습니다.
더 깨끗한 입력이 모델 성능을 향상시킬 수 있음을 시사합니다. 다음으로 미국 특허청 (USPTO) 28에서 공개 한 분자 이미지를 무작위로 오버레이했습니다.
흰색이 아닌 픽셀과 겹치는 구조가 없는지 확인합니다. 구조 이미지는 때때로 아핀 변환, 배경 음영 변경 및 / 또는 구조 주변에 추가 된 선 (테이블 그리드 선을 시뮬레이션하기 위해)을 사용하여 교란되었습니다. 또한 오버레이 된 각 페이지에 대해 실제 마스크를 생성했습니다. 이러한 마스크는 픽셀이 분자의 일부인 경우를 제외하고는 값이 0입니다 (픽셀에 값 1이 할당 됨). 훈련 중에 128x128 픽셀의 샘플을 6 개의 오버레이 된 페이지와 마스크에서 무작위로 잘라내어 훈련을 위해 네트워크에 공급하기 위해 미니 배치로 배열했습니다. 이미지 마스크 쌍의 예는 그림 4에 나와 있습니다.

그림 4. 생성 된 세분화 데이터 세트에서 샘플링 한 예제. 네트워크에 대한 입력은 왼쪽에 표시되고 훈련에 사용 된 해당 마스크는 오른쪽에 표시됩니다. 흰색은 화학 구조의 일부인 픽셀을 나타냅니다.

Molecular Image Dataset
An important goal of this work was to improve recognition of low resolution or poor quality images.
Utilizing training data that is of too high quality or too clean could negatively impact the generalizability of the final model. Arguably, the network should be capable of becoming invariant to image quality when trained explicitly on both high and low quality examples, at the expense of more computation and likely more data. However, we opted to handle quality implicitly by scaling all inputs down considerably.

To illustrate the impact of scaling images of molecular structures, consider two structures in Figure 5 that are chemically identical but presented with different levels of quality. The top-left image is of fairly high quality apart from some perforation throughout the image. In the top-right image the perforation is much more
pronounced with some bonds no longer continuous (small breaks due to the excessive noise).
When these images are downsampled significantly using bilinear interpolation, the images appear similar and it is hard to differentiate which of the two began as a lower quality image, apart from one being darker than the other.

분자 이미지 데이터 세트
이 작업의 중요한 목표는 저해상도 또는 저품질 이미지의 인식을 개선하는 것이 었습니다.
너무 높은 품질이거나 너무 깨끗한 훈련 데이터를 활용하면 최종 모델의 일반화 가능성에 부정적인 영향을 미칠 수 있습니다. 의심 할 여지없이 네트워크는 더 많은 계산과 더 많은 데이터를 희생하면서 고품질 및 저품질 예제 모두에서 명시 적으로 훈련 될 때 이미지 품질에 변하지 않을 수 있어야합니다. 그러나 우리는 모든 입력을 상당히 줄임으로써 암시 적으로 품질을 처리하기로 결정했습니다.

분자 구조의 스케일링 이미지의 영향을 설명하기 위해 그림 5에서 화학적으로 동일하지만 품질 수준이 다른 두 구조를 고려하십시오. 왼쪽 상단 이미지는 이미지 전체의 일부 천공을 제외하고 상당히 높은 품질입니다. 오른쪽 상단 이미지에서 천공은 훨씬 더
더 이상 연속적이지 않은 일부 결합 (과도한 소음으로 인한 작은 파손)으로 발음됩니다.
이중 선형 보간법을 사용하여 이러한 이미지를 크게 다운 샘플링하면 이미지가 유사하게 나타나고 하나가 다른 이미지보다 더 어둡다는 점을 제외하고 둘 중 어느 것이 더 낮은 품질의 이미지로 시작되었는지 구별하기가 어렵습니다.

Chemical structures vary significantly in size, some being small fragments with just a few atoms, to very large structures, including natural products or peptide sequences.
This necessitates using an image size that is not too large to be too computationally intensive to train, but large enough to fully fit reasonably
sized structures, i.e., drug-like small molecules, into the image. Although structures themselves can be large the individual atoms and their relative connectivity may be contained within a small number of pixels regardless of the size of the overall structure. Thus, scaling an image too aggressively will result in important information being lost. To train a neural network that can work with both low and high quality images, we utilized an image size of 256x256 and scaled images to fit within this size constraint (bond
lengths resulting in approximately the 3-12 pixels range). Training a neural network over higher resolution images is an interesting research direction and may improve results, but was here left for future work.

화학 구조는 크기가 크게 다르며 일부는 원자가 몇 개 밖에없는 작은 조각부터 천연물 또는 펩타이드 서열을 포함한 매우 큰 구조까지 다양합니다.
이렇게하려면 너무 크지 않아서 학습하기에 너무 계산 집약적이지 않지만 합리적으로 완전히 맞출 수있을만큼 충분히 큰 이미지 크기를 사용해야합니다.
크기의 구조, 즉 약물과 같은 작은 분자를 이미지에 삽입합니다. 구조 자체가 클 수 있지만 개별 원자와 상대적 연결은 전체 구조의 크기에 관계없이 적은 수의 픽셀 내에 포함될 수 있습니다. 따라서 이미지를 너무 공격적으로 조정하면 중요한 정보가 손실 될 수 있습니다. 저품질 이미지와 고품질 이미지 모두에서 작동 할 수있는 신경망을 훈련시키기 위해 256x256의 이미지 크기를 사용하고이 크기 제한 (결합
길이는 약 3-12 픽셀 범위가됩니다.) 고해상도 이미지를 통해 신경망을 훈련하는 것은 흥미로운 연구 방향이며 결과를 개선 할 수 있지만 향후 작업을 위해 여기에 남겨 두었습니다.

To ensure that the training data contained a variety of molecular image styles we used three separate datasets, each sampled uniformly during training. Additionally, we focused on drug-like molecules and imposed the following restrictions while preparing data and training the model:
● Structures with a SMILES length of 21-100 characters (a range that covers most drug-like small
molecules) were included, all others removed.
● Attachment placeholders of the format R1, R2, R3, etc., were included but other forms of
placeholder notation were not included.
● Enumeration fragments were not predicted or attached to the parent structure.
● Salts were removed and each image was assumed to contain only one structure.
● All SMILES were kekulized and canonicalized.

훈련 데이터에 다양한 분자 이미지 스타일이 포함되어 있는지 확인하기 위해 각각 훈련 중에 균일하게 샘플링 된 세 개의 개별 데이터 세트를 사용했습니다. 또한 약물과 유사한 분자에 초점을 맞추고 데이터를 준비하고 모델을 훈련하는 동안 다음과 같은 제한 사항을 적용했습니다.
● SMILES 길이가 21-100자인 구조 (대부분의 약물과 같은 작은
분자)가 포함되었고 나머지는 모두 제거되었습니다.
● R1, R2, R3 등 형식의 첨부 파일 자리 표시자가 포함되었지만 다른 형식의
자리 표시 자 표기가 포함되지 않았습니다.
● 열거 조각이 예측되지 않았거나 상위 구조에 연결되지 않았습니다.
● 소금이 제거되고 각 이미지에 하나의 구조 만 포함 된 것으로 가정되었습니다.
● 모든 SMILES가 케 쿨링되고 표준화되었습니다.

The first utilized dataset consisted of a 57 million molecule subset of molecules available in the PubChem database 29 rendered into images of 256x256 pixels of various styles (bond thicknesses, character sizes, etc.) using Indigo. PubChem structures were available in InChI format and were converted to SMILES
using Indigo. To evaluate performance of the model during training the dataset was split into
train/validation subsets; 90% of the dataset was used to train the model and the remaining 10% was reserved for validation.
The second dataset comprised 10 million images rendered using Indigo in OS X. Because Indigo rendering output can vary significantly between operating systems, we included these images to supplement training with additional image styles.

The third dataset consisted of 1.7 million image/molecule pairs curated from data made publicly available by the USPTO.

처음 사용 된 데이터 세트는 Indigo를 사용하여 256x256 픽셀의 다양한 스타일 (결합 두께, 문자 크기 등)의 이미지로 렌더링 된 PubChem 데이터베이스 29에서 사용 가능한 분자의 5,700 만 분자 하위 집합으로 구성되었습니다. PubChem 구조는 InChI 형식으로 제공되었으며 SMILES로 변환되었습니다.
인디고를 사용합니다. 훈련 중 모델의 성능을 평가하기 위해 데이터 세트를
훈련 / 검증 부분 집합; 데이터 세트의 90 %는 모델 학습에 사용되었으며 나머지 10 %는 검증을 위해 예약되었습니다.

두 번째 데이터 세트는 OS X에서 Indigo를 사용하여 렌더링 된 1,000 만 개의 이미지로 구성되었습니다. Indigo 렌더링 출력은 운영 체제마다 크게 다를 수 있으므로 추가 이미지 스타일로 훈련을 보완하기 위해 이러한 이미지를 포함했습니다.

세 번째 데이터 세트는 USPTO에서 공개 한 데이터에서 큐 레이트 된 170 만 개의 이미지 / 분자 쌍으로 구성되었습니다.

Many of these images contain extraneous text and labels and were preprocessed to remove non-structure elements before training. For some files in the USPTO dataset, we observed that Indigo does not correctly retain stereochemistry when converting MOL format into canonical SMILES, which resulted in some SMILES not containing identical stereochemistry to that in images.

이러한 이미지 중 상당수는 관련없는 텍스트와 레이블을 포함하며 훈련 전에 비 구조적 요소를 제거하기 위해 사전 처리되었습니다. USPTO 데이터 세트의 일부 파일에 대해 MOL 형식을 표준 SMILES로 변환 할 때 Indigo가 입체 화학을 올바르게 유지하지 않아 일부 SMILES가 이미지와 동일한 입체 화학을 포함하지 않는 것으로 나타났습니다.

Many of these images contain extraneous text
Results may improve with cleaner data. Similar to the Indigo set, the dataset was split into training and validation portions; 75% of the set was used to train with 25% reserved for validation.

Apart from the training and validation sets just described, two additional test sets were utilized in evaluating the performance of the method. The first is the dataset published with Valko & Johnson7(Valko dataset) consisting of 454 images of molecules cropped from literature.

The Valko dataset is interesting because it contains complicated molecules with challenging features such as bridges, stereochemistry, and a variety of superatoms.

The second dataset consists of a proprietary collection of image-SMILES pairs from 47 published articles and 5 patents.

The molecules in the proprietary dataset are drug-like and some of the images contain small amounts of extraneous artifacts, e.g., surrounding text, compound labels, lines from enclosing table, etc. and was used to evaluate overall method effectiveness in examples extracted for use in real drug discovery projects.

With the focus of this work being on low quality/resolution images, rather than predicting high resolution images, we tested our method on downsampled versions of the Valko and proprietary datasets. During the segmentation phase each Valko dataset image was downsampled to 5-10% of its original size, and
during the sequence prediction phase, images were downsampled to 10-22.5%, with similar scaling performed on the proprietary dataset. These scale ranges were chosen so that the resolution used during prediction approximately matched the (low) resolutions of the images used during training.

이러한 이미지 중 상당수가 관련없는 텍스트를 포함하고 있습니다.
더 깨끗한 데이터로 결과가 향상 될 수 있습니다. Indigo 세트와 유사하게 데이터 세트는 훈련 및 검증 부분으로 분할되었습니다. 세트의 75 %는 검증을 위해 예약 된 25 %로 훈련에 사용되었습니다.

방금 설명한 훈련 및 검증 세트 외에도 방법의 성능을 평가하는 데 두 개의 추가 테스트 세트가 사용되었습니다. 첫 번째는 문헌에서 자른 454 개의 분자 이미지로 구성된 Valko & Johnson7 (Valko 데이터 세트)에서 게시 한 데이터 세트입니다.

Valko 데이터 세트는 브리지, 입체 화학 및 다양한 슈퍼 원자와 같은 까다로운 기능을 가진 복잡한 분자를 포함하고 있기 때문에 흥미 롭습니다.

두 번째 데이터 세트는 47 개의 게시 된 기사와 5 개의 특허에서 가져온 이미지 -SMILES 쌍의 독점 컬렉션으로 구성됩니다.

독점 데이터 세트의 분자는 약물과 유사하며 일부 이미지에는 주변 텍스트, 복합 라벨, 둘러싸는 테이블의 줄 등과 같은 작은 양의 외부 아티팩트가 포함되어 있으며 사용을 위해 추출 된 예제에서 전체 방법 효율성을 평가하는 데 사용되었습니다. 실제 신약 개발 프로젝트에서.

이 작업의 초점은 고해상도 이미지를 예측하는 것이 아니라 저품질 / 해상도 이미지에 초점을 맞추고 Valko 및 독점 데이터 세트의 다운 샘플링 된 버전에서 방법을 테스트했습니다. 세분화 단계에서 각 Valko 데이터 세트 이미지는 원래 크기의 5-10 %로 다운 샘플링되었습니다.
시퀀스 예측 단계에서 이미지는 10-22.5 %로 다운 샘플링되었으며 독점 데이터 세트에서 유사한 스케일링이 수행되었습니다. 이러한 스케일 범위는 예측 중에 사용 된 해상도가 훈련 중에 사용 된 이미지의 (낮은) 해상도와 거의 일치하도록 선택되었습니다.

A list of 65 characters containing all the unique characters in the Indigo dataset was assembled. These characters served as the list of available characters that can be selected at each SMILES decoding step.
Four of these characters are special tokens that are not part of SMILES notation but were necessary for successfully implementing the model. The special tokens indicate the beginning or end of a sequence, replace unknown characters, or pad sequences that were shorter than the maximum length (during
training and testing 100 characters were generated for each input and any characters generated after the end-of-sequence token were ignored).

Indigo 데이터 세트의 모든 고유 문자를 포함하는 65 자 목록이 조합되었습니다. 이러한 문자는 각 SMILES 디코딩 단계에서 선택할 수있는 사용 가능한 문자 목록으로 사용되었습니다.
이러한 문자 중 4 개는 SMILES 표기법의 일부는 아니지만 모델을 성공적으로 구현하는 데 필요한 특수 토큰입니다. 특수 토큰은 시퀀스의 시작 또는 끝을 나타내거나, 알 수없는 문자를 대체하거나, 최대 길이보다 짧은 시퀀스를 채 웁니다 (
각 입력에 대해 100 개의 문자 학습 및 테스트가 생성되었으며 시퀀스 끝 토큰 이후에 생성 된 모든 문자는 무시되었습니다.)

Training
The segmentation model had 380,000 parameters and was trained on batches of 64 images (128x128 pixels in size). In our experiments, training converged after 650,000 steps and took 4 days to complete on a single GPU. The sequencing model had 46.3 million parameters and was trained on batches of 128
images (256x256 pixels in size). During training, images were randomly affine transformed, brightness scaled, and/or binarized. Augmenting the dataset while training using random transformations ensured that the model would not become too reliant on styles either generated by Indigo or seen in the patent images. In our experiments, training converged after 1 million training steps (26 days on 8 GPUs). Both models were constructed using TensorFlow 18 and were trained using the Adam optimizer 17 on NVIDIA Pascal GPUs.

훈련
세분화 모델에는 380,000 개의 매개 변수가 있으며 64 개 이미지 (128x128 픽셀 크기)의 배치에 대해 학습되었습니다. 실험에서 훈련은 650,000 단계 후에 수렴되었고 단일 GPU에서 완료하는 데 4 일이 걸렸습니다. 시퀀싱 모델에는 4,630 만 개의 매개 변수가 있으며 128 개의 배치에서 학습되었습니다.
이미지 (256x256 픽셀 크기). 훈련하는 동안 이미지는 무작위로 유사 변환, 밝기 조정 및 / 또는 이진화되었습니다. 무작위 변환을 사용하여 훈련하는 동안 데이터 세트를 확대하면 모델이 Indigo에서 생성하거나 특허 이미지에서 볼 수있는 스타일에 너무 의존하지 않게되었습니다. 실험에서 학습은 1 백만 개의 학습 단계 (8 개의 GPU에서 26 일) 후에 수렴되었습니다. 두 모델 모두 TensorFlow 18을 사용하여 구성되었으며 NVIDIA Pascal GPU에서 Adam 최적화 프로그램 17을 사용하여 학습되었습니다.

Results and Discussion
During training, metrics were tracked for performance on both the Indigo and USPTO validation datasets. We observed no apparent overfitting over the Indigo dataset during training but did experience some overfitting over USPTO data (Figure 6). Due to the large size of the Indigo dataset (52 million examples used during training) and the many rendering styles available in Indigo it is not surprising that the model did not experience any apparent overfitting on Indigo data. Conversely, the USPTO set is much smaller (1.27 million examples used during training) with each example sampled much more frequently (approximately 40 times more often), increasing the risk of overfitting.

결과 및 논의
훈련 중에 Indigo 및 USPTO 검증 데이터 세트의 성능에 대한 메트릭을 추적했습니다. 우리는 훈련 중에 Indigo 데이터 세트에 대한 명백한 과적 합을 관찰하지 않았지만 USPTO 데이터에 대한 과적 합을 경험했습니다 (그림 6). Indigo 데이터 세트의 큰 크기 (학습 중에 5,200 만 개의 예제가 사용됨)와 Indigo에서 사용할 수있는 많은 렌더링 스타일로 인해 모델이 Indigo 데이터에 대해 명백한 과적 합을 경험하지 않은 것은 놀라운 일이 아닙니다. 반대로, USPTO 세트는 훨씬 더 작으며 (학습 중에 사용 된 127 만 개의 예제) 각 예제가 훨씬 더 자주 (약 40 배 더 자주) 샘플링되어 과적 합 위험이 증가합니다.

After the models were trained, performance was measured on the Valko and proprietary test sets. The test sets were evaluated using the full segmentation and sequence generation pipeline described above, and accuracies for the validation and test sets are reported in Table 1. In order for a result to contribute to accuracy, it must be chemically identical to the ground truth, including stereochemistry. Any error results in the structure being deemed incorrect. We observed that despite the method requiring low resolution inputs, accuracy was generally high across the datasets. Additionally, accuracy for the validation sets and 9 the proprietary set were all similar (77- 3%) indicating that the training sets used in developing the method reasonably approximate data useful in actual drug discovery projects as represented by the proprietary test set

모델을 학습 한 후 Valko 및 독점 테스트 세트에서 성능을 측정했습니다. 테스트 세트는 위에서 설명한 전체 세분화 및 시퀀스 생성 파이프 라인을 사용하여 평가되었으며 유효성 검사 및 테스트 세트에 대한 정확도는 표 1에보고되어 있습니다. 결과가 정확도에 기여하려면 실제와 화학적으로 동일해야합니다. 입체 화학 포함. 오류로 인해 구조가 잘못된 것으로 간주됩니다. 저해상도 입력이 필요한 방법에도 불구하고 일반적으로 데이터 세트 전체에서 정확도가 높았습니다. 또한 검증 세트 및 9에 대한 정확도
독점 세트는 모두 유사 (77-83 %)로 방법을 개발하는 데 사용 된 훈련 세트가 독점 테스트 세트로 표현 된 실제 신약 발견 프로젝트에 유용한 데이터에 합리적으로 근접 함을 나타냅니다.

On the Valko test set we observed an accuracy of 41% over the full, downsampled dataset, which is significantly lower than the accuracies observed in the other datasets. The decrease in performance is likely due to the higher rate of certain challenging features seen less frequently in the other datasets, including superatoms. Superatoms are the single largest contributor to prediction errors in the Valko dataset (21% of samples containing one or more incorrectly predicted superatoms). In our training sets,
superatoms were only included in the USPTO dataset and were not generated as part of the Indigo dataset resulting in a low rate of inclusion during training (6.6% of total images seen contain some superatom, with most superatoms included at a rate of <<1%). An increased sampling rate of images containing superatoms will likely provide a significant accuracy improvement in this area.
In further exploring incorrectly generated superatoms we discovered, unsurprisingly, that larger or more uncommon superatoms were recognized with less success than smaller, more common types. For example, “Me” (methyl) is predicted correctly about half the time (other times being mistaken for a nitrogen or oxygen) while some larger superatoms are not predicted well at all (“P+(C4H9)3”, “(H2C)5”, and“P+(n-Bu)3” all predicted incorrectly). In some cases, however, large superatoms were recognized
correctly, e.g., the single example of “n-C8H17” in the dataset is predicted correctly, and in Figure 7 we show an example structure with the “OTBS” (tert-butyldimethylsilyl ether) superatom predicted correctly despite aggressive downsampling and cluttering of characters.

Valko 테스트 세트에서 전체 다운 샘플링 된 데이터 세트에 대해 41 %의 정확도를 관찰했는데, 이는 다른 데이터 세트에서 관찰 된 정확도보다 상당히 낮습니다. 성능 저하는 수퍼 원자를 포함하여 다른 데이터 세트에서 덜 자주 나타나는 특정 까다로운 기능의 비율이 높기 때문일 수 있습니다. Superatoms는 Valko 데이터 세트 (하나 이상의 잘못 예측 된 superatoms를 포함하는 샘플의 21 %)에서 예측 오류에 가장 큰 원인이됩니다. 트레이닝 세트에서
superatoms는 USPTO 데이터 세트에만 포함되었으며 Indigo 데이터 세트의 일부로 생성되지 않아 훈련 중에 포함 률이 낮아졌습니다 (표시된 총 이미지의 6.6 %에는 일부 superatom이 포함되어 있으며 대부분의 superatoms는 << 1 % 비율로 포함됩니다. ). 초원자를 포함하는 이미지의 샘플링 속도가 증가하면이 영역에서 상당한 정확도 향상을 가져올 수 있습니다.
잘못 생성 된 슈퍼 원자를 더 탐구하면서 우리는 놀랍게도 더 크거나 더 드문 슈퍼 원자가 작고 일반적인 유형보다 덜 성공한 것으로 인식된다는 사실을 발견했습니다. 예를 들어, "Me"(메틸)는 대략 절반의 시간 동안 정확하게 예측되지만 (다른 경우에는 질소 또는 산소로 오인되는 경우도 있음) 일부 더 큰 슈퍼 원자는 전혀 예측되지 않습니다 ( "P + (C4H9) 3", "(H2C) 5 "및"P + (n-Bu) 3 "모두 잘못 예측 됨). 그러나 어떤 경우에는 큰 초 원자가 인식되었습니다.
정확하게, 예를 들어 데이터 세트에서 "n-C8H17"의 단일 예가 올바르게 예측되고 그림 7에서는 공격적인 다운 샘플링 및 문자의 어수선 함에도 불구하고 올바르게 예측 된 "OTBS"(tert-butyldimethylsilyl ether) 수퍼 아톰이있는 예제 구조를 보여줍니다. .

그림 7. 고도로 다운 샘플링 된 입력 (A)이 예측 중에 사용되는 예제 구조. 예측 된 구조 (B)에는이 특정 예가 매우 낮은 분해능이기 때문에 많은 오류가 있지만 실릴 에테르는 지상 진리 (C)와 비교할 때 올바르게 예측됩니다.

Another interesting case in the dataset regarded the prediction of “NEt2” (diethylamine) superatom. In Figure 8 three similar input images are shown, each containing the diethylamine functional group. In the results only one image had the functional group predicted correctly (the rightmost example in the figure) while the other two were incorrect, but interestingly the two incorrect examples were not incorrect in the same way. The middle example was predicted to contain an aniline while the leftmost example was predicted to contain an azide. This was despite the functional groups appearing nearly identical and occupying the same locality in the input images.

데이터 세트의 또 다른 흥미로운 사례는 "NEt2"(디 에틸 아민) 슈퍼 원자의 예측과 관련이 있습니다. 그림 8에는 각각 디 에틸 아민 작용기를 포함하는 세 개의 유사한 입력 이미지가 표시됩니다. 결과에서 하나의 이미지 만 올바르게 예측 된 기능 그룹 (그림에서 가장 오른쪽의 예)을 가졌고 다른 두 이미지는 올바르지 않았지만 흥미롭게도 두 가지 잘못된 예는 같은 방식으로 올바르지 않았습니다. 중간 예는 아닐린을 포함 할 것으로 예상되는 반면 가장 왼쪽 예는 아 지드를 포함 할 것으로 예측되었습니다. 이는 기능 그룹이 거의 동일하게 보이고 입력 이미지에서 동일한 위치를 차지함에도 불구하고 발생했습니다.

219/5000 그림 8. diethylamine superatom을 포함하는 3 개의 유사한 화합물, 그중 하나만 올바르게 예측되었습니다 (가장 오른쪽). 각 이미지 아래에는 디 에틸 아민 초 원자에 대해 SMILES에서 예측되는 문자가 있습니다.

In analyzing stereochemistry-related errors in the Valko dataset we observed that 60% of compounds with incorrectly predicted stereochemistry had explicitly assigned stereochemistry in both the ground truth and
the predicted result, but the assignments in the predicted SMILES were incorrect.

Valko 데이터 세트의 입체 화학 관련 오류를 분석하는 과정에서 입체 화학이 잘못 예측 된 화합물의 60 %가 지상 진실과
예측 된 결과이지만 예측 된 SMILES의 할당이 올바르지 않습니다.

In other words, the model most often correctly predicted which atoms have explicit stereochemistry assigned, but occasionally assigned the wrong configuration (e.g.,predicted R configuration when it should have been S). Intuitively, stereochemistry assignment is not a strictly local decision, i.e., observing a hash or a
wedge is not sufficient information to make a configuration assignment, and more information about the neighboring atoms and connectedness is required for correct assignment. A possible explanation for the difficulty in learning stereochemistry from images is that our current model architecture may be insufficient in incorporating large enough context when computing certain features.

다시 말해, 모델은 어떤 원자가 명시 적 입체 화학이 할당되었는지 가장 자주 정확하게 예측했지만 때때로 잘못된 구성을 할당했습니다 (예 : S이어야 할 때 R 구성 예측). 직관적으로 입체 화학 할당은 엄격한 로컬 결정이 아닙니다. 즉, 해시 또는 쐐기는 구성 할당을위한 정보가 충분하지 않으며 올바른 할당을 위해 인접 원자 및 연결성에 대한 자세한 정보가 필요합니다. 이미지에서 입체 화학을 배우는 데 어려움에 대한 가능한 설명은 현재 모델 아키텍처가 특정 기능을 계산할 때 충분히 큰 컨텍스트를 통합하는 데 충분하지 않을 수 있다는 것입니다.

Some images failed to produce a valid structure (either output SMILES was not valid or output confidence was <1%). Common issues that resulted in a structure failing or otherwise being severely incorrect included structures that were too large, macrocycles with large rings that were cleaved during prediction, structures with many superatoms or stereocenters, or images where downscaling was too aggressive and resolution too low for adequate recognition. The Valko set also contains images with multiple structures or that are inverted (white structures on black background), neither of which were supported in our validation scheme and are reported as incorrect.

일부 이미지는 유효한 구조를 생성하지 못했습니다 (출력 SMILES가 유효하지 않거나 출력 신뢰도가 <1 % 임). 구조가 실패하거나 심각하게 잘못되는 일반적인 문제에는 너무 큰 구조, 예측 중에 갈라진 큰 고리가있는 매크로 사이클, 많은 수퍼 원자 또는 입체 중심이있는 구조, 축소가 너무 공격적이고 해상도가 너무 낮은 이미지가 포함됩니다. 인식. Valko 세트에는 또한 여러 구조 또는
반전 된 (검은 색 배경에 흰색 구조), 둘 다 검증 체계에서 지원되지 않았으며 잘못된 것으로보고되었습니다.

Across both test sets we observed low error rates due to segmentation and predicted masks appeared quite clean and were generated at reasonably high resolution (see Figure 2 in the method section above for an example). Only 3.3% of the Valko dataset and 6.6% proprietary images failed to segment properly.

To further analyze performance over the validation and test sets, we explored distributions for several metrics.
In Figure 9 we report distributions of correct and incorrect examples for molecular weight, number of heavy atoms contained in the molecules, number of characters in the ground truth SMILES, and the types of heavy atoms contained in the molecules.
In exploring molecular weight and heavy atoms of both correct and incorrect molecules in the USPTO validation set we observed that the model slightly favors smaller molecules. Predicting more errors on larger molecules was not surprising considering large molecules have longer SMILES strings and necessitates the model to compress more information during
encoding and predict more characters during decoding. It was surprising, however, that the difference between correct and incorrect distributions is not more pronounced. Our expectation was that larger molecules would be significantly more challenging to predict and that the model would heavily favor smaller molecules. Incorrect SMILES tend to shift toward heavier or larger molecules, but correctness cannot be adequately attributed to either metric. Predicting well on large molecules is encouraging, and suggests that the model may be easily extended to molecules larger than 100 characters in SMILES length. In exploring the types of heavy atoms seen in both the correct and incorrect examples, once again, both distributions appeared similar. Particularly interesting are the atoms that appear much more rarely in SMILES, e.g., Na, Sn, W. In predicting rarer atoms, the network performed surprisingly well on some (Na, Ar) but not well at all on others (U, V). Further work is needed to explore the distribution of rare atoms across the full dataset and ensure that all atom types are sampled sufficiently during training.

두 테스트 세트에서 세분화로 인한 낮은 오류율을 관찰했으며 예측 된 마스크는 매우 깨끗해 보였고 상당히 높은 해상도로 생성되었습니다 (예를 들어 위의 방법 섹션에서 그림 2 참조). Valko 데이터 세트의 3.3 %와 독점 이미지 6.6 %만이 제대로 분할하지 못했습니다.

검증 및 테스트 세트에 대한 성능을 추가로 분석하기 위해 여러 메트릭에 대한 분포를 탐색했습니다.
그림 9에서는 분자량, 분자에 포함 된 중원 자 수, Ground Truth SMILES의 문자 수 및 분자에 포함 된 중원 자의 유형에 대한 정확하고 잘못된 예의 분포를보고합니다.
USPTO 검증 세트에서 올바른 분자와 부정확 한 분자의 분자량과 무거운 원자를 탐색하면서 모델이 더 작은 분자를 약간 선호한다는 것을 관찰했습니다. 더 큰 분자에서 더 많은 오류를 예측하는 것은 큰 분자가 더 긴 SMILES 문자열을 가지고 있고 모델이 더 많은 정보를 압축해야한다는 점을 고려하면 놀라운 일이 아닙니다.
디코딩하는 동안 더 많은 문자를 인코딩하고 예측합니다. 그러나 정확한 분포와 잘못된 분포의 차이가 더 뚜렷하지 않다는 것은 놀랍습니다. 우리의 기대는 더 큰 분자가 예측하기 훨씬 더 어렵고 모델이 더 작은 분자를 선호 할 것이라는 것이 었습니다. 부정확 한 SMILES는 더 무겁거나 더 큰 분자로 이동하는 경향이 있지만 정확성은 두 메트릭에 적절하게 기인 할 수 없습니다. 큰 분자를 잘 예측하는 것은 고무적이며 모델이 SMILES 길이의 100 자 이상의 분자로 쉽게 확장 될 수 있음을 시사합니다. 옳고 그른 예에서 볼 수있는 중원 자의 유형을 조사 할 때 다시 한 번 두 분포가 비슷하게 나타났습니다. 특히 흥미로운 것은 SMILES에서 훨씬 더 드물게 나타나는 원자 (예 : Na, Sn, W)입니다. 희귀 한 원자를 예측할 때 네트워크는 일부 (Na, Ar)에서는 놀랍게도 잘 수행되었지만 다른 원자에서는 전혀 잘 수행되지 않았습니다 (U, V). . 전체 데이터 세트에서 희귀 원자의 분포를 탐색하고 모든 원자 유형이 훈련 중에 충분히 샘플링되도록하려면 추가 작업이 필요합니다.

Similar to the analysis on the USPTO dataset reported above, we explored distributions of simple molecular properties for SMILES predicted correctly in the Valko dataset (Figure 10). Interestingly, the distributions all appear to be more narrow than in the USPTO dataset and the SMILES strings are longer.

It is worth noting that the Valko dataset is quite small and a larger dataset containing a broader
distribution of molecules would be interesting for the community to benchmark against, but is left for future work

위에서보고 된 USPTO 데이터 세트에 대한 분석과 유사하게 Valko 데이터 세트에서 올바르게 예측 된 SMILES의 단순 분자 특성 분포를 조사했습니다 (그림 10). 흥미롭게도 분포는 모두 USPTO 데이터 세트보다 좁은 것으로 보이며 SMILES 문자열은 더 깁니다.

Valko 데이터 세트는 매우 작고 더 넓은 데이터 세트를 포함하는 더 큰 데이터 세트입니다. 분자 분포는 커뮤니티가 벤치마킹하기에 흥미로울 것이지만 향후 작업을 위해 남겨졌습니다.

In reviewing structures that were predicted correctly, we observe that the methods described in this work show promise in their ability to predict valid and correct SMILES for low resolution images. We showcase a few examples in Figure 11. These examples contain a variety of atom types, some examples of stereochemistry and superatoms, and are not trivial in size. Further progress may require developing methods which eliminate the restriction of downsampling all inputs by supporting high resolution data when available, and supporting structures larger than 100 characters in length.

올바르게 예측 된 구조를 검토 할 때이 작업에 설명 된 방법이 저해상도 이미지에 대해 유효하고 올바른 SMILES를 예측할 수있는 능력이 있음을 보여줍니다. 그림 11에는 몇 가지 예가 나와 있습니다.이 예에는 다양한 원자 유형이 포함되어 있습니다.
입체 화학과 초 원자이며 크기가 사소하지 않습니다. 추가 진행에는 가능한 경우 고해상도 데이터를 지원하고 길이가 100 자보다 큰 구조를 지원하여 모든 입력을 다운 샘플링하는 제한을 제거하는 방법을 개발해야 할 수 있습니다.

Conclusion
In this work we presented deep learning solutions to both extract structures from documents and predict SMILES for structure images. The method does not rely on handcrafted features or rules and operates
directly on raw pixels enabling the method to learn from and predict images of virtually any style. Using datasets containing molecule images cropped from journal articles and patents we showed that deep learning can learn to predict images of molecules from literature at reasonably high accuracy. The method herein was trained exclusively on low resolution data, and thus only supported prediction over low resolution input. Training over high resolution images as well may greatly improve results, particularly when high resolution inputs are available. All images used in the reported results were highly downsampled, demonstrating the ability to predict low resolution images of chemical structures using an automated method, which was not previously possible.

결론
이 작업에서 우리는 문서에서 구조를 추출하고 구조 이미지에 대한 SMILES를 예측하는 딥 러닝 솔루션을 제시했습니다. 이 방법은 수제 기능이나 규칙에 의존하지 않고 작동합니다.
이 방법은 거의 모든 스타일의 이미지에서 학습하고 예측할 수있게합니다. 저널 기사와 특허에서 자른 분자 이미지가 포함 된 데이터 세트를 사용하여 딥 러닝이 문헌에서 분자 이미지를 상당히 높은 정확도로 예측하는 방법을 배울 수 있음을 보여주었습니다.
이 방법은 저해상도 데이터에 대해 배타적으로 학습되었으므로 저해상도 입력에 대한 예측 만 지원했습니다. 고해상도 이미지에 대한 교육도 특히 고해상도 입력을 사용할 수있는 경우 결과를 크게 향상시킬 수 있습니다. 보고 된 결과에 사용 된 모든 이미지는 고도로 다운 샘플링되어 이전에는 불가능했던 자동화 된 방법을 사용하여 화학 구조의 저해상도 이미지를 예측할 수 있음을 보여줍니다.

We anticipate the use of chemical structure extraction algorithms, such as those described herein as well as future generalizations and improvements, may greatly accelerate drug discovery efforts in many ways.
Most immediately, such algorithms may help to greatly accelerate curation of published journal article and patent data to facilitate routine QSAR/QSPR modelling work. However, given the very high rate at which data is being introduced into the public academic and patent literature, expeditious and efficient curation of public data may in the future become a chief bottleneck in the construction of maximally optimal global
ADMET property prediction models for drug discovery. Steps toward fully automating data curation may enable drug discovery projects to more routinely utilize all relevant available data for ADMET property prediction at all moments in time in the progression of the project. Given the widespread recognition of the dependence of ADMET property prediction on data set size and cleanliness, we anticipate such technologies should broadly improve the quality of drug discovery ADMET property modeling in the future.

우리는 여기에 설명 된 것과 같은 화학 구조 추출 알고리즘의 사용과 미래의 일반화 및 개선이 여러면에서 약물 발견 노력을 크게 가속화 할 수있을 것으로 예상합니다.
가장 즉각적으로 이러한 알고리즘은 출판 된 저널 기사 및 특허 데이터의 큐 레이션을 크게 가속화하여 일상적인 QSAR / QSPR 모델링 작업을 용이하게하는 데 도움이 될 수 있습니다. 그러나 데이터가 공공 학술 및 특허 문헌에 도입되는 매우 빠른 속도를 고려할 때, 공공 데이터의 신속하고 효율적인 큐 레이션은 미래에 최대한 최적의 글로벌 구축에있어 주요 병목이 될 수 있습니다.
신약 발견을위한 ADMET 속성 예측 모델. 데이터 큐 레이션을 완전히 자동화하기위한 단계는 신약 발견 프로젝트가 프로젝트 진행의 모든 시점에서 ADMET 속성 예측에 사용 가능한 모든 관련 데이터를보다 일상적으로 활용할 수 있도록합니다. ADMET 속성 예측이 데이터 세트 크기 및 청결도에 대한 의존성에 대한 광범위한 인식을 감안할 때 이러한 기술은 향후 신약 발견 ADMET 속성 모델링의 품질을 광범위하게 향상시킬 것으로 예상합니다.