Workflow

A single pipeline, end to end.

Each stage is configurable. Inputs accept SMILES paired with target variables or a previously prepared feature matrix. Models, descriptors, and validation strategy stay traceable from upload through report generation. Configuration choices recorded at the start of a study persist through every later stage, so the path from input data to final report is reproducible and auditable from a single source of truth.

01
Upload molecules

SMILES with targets, or a prepared feature matrix. Structural validation, standardisation, and deduplication run automatically.
02
Configure study

Select descriptor types and fingerprint families. Apply dimensionality reduction and feature selection. Hyperparameter optimization runs inside the same menu.
03
Train and validate

Scaffold-aware partitioning, cross-validation, and independent test sets are applied. Multiple algorithm families compete on the same task.
04
Interpret predictions

Feature attribution links each prediction to the molecular descriptors that drove it. The descriptor dictionary explains every term in chemical and mathematical terms.
05
Generate report

Model documentation and prediction documentation export in structures aligned with regulated assessment frameworks. Outputs are reproducible and auditable.

Analytical modules

Three modules, one shared workflow.

The same configuration interface powers all three analytical paths. Researchers choose the path that matches their data and discovery question, then move through the workflow without context switching between tools.

Module 01

Predictive modelling

Train and compare model families on a single task. Performance is evaluated through both cross-validation and independent test sets. Trained models export with the configuration components, preprocessing steps, and scaling parameters required to replicate the run externally.

Multiple model families compared in one study
Cross-validation and independent test set evaluation
Models deploy directly to new molecular libraries

Module 02

Transfer learning

For studies where labelled data is too sparse to train a reliable end-to-end model, train a base model on open-access databases through the same menu, then fine-tune on the smaller target dataset. The adapted model is applied to under-labelled compounds with calibrated confidence estimates attached.

Base model trained on open-access chemistry data
Fine-tuning on the smaller target dataset
Calibrated confidence reported with every prediction

Module 03

Multi-objective optimization

Evaluate competing discovery goals at the same time. Biological activity, toxicity, and solubility are assessed concurrently. Researchers examine Pareto-optimal subsets of a candidate library interactively, allowing balanced analysis of trade-offs within the same interface.

Concurrent evaluation of activity, toxicity, solubility
Interactive Pareto-optimal candidate exploration
Trade-offs surfaced for prioritisation decisions

Interpretability

Every prediction, linked to chemistry.

Interpretability is central to the platform. Each prediction is paired with an analysis that links the model output to the underlying molecular features. The aim is not to convince the researcher of the prediction, but to give them the evidence required to act on it or set it aside.

Feature attribution

Multiple explainability techniques run alongside each prediction, presented through interactive visualisations. Researchers see which descriptors drove the model output and how strongly each contributed. Visual representations make it possible to compare attribution patterns across compounds in a series, surfacing the structural features that the model treats as decisive for the modelled property.

Descriptor dictionary

Every descriptor that contributes to a prediction is connected to an entry in a molecular descriptor dictionary developed specifically within this project. Each entry provides the mathematical formulation alongside the physical and chemical explanation. Analysis and dictionary entries appear together in one interface, so interpretation never requires a separate reference text. This is particularly useful for early-career researchers, but the same dictionary is the working reference for experienced modellers reviewing why a particular descriptor family is performing well or poorly on a given series.

Applicability domain

An applicability domain analysis accompanies every prediction to determine whether the input molecule lies within the chemical space represented during model training. Researchers know when to trust a prediction and when to mark it for follow-up. This information is recorded in the prediction documentation, so the audit trail captures both the prediction and the basis for confidence in it.

Calibrated confidence

For transfer learning runs on under-labelled compound series, calibrated confidence estimates accompany every prediction. Researchers receive a realistic uncertainty rather than an overconfident point estimate. The confidence values are usable for downstream prioritisation: a high-confidence prediction near an activity threshold deserves different treatment than a low-confidence prediction that happens to fall in the same range.

Reporting

Documentation built into the workflow.

The reporting layer is not a separate tool added at the end. It runs inside the same pipeline, so every report is reproducible and traceable to its source configuration.

QMRF QPRF ICH M7 REACH

MergenKit produces scientific records consistent with documentation frameworks used in regulated assessments. QMRF structures model documentation; QPRF structures prediction documentation. These align with the requirements of frameworks like ICH M7, which covers assessment and control of DNA reactive impurities, and REACH, which governs chemical safety assessment in the European Union.

The platform does not act as a regulatory authority or perform independent regulatory assessments. The scientific reporting layer remains separate from final regulatory decision-making, which is the responsibility of the user organisation. This boundary is deliberate: scientific computing and regulatory authority are different functions, and the platform supports the first without claiming the second. Read the reporting principles on the Science page for the underlying methodology.

Five stages, one configuration interface.

A single pipeline, end to end.

Upload molecules

Configure study

Train and validate

Interpret predictions

Generate report