Protein solubility depends on centrifugation

Aiki-Sol

A fine-tuned ESM-2 650M that scores a protein sequence against five centrifugation regimes from 3,000×g to 100,000×g and returns calibrated soluble probabilities at each, plus a stringency-marginal output for when the downstream protocol is unknown.

Riya Rajagopalan, Radheesh Sharma Meda, Shankar Shastry, Venkatesh Mysore^*

Aikium, Inc. · ^*Corresponding author: venkatesh@aikium.com

v0.2.0 147,574 training proteins (Apache-tier) 5+1 outputs (per-stringency + marginal) ESM-2 650M backbone · full fine-tune

Abstract

Protein solubility depends on centrifugation: Aiki-Sol, a per-regime predictor for E. coli — bioRxiv preprint, submitted to Bioinformatics.

Motivation. Sequence-based predictors of recombinant protein solubility in Escherichia coli have plateaued (NESG independent-test AUC 0.760 to about 0.80 over eight years of protein-language-model variants). The plateau hides a latent confound: the centrifugation regime used to separate the soluble from the insoluble fraction is a hidden variable collapsed into a single binary “soluble” label. The protein's biochemistry does not change between regimes; what changes is which fraction of the lysate is recovered as soluble. Existing predictors treat the regime as label noise rather than a feature, and sequence overlap between training and test partitions masks the resulting failure mode.

Results. We release the Aiki-Sol Dataset, a tiered E. coli solubility corpus: a ~85K stringency-annotated benchmark, an Apache-licensed ~147K extension adding binary-only-labelled proteins, and a ~229K research-tier pool incorporating non-commercially-licensed sources. On the ~85K benchmark, scored on sequence-cluster-disjoint partitions, the strongest published binary comparator falls below chance on the 32,000×g stratum (AUC 0.491 ± 0.020); a fine-tuned ESM-2 650M backbone with five protocol-matched outputs lifts pooled AUC by +0.108 (paired-bootstrap CI lower bound ≥ +0.090). The gain is curation, not architecture: structure-aware predictors given ESMFold structures do not outperform the sequence-only frame, and capacity scaled to 3B parameters does not exceed the conditioned 650M backbone. The released model, Aiki-Sol, jointly supervises five per-stringency outputs alongside a marginal output for stringency-unknown proteins; on five external cohorts it lifts cohort-mean AUC from 0.69–0.70 to 0.825, with a +0.10 to +0.16 lift on the three cohorts at measurably-zero training-pool overlap.

Try Aiki-Sol

Paste a protein sequence and get calibrated soluble probabilities across the five regimes.

What Aiki-Sol predicts. From a protein sequence, the model returns the probability of remaining in the soluble fraction under five experimental protocols — four in vivo centrifugation regimes and one in vitro cell-free assay. The “stringency-marginal” output is a single calibrated probability for cases where the protocol is unknown or unspecified. Aiki-Sol does not predict expression yield, aggregation kinetics, long-term stability, or biological activity. The five outputs disagreeing for the same protein is the point: a single solubility score averages away protocol-dependent biology.

Loading worked example…

Reading the chart. Four outputs are in vivo: heterologous expression in E. coli, lysed, then spun at the named g-force — supernatant counted as soluble. PURE 21,600×g is in vitro: cell-free synthesis with the PURE system (Shimizu et al., 2001), then a 21,600×g/30 min spin. Because PURE skips inclusion-body biology, a protein can score high under PURE and low under high-g centrifugation, or vice versa — the “flip” the default example demonstrates.

Predicted 3D structure

ESMFold v1 (Meta AI) structure prediction via the public ESMFold Atlas API; residues coloured by per-residue pLDDT confidence. ● very high (≥90) ● confident (70–90) ● low (50–70) ● very low (<50)

ESMFold Atlas caps at 400 residues. Click after entering a sequence above.

How Aiki-Sol works

Three steps from a tagged protein construct to per-protocol probabilities.

Strip purification tags

His6, HiBit (VSGWRLFKKIS), Strep, common linkers, and the initiator methionine are removed using aikisol.normalize_sequence. The training pool was tag-stripped, so scoring tagged inputs is a silent train/inference distribution mismatch.

Forward through ESM-2 650M

Sequence → ESM-2 tokens (max_length=1022) → masked-mean pool over residues. Backbone weights (facebook/esm2_t33_650M_UR50D, MIT) come from HuggingFace; the curve-head adapter is loaded from the Zenodo deposit.

6-output head, sigmoid

The pooled representation passes through a 2-layer MLP (1280 → 256 → 6) under sigmoid. Outputs 1–5 are P(soluble | regime) for the five centrifugation regimes; output 6 is a dedicated P(soluble | sequence) marginal head trained on stringency-unknown rows — the recommended single number when the downstream protocol is not specified.

API endpoints

Three endpoints, each scale-to-zero on Modal CPU containers.

All routes live under https://aikium--aikisol-landing-page.modal.run/.

GET /lookup_protein

Score a single sequence passed as ?seq=.... Returns mean_prob and per-stringency probabilities.

curl 'https://aikium--aikisol-landing-page.modal.run/lookup_protein?seq=MKLITVLVLALLAVAVAFPV'

POST /predict_fasta

Batch-score a FASTA payload. JSON or CSV response.

curl -X POST 'https://aikium--aikisol-landing-page.modal.run/predict_fasta' \
  -H 'Content-Type: application/json' \
  -d '{"fasta": ">p1\nMKLITVLVLALLAVAVAFPV\n>p2\nMAEILVTQNMK\n", "format": "csv"}'

Install & run locally

Same model, on your own machine, no rate limits.

pip

pip install aikisol
aikisol-predict --sequence "MAEILVTQNMK..."

FASTA + CSV

aikisol-predict --fasta my.fasta --out predictions.csv
aikisol-predict --csv input.csv --seq-col sequence

Docker

docker pull ghcr.io/aikium-public/aiki-sol:full
docker run -v "$PWD":/work ghcr.io/aikium-public/aiki-sol:full \
  --fasta /work/my.fasta --out /work/predictions.csv

Where Aiki-Sol works, and where it doesn't

Known limits of this release.

Length cap

Sequences are truncated at 1,022 residues at the ESM-2 tokenizer level. Long modular proteins beyond that length are scored on the N-terminal 1,022-aa window only.

Tag-stripped training distribution

The training pool was tag-stripped. Inputs containing His6 / HiBit / Strep are silently re-stripped before scoring; the API raises if more than 5% of a batch retain tags after normalization.

Mesophile centrifuge regimes

The 5 calibrated regimes correspond to centrifugation conditions used in the training pool's source datasets (eSol/PURE, NESG, PSI:Biology, DeepSol/DeepSoluE, ProgSol). Regimes outside this range — e.g. tangential-flow filtration, ultrafiltration — are extrapolations.

Bacterial expression contexts

Most training rows are from E. coli-expressed targets. The model transfers reasonably to yeast (ProgSol-yeast AUC 0.883) but eukaryotic-expression-specific aggregation modes are not separately calibrated.

Cite, data, code

Companion paper, Zenodo deposit, GitHub repo.

Rajagopalan, R., Sharma Meda, R., Shastry, S., Mysore, V. (2026). Protein solubility depends on centrifugation: Aiki-Sol, a per-regime predictor for E. coli. bioRxiv 10.64898/2026.05.14.725067. Companion artefacts at Zenodo 10.5281/zenodo.20151817 and github.com/aikium-public/aiki-sol.

@article{aikisol2026,
  author       = {Rajagopalan, Riya and
                  Sharma Meda, Radheesh and
                  Shastry, Shankar and
                  Mysore, Venkatesh},
  title        = {Protein solubility depends on centrifugation:
                  {A}iki-{S}ol, a per-regime predictor for {\it E.~coli}},
  year         = {2026},
  journal      = {bioRxiv},
  doi          = {10.64898/2026.05.14.725067},
  url          = {https://www.biorxiv.org/content/10.64898/2026.05.14.725067v1},
  note         = {Companion data + model weights:
                  \url{https://doi.org/10.5281/zenodo.20151817}}
}

Acknowledgements

Aiki-Sol stands on a stack of open data and open weights.

Data. Niwa et al. (2009) for the eSol cell-free intrinsic-aggregation dataset (PURE protocol). The NESG, PSI:Biology, MCSG, CSGID, and NYSGRC structural-genomics consortia, and the curators of TargetTrack, for the open consortium-level expression and solubility records this work builds on.

Models & tools. ESM-2 (Meta AI, Lin et al. 2023), ESM Cambrian (EvolutionaryScale), and ProtT5-XL (TUM/Rostlab) for releasing open weights. MMseqs2 (Steinegger & Söding 2017) for the sequence-clustering tool our cluster-disjoint partitions depend on. The authors of NetSolP, PLM_Sol, SoluProt, DeepSol, DSResSol, GATSol, and ProtSolM for shipping deployable inference code that made the cross-predictor benchmark possible.

Infrastructure. HuggingFace for hosting the ESM-2 weights, Meta AI for the ESMFold Atlas API powering the 3D structure widget, Modal Labs for the serverless compute that hosts this demo, and Zenodo for the permanent data+weights deposit.