A fine-tuned ESM-2 650M that scores a protein sequence against five centrifugation regimes from 3,000×g to 100,000×g and returns calibrated soluble probabilities at each, plus a stringency-marginal output for when the downstream protocol is unknown.
Riya Rajagopalan, Radheesh Sharma Meda, Shankar Shastry, Venkatesh Mysore*
Aikium, Inc. · *Corresponding author: venkatesh@aikium.com
Protein solubility depends on centrifugation: Aiki-Sol, a per-regime predictor for E. coli — bioRxiv preprint, submitted to Bioinformatics.
Motivation. Sequence-based predictors of recombinant protein solubility in Escherichia coli have plateaued (NESG independent-test AUC 0.760 to about 0.80 over eight years of protein-language-model variants). The plateau hides a latent confound: the centrifugation regime used to separate the soluble from the insoluble fraction is a hidden variable collapsed into a single binary “soluble” label. The protein's biochemistry does not change between regimes; what changes is which fraction of the lysate is recovered as soluble. Existing predictors treat the regime as label noise rather than a feature, and sequence overlap between training and test partitions masks the resulting failure mode.
Results. We release the Aiki-Sol Dataset, a tiered E. coli solubility corpus: a ~85K stringency-annotated benchmark, an Apache-licensed ~147K extension adding binary-only-labelled proteins, and a ~229K research-tier pool incorporating non-commercially-licensed sources. On the ~85K benchmark, scored on sequence-cluster-disjoint partitions, the strongest published binary comparator falls below chance on the 32,000×g stratum (AUC 0.491 ± 0.020); a fine-tuned ESM-2 650M backbone with five protocol-matched outputs lifts pooled AUC by +0.108 (paired-bootstrap CI lower bound ≥ +0.090). The gain is curation, not architecture: structure-aware predictors given ESMFold structures do not outperform the sequence-only frame, and capacity scaled to 3B parameters does not exceed the conditioned 650M backbone. The released model, Aiki-Sol, jointly supervises five per-stringency outputs alongside a marginal output for stringency-unknown proteins; on five external cohorts it lifts cohort-mean AUC from 0.69–0.70 to 0.825, with a +0.10 to +0.16 lift on the three cohorts at measurably-zero training-pool overlap.
Paste a protein sequence and get calibrated soluble probabilities across the five regimes.
ESMFold v1 (Meta AI) structure prediction via the public ESMFold Atlas API; residues coloured by per-residue pLDDT confidence. ● very high (≥90) ● confident (70–90) ● low (50–70) ● very low (<50)
Three steps from a tagged protein construct to per-protocol probabilities.
His6, HiBit (VSGWRLFKKIS), Strep, common linkers, and the initiator methionine are removed using aikisol.normalize_sequence. The training pool was tag-stripped, so scoring tagged inputs is a silent train/inference distribution mismatch.
Sequence → ESM-2 tokens (max_length=1022) → masked-mean pool over residues. Backbone weights (facebook/esm2_t33_650M_UR50D, MIT) come from HuggingFace; the curve-head adapter is loaded from the Zenodo deposit.
The pooled representation passes through a 2-layer MLP (1280 → 256 → 6) under sigmoid. Outputs 1–5 are P(soluble | regime) for the five centrifugation regimes; output 6 is a dedicated P(soluble | sequence) marginal head trained on stringency-unknown rows — the recommended single number when the downstream protocol is not specified.
Three endpoints, each scale-to-zero on Modal CPU containers.
All routes live under https://aikium--aikisol-landing-page.modal.run/.
Score a single sequence passed as ?seq=.... Returns mean_prob and per-stringency probabilities.
curl 'https://aikium--aikisol-landing-page.modal.run/lookup_protein?seq=MKLITVLVLALLAVAVAFPV'
Batch-score a FASTA payload. JSON or CSV response.
curl -X POST 'https://aikium--aikisol-landing-page.modal.run/predict_fasta' \
-H 'Content-Type: application/json' \
-d '{"fasta": ">p1\nMKLITVLVLALLAVAVAFPV\n>p2\nMAEILVTQNMK\n", "format": "csv"}'
Same model, on your own machine, no rate limits.
pip install aikisol aikisol-predict --sequence "MAEILVTQNMK..."
aikisol-predict --fasta my.fasta --out predictions.csv aikisol-predict --csv input.csv --seq-col sequence
docker pull ghcr.io/aikium-public/aiki-sol:full docker run -v "$PWD":/work ghcr.io/aikium-public/aiki-sol:full \ --fasta /work/my.fasta --out /work/predictions.csv
Known limits of this release.
Sequences are truncated at 1,022 residues at the ESM-2 tokenizer level. Long modular proteins beyond that length are scored on the N-terminal 1,022-aa window only.
The training pool was tag-stripped. Inputs containing His6 / HiBit / Strep are silently re-stripped before scoring; the API raises if more than 5% of a batch retain tags after normalization.
The 5 calibrated regimes correspond to centrifugation conditions used in the training pool's source datasets (eSol/PURE, NESG, PSI:Biology, DeepSol/DeepSoluE, ProgSol). Regimes outside this range — e.g. tangential-flow filtration, ultrafiltration — are extrapolations.
Most training rows are from E. coli-expressed targets. The model transfers reasonably to yeast (ProgSol-yeast AUC 0.883) but eukaryotic-expression-specific aggregation modes are not separately calibrated.
Companion paper, Zenodo deposit, GitHub repo.
Rajagopalan, R., Sharma Meda, R., Shastry, S., Mysore, V. (2026). Protein solubility depends on centrifugation: Aiki-Sol, a per-regime predictor for E. coli. bioRxiv 10.64898/2026.05.14.725067. Companion artefacts at Zenodo 10.5281/zenodo.20151817 and github.com/aikium-public/aiki-sol.
@article{aikisol2026,
author = {Rajagopalan, Riya and
Sharma Meda, Radheesh and
Shastry, Shankar and
Mysore, Venkatesh},
title = {Protein solubility depends on centrifugation:
{A}iki-{S}ol, a per-regime predictor for {\it E.~coli}},
year = {2026},
journal = {bioRxiv},
doi = {10.64898/2026.05.14.725067},
url = {https://www.biorxiv.org/content/10.64898/2026.05.14.725067v1},
note = {Companion data + model weights:
\url{https://doi.org/10.5281/zenodo.20151817}}
}
Aiki-Sol stands on a stack of open data and open weights.
Data. Niwa et al. (2009) for the eSol cell-free intrinsic-aggregation dataset (PURE protocol). The NESG, PSI:Biology, MCSG, CSGID, and NYSGRC structural-genomics consortia, and the curators of TargetTrack, for the open consortium-level expression and solubility records this work builds on.
Models & tools. ESM-2 (Meta AI, Lin et al. 2023), ESM Cambrian (EvolutionaryScale), and ProtT5-XL (TUM/Rostlab) for releasing open weights. MMseqs2 (Steinegger & Söding 2017) for the sequence-clustering tool our cluster-disjoint partitions depend on. The authors of NetSolP, PLM_Sol, SoluProt, DeepSol, DSResSol, GATSol, and ProtSolM for shipping deployable inference code that made the cross-predictor benchmark possible.
Infrastructure. HuggingFace for hosting the ESM-2 weights, Meta AI for the ESMFold Atlas API powering the 3D structure widget, Modal Labs for the serverless compute that hosts this demo, and Zenodo for the permanent data+weights deposit.