In the age of large AI models, it’s largely been Big Tech-backed research labs — outfitted with near-limitless access to cutting edge chips — who’ve led the pack. But beyond raw compute power, which can be bought, the other key ingredient for developing cutting edge AI is one where startups are finding an edge: training data.
Today UK startup Basecamp Research is releasing a new model, BaseFold, which it says improves the accuracy of Google DeepMind’s 2020 breakthrough protein folding system AlphaFold by as much as 6x.
It’s done so by combining the open source AlphaFold model with its own unique dataset of biological samples gathered by venturing to some of the world’s most remote locations.
This combination of DeepMind’s research with Basecamp’s data could have big implications for biotech breakthroughs, as well as setting a precedent for sourcing the material we train AI models on in more responsible ways.
The magic of protein folding
Protein folding AI models have revolutionised the way that biotech companies assess candidates for new drugs or other useful enzymes with industrial applications.
Previously, to understand a protein’s structure and how it might interact with other molecules, researchers had to rely on an expensive and time consuming process called X-ray crystallography. Advances in AI models like AlphaFold made it possible to predict the same information far more quickly and cheaply through machine learning techniques, but are inherently less reliable than using X-ray crystallography instruments to assess a protein in the lab.
“You don't really have as much accuracy [with systems like AlphaFold], as you would with X-ray crystallography; particularly when the proteins are complex or particularly large it really struggles,” explains Basecamp cofounder Glen Gowers. “That's where increasing the accuracy is really important because then you almost get the best of both worlds.”
He says that Basecamp’s team has shown that, by supplementing the AlphaFold code with a richer dataset, it’s possible to reduce the rate of errors it makes by up to 80%.
“In some cases we saw a 6x improvement in the accuracy of the model,” he says. “We're able to use the architecture of AlphaFold and supplement it with our [protein] sequences to give it that much more granular view.”
Deep data
Basecamp’s ability to improve a world-leading model like AlphaFold has much to do with its unusual business model. The startup, founded in 2019, has sent expedition teams to 23 countries to collect biological samples from places with unusual geologies and microclimates where they hope to find rare microbes.
Having now gathered microbial samples from more than 75 hard-to-reach places around the world, Basecamp has built a database that it’s able to monetise by licensing it out to corporate clients like pharma companies. It also has its own AI team which helps clients to find new proteins for specific functions. It’s partnered with US-based biotech Protein Evolution, for example, to find enzymes to break down plastics.
Gowers tells Sifted that models like AlphaFold are normally trained on publicly available datasets which are often marred by “truncations, errors and very short fragments” in the protein sequences. He adds that Basecamp’s dataset doesn’t only include the proteins themselves, but reflects how they interact with DNA in nature — something that’s valuable when assessing how a protein might be useful commercially.
“All the public datasets are protein databases or separate DNA databases,” Gowers says. “We're able to see the protein and the DNA in the same context and link them together… There are relationships within this dataset that you can't find anywhere else.”
With its new model BaseFold, Basecamp will be able to predict how a prospective protein will function more accurately than ever before, improving the time savings and the accuracy of its results for clients.
Permission
The startup’s dataset isn’t only notable for what’s in it, but also for the terms on which it’s gathered. Basecamp signs revenue share deals with the custodians of the locations it takes microbial samples from, so that local communities and ecosystems also benefit from the economic value generated by biodiversity in those places.
As well as ensuring that there’s an ethical upside to whatever products come out of Basecamp’s data, it also addresses the kinds of IP concerns that are seeing a number of AI companies in court over alleged misuse of copyrighted material.
“Not only is the dataset better quality, but it's actually sourced correctly with the right permissions around it so that it can be commercialised,” says Gowers.
Basecamp says that it’s now distributed money to 42 communities around the world, incentivising them to protect the biodiversity of their ecosystems.
And with the tech world fretting over the looming dominance of Big Tech companies in the new AI economy, it’s a story that shows that startups can still punch above their weight.