It’s no secret that generative AI is becoming a Big Tech party hosted in Silicon Valley, as Microsoft, Google, Amazon and Meta all flex their muscles to produce the large language models (LLMs) that many believe represent the next paradigm shift in tech.
But early results from a new European model released today give some cause for optimism that all is not lost, particularly when it comes to systems outside of the English language.
Much of Big Tech’s AI dominance has to do with hardware resources: LLMs require huge amounts of computing power (known in the industry as ‘compute’) to be trained and these companies have huge quantities of AI specialised chips at their disposal.
But compute isn’t the only thing you need bucketloads of to build an LLM. You also need a lot of language data — normally text scraped from the internet — and this is one department where Europe could have an edge.
The devil’s in the data
While it’s probably unlikely that a European company is going to build a more powerful English LLM than Google or OpenAI any time soon, one Finnish AI company, Silo, has today released the first results for a multilingual model called Poro (Finnish for reindeer).
The model is trained on Finnish and English text, as a proof of concept that high performance LLMs can be built with a mix of different languages, and Silo says that early results show that it’s competitive with Meta’s open source Llama models.
Silo has built Poro in collaboration with the University of Turku, and now plans to train a range of models across all European languages, thanks to a goldmine of data it’s sitting on.
The company, in combination with the university, has access to data from an EU-funded initiative called The High Performance Language Technologies (HPLT) project that, since 2022, has gathered 7 petabytes (7,000 terabytes) of language data across 80 languages.
For context, GPT-3.5 (the model that powered the release version of ChatGPT) was trained on 45 terabytes of text data.
Access to high-quality, publicly-funded text data from projects like HPLT could mean that models like Silo’s are better than what Big Tech can accomplish in less commonly used languages, where there is less data available online.
Because there still aren’t huge quantities of data for languages like Finnish, Silo has made its model multilingual by “cross-training” the model with English and Finnish data.
This means that the model is fed text in both languages, and it then learns itself how the two languages relate to each other. This means you can ask it things in Finnish, and it’ll give you an answer in Finnish, even if it has to draw on English training data.
“You're able to generate code in Finnish even though the model has not seen any Finnish code,” explains Silo cofounder and CEO Peter Sarlin.
New models trained using Silo’s cross-training techniques — which it will open source — could allow for the production of models across all European languages, even those where there isn’t much data to work with.
Sovereignty and supercomputers
Sarlin says that “there's clearly an opportunity gap in the market” for LLMs in other languages, adding that it’s crucial for European businesses to not build with technology owned by large US companies.
“If we in Europe make the decision on the individual company level that we just utilise Big Tech [AI models] and apply that within our products, then that will eventually imply that very little of that value creation stays in Europe,” he argues.
Poro was also trained on LUMI — an EU-funded supercomputer which came online in 2022. LUMI is not built with industry-standard NVIDIA chips, but ones from rival, California-based manufacturer AMD which many say are costly and inefficient for AI. However Sarlin says his team has invested significant resources into building software that works well for AI training.
“I think we will open source a large part of that [software], and we have been and will continue to be very cooperative and help out other companies to train models on LUMI,” says Sarlin.
If European companies can begin making use of resources like LUMI for AI training, based on advancements like these, that could be a big deal as the continent fights to hold its own in the age of AI.