Spain will develop an open-source large language model trained in Spanish (Castellano), Basque, Catalan, Galician and Valencian.
Spanish Prime Minister Pedro Sánchez announced the project, which will involve a range of public and private organisations, at the Mobile World Congress taking place in Barcelona this week.
Sánchez said Latin American countries will be invited to train the LLM so that it will be useful to users from any Spanish-speaking country.
That will also allow Spanish AI startups to compete in the vast markets of Latin America and the Spanish speaking communities of the US, says Carlos KiK, CTO and cofounder at Barcelona-based AiMA Beyond AI startup.
“We need this project very much to compete with the American tech companies. If we don’t move quickly, one of the big ones will come and impose their Spanish model on us,” he says.
The LLM will be developed as a public-private partnership between the Barcelona Supercomputer Center (BSC), the Spanish Supercomputing Network comprising 12 of these ultrafast machines, the Royal Spanish Academy and the Association of Spanish Language Academies — which work to protect the integrity of the Spanish language across the world.
Albert Cañigueral, tech transfer director for AI and language technology projects at BSC, hopes the LLM will be equivalent to OpenAI’s GPT-3 model and will be released by the summer, assuming the centre’s MareNostrum 5, one of the world’s 10 supercomputers, comes into operation as planned this spring.
Propelling the industry
The initiative will build on two existing BSC projects, Aina on Catalan and Ilenia on Spanish and other regional languages, which have been gathering written data and recording speech in many parts of Spain, Cañigueral tells Sifted.
Once released, the BSC project’s second phase will be focused on ensuring its LLM is picked up by industry and public institutions.
The BSC data, which will not include licensed content, is already accessible to companies of all sizes and has been used by Google to improve its PaLM-2 model, says Cañigueral.
A handful of startups and publicly-funded projects had already been developing LLMs trained with data in these languages, such as Clibrain — absorbed into Clidrive earlier this month — on Spanish and Latxa on Basque, which Cañigueral says will also benefit from the BSC initiative.
“There will be a coexistence of models, there won’t be a unique model to rule them all like [the One Ring] in the Lord of the Rings,” he says. “There’s a lot of options for collaboration with smaller models specialised on specific tasks, which will also be able to access the BSC data.”
The BSC data will also be available for evaluating models, enabling better comparisons of their language accuracy and fluency, Cañigueral adds.
Warm welcome
KiK says the BSC LLM project could substantially improve the accuracy of the languages spoken as AI startups in Spain, which rely on LLMs trained with up to 90% English language data.
AiMA is an AI-powered life companion that can interact in English, Spanish and Catalan, built on Meta’s LLaMA-2 open-source LLM, which translates from American English. The quality of its output in Catalan is quite poor, Kik says.
“I’ve had to make a gargantuan effort to get it to respond in the language you used when asking a question,” he says.
The BSC LLM would save developers’ time by not having to introduce modifications to make the software sound more natural, KiK adds.
The companions would also be able to identify the local dialects spoken by the user and adapt their responses as if they were a speaker from the same geographic region, for example, by using local expressions.