Future of data
Big data just got a whole lot bigger
Last updated: 23 Nov 2023
Market 101
AI is being used to recognise faces, code without knowing how to code, rate creditworthiness, write wedding speeches and predict the weather. Underpinning all these efforts are enormous datasets, like the ones used to train the latest generation of AI — the ChatGPTs and Stable Diffusions — which can contain billions of images and words scraped from the internet.
Just about every startup is shaping up to provide “AI-powered” X or Y and they’re all searching for good data — sometimes in unusual places. When a group of researchers wanted to build a Danish language dataset with which to train AI, they turned to a Danish web forum created in 1997 by horse enthusiasts to talk about — what else — horses. Anyone hoping to develop a ChatGPT challenger will need to find their own version of an equine forum to get the necessary info.
Or they could just make some up. “Synthetic” or artificial data can be used to train AIs in areas where real data is scarce or too sensitive. British founder Simi Lindgren is an early adopter: in 2015, she created a website called Yuty to sell beauty products for all skin types and used hundreds of thousands of photorealistic images of people with diverse skin tones.
Get ready for a world awash in artificial datasets and identities. 60% of all data used to train AI will be fake by 2024, up from 1% in 2021, predicts Gartner. London’s Synthesia, for instance — one of a handful of European unicorns born in 2023 — supplies digital human faces on demand. Proponents argue that this data can rid us of the kind of bias that comes at the expense of women and people of colour; others worry fake data will be exploited by fraudsters.
At the same time, the future of data is a scramble for supercomputers and data centres. Experts warn a lack of proper computing power, unless addressed, will hobble Europe’s lofty AI ambitions. Take Paris-based Mistral AI, which raised €105m in June to build models that could rival OpenAI. As Sifted has reported, the company trains a large part of its model in the US. Growing alongside data infrastructure companies is demand for label makers, as machine learning systems require a huge number of correctly labelled samples to start getting good at prediction, meaning the likes of London annotation specialist V7 are pulling in big money from VCs.
The future of data also comes with plenty of lawsuits. Copyright fights are the coming storm for GenAI, which of course, harvests original work by humans. Sifted reported last week that Stability AI’s head of audio resigned from the company due to concerns over its use of copyrighted material in training its models. The revolt against AI is going to be noisy.
Early stage market map
Key facts
$2.1bn
the size of the fake data market by 2028, up from $0.3bn today, according to MarketsandMarkets1
17%
YoY growth in data centres in Europe in 20232
100+
the number of homes' annual electricity consumption comparable with the energy required to train a single LLM3
Trends to watch
Record data centre demand
Small companies are coming up with new AI applications, while the big ones are providing computing power.
Securing good data infrastructure is a big battle and cloud providers like Google and Amazon and specialised chipmakers like NVIDIA enjoy a disproportionate advantage here.
As Sifted has reported, not only are startups largely dependent on the hardware and services provided by foreign players, they also have to play by their rules, which can become more constraining as access to GPUs — aka the “compute” used to train AI models — becomes increasingly competitive and waiting lists for key components grow.
The growth in data-heavy AI applications sees Europe bulking up on server farms: 2023 will be “a record year” for data centre growth in the region, according to analysis from real estate company JLL. Supply is predicted to go up 17%, with Frankfurt, London, Amsterdam, Paris and Dublin to see the most new hubs created.
New types of licensing deals
The growing data needs of AI startups will create new licensing opportunities for content providers, Pablo Ducru, the founder of a new large language model (LLM) called Raive, told an audience in Paris last week.
He envisages new royalty streams for content providers loaning their data to AI companies for training purposes, rather than redistribution. “The early AI pirates grabbed data without asking permission — it won’t continue that way,” he says.
Media companies including the New York Times are rewriting their terms and conditions to ban LLMs from scraping their data, forcing AI to go through the front door and sign exclusive licensing deals (see also OpenAI’s partnership with Associated Press).
Good data still needs (super)human efforts
Powerful AI systems will still have to reckon with human foibles. Take the experience of Paris-based Owkin, for example. The company wants to use AI to personalise treatment for every patient. But it needs very good data to do this, which means it needs people like Agathe Arlotti, the company’s senior vice president of partnerships.
Arlotti’s job is to forge relationships with the people who run Europe’s hospitals and convince them to trade everything from molecular data to slide decks created by the pathologist. “Hospitals are not ready to do this, so we have to send our team to build with them,” she said last week at a Paris conference.
There’s also data privacy legislation, created by humans, for startups to understand and get around. “GDPR’s slowing down a lot of research products,” Arlotti adds.
Beware “poisoned” AI
Mo data, mo problems. With so much fake data in our future systems, what are the risks that some of it will be used for fraud? Security specialists warn that manipulating the data used to train machines offers a powerful method to get around AI-powered defences.
This “poisoned AI” is malicious code labelled as “good” by a hacker hoping to trick a neutral network into thinking a piece of software is harmless. Or it might not even be a hacker: we’re seeing the emergence of new tools in the US that let artists subtly tweak their art before they upload it online so that if it’s scraped by an AI company without their say so, it can cause the resulting model to break.
There’s going to be added focus in the future on ensuring data is clean and trustworthy — look out for a new generation of companies offering to help.
Startups tracked by Sifted
Sifted take
The frenzied race to create the world’s best chatbot has everyone on the hunt for (even more) data. Europe doesn’t have the AI specialised chips — these mostly reside in Silicon Valley — but there are other ways for the region to gatecrash the Big Tech party. It’s a great time to be making fake data or working on advanced cybersecurity software that detects anyone using AI to deceive. Language is arguably the main department where Europe should be seeking an edge: we may not see a European company build a more powerful English LLM than Google or OpenAI any time soon, but we should expect to see the best Finnish, Swedish, German (you name it) LLMs come out of this region.
Rising stars
This startup, which has developed a platform that makes sensitive data available for machine learning systems without sharing it more widely, raised £6m in seed funding in November 2022. Octopus Ventures led the round, joined by Phoenix Court and six other investors.
Round
Seed
Valuation
Undisclosed
Date
2022
Size
€8.7m
Developer of privacy-preserving software intended to work on sensitive data without accessing the data directly. The company joined Y Combinator’s winter 2022 cohort and received $500k in funding in the form of SAFE notes. Previously, the company raised seed funding from Michael Schroll, Zillionize and other undisclosed investors in 2022.
Round
Pre-seed
Valuation
Undisclosed
Date
2022
Size
€1.4m
Developer of an intelligent risk underwriting management platform designed to reduce global business risk and business interruption claims. The company raised £1.62m in seed funding in December last year, with SuperSeed leading the deal. Cornwall & Isles of Scilly Investment Fund and The FSE Group also participated in the round.
Round
Seed
Valuation
Undisclosed
Date
2022
Size
€2.4m
Scientific AI platform aiming to accelerate drug development. The company secured €2.5m in grant funding from the EIC Accelerator in October 2022.
Round
Seed
Valuation
Undisclosed
Date
2022
Size
€6.1m
Early stage startups to watch
Abzu
Operational Research
€13.3m
€6.1m
-
Apheris AI
Deep Learning
€11.2m
€8.7m
-
Bitfount
Machine Learning
€4.5m
€4.5m
-
Bookingdata IO GmbH
Business Intelligence
€565k
€565k
-
BranchKey
Operational Research
€250k
€250k
-
Clearbox AI
Deep Learning
€750k
€390k
-
Core Life Analytics
Deep Learning
€1.2m
€1m
-
Cyanite
Machine Learning
€1.8m
€800k
-
Dedomena
Business Intelligence
€500k
€500k
-
Edgeless Systems
Data preparation and Managment
€6.1m
€4.5m
-
Gardenia Technologies
Business Intelligence
€3.7m
€1.6m
-
GoodVision
Machine Learning
€3.7m
€2.7m
-
Intelligent AI
Operational Research
€3.2m
€2.4m
€9.6m
Kaiko Systems
Operational Research
€2m
€2m
-
Kestrix
Deep Learning
€1m
€573k
-
Roseman Labs
Data preparation and Managment
€4.4m
€4m
-
Sarus
Deep Learning
€3.4m
€1.4m
-
Syntonym
Data preparation and Managment
Synthetic data
€1.2m
€820k
-
Europe’s success stories
Who early stage startups are up against
(Pre-)Seed
Series A
Series B
Series C
Series D+
IPO/Exit
This London-based sports data company, which went public in 2023, provides data management, video streaming and other services to sports leagues, bookmakers and media companies. The company posted revenues of $102m in November 2023.
(Pre-)Seed
Series A
Series B
Series C
Series D+
IPO/Exit
The London-based company raised £25m in March 2022 for its analytics software aimed at improving regulatory compliance and customer onboarding. The money helped the company expand and set up offices in Belgrade, Glasgow and Sydney.
(Pre-)Seed
Series A
Series B
Series C
Series D+
IPO/Exit
This French company creates software that lets you organise, share and visualise any type of data. Founded in 2011, Opendatasoft powers data-sharing portals for more than 600 organisations all over the world.
(Pre-)Seed
Series A
Series B
Series C
Series D+
IPO/Exit
This Austrian startup specialising in “synthetic” — i.e. artificial — customer data raised $25m in Series B funding in January 2022. Investors at Molten Ventures led the round, with participation from Citi Ventures, 42CAP and Earlybird Venture Capital.
Sources
News articles
How is the global run on AI hardware affecting startups? | August 2023 | Sifted
AI startups need more data centres. France wants to build them | October 2023 | Sifted
Stability AI’s head of audio resigns over copyright concerns | November 2023 | Sifted
Synthetic data for AI | February 2022 | MIT Technology Review
Europe has a secret weapon to beat Big Tech on GenAI | November 2023 | Sifted
2 Core European data centre markets set for record growth in 2023, with 17% increase in new supply and 32% more take-up expected | June 2023 | JLL
3 Artificial Intelligence Is Booming—So Is Its Carbon Footprint | March 2023 | Bloomberg
Research reports
Synthetic data generation market worth $2.1 billion by 2028 | June 2023 | MarketsandMarkets
Emerging Space brief: Synthetic data | March 2023 | Pitchbook
Other
1 Synthetic Data Generation Market worth $2.1 billion by 2028 - Exclusive Report by MarketsandMarkets | June 2023 | Bloomberg
Your feedback
How would you rate this briefing?