Future of data

Big data just got a whole lot bigger

Last updated: 23 Nov 2023

Market 101

AI is being used to recognise faces, code without knowing how to code, rate creditworthiness, write wedding speeches and predict the weather. Underpinning all these efforts are enormous datasets, like the ones used to train the latest generation of AI — the ChatGPTs and Stable Diffusions — which can contain billions of images and words scraped from the internet.

Just about every startup is shaping up to provide “AI-powered” X or Y and they’re all searching for good data — sometimes in unusual places. When a group of researchers wanted to build a Danish language dataset with which to train AI, they turned to a Danish web forum created in 1997 by horse enthusiasts to talk about — what else — horses. Anyone hoping to develop a ChatGPT challenger will need to find their own version of an equine forum to get the necessary info.

Or they could just make some up. “Synthetic” or artificial data can be used to train AIs in areas where real data is scarce or too sensitive. British founder Simi Lindgren is an early adopter: in 2015, she created a website called Yuty to sell beauty products for all skin types and used hundreds of thousands of photorealistic images of people with diverse skin tones.

Get ready for a world awash in artificial datasets and identities. 60% of all data used to train AI will be fake by 2024, up from 1% in 2021, predicts Gartner. London’s Synthesia, for instance — one of a handful of European unicorns born in 2023 — supplies digital human faces on demand. Proponents argue that this data can rid us of the kind of bias that comes at the expense of women and people of colour; others worry fake data will be exploited by fraudsters.

At the same time, the future of data is a scramble for supercomputers and data centres. Experts warn a lack of proper computing power, unless addressed, will hobble Europe’s lofty AI ambitions. Take Paris-based Mistral AI, which raised €105m in June to build models that could rival OpenAI. As Sifted has reported, the company trains a large part of its model in the US. Growing alongside data infrastructure companies is demand for label makers, as machine learning systems require a huge number of correctly labelled samples to start getting good at prediction, meaning the likes of London annotation specialist V7 are pulling in big money from VCs.

The future of data also comes with plenty of lawsuits. Copyright fights are the coming storm for GenAI, which of course, harvests original work by humans. Sifted reported last week that Stability AI’s head of audio resigned from the company due to concerns over its use of copyrighted material in training its models. The revolt against AI is going to be noisy.

Early stage market map

Key facts

$2.1bn

the size of the fake data market by 2028, up from $0.3bn today, according to MarketsandMarkets1

17%

YoY growth in data centres in Europe in 20232

100+

the number of homes' annual electricity consumption comparable with the energy required to train a single LLM3

Startups tracked by Sifted

Sifted take

The frenzied race to create the world’s best chatbot has everyone on the hunt for (even more) data. Europe doesn’t have the AI specialised chips — these mostly reside in Silicon Valley — but there are other ways for the region to gatecrash the Big Tech party. It’s a great time to be making fake data or working on advanced cybersecurity software that detects anyone using AI to deceive. Language is arguably the main department where Europe should be seeking an edge: we may not see a European company build a more powerful English LLM than Google or OpenAI any time soon, but we should expect to see the best Finnish, Swedish, German (you name it) LLMs come out of this region.

Rising stars

Apheris AI

Deep Learning

Total funding

€11.2m

Berlin, Germany
2019

This startup, which has developed a platform that makes sensitive data available for machine learning systems without sharing it more widely, raised £6m in seed funding in November 2022. Octopus Ventures led the round, joined by Phoenix Court and six other investors.

Round

Seed

Valuation

Undisclosed


Date

2022

Size

€8.7m

Sarus

Deep Learning

Total funding

€3.4m

Paris, France
2020

Developer of privacy-preserving software intended to work on sensitive data without accessing the data directly. The company joined Y Combinator’s winter 2022 cohort and received $500k in funding in the form of SAFE notes. Previously, the company raised seed funding from Michael Schroll, Zillionize and other undisclosed investors in 2022.

Round

Pre-seed

Valuation

Undisclosed


Date

2022

Size

€1.4m

Intelligent AI

Operational Research

Total funding

€3.2m

Exeter, UK
2020

Developer of an intelligent risk underwriting management platform designed to reduce global business risk and business interruption claims. The company raised £1.62m in seed funding in December last year, with SuperSeed leading the deal. Cornwall & Isles of Scilly Investment Fund and The FSE Group also participated in the round.

Round

Seed

Valuation

Undisclosed


Date

2022

Size

€2.4m

Abzu AI

Operational Research

Total funding

€13.3m

Copenhagen, Denmark
2018

Scientific AI platform aiming to accelerate drug development. The company secured €2.5m in grant funding from the EIC Accelerator in October 2022.

Round

Seed

Valuation

Undisclosed


Date

2022

Size

€6.1m

Early stage startups to watch

Abzu

Copenhagen, Barcelona, Denmark
2018
Seed

13.3m

6.1m

-

Apheris AI

Berlin, Germany
2019
Seed

11.2m

8.7m

-

Bitfount

London, United Kingdom
2020
Seed

4.5m

4.5m

-

Bookingdata IO GmbH

Munich, Germany
2021
Seed

565k

565k

-

BranchKey

Amsterdam, Netherlands
2019
Pre-seed

250k

250k

-

Clearbox AI

Turin, Italy
2019
Pre-seed

750k

390k

-

Core Life Analytics

's-Hertogenbosch, Netherlands
2016
Seed

1.2m

1m

-

Cyanite

Berlin, Germany
2019
Seed

1.8m

800k

-

Dedomena

Rivas-Vaciamadrid, Spain
2021
Seed

500k

500k

-

Edgeless Systems

Bochum, Germany
2019
Seed

6.1m

4.5m

-

Gardenia Technologies

London, United Kingdom
2016
Seed

3.7m

1.6m

-

GoodVision

London, United Kingdom
2017
Seed

3.7m

2.7m

-

Intelligent AI

Exeter, United Kingdom
2020
Seed

3.2m

2.4m

9.6m

Kaiko Systems

Berlin, Germany
2020
Seed

2m

2m

-

Kestrix

London, United Kingdom
2022
Pre-seed

1m

573k

-

Roseman Labs

Utrecht, Netherlands
2020
Seed

4.4m

4m

-

Sarus

Paris, France
2020
Pre-seed

3.4m

1.4m

-

Syntonym

London, United Kingdom
2019
Seed

1.2m

820k

-

Europe’s success stories

Who early stage startups are up against

(Pre-)Seed

SeriesA

SeriesB

SeriesC

SeriesD+

IPO/Exit

This London-based sports data company, which went public in 2023, provides data management, video streaming and other services to sports leagues, bookmakers and media companies. The company posted revenues of $102m in November 2023.

(Pre-)Seed

SeriesA

SeriesB

SeriesC

SeriesD+

IPO/Exit

The London-based company raised £25m in March 2022 for its analytics software aimed at improving regulatory compliance and customer onboarding. The money helped the company expand and set up offices in Belgrade, Glasgow and Sydney.

(Pre-)Seed

SeriesA

SeriesB

SeriesC

SeriesD+

IPO/Exit

This French company creates software that lets you organise, share and visualise any type of data. Founded in 2011, Opendatasoft powers data-sharing portals for more than 600 organisations all over the world.

(Pre-)Seed

SeriesA

SeriesB

SeriesC

SeriesD+

IPO/Exit

This Austrian startup specialising in “synthetic” — i.e. artificial — customer data raised $25m in Series B funding in January 2022. Investors at Molten Ventures led the round, with participation from Citi Ventures, 42CAP and Earlybird Venture Capital.

Your feedback

How would you rate this briefing?

1
2
3
4
5