avatar
Romeo Kokriatski @vagrantjourno.bsky.social

There are very good reasons to oppose the widespread and unregulated use of training data on privacy grounds, unrelated to any questions of copyright, even. So there are tons of mechanisms available to limit the spread and harm of automated snake oil vending machines.

aug 27, 2025, 7:24 am • 0 0

Replies

avatar
Kevin Riggle @kevinriggle.bsky.social

Copyright, yes. Privacy? I'm open to the argument but I'm not obviously seeing it. And I spent a big chunk of the last nine months working on policy in this area so I'm fairly read in

aug 27, 2025, 7:27 am • 0 0 • view
avatar
Romeo Kokriatski @vagrantjourno.bsky.social

How well is the training data scrubbed? It's irrelevant to me if the model can reproduce it or not (though I have seen apocryphal cases of that), but the very existence of possibly privileged data being exploited by an unacknowledged third party is a violation, especially if that includes bio data

aug 27, 2025, 7:30 am • 0 0 • view
avatar
Romeo Kokriatski @vagrantjourno.bsky.social

Take 'right to be forgotten'. How does that apply to say, a scrape of Facebook posts Ive gotten Meta to remove but still exists in training data, which theoretically can still be replicated without my consent

aug 27, 2025, 7:30 am • 0 0 • view
avatar
Kevin Riggle @kevinriggle.bsky.social

Everybody misunderstands "right to be forgotten," it's much less broad than people make it out to be, and there's a linkability requirement which means it's mostly moot for training data

aug 27, 2025, 7:31 am • 1 0 • view
avatar
Romeo Kokriatski @vagrantjourno.bsky.social

? I can request Meta to remove my profile and all posts made from it, yes?

aug 27, 2025, 7:32 am • 0 0 • view
avatar
Kevin Riggle @kevinriggle.bsky.social

Right, that's all in a database somewhere and straightforwardly linkable to an identifier related to you. However, in the kind of training data corpuses we're talking about, scraped from the public web, any information related to you is not in a database linked to an identifier related to you

aug 27, 2025, 7:34 am • 0 0 • view
avatar
Kevin Riggle @kevinriggle.bsky.social

At a very high level, Facebook is only required to delete stuff in the former category, not the latter

aug 27, 2025, 7:35 am • 0 0 • view
avatar
Romeo Kokriatski @vagrantjourno.bsky.social

But wouldn't the existence of that data in a training scrape also allow me to in theory request its deletion?

aug 27, 2025, 7:36 am • 0 0 • view
avatar
Kevin Riggle @kevinriggle.bsky.social

I forget the exact language in the GDPR, but, no. And you can kind of see why if you consider the inverse, that requiring Facebook to do so would require them to prove the negative that no data possibly related to you existed anywhere under their control.

aug 27, 2025, 7:38 am • 0 0 • view
avatar
Romeo Kokriatski @vagrantjourno.bsky.social

Curious. What if it is provable that biographical information on non-public figures can be retrieved from an LLM via prompting?

aug 27, 2025, 7:39 am • 0 0 • view
avatar
Kevin Riggle @kevinriggle.bsky.social

If it was discovered that Facebook was 1) training on user data, and 2) still training on user data which was linked to your user account to you after you requested deletion and they indicated that your data had been deleted, there would be a case, however, they don't do that

aug 27, 2025, 7:36 am • 0 0 • view
avatar
Romeo Kokriatski @vagrantjourno.bsky.social

Don't we only have their word for it?

aug 27, 2025, 7:37 am • 0 0 • view
avatar
Kevin Riggle @kevinriggle.bsky.social

I mean if you can prove otherwise, go for it

aug 27, 2025, 7:38 am • 1 0 • view