avatar
Alexander Doria @dorialexander.bsky.social

Meanwhile, I see deep questions on the AI side that should definitely warrant some humanities input. And I don't mean the usual ethics corner, but: how to design reasoning structure/flow, personality tuning (what is the "I" in the model?), rethinking tokenization. Won't happen for some time.

sep 4, 2025, 11:28 pm • 60 5

Replies

avatar
Nafnlaus 🇮🇸 🇺🇦 🇬🇪 @nafnlaus.bsky.social

If we had had more humanities input, we probably could have foreseen how bad naive RLHF would be in terms of promoting sycophancy, and in turn what harm having models be endlessly sycophantic could do to certain vulnerable human populations (e.g. promotion of delusions, etc)

sep 4, 2025, 11:31 pm • 5 0 • view
avatar
Alexander Doria @dorialexander.bsky.social

Even just bad for training. I'm still slightly horrified by the post-training datasets still in use in the open ecosystem. Just a tiny bit of care from people with actual expression skills would make a difference…

sep 4, 2025, 11:33 pm • 7 0 • view
avatar
Alexander Doria @dorialexander.bsky.social

Also: we are switching pretraining paradigms right now from fuzzy web crawl corpus with honestly questionable data practices all around, to some form of generalized "distant writing". Don't think anyone is aware of this/talking about this in DH circles and this is certainly a miss.

sep 4, 2025, 11:34 pm • 22 2 • view
avatar
Peter T. Evans @petertevans.bsky.social

There is a very small "group" of us doing computational theology. In broad terms, we're asking 'distant writing' related questions: what does it mean when a model trained on theological texts then *writes* theological text? How do we categorize it? Is it "new" theology? Etc.

sep 4, 2025, 11:38 pm • 4 0 • view
avatar
Alexander Doria @dorialexander.bsky.social

Could be actually an interesting candidate for amplified mid-training: not a large search space, enough to get performance with a small model.

sep 4, 2025, 11:49 pm • 0 0 • view
avatar
Jonathan Cheng @jonathancheng.bsky.social

In hindsight, what I should’ve done is written a few articles or a book, *and then* joined an outfit where I shouldn’t say much. There’s *a lot* to say here.

sep 5, 2025, 6:36 am • 1 0 • view
avatar
Jonathan Cheng @jonathancheng.bsky.social

Yeah, I know exactly what you mean. Given that *nervous cough* I suspect we’re thinking about very similar problems. There’s a line to draw from before nanogenmo, back translation, synthetic data making, etc etc. It’s both interesting and, frankly, very fun.

sep 5, 2025, 6:21 am • 1 0 • view
avatar
Alexander Doria @dorialexander.bsky.social

Ah ah. I think multiple research team are circling on the same thing. Last release from Meta on active learning hit a bit too close to home.

sep 5, 2025, 7:51 am • 1 0 • view
avatar
Jonathan Cheng @jonathancheng.bsky.social

Something something, one person’s language data augmentation is another person’s distant writing is another person’s creative writing project

sep 5, 2025, 6:22 am • 1 0 • view
avatar
Michael Castelle @mcastelle.bsky.social

We’ve also been training on a mix of code and natural language (cleaned/augmented or not) since GPT-3 which should have been extremely weird to everyone. But I agree that humanists are well placed to contrast the long history of “reasoning” with whatever it is that currently goes under that name

sep 5, 2025, 3:35 am • 1 0 • view
avatar
Ted Underwood @tedunderwood.me

Say more: Do you mean synthetic data?

sep 4, 2025, 11:38 pm • 3 0 • view
avatar
Alexander Doria @dorialexander.bsky.social

Yes but that won't say much. It's rephrasing/knowledge amplification. It's simulating problems with logic constraints at scale. It's writing things that have never been written before (like a complete search sequence). (most of my daily work right now)

sep 4, 2025, 11:48 pm • 10 0 • view
avatar
Sarah Bull @sarahebull.bsky.social

FWIW I'm really interested in the editing / rephrasing / amplification side, but my mostly-sole-author research and publication pipeline is comparatively glacial (probably too slow to be remotely helpful or interesting for most AI/ML people). I wonder if that's a factor here? We'll see more in +

sep 5, 2025, 2:26 am • 2 0 • view
avatar
Sarah Bull @sarahebull.bsky.social

6-12 months from DH people?

sep 5, 2025, 2:26 am • 1 0 • view
avatar
Sarah Bull @sarahebull.bsky.social

*people who are much faster than me!

sep 5, 2025, 2:27 am • 1 0 • view
avatar
Ted Underwood @tedunderwood.me

Pierre-Carl is at the cutting edge of this stuff. I don’t know how well academic DH will be able to keep up, but I think it’s fair to say that I know lots of people who do recognize these as exciting problems!

sep 5, 2025, 2:40 am • 1 0 • view
avatar
Sarah Bull @sarahebull.bsky.social

Yeah, I'd think that there's a lot of work on them coming down the pipes.

sep 5, 2025, 2:44 am • 1 0 • view
avatar
Paschalis Tsilias @tpaschalis.me

As a layman, can I ask what is "distant writing"? Is it about the distance between the thought (prompt) and output?

sep 5, 2025, 4:53 pm • 0 0 • view
avatar
Alexander Doria @dorialexander.bsky.social

From a DH perspective, it's roughly the reversed of distant reading: "reading" lots of text using computational/text mining methods => "generating" lots of texts using language models and some formal constraints/inspirations (which are called "backtranslations")

sep 5, 2025, 5:10 pm • 0 0 • view
avatar
Alexander Doria @dorialexander.bsky.social

Main purpose is to train better small models for dedicated tasks. Language models have a very high memorization inertia: you need lots of text to get accurate assimilation of new facts.

sep 5, 2025, 5:11 pm • 0 0 • view
avatar
Thomas Wood @advanced-eschatonics.com

I'm working on getting product folks with more humanities backgrounds much deeper into the technical side of things where I work. All the bluster of a lot of these people about being luddites dies when the zoom meeting starts and its time to prioritize deadlines

sep 5, 2025, 12:11 am • 1 0 • view