Post by Alexander Doria / Redsky

Alexander Doria • Feed

Meanwhile, I see deep questions on the AI side that should definitely warrant some humanities input. And I don't mean the usual ethics corner, but: how to design reasoning structure/flow, personality tuning (what is the "I" in the model?), rethinking tokenization. Won't happen for some time.

sep 4, 2025, 11:28 pm • 60 5

Replies

If we had had more humanities input, we probably could have foreseen how bad naive RLHF would be in terms of promoting sycophancy, and in turn what harm having models be endlessly sycophantic could do to certain vulnerable human populations (e.g. promotion of delusions, etc)

sep 4, 2025, 11:31 pm • 5 0 • view

Even just bad for training. I'm still slightly horrified by the post-training datasets still in use in the open ecosystem. Just a tiny bit of care from people with actual expression skills would make a difference…

sep 4, 2025, 11:33 pm • 7 0 • view

Also: we are switching pretraining paradigms right now from fuzzy web crawl corpus with honestly questionable data practices all around, to some form of generalized "distant writing". Don't think anyone is aware of this/talking about this in DH circles and this is certainly a miss.

sep 4, 2025, 11:34 pm • 22 2 • view

There is a very small "group" of us doing computational theology. In broad terms, we're asking 'distant writing' related questions: what does it mean when a model trained on theological texts then *writes* theological text? How do we categorize it? Is it "new" theology? Etc.

sep 4, 2025, 11:38 pm • 4 0 • view

Could be actually an interesting candidate for amplified mid-training: not a large search space, enough to get performance with a small model.

sep 4, 2025, 11:49 pm • 0 0 • view

In hindsight, what I should’ve done is written a few articles or a book, *and then* joined an outfit where I shouldn’t say much. There’s *a lot* to say here.

sep 5, 2025, 6:36 am • 1 0 • view

Yeah, I know exactly what you mean. Given that *nervous cough* I suspect we’re thinking about very similar problems. There’s a line to draw from before nanogenmo, back translation, synthetic data making, etc etc. It’s both interesting and, frankly, very fun.

sep 5, 2025, 6:21 am • 1 0 • view

Ah ah. I think multiple research team are circling on the same thing. Last release from Meta on active learning hit a bit too close to home.

sep 5, 2025, 7:51 am • 1 0 • view

Something something, one person’s language data augmentation is another person’s distant writing is another person’s creative writing project

sep 5, 2025, 6:22 am • 1 0 • view

We’ve also been training on a mix of code and natural language (cleaned/augmented or not) since GPT-3 which should have been extremely weird to everyone. But I agree that humanists are well placed to contrast the long history of “reasoning” with whatever it is that currently goes under that name

sep 5, 2025, 3:35 am • 1 0 • view

Say more: Do you mean synthetic data?

sep 4, 2025, 11:38 pm • 3 0 • view

Yes but that won't say much. It's rephrasing/knowledge amplification. It's simulating problems with logic constraints at scale. It's writing things that have never been written before (like a complete search sequence). (most of my daily work right now)

sep 4, 2025, 11:48 pm • 10 0 • view

FWIW I'm really interested in the editing / rephrasing / amplification side, but my mostly-sole-author research and publication pipeline is comparatively glacial (probably too slow to be remotely helpful or interesting for most AI/ML people). I wonder if that's a factor here? We'll see more in +

sep 5, 2025, 2:26 am • 2 0 • view

6-12 months from DH people?

sep 5, 2025, 2:26 am • 1 0 • view

*people who are much faster than me!

sep 5, 2025, 2:27 am • 1 0 • view

Pierre-Carl is at the cutting edge of this stuff. I don’t know how well academic DH will be able to keep up, but I think it’s fair to say that I know lots of people who do recognize these as exciting problems!

sep 5, 2025, 2:40 am • 1 0 • view

Yeah, I'd think that there's a lot of work on them coming down the pipes.

sep 5, 2025, 2:44 am • 1 0 • view

As a layman, can I ask what is "distant writing"? Is it about the distance between the thought (prompt) and output?

sep 5, 2025, 4:53 pm • 0 0 • view

From a DH perspective, it's roughly the reversed of distant reading: "reading" lots of text using computational/text mining methods => "generating" lots of texts using language models and some formal constraints/inspirations (which are called "backtranslations")

sep 5, 2025, 5:10 pm • 0 0 • view

Main purpose is to train better small models for dedicated tasks. Language models have a very high memorization inertia: you need lots of text to get accurate assimilation of new facts.

sep 5, 2025, 5:11 pm • 0 0 • view

I'm working on getting product folks with more humanities backgrounds much deeper into the technical side of things where I work. All the bluster of a lot of these people about being luddites dies when the zoom meeting starts and its time to prioritize deadlines

sep 5, 2025, 12:11 am • 1 0 • view