So recently there’s been a big controversy over a website called Prosecraft. This is from a techcrunch article on the subject.
“On Monday morning, numerous writers woke up to learn that their books had been uploaded and scanned into a massive dataset without their consent. A project of cloud word processor Shaxpir, Prosecraft compiled over 27,000 books, comparing, ranking and analyzing them based on the “vividness” of their language. Many authors — including Young Adult powerhouse Maureen Johnson and “Little Fires Everywhere” author Celeste Ng — spoke out against Prosecraft for training a model on their books without consent. Even books published less than a month ago had already been uploaded.”
Basically Benji Smith had webpage where he did stylometrics- the statistical study of stylistic patterns in book. Stylometrics doesn’t involve AI (unless you use an absurdly broad definition on which a calculator is AI). Stylometrics works not all that different to Microsoft Office’s grammar checker. A list of simple if-then rules- nothing more, is applied across the text.
Somehow people decided he was “training an AI” on the books he’d ran through his software and this led to a backlash. This was wrong:
“Smith’s Prosecraft was not a generative AI tool, but authors worried it could become one, since he had amassed a dataset of a quarter billion words from published books, which he found by crawling the internet.”
Yeah that’s nonsense. They’ve already got your books. Models have already been trained on them. They’d already trained a novel writing AI on novels back in 2016. Speculation is GPT-4 was trained on 13 trillion tokens, roughly equivalent to 250 million novels. You think they got to that figure without including a lot of books?
Anyway people yelled at Benji Smith, and he gave a rather tragic apology in which he conceded far more than he needed to and then he took the website down.
Honestly, I think a lot of the outrage was about the idea of Stylometrics itself. This seems to me like an intensely insecure thing to get upset about:
“If you’re a writer as a career it’s maddening, in part because style is not the same as writing a fucking whitepaper for a business that needs to be in active voice or whatever,” author Ilana Masad said. “Style is style!”
Yeah sure, stylometrics is far from perfect. You may not think it’s the best way to approach text, nevertheless it’s a popular academic technique, and the fact that you’re disturbed by quantitative analysis doesn’t make it morally wrong. Subjecting books to quantitative study is not a violation of copyright, even if you put them on a harddrive to do so.
This guy was doing the world a service by doing stylometrics for a wide variety of content. This is like that bit in the book where the townsfolk, riled up against artificial intelligence, break a modern tractor in their panic. To quote Marx
“Modern bourgeois society with its relations of production, of exchange, and of property, a society that has conjured up such gigantic means of production and of exchange, is like the sorcerer, who is no longer able to control the powers of the nether world whom he has called up by his spells.”
Except it’s not just the bourgeoise who are scared of what we’re conjuring up now. We’re all scared. I’m scared. Just don’t break that tractor, we need it for the coming winter.
A few facts:
At the moment, readily available models can produce a 6000 word short story better than 99%> of the population. I don’t mean that in some objective sense of aesthetic merit- I mean that that’s what the typical human rater will say if asked to pick.
Unless there are unexpected problems in writing long passages of text (very unlikely), this ability will soon be extended to something novel length. GPT-4 32K (about 24k words) exists. While I haven’t tried it myself, I reckon it could probably have a decent crack at writing a novella. [Again, not in some objective sense, just in terms of the preferences of human raters].
Also, the quality will jump from being able to write better than 99% of people, to something like 98% of authors. This will be more than enough for it to lock the majority of contemporary authors out of the writing business and pump out genre fiction. With the exception of a few elite celebrity artists, authors, etc., humans will be done in the creative industries. This is unless something is done.
The last three things I predicted could well happen tomorrow, but I’ll be shocked if they haven’t happened in the next three years. Then our beloved techno-capital overseers will reap the benefits of their immensely productive capital, having finally displaced those fatcat puppet-masters of the economy, artists and writers.
The legal case that analyzing a corpus of books using stylometrics is a breach of copywrite is non-existent. More importantly, the case that Large Language Models violate copyright during their training phase is very weak. Even if the courts find that (dubious- they won’t want to slow down AI and risk our advantage over China) ways will be found around it. I don’t think there is a legal solution here- at least not under current laws.
If human writers want to resist being made obsolete- and they should! They need to organize politically. They need to connect with other industries likely to be affected which is, to be clear, all of them. Yelling at some rando on the internet with a statistics hobby won’t do anything. It also seems very mean.
Instead of this crab in a bucket behavior we need to:
A) Go into the real world and make connections with people. We can work politically online as well, but online activity needs to be supplemented with real world connections.
B) Use our connections, talk, and above all listen in order to persuade others of our political views.
C) Build both organization and consciousness.
We need to think coherently about our demands as well. In my view, it’s unlikely we’ll stop this wave of generative AI, the genie is out of the bottle and an argument that AI art is inherently immoral is going to be very hard to win. We should demand that the government actively step in to keep humans doing art through subsidies because it’s important that humans remain creative. This needs to be combined with a general demand no one is left without a job- even temporarily!- Due to AI. The result of displacing the need for certain types of labor must not be more drudgery and a reduction in human creative output, but it will be unless we band together.
Do you think audiences will prefer novels written by AI models if they *know* they're written by AI models, though?
Just for myself (given current, almost-certainly-unconscious AI) find the idea of reading a novel without an actual *consciousness* behind it unappealing. Isn't connecting with another consciousness, with its own unique way of seeing the world, one of the key pleasures of literature?
The real questions seem to be:
1) Does this apply to non-artistically-ambitious genre fiction that just aims to tell a ripping yarn? For me, at least--as someone who has enjoyed a lot of Stephen King and other genre fare--the desire for an actual conscious perspective is still there.
2) How much input from AI can there be before the sense of an author's conscious perspective is lost? Does merely coming up with a plot outline, having the AI write the first draft, and revising afterwards keep the human spark?
3) Will there be widespread concealed use of AIs, if audiences are unwilling to read novels they know to be AI-written? The very fact that fiction is infamously oversupplied--that people *enjoy* writing it--suggests that most authors won't be motivated to reduce their workload by outsourcing it to AIs. OTOH, the flood of (still-mercifully-terrible) AI-generated stories and children's books does seem to show that a minority of non-intrinsically-motivated writers will try to make money off writing by AI-generating it in bulk. And the real threat--certainly for TV and movies, but also potentially for fiction--is corporate middlemen cutting out the human author.
This has been a problem for a long time... and if I'm being honest, a problem that I don't think authors can win. Once the internet was built, information was going to flow freely no matter what happened.