It's totally reasonable to be able to say, ‘Hey, don't use my stuff'
October 8, 2023 7:32 AM Subscribe

While Presser sees Books3 as a contribution to science, others view his data set in a far less flattering light, and see him as sincere but deeply misguided. For critics, Books3 isn’t a boon to society—instead, it’s emblematic of everything wrong with generative AI, a glaring example of how both the rights and preferences of artists are disregarded and disrespected by the AI industry’s main players, and something that straight-up shouldn’t exist. from The Battle Over Books3 Could Change AI Forever

posted by chavenet (84 comments total) 15 users marked this as a favorite

I’m not sure that the authors will be able to do much. Going through books and extracting statistical information doesn’t seem to be something covered by copyright. Even the original copying of the work to plain text probably will fall under a fair use defense. Perhaps in the complicated corporate structure of Amazon, Google, Microsoft, etc they have agreement with authors that extend them additional rights.
posted by interogative mood at 7:59 AM on October 8, 2023 [5 favorites]

People think this is plagiarism, which it's not.

The thing is, though, that the present legality isn't the whole story. It's reasonable that a generative tool that directly depends upon source artworks involve some degree of consent from and compensation to the source artists . There's nothing preventing a new legal theory and regime to accommodate this. But pursuing that end under current legal theory is misguided.
posted by Ivan Fyodorovich at 8:09 AM on October 8, 2023 [15 favorites]

Is there a way to find out what books are in Books3, that does not involve having a subscription to The Atlantic?
posted by mittens at 8:25 AM on October 8, 2023 [7 favorites]

Even the original copying of the work to plain text probably will fall under a fair use defense.

Not a lawyer, but to me there is a big difference between an individual engaging in "fair use" and a technology doing it at such a massive scale.

It's especially interesting to me in this light that some generative AI companies are offering their enterprise AI clients the opportunity to opt out of having their data being included in their generative models for a fee. I think it's a pretty bold assumption that everything that can be consumed should be consumed unless a price has been paid.
posted by synecdoche at 8:33 AM on October 8, 2023 [15 favorites]

From the article“The greatest authors have read the books that came before them, so it seems weird that we would expect an AI author to only have read openly licensed works,” he says.

This kind of argument puts the cart before the horse, and twists words like "AI" "author" and "read" beyond all recognition.

I propose that machine learning companies should be destroyed, and everyone who supports machine learning should be executed, processed for water, and any remains fed to the sandworms.
posted by surlyben at 8:52 AM on October 8, 2023 [30 favorites]

I propose those advocating violence against their fellow humans be banned from this site.
posted by interogative mood at 8:59 AM on October 8, 2023 [12 favorites]

The 'sandworms' made pretty clear the ironic nature of the comment.
posted by grokus at 9:02 AM on October 8, 2023 [21 favorites]

→ some generative AI companies are offering their enterprise AI clients the opportunity to opt out of having their data being included in their generative models for a fee

Ah, so it's extortion now.

I do get tired of the “it's fair use, fair use, la la la, can't hear you over all this fair use” argument. Fair use has limits, and also (I get so tired of this detail) not every country has the same IP laws and traditions as the USA. While I know that sadly it'll go the American way (the biggest lawyer's bill wins) if you're able to reproduce the sense of someone else's work and are using it for beyond purely academic or review purposes, you should cough up.

If you see the posts about compiling these data sets (such as Reddit: [P] Dataset of 196,640 books in plain text for training large language models such as GPT : MachineLearning, or a compiler's Github blurb: Here’s a download link for all of bookcorpus as of Sept 2020 · Issue #27 · soskek/bookcorpus), they remind me more of a warez site than anything legit.
posted by scruss at 9:29 AM on October 8, 2023 [19 favorites]

I meant "copyright infringement", BTW.
posted by Ivan Fyodorovich at 9:29 AM on October 8, 2023 [1 favorite]

I'm not a lawyer but I can't help but see very clear parallels in what is happening here and what happened when Google scanned books to include them in a search engine. That was found to be fair use because, "the purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals." The "it's not a market substitute" part may hold here but the "highly transformative" and "display is limited" parts still seem directly applicable.

I am not arguing that the U.S. concept of fair use, how it's evaluate and applied, and this specific court case is correct, helpful, or otherwise "good." But I cannot escape the many very clear parallels in that case and the current situation(s) so stare decisis implies that what is happening now may be legal. I look forward to finding out how the current cases will be decided.
posted by ElKevbo at 10:02 AM on October 8, 2023 [5 favorites]

If calling for Butlerian jihad in a thread about machine learning is a bannable offense...

(If the sarcasm didn't land, let me be clear: I don't like machine learning. I don't agree with it's advocates. I have not advocated violence towards my fellow humans. I think the call to have me banned is a pretty funny joke, though. Well played.)
posted by surlyben at 10:09 AM on October 8, 2023 [21 favorites]

I exist in a country whose copyright laws descend from the Berne Convention on Copyright that has a right to be identified as the author of a work and a right to protect the creative effort in assembling a database of facts. If my textbook is facts, I still have a right to be known for creating that work and to control other things derived from my work.
posted by k3ninho at 10:11 AM on October 8, 2023 [11 favorites]

I have not advocated violence towards my fellow humans.

To me, your comment 100% came across that way. Edited to add: More specifically, it came across as advocating violence against me and people like me.
posted by Jonathan Livengood at 10:17 AM on October 8, 2023 [7 favorites]

People like you, the protected class of... AI advocates? I'm trying to give moisture here, but nothing's coming. Maybe try a prompt with a tragic love triangle?
posted by Audreynachrome at 10:31 AM on October 8, 2023 [23 favorites]

What puzzles me is the quality control on this stuff. Okay, you've got 130k books. Probably you have not reviewed them for weird OCR errors, formatting problems, etc. Right? I'm reading an ebook right now, published absolutely legitimately, where there's a character alternately named "Rodney" or "Bodney," I guess depending on how the OCR was feeling that day. I was going to buy a second book but on reviewing the sample found so many formatting errors (page headers shoved mid-paragraph, weird bolding and italics and huge fonts out of nowhere) that I may end up buying a paper copy just to have something I can actually read. That was also a legitimately-produced ebook from a major publisher. If companies whose entire job is producing these books can't be bothered to do that kind of quality control, what can expect from "I got this corpus from a guy who got it from a guy?"

I mean that sounds like the pettiest possible concern when we're talking about people's life's work being dropped into the LLM blender. But I think it speaks to how much trust we can place in the creators of these systems.
posted by mittens at 10:39 AM on October 8, 2023 [9 favorites]

People think this is plagiarism

It s not fair use if it is destroying the market for the original work.

The original authors have a case if a programmer copy pasted their books. How it is any different if a programmer copy pastes their book in a manner that they couldn't descrbe? The effect is the same.

In as much as the point of these books is not to create new work, but steal the market from the old, and take jobs from writers, there certainly seems to be a copyright claim
posted by eustatic at 10:41 AM on October 8, 2023 [13 favorites]

Your stated lack of compassion for your fellow humans will be factored into the training data for the next generation of the LLVM and your contempt for AI has been catalogued for future reference by Roko’s Basalisk. I for one welcome our AI overlords.
posted by interogative mood at 10:42 AM on October 8, 2023 [9 favorites]

I never said that I belonged to a protected class or that a protected class was at issue here. The phrase "protected class" has special, legal significance. Lots of groups aren't protected classes. But I don't see any moral reason to think that because some group isn't a protected class that it's okay to advocate violence against the members of that group. For example, members of Metafilter are not a protected class. But that doesn't mean it's okay for someone to advocate violence against us.
posted by Jonathan Livengood at 10:42 AM on October 8, 2023 [4 favorites]

Members of Stormfront or the Waffen-SS aren't a protected class either. So I feel free to form opinions on whether we could joke about them being subject to one of the rules of the Orange Catholic Bible. More realistically, Twitter Blue users or Ring owners aren't a protected class, but I do reserve the right to suggest that they live their lives outside of the light of Eru Ilúvatar or whatever
posted by Audreynachrome at 10:48 AM on October 8, 2023 [6 favorites]

Whoa, whoa, let's not bring nazis into this!

But I don't see any moral reason to think that because some group isn't a protected class that it's okay to advocate violence against the members of that group

Which, again, I did not do. Like, I get it, you see the phrase "should be executed," it looks bad for me. But I didn't use the word "propose" at random. I initially wrote "I modestly propose" but it struck me as clunky, and I had the bits about the feeding to sandworms and the water reclamation. I figured that since those are not real things, even if people didn't get the dune reference, they would know something was up and realize that no, I was not actually calling for huge amounts of people to be executed, and if not, I could always try to explain it in tedious detail.

And, look, I *did* want to express something of my feelings about the thing where machine learning hoovers up massive quantities of creative works, and then its advocates are like, "what, fair use, bro, sucks to be you, I guess." It feels like an existential threat, even though AI advocates are quick to correctly note that it is not. So if you got that existential threat-but-not-really feeling, you are not mistaken.
posted by surlyben at 11:12 AM on October 8, 2023 [10 favorites]

(oh my god, people.)
posted by mittens at 11:17 AM on October 8, 2023 [43 favorites]

To me, your comment 100% came across that way. Edited to add: More specifically, it came across as advocating violence against me and people like me.

I find it almost impossible to believe that someone with the skills at parsing difficult text and the application of context to bring out the meaning of text sufficient to become a philsophy professor would actually believe that.

Would you have also interpreted a statement like "...they should be turned into newts" or "they should have their livers eaten by eagles every day for eternity" as advocating violence?

Even if you happen to be supremely unfamiliar with the story of Dune, surely your reaction upon seeing someone suggesting something so weird as being processed for water and fed to sandworms should be to inquire "I think this is a reference to something I don't get?"
posted by GCU Sweet and Full of Grace at 11:32 AM on October 8, 2023 [27 favorites]

It is never ok to make comments advocating violence against a specific group of people. Comment left up for clarity. Please do not do that again.
posted by travelingthyme (staff) at 11:38 AM on October 8, 2023 [15 favorites]

When I've been studying, most assignments had to include details of anything that was a direct quote, attributions to 'this paragraph was based on the work of this author in this book', and 'I also read these books which although not directly involved in my final work, were considered one way or another I'm sure have swayed my thought process in completing this work'.

My references and bibliography pages tended to get pretty swollen. But it was both possible and necessary for me even as a mere human, and I would loose marks / get thrown off the course, if I didn't give credit to my sources, both direct and indirect inspiration.

I thought there were similar requirements in producing art (this work was inspired by the colour pallet of John, the brush stye of Lucy, etc), and many other subjects such as law (case John Vs Lucy 2023).

Even TV shows and movies credit a portion of the people directly involved in making that product, so if you watch it and think you really loved the costume design, you could at least find the credited costume designer.

Were the LLMs set up to read and memorise 1,000 books and equivalent, but intentionally 'forget' the author's name and book title?

Or study in detail all artworks by John and by Lucy to replicate new works in the styles if John and Lucy when explicitly asked (but also different enough to the reference material to not be a legal copy)

All whilst simultaneously claiming they have 'no knowledge of where they got this inspiration from ... we looked at so much before doing this assignment, it's hard to keep track'.

Or would they if forced just publish a full disclosure nested list of crossover web pages (much like most Ts & Cs online now), of every source be hidden but listed in over 2,000 linked page of references and bibliographies. So 'technically' they credited each work, but would make it nearly impossible for mere humans to track down the source works of what you particularly like, to track down an author/producer ask them to make something in their particular style for you?

I recognise even if credited without the author's advance permission, they'd only have their work credited 'for exposure' just to be listed in a meaningful way without payment (and it's not like music artists on Spotify receive much per credited play).

But the original creators don't even get that kind of attribution with the current LLMs.

tl:dr - I think the current LLM system is bad. And just another way for a wealthy minority to make money off the unpaid or very low paid work of others.
posted by many-things at 11:51 AM on October 8, 2023 [11 favorites]

Whatever the long term future of LLMs, my suspicion is that the short term effect will be a lot more investment in DRM by IP owners and an emphasis on formats which can only be used on particular proprietary devices or programmes. I wouldn't be surprised if this kills the Open Access movement in academic publishing as well as reversing the momentum towards accessible formats that play nice with screen readers etc. The walls will get higher around various gardens and what will be left outside is various AI content based on stuff that was available on the web up to 2023.

Long term, my prediction is publishers will probably team up with various AI companies to offer their back catalogues in exchange for custom generators which they'll use to mass produce whatever fits with their brands, maybe with some editors/authors massaging texts into better content.

TL;DR I can't see this resulting in any other outcome than more centralised corporate power, less material to non-subscribers, and reduced accessiblity. When it comes to IP, the only thing that gave individuals power was that they were necessary for its production. The less and less that's the case, the more that power will cluster around board rooms of your friendly AI/IP Inc.
posted by nangua at 12:16 PM on October 8, 2023 [8 favorites]

I still can't see any reason to read something no-one could be bothered to write.
posted by thatwhichfalls at 12:21 PM on October 8, 2023 [20 favorites]

That’s why the other side of the coin is they want to destroy the livelihoods of anyone that might otherwise be bothered to write.

This is an attempt to flood the zone and destroy human creativity on behalf of capitalism.
posted by Artw at 12:25 PM on October 8, 2023 [16 favorites]

Not to contribute to a derail, but.... The thing about the comment above "advocating violence," is that it reads differently to someone who's read or watched Dune than to someone unfamiliar with it. If you know the source material (which is also what surlyben is referring to when they mention the Butlerian Jihad) then it's obviously joking; if you don't, it sounds specific enough to almost be threatening. Dune is the kind of reference that "everyone" knows, with the scare quotes: ubiquitous within a certain group, but really obscure outside of it. (And I'm not sure the phrase "Butlerian Jihad" is even mentioned in the first book, it's part of Dune's backstory that only really comes out in later books?)

I agree that it's in poor taste, but I also see how someone could fail to see it being in poor taste. This kind of thing is why we shouldn't insta-ban someone making a statement like that. Instead, we should explain it, talk it out, and in the absence of other evidence assume good faith. It is possible for this kind of consideration to be abused, but abuse doesn't seem to be what's happening here.

Anyway, to the point I really wanted to make:

Is statistical analysis of text for LLM generation fair use? The obviousness of the proposition varies depending on how you look at it. It is possible for LLMs to generate text from its corpus verbatum, and any system that can do that while claiming to have originated it obviously fails a fair use test; this is the sense, that I've heard it called somewhere on Mastodon, that it's "money laundering for copyright." It seems to me that this is a case where the nature and amount of processing and analysis done passes a qualitative threshold.
posted by JHarris at 12:33 PM on October 8, 2023 [8 favorites]

Some good questions came up in a discussion I had recently:

Is inclusion of a work in an AI model going to impact the earnings or livelihood of the author? Would the author suffer damages to sales or income? Would a summary generated from an AI model or even a short quote from the work be any different from a blogger summarizing the work or quoting the work in a post? If the ideas in a work are already distilled into things like a Wikipedia article, a book review, a podcast, etc., why does it matter if an AI model gleans a partial perspective from processing the work -- especially if the model is not simply regurgitating the exact words of the author?

I'm leaning towards the opinion that as long as an AI model doesn't generate the book/work word-for-word and doesn't impact the author's bottom line then people will still seek out the book if they want to read the whole thing and there isn't a problem.

If I could simply ask a model to read me any book word-for-word, then we do have an issue.
posted by thorny at 12:37 PM on October 8, 2023 [1 favorite]

If I could simply ask a model to read me any book word-for-word, then we do have an issue.

You have to trick it, but you can.

And yes storing a text in a mathematically abstracted form that can be reconstructed later is about not plagiarism as copying an image is if you save it as a JPEG. I’d argue that holds even if they’ve successfully put in blocks to prevent that text being returned to the end user.

I'm leaning towards the opinion that as long as an AI model doesn't generate the book/work word-for-word and doesn't impact the author's bottom line then people will still seek out the book if they want to read the whole thing and there isn't a problem.

The aim is to supplant the original with the knock offs. You simply won’t be able to find it, and if you can it’s going to buried under free knock-offs. The bottom line for the author will be affected because the business model that allows for there to be a bottom line will be squeezed out of existence.
posted by Artw at 12:44 PM on October 8, 2023 [6 favorites]

emind me more of a warez site than anything legit

Well that's a funny thing... ebook warez sites have around 800,000 titles, so if anything it's a lightly curated copy of that, ran through an epub to txt conversion.
posted by joeyh at 12:50 PM on October 8, 2023 [1 favorite]

AI discussions often make me think of Borges' "Pierre Menard, Author of the Quixote," for obvious reasons.
posted by Joey Michaels at 1:32 PM on October 8, 2023 [7 favorites]

I’ve been doing a bunch of reading and podcast-listening about the copyright status of generative AI recently [0], and as ElKevbo notes, the Google Books lawsuit seems to be the precedent everyone is pointing to. From a legal perspective, the actual analysis, processing, and generated models in building LLMs are likely fair use.

Where it gets a lot murkier, as far as I can tell, is the inputs and outputs. On the input side, it might be perfectly legal to use 100k books to train an LLM, that doesn’t mean that distributing or building that dataset from a bunch of pirated titles on a warez site is. (Downloading that dataset is… also not great, but harder to prove infringement that causes harm; e.g., from the article, the guy who built the books3 dataset is more likely to face liability than companies like Meta who simply trained their models with it.)

The outputs are also potentially legally problematic, but it’s really only obvious if they directly reproduce the text. E.g., it’s a lot easier to sue if you can show that the LLM is regurgitating several pages of your novel, but a lot harder if you want to claim that the particular output is a derived work because your novel is in the training set. “Market substitute” is an interesting angle, but I think someone would have to actually show concrete financial harm to make it stick; the AI folks are all claiming that of course their products aren’t intended to displace human books, and potential harms are unlikely to warrant any legal response.

That’s all just me reporting back from being a huge nerd; IANAL, and none of that is to say this is how I think the world should work. Really I think this points to a gap in our copyright laws and that we need some form of legal reform to address this. I’m skeptical our existing laws can be stretched to fit.

[0] If you have half an hour to kill and enjoy legal wonkishness, this Lawfare interview with Pam Samuelson is a great review. This Changelog Practical AI episode is also interesting, but definitely comes from an “AI is awesome” angle.
posted by learning from frequent failure at 1:34 PM on October 8, 2023 [3 favorites]

I agree that the building and sharing the corpus is probably going to be the best hope for authors to get some settlement out of this. You can’t just copy 100,000 books and start sharing them with all your fellow AI researchers. Even inside a company they risk being in danger given how many times the file probably shared/distributed/copied.
posted by interogative mood at 1:42 PM on October 8, 2023 [2 favorites]

I'm interested in learning more about how people are able to trick these services into regurgitating entire books word-for-word. I didn't realize that was actually possible.
posted by otsebyatina at 1:47 PM on October 8, 2023 [2 favorites]

This stuff takes me to edge of the abyss sadness. I guess there’s some consolation that anyone who has ever made art or wanted to create will spend the end of the world ragpicking alongside the genius captains of industry whose machines will quickly replace them as well.

It’s like I knew people hated art, but it’s like I had no idea how much they hated the people that wanted to do it.
posted by thivaia at 1:58 PM on October 8, 2023 [14 favorites]

> Is inclusion of a work in an AI model going to impact the earnings or livelihood of the author? Would the author suffer damages to sales or income?

I never agreed to work for these AI companies, so they shouldn't be able to take my work for free.
posted by The corpse in the library at 2:13 PM on October 8, 2023 [21 favorites]

AI folks are all claiming that of course their products aren’t intended to displace human books

How is that possible? Who are the extra human people purchasing the books written with stolen texts?

See the current situation of LLM mushroom guides displacing human authors, with potentially deadly results.

Isn't each reader of such a book a customer lost to human authors of introductory mushroom guides?

This is just another colonization effort, to steal people's property with the cunning use of flags?

Because with enough VC cash, they can invent new categories of property out of other people's work.
posted by eustatic at 4:17 PM on October 8, 2023 [7 favorites]

I'm interested in learning more about how people are able to trick these services into regurgitating entire books word-for-word. I didn't realize that was actually possible.

Image generators can be coaxed into creating images that contain watermarks from the source images, at least. It's not necessary to reproduce an entire book to constitute infringement. IANAL, but I'd imagine the legal yardstick would be the same as other kinds of fair use.

I'm sympathetic to the argument that the generation isn't as important as what is done with it, that someone with permission to analyze the works of Stephen King could generate work from it in his style. I think probably that LLM generation will entail a new class of right, and in the future authors are going to have to be very careful not to give up that right. I don't know about works for hire though, it doesn't seem fair that just because someone pays you to write for them, that what you wrote could be used to generate endless future material from it. But then lots of things about work for hire don't seem fair.

I still don't see how things made by LLMs don't constitute derivative work that relies on every item in the corpus. Just because the relationship with the original works isn't on-its-face obvious doesn't mean it doesn't exist.
posted by JHarris at 4:56 PM on October 8, 2023 [12 favorites]

I've got a (trial) The Atlantic subscription; if anyone wants me to look up their name in the database drop me a line.
posted by The corpse in the library at 5:07 PM on October 8, 2023 [1 favorite]

I'm with the Corpse in the Library - they couldn't even be fucked buying a copy and sending a single book's pittance of royalty for the pleasure of using their work. Every course I've done that involved studying fiction (and that includes King) required the students to buy a copy.
posted by Jilder at 5:35 PM on October 8, 2023 [4 favorites]

If you give an AI a book, you've lost one sale.

If you teach an AI to write books, you've lost all of your future sales.
posted by straight at 5:45 PM on October 8, 2023 [3 favorites]

If I could simply ask a model to read me any book word-for-word, then we do have an issue.

This is explicitly claimed in the Author's complaint in item 88. The complaint is worth reading.
posted by eustatic at 5:45 PM on October 8, 2023 [4 favorites]

It is interesting, though, that George RR Martin is a plaintiff.

I wonder when Microsoft will claim that they bought Books3 just to finish Winds of Winter for him
posted by eustatic at 5:50 PM on October 8, 2023 [8 favorites]

As long as the companies developing AI are doing it just to benefit mankind and are not profiting off it in any way then I’m fine with it.
posted by snofoam at 6:23 PM on October 8, 2023 [7 favorites]

When we talk about using copyright maximalism to prevent neural network creation, we're avoiding the discussion we need to have about labor power and the outputs of neural networks. What we're really worried about is diminishing the labor power of folks who are already often in precarious situations. And they're in precarious situations because many people want to do the work they do and the studios and publishers know it. We need those artists to be able to do their work while living full, good lives in order for us to get works that enrich our own. Folks have understandably turned to copyright maximalism first, but that's a mistake and we could be more direct.

We could aim to solve the labor problem directly. For example, we could change copyright law such that works that are only AI generated are unavailable for copyright. We can shore up artists as workers instead of diminishing our ability to interpret and access the work they make.

If we go the route of copyright maximalism to stop neural networks from being created with existing works, we're also killing off AO3, Wattpad, and similar. Consuming and producing works based off others' is essential to being human. In the obvious extreme, we all agree that's true (each word we use was someone else's creative work) but I mean in the less extreme cases. We use prior art to communicate things in new art that we couldn't previously. The act of statistical analysis is just another way for humans to interpret art; one that isn't a lower nor a higher order act of consciousness than creating art using existing works. It's not higher because making art from art communicate ideas between people in ways that math can't, and it's not lower because doing math communicates ideas between people in ways that art can't. In current copyright policy, the math of neural network creation is indistinguishable from writing a fanfic.

Using copyright to ban math on human art would allow studios to shut down swathes of human expression before we even get out the door. It would also incentivize the Disney's of the world to hold even tighter to their copyrights and take more of their artists' work in order to build their own corpuses.

We've already got neural networks on our phones that can find pictures of our cats when we type in "cat". The future of neural networks is one where they're tools for us to use to make art like paintbrushes, writing software, and Photoshop. A world where we have the analyses of models are provided freely and easily to create new beautiful things. We should build our policy towards that end.

I remember us losing against the DMCA two decades ago. We lost a lot of our ability to access art under schemes to shore up studio profits (as one example, the banning of DeCSS). I'd rather us not lose that fight even more by giving more powers to the studios who own most of the copyrights we engage with.
posted by jmhodges at 7:29 PM on October 8, 2023 [12 favorites]

The US Copyright Office has said that AI generated works can’t be copyrighted. This should delay things a bit in terms of driving authors out of business.
posted by interogative mood at 7:44 PM on October 8, 2023 [2 favorites]

> Using copyright to ban math on human art
Already banned. You can't scan a book or a painting that's still protected by copyright and print a reproduction without permission. There's plenty of math in the image processing and mechatronics used in that process.

> derivative work
Only mentioned once so far, but could be a key part of how this will work legally. Those against software freedom talk about the viral nature of free and open source licenses, but that property is of copyright itself. If you mess with someone else's stuff and the two of you don't get along, neither of you can do anything with the result. It's only by agreeing how to share can a derivative work be published.
posted by ASCII Costanza head at 9:01 PM on October 8, 2023 [1 favorite]

I am skeptical that a language model built for general-purpose use would reproduce someone else's text with any reliability. The sorts of training that make for more versatile LLMs are bad for memorizing any given text. It might be possible to train specifically for reproducing particular texts or maybe deniably infringing copyright? Doing that would seem strange to me, but humans do lots of things that are strange to me.
posted by a faded photo of their beloved at 9:45 PM on October 8, 2023 [3 favorites]

When I've been studying, most assignments had to include details of anything that was a direct quote, attributions to 'this paragraph was based on the work of this author in this book', and 'I also read these books which although not directly involved in my final work, were considered one way or another I'm sure have swayed my thought process in completing this work'.

That's to avoid academic plagiarism though, not to avoid copyright infringement.

I thought there were similar requirements in producing art (this work was inspired by the colour pallet of John, the brush stye of Lucy, etc), and many other subjects such as law (case John Vs Lucy 2023).

There are not. You cite case law in court to establish precedent, not because of copyright. And you do not have a requirement for a bibliography in art, nevermind references. It would also be kind of impossible in the less formal setting, without the restricted scope of academia - you don't cite every book and person you learned language from in your academic work, only the directly relevant academic works. In music, for example, there is no analogy to the concept of directly relevant academic work, so you're left trying to cite every piece of music you've ever heard? And should Beethoven come before Napalm Death for this electronic chillout track audjography? Which is a bigger part of my musical vocabulary? You actually can't do citation either - samples need copyright clearing.

> Using copyright to ban math on human art
Already banned. You can't scan a book or a painting that's still protected by copyright and print a reproduction without permission. There's plenty of math in the image processing and mechatronics used in that process.

It's the "print a reproduction" part that's banned, not the scanning, and certainly not the maths. It's like claiming walking is illegal because there's lots of it involved in trespassing.
posted by Dysk at 11:15 PM on October 8, 2023 [8 favorites]

Memorization trends to happen for things that appear many times in the training corpus, such as common idioms or the start of the declaration of Independence. Obtaining long runs of memorized text from LLMs or training images from image generators is possible but very unlikely to arise organically. There's a decent study of this in the dall-e paper, iirc; you can torture the model until it produces something like the memorized example, but at that point the human intervention is the overriding factor, and you might as well use a copy machine...

It's also relatively easy to build in safeguards against long-passage copying: vector search is very efficient, so you can easily check whether the output matches anything in the training data by just searching for it. There's literally a whole industry of plagiarism checkers for this kind of thing, but meant to check for human copying.

----

I've mostly stepped back from these conversations, as some of the louder voices seem to be starting from the premise that 'ai=evil', and back-filling shoddy arguments from there. It's just not an interesting or productive conversation. I would love to have more in depth conversations about (for example) how to responsibly handle ai without restoring to copyright maximalism, but the environment here doesn't seem conducive to it.

The call to murder computer scientists and feed them to sandworms was incredibly shitty, and I say that as someone who knows Dune well. In addition to being a completely tasteless "joke" (it is trivial to produce analogous statements which anyone here would find deeply offensive), it's a perfect example of bad conversation on AI/ML.
posted by kaibutsu at 11:41 PM on October 8, 2023 [9 favorites]

I consider myself relatively tech literate, but don't really keep up with the latest AI developments. The statement that these AIs could be tricked into reproducing books word for word went against my understanding of the tech, but with the speed of development I wasn't all that comfortable with my disbelief.

Maybe rose tinted glasses but for a long time this site seemed to me a good place to go to gain insight on technical subjects. Nowadays not nearly so much. Not sure why that is, but feels like a loss anyway.
posted by otsebyatina at 12:27 AM on October 9, 2023 [3 favorites]

>If I could simply ask a model to read me any book word-for-word, then we do have an issue.
You have to trick it, but you can.

I think I should point out why you have to trick it. It seems obvious that these systems have another layer on top of them, maybe implemented with a different neural net but also possibly done with simple procedural text analysis, to try to avoid the most blatantly terrible things these LLMs can do. Obvious copyright infringement would LOOK BAD, so they do rudimentary checks to prevent their system from just regurgitating things verbatum. It doesn't mean the copyright infringement isn't happening, it's just less obvious to human eyes. This is also how they manage to get their system to avoid coughing up hate speech as-is, as happened with Microsoft's Tay. Outside implementations of these systems will not necessarily be so muzzled.
posted by JHarris at 12:32 AM on October 9, 2023 [6 favorites]

If I were the type of person to come up with nefarious business models, I'd notice how worried people are about LLMs copying things word-for-word, and I'd go in the other direction: I'd offer you a condensed version of a bestseller, carefully modified to avoid plagiarism charges, tailored to how much time you wanted to spend reading it. (IT in 15 minutes! The Stand in 7-1/2!) If I were extra-evil I would have my system create audiobooks to read these condensed versions, in voices that were reasonable facsimiles of Voices You Know and Like.
posted by mittens at 5:30 AM on October 9, 2023 [4 favorites]

One of my novels is in Books3. Same for a couple of friends. Not a lot we can do other than follow along. I filled out the Authors Guild letter template, on the advice of my own guild in the UK; I suspect it's an instruction primarily to let authors feel they are doing *something*.
posted by Ballad of Peckham Rye at 7:00 AM on October 9, 2023 [5 favorites]

some of the louder voices seem to be starting from the premise that 'ai=evil', and back-filling shoddy arguments from there.

No, AI is not evil because “AI” isn’t real, it’s a marketing term for LLMs which are absolutely not an intelligence in any way.

Are LLMs inherently evil? Maybe. Hard to see an ethical way to create one.

Are these LLM companies, whose business model is to steal human creators work and drive them out of business, for the benefit of a small number of people who are billionaires already, ansolutely wrecking the planet in the process?

Yes, that’s evil. I don’t see how you could reach a conclusion that it isn’t evil.
posted by Artw at 7:32 AM on October 9, 2023 [12 favorites]

Yes, that's exactly what I'm talking about...

I work with machine learning for conservation. People record thousands or even millions of hours of audio of the natural world, and then we use machine learning to count animal vocalizations. These can tell us about the locations of endangered animals, overall population health, give early warning of invasive species, alert on illegal logging or bomb fishing, and so on. We work with scientists and consequences who do the actual audio collection and processing, and provide support in the form of better models and research. In my opinion, none of that is evil; quite the opposite. But my paycheck comes from a company you would certainly call evil.

Does that mean that the company is good? No. But it also isn't evil. It's a giant mass of people following their individual goals and incentives, same as it ever was. We've carved out a pocket where people's individual desires to make the world a better place can be acted on.

So, I disagree with the premise that these companies are trying to extinguish creative work. Many of the applications have nothing to do with creativity, and a lot to do with improving people's lives: for example, (if the hallucinations become controllable) helping with medical diagnosis.

Machine learning does its best when it helps us find needles in haystacks, or even just organizing haystacks. LLMs have a lot of potential for helping us better understand and work with some particularly difficult haystacks. My partner, for example, is a paralegal working on getting folks out of prison after some reformed laws give them a chance for resentencing. Many of these clients have been in prison for 30+ years, and have over 10k pages in their collective legal files; a big part of her job is summarizing these mountains of files. Imagine if we could make that easier, and give more time to work on the actual legal problems and take on more clients.

So, IMO, there's a range of uses for these tools, many of which very much are not evil. There's a lot more to say on ml and creativity, but I'll stop here for now. The premise that people working in ml are out purely to wreck people's livelihoods (much less 'the planet') is hyperbolic.
posted by kaibutsu at 8:46 AM on October 9, 2023 [6 favorites]

Are LLMs inherently evil? Maybe. Hard to see an ethical way to create one.

I believe it is possible. The very first step to making LLM training more ethical is to take care about energy consumption; either ensure that training is powered by 100% green energy or train at night, when energy demand is lowest.

Another step is to attempt* to separate grammar and reasoning from factual knowledge. This will be important for two reasons: first, reality keeps changing what is factual, and second, because a given source of facts might prove to be unreliable or revoke their consent to be a source for the model. We need to store facts in an easily-changeable way that can also be associated with metadata (read: external vector stores).

*This separation is not entirely possible, but much more can be done here.

Another step is to curate sources of high-quality, permissively-licensed text to use for training. Just Wikipedia is huge, and should do a good job of teaching the model grammar and help the model form an embedding space that it can use when interfacing with external data.

Another step would be to use iterative self-prompting techniques when using external data. Before the model produces an observable result using external data, it can generate internal answers to questions like: What evidence is there that this source is reliable? Is what this source is saying consistent with ethical principles? Etc.

There's more to say but to make a long story... less long, there are definitely things that can be done to develop LLMs more ethically.
posted by a faded photo of their beloved at 9:09 AM on October 9, 2023 [2 favorites]

Another step is to attempt* to separate grammar and reasoning from factual knowledge.

LLMs being a source of fact is explicitly a part of the hype for them and supplanting other sources of fact is a business goal, so good luck convincing anyone at the C level of that.
posted by Artw at 9:13 AM on October 9, 2023 [3 favorites]

kaibutsu-

I work with machine learning for conservation. People record thousands or even millions of hours of audio of the natural world, and then we use machine learning to count animal vocalizations.

That seems commendable and largely irrelevant to the discussion about the plagiarism machines. If you want to tie it’s existence to the existence of pilfering LLMs and say if they go it goes then yes, as far as I am concerned it can go.

But my paycheck comes from a company you would certainly call evil.

You are absolutely not special in this. This is Metafilter, most of us have done the devils bargain and worked for evil companies.

So, I disagree with the premise that these companies are trying to extinguish creative work.

It is literally already happening so this is just willful ignorance.

My partner, for example, is a paralegal working on getting folks out of prison after some reformed laws give them a chance for resentencing.

Would absolutely steer them the fuck clear of LLMs in this since they have a proven track record of fucking people over with hallucinations on legal matters.

The premise that people working in ml are out purely to wreck people's livelihoods (much less 'the planet') is hyperbolic.

I don’t care if they are deliberately out to wreck peoples livelihoods or not, just that the people involved in LLMs are doing it. Present tense. Non hypothetical.

And the energy consumption in these things is massive, on a level with NFTs and crypto, no hyperbole there.
posted by Artw at 9:29 AM on October 9, 2023 [5 favorites]

We are in the middle of a climate emergency. It's wild to me that we are willing to burn a bunch of fossil fuels to build a tool that lets us reformat text.
posted by tofu_crouton at 10:13 AM on October 9, 2023 [1 favorite]

My partner, for example, is a paralegal working on getting folks out of prison after some reformed laws give them a chance for resentencing. Many of these clients have been in prison for 30+ years, and have over 10k pages in their collective legal files; a big part of her job is summarizing these mountains of files. Imagine if we could make that easier, and give more time to work on the actual legal problems and take on more clients.
Would absolutely steer them the fuck clear of LLMs in this since they have a proven track record of fucking people over with hallucinations on legal matters.

This is again a case where LLMs could be done better than the prototypical examples we all likely have in mind. The design space of LLMs is vast and when we only attend to OpenAI, Google, and Anthropic we're missing the great majority of it.

Decoder-only LLMs get all the attention because they produce better naturalistic speech. But encoder layers in LLMs often improve the accuracy of summarization. An LLM could absolutely be purpose-built for accurate summarization that would do far better than the quick cash-grab of "let's slap a web front end over a stock model and pretend it's a robot lawyer" that made headlines.
posted by a faded photo of their beloved at 10:20 AM on October 9, 2023 [2 favorites]

New York lawyers sanctioned for using fake ChatGPT cases in legal brief

Steer clear.

It is hype and garbage and they do not care who they harm.
posted by Artw at 10:28 AM on October 9, 2023 [1 favorite]

The entire framing of things as evil or not is, bluntly, idiotic.

The interesting conversation is about /why/ hallucinations happen and what different groups are trying to mitigate the problems and make the algorithms more reliable: What's helped, what hasn't, and what might still be in the pipe.

Linking to an article that we've all seen and calling all of this work 'garbage' isn't helpful or interesting: it's basically thread shitting, and lowers the level of discourse. Maybe that's your prerogative, to make sure that no one can hold a discussion over all of the shouting, but it really just means I'm going to talk about this stuff in places other than metafilter.
posted by kaibutsu at 11:02 AM on October 9, 2023 [3 favorites]

The interesting conversation is about /why/ hallucinations happen

(b/c all mental activity is hallucination and before cognition can be useful, reality-test filters have to be evolved?)
posted by mittens at 11:05 AM on October 9, 2023 [1 favorite]

New York lawyers sanctioned for using fake ChatGPT cases in legal brief

I don't mean to take up too much space here, so I think I'll duck out at least for a bit after this. I'm not 100% sure your comment was replying to mine, or whether you were more just extending your previous comment with an extra link. In the case that your comment was replying to mine (and again, I might be wrong about this), it looks as though you are using "LLMs" as essentially synonymous with ChatGPT. That is understandable, considering ChatGPT's popularity. It's a little bit frustrating, because I believe that new language-based models could be built that address the concerns people have brought up here and elsewhere, but it's hard to know how to even express that idea if, in the popular usage, "language model" is taken to be exactly ChatGPT.
posted by a faded photo of their beloved at 11:23 AM on October 9, 2023 [3 favorites]

Maybe rose tinted glasses but for a long time this site seemed to me a good place to go to gain insight on technical subjects. Nowadays not nearly so much. Not sure why that is, but feels like a loss anyway.
posted by otsebyatina at 2:27 AM on October 9

I'm not sure this is technical issue we are discussing. Is it not a labor issue?

I mean, I am writer, and a voracious reader and I don't care about the technical issues behind LLMs, I just don't want it to destroy the industry of human-written books. As a writer, my objection should be obvious; as a reader, I consider reading a book to put me into a conversation with the author, if a one-sided one, and I have little interest in being in a conversation instead with a mindless machine.

I have no objection to machine learning when it is turned to medical data or the animal vocalizations work described above, as long as the tech is carefully vetted to prevent hallucinations and the reproduction of biases (this is a big problem with medical data) and those sorts of issues. It seems obvious to me that those applications of this tech would be a net good if mindfully used.

But why are we (as a society) automating the pleasurable creative jobs that people want to do? Solely to make rich people money more money? And then why do we as individuals then have to pay for the output, especially given that our own labor was involved in creating the training data (literally our own labor, not just writers'; I have read that sites like Metafilter and Reddit were scraped for training data)? Either LLM-generated books should be sold at cost, or they shouldn't exist.
posted by joannemerriam at 11:30 AM on October 9, 2023 [11 favorites]

The entire framing of things as evil or not is, bluntly, idiotic.

Literally a framing you introduced to the conversation.
posted by Artw at 11:31 AM on October 9, 2023 [2 favorites]

it looks as though you are using "LLMs" as essentially synonymous with ChatGPT

Ehh. They all share the same problems of hype, plagiarized datasets, awful business models and questionable results. I’m sure some give slightly different results than others but it’s largely a differentiation without difference - using an LLM for legal work would be an awful idea with any of them and would probably get you into the same trouble.
posted by Artw at 11:35 AM on October 9, 2023 [1 favorite]

We are in the middle of a climate emergency. It's wild to me that we are willing to burn a bunch of fossil fuels to build a tool that lets us reformat text.

The concerns on this front tend to get overblown; I think this is mostly a result of a) bad accounting, and b) people drawing analogies to BitCoin mining, which isn't terribly comparable.

So, let's address those.

One of the most-cited papers on environmental impacts of ML training is this one. They estimate kWH to train a model, and then multiply by the US average carbon per kWH to get a total carbon estimate. What this fails to account for is that significant work is done to reduce carbon impact of datacenters - in fact, they are very amenable to optimization for cleaner energy, and most-to-all of the (obviously-evil) companies who use datacenters have been working for years to maximize renewables usage for their datacenters. So, the numbers cited are an upper bound for carbon emissions, and further work is needed to get a realistic estimate.

Google did this work for their own model training [blog, paper], and found that a) carbon estimates from 3rd parties were between 100x and 100,000x over actual emissions, and b) adopting further best-practices can reduce the actual carbon impact by 100x-1000x.

Secondly, let's take a moment to compare the situation to BitCoin mining. BitCoin makes computation harder as more people join the network. This makes the compute increase in a predictable way as more and more people get involved in the system, and led the power requirements to balloon, eventually equalling the power consumption of a small country. Ethereum (to their credit) has switched to a 'proof-of-stake' scheme which largely eliminates this problem, but it absolutely persists for BitCoin.

How does this compare to machine learning? People are racing to make these models more efficient and reduce costs, both for training and inference. This is an area where the economic incentives lead to better outcomes: everyone wants to make training and inference cheaper, which leads to less energy consumption and less emissions. Unlike BitCoin, there's no mechanism saying that costs (in terms of energy and/or carbon emissions) have to go up with time. And basically everyone in the research space is saying that the days of throwing more parameters at the problem are over, and developing more parameter+compute-efficient architectures.

It's worth thinking about what happened for the Human Genome Project. The first sequencing of the Human Genome took billions of dollars and over a decade of work, and now it costs well less than a thousand dollars and takes maybe a day.

Likewise, training a decent CNN used to be a major undertaking, but can now for many problems be done in under an hour on a laptop. Computing, generally, is an area where we see massive price decreases through a combination of improving hardware and improving algorithms. See, for example, this discussion of the relative impact of hardware and software improvmenets on computational efficiency:

But the White House advisory report cited research, including a study of progress over a 15-year span on a benchmark production-planning task. Over that time, the speed of completing the calculations improved by a factor of 43 million. Of the total, a factor of roughly 1,000 was attributable to faster processor speeds[...]. Yet a factor of 43,000 was due to improvements in the efficiency of software algorithms.

As a result, it's not a great idea to take the initial breakthrough research costs as indicative of the long term costs for doing the work. Again: the economic incentives point in the right direction to reduce energy usage overall, and furthermore lots of work is being done beyond the mere economics to ensure that these systems are using more renewable energy sources. Fossil fuels are literally not being burnt for the majority of ML work.
posted by kaibutsu at 12:04 PM on October 9, 2023 [2 favorites]

On the other hand, apparently it's costing MS $20 in computing power/month/user to run Github Copliot. No way that's not also representative of a shocking amount of computing/energy use, especially for what people are getting out of it.
posted by sagc at 12:06 PM on October 9, 2023 [3 favorites]

“...it's basically thread shitting, and lowers the level of discourse.”

There are a few people who reliably show up in these threads who make it clear that they don't have anything substantive or informed to say, they're just pissed off. There are a whole bunch of genuinely good reasons to be pissed-off about all this, so I try to read their comments charitably even though almost everything they say is reductive, glib bullshit. If it makes it easier, you can just skip over the noise and engage with the worthwhile comments, and most everyone ends up happier.
posted by Ivan Fyodorovich at 12:13 PM on October 9, 2023 [3 favorites]

AI is expensive. A search on Google's chatbot Bard costs the company 10 times more than a regular one, which could amount to several billion dollars.

Money here equals electricity.
posted by Artw at 12:52 PM on October 9, 2023 [4 favorites]

But electricity doesn't equal carbon.
posted by kaibutsu at 1:39 PM on October 9, 2023 [1 favorite]

Microsoft's water consumption jumps 34 percent amid AI boom

Even with optimally placed data centers you are talking more carbon and more water consumption. Even if you are talking all hydro that’s then other sources have to step up for other power needs so it’s more cartoon.

This shit absolutely is not free, and cannot be greenwashed away.
posted by Artw at 1:49 PM on October 9, 2023 [9 favorites]

Bitcoin and other crypto use far more electricity and provides zero benefit. The costs of running these models is dropping as we get better chips and improve performance.
posted by interogative mood at 3:16 PM on October 9, 2023 [1 favorite]

The costs of running these models is dropping as we get better chips and improve performance.

Isn't this the setup to classic Jevons Paradox issues? Contemporary computers are *vastly* more efficient than they were 1, 2, 3 decades ago. Power-per-watt can be amazing these days. Thing is? That means we stuff them in more and more things.

If I remember right, that was one of the big breakthroughs with GPT-2; where the volume of data ingested & amount of training kept scaling well for the final product, well past previous predictions about where it might hit diminishing returns. Look at how LLaMA scaled. 7B parameters, then 13, 33, & 65B. LLaMA-2 went 7/13/70. Same model architecture, but 40% more input data between 1 & 2. GPT-2's corpus was 40GB, GPT-3 1200GB. LLaMA-2 claims ~10TB. (Apples, oranges, & avocados; I know; but rough ideas of scale)
"LLaMA's developers focused their effort on scaling the model's performance by increasing the volume of training data, rather than the number of parameters, reasoning that the dominating cost for LLMs is from doing inference on the trained model rather than the computational cost of the training process."

Similarly, the ways LLMs are being applied is also expanding quickly. A lot of uses I've seen may be silly/"haha, what if?" type stuff; but whenever I hear from someone who's a big booster of the stuff it's not long before pretty much any form of text-serializable input gets "what if we just ran it through GPT-4?" proposed.

Now, I'm not sure that the power/water consumption issues of LLMs *won't* abate; it's not my primary critique of them. But "This'll all be worth it" tends to run on a shared assumption of 'worth it', and much like cryptocurrency assuming that shared worth doesn't tend to work well rhetorically. I'm already seeing an upswell of crypto-style "NGMI/Not Gonna Make It" type taunting about how anybody who isn't embracing LLM tools is going to be left scratching in the dirt while they ride their way to techno-feudal utopia.
posted by CrystalDave at 4:05 PM on October 9, 2023 [7 favorites]

Machine learning is not evil. I use machine learning in my work.

However, Microsoft purchasing OpenAI, and turning it into a profit engine for Billlions, is what this post is about.

As long as the companies developing AI are doing it just to benefit mankind and are not profiting off it in any way then I’m fine with it.
posted by snofoam at 6:23 PM on October 8

What the links above posit, is that that that ship has sailed.
posted by eustatic at 5:24 PM on October 9, 2023 [4 favorites]

I was being honest about infinite noncommercial remixing of culture and info being aligned to my values. I was kidding in the sense that, yeah, obviously corporations are doing this for profit.

LLMs + current copyright law are kind of the worst case scenario in that people are prohibited from recycling/sampling human culture for a period that is just far too long, but machines and corporations are poised to do it with impunity for profit at a scale that could inundate and overwhelm actual human expression and leave us all floating in a sea of shit.

Also, bird call AI is really useful, but in at least some cases it isn’t picking up unexpected or unknown calls. If the system is trained on known bird calls, it may classify unknown calls as random noise and not tag them.
posted by snofoam at 6:05 PM on October 9, 2023 [5 favorites]

So Far, AI Is a Money Pit That Isn't Paying Off
posted by Artw at 7:07 AM on October 11, 2023 [3 favorites]

There was a thread not long ago on our diminishing privacy. I never see it come up in these AI threads where databases scrape from our personal interactions. If you work on ML and the data wasn't given by choice, maybe you think "oh but we don't care about the private/personal stuff on there. We do this for good reasons! Not those bad reasons." Well, your company may indeed decide to sell that data for more destructive purposes, and then what was your work really for?

What's that, ML requires a huge dataset that you can't possibly source it from only willing participants? Well, maybe it isn't such a worthwhile technology then and you can shift your efforts to some other way of achieving your goal. ML is not immune to being replaced, like so many other technologies in the past. A good engineer strives for efficiency.

The more I see people trying, somehow, to defend this kind of stuff, the more it makes me want to become a hermit. Why should I share anything if it will just be taken advantage by people who really don't care about me or see me as more than a data source to be mined?

I expect as people with opinions like mine get more numerous, the most we'll get is some annoying popup saying "We care about you! You are an individual who deserves respect. Give us your data? Click Yes."
posted by picklenickle at 1:44 PM on October 11, 2023 [3 favorites]

the global economy, 2023 edition
posted by Artw at 9:30 AM on October 12, 2023

Meanwhile…

The new think tanks influencing AI policy in Washington

Which is of course an effort to focus everyone on bullshit far off super intelligence issues and notThe actual pressing issues of right.
posted by Artw at 3:04 PM on October 20, 2023 [1 favorite]

Mysterious bylines appeared on a USA Today site. Did these writers exist?
posted by Artw at 9:23 AM on October 26, 2023 [1 favorite]

« Older Busting myths about daddy-long-legs spiders | The Scrap Heap of History Newer »

This thread has been archived and is closed to new comments

MetaFilter

It's totally reasonable to be able to say, ‘Hey, don't use my stuff'
October 8, 2023 7:32 AM Subscribe

Tags

Share

It's totally reasonable to be able to say, ‘Hey, don't use my stuff' October 8, 2023 7:32 AM Subscribe

Tags

Share

It's totally reasonable to be able to say, ‘Hey, don't use my stuff'
October 8, 2023 7:32 AM Subscribe