"The Times hereby demands a jury trial for all claims so triable"
December 28, 2023 11:52 AM Subscribe

The New York Times sued OpenAI and Microsoft for copyright infringement on Wednesday, opening a new front in the increasingly intense legal battle over the unauthorized use of published work to train artificial intelligence technologies. [so many previouslies]
posted by chavenet (76 comments total) 19 users marked this as a favorite

AI generated outputs are not copyrightable already, and if they aren’t allowed to use copyrighted inputs without permission, the entire modern AI exercise is going to wind up being nothing more than fuzzy pattern matching and nondeterministic repetition of things people are already giving away for free.

Hard to see investors making any sort of bank on this.
posted by mhoye at 12:05 PM on December 28, 2023 [9 favorites]

There are a few of these lawsuits, most of which hinge on outside observations of regurgitation. Bad but perhaps not enough to show the damages they want.

On the other hand, it would be perfectly reasonable for any of the plaintiffs to ask, and the judge to compel, OpenAI to share its training data for the last few GPT generations. I have a feeling that will get out and it will be damning in its laissez-faire attitude to copyright. Whether a crime was committed is a difficult question under current law since there's little precedent here. But I think these lawsuits will cause change one way or the other.
posted by BlackLeotardFront at 12:25 PM on December 28, 2023 [18 favorites]

Note that this lawsuit was announced just after Apple set a price for licensing articles for AI training. The MAANG is gonna build a legal moat around LLM training data and pretend they hate it.
posted by credulous at 12:50 PM on December 28, 2023 [11 favorites]

Are people able to create a work of original art based on their training data?
posted by iamck at 12:51 PM on December 28, 2023 [6 favorites]

Is training a neural network on publically accessible media a different "right" than training your own brain? I'm not sure, and I'm not sure I trust a Jury to decide either.
posted by Popular Ethics at 12:54 PM on December 28, 2023 [9 favorites]

And following on from "training your own brain" .... what happens when AI reaches self awareness .... can the Times own someone's brain if it's silicon, but not if it's meat?
posted by mbo at 1:05 PM on December 28, 2023 [3 favorites]

Is training a neural network on publically accessible media a different "right" than training your own brain? I'm not sure, and I'm not sure I trust a Jury to decide either.

Absolutely it is because humans, who can think with their brains, are different from machines, who cannot. I hope this is helpful.
posted by an octopus IRL at 1:06 PM on December 28, 2023 [60 favorites]

I hope this is helpful.

It is definitely not.
posted by Jonathan Livengood at 1:09 PM on December 28, 2023 [35 favorites]

can the Times own someone's brain if it's silicon, but not if it's meat

Humans training their brains on Times articles are typically paying money to the Times for the privilege, in the form of subscription fees.

Companies owning and operating AI models are not paying those fees. That's where the lawsuit is coming from.

Your squishy meatbrain is probably safe from the cold, clammy hands of NYT lawyers, for now. Just don't steal their cooking recipes.
posted by They sucked his brains out! at 1:11 PM on December 28, 2023 [20 favorites]

The MAANG is gonna build a legal moat around LLM training data and pretend they hate it.

Interesting outcome, if the lawsuits against tech companies are really to their long-term benefit. Once the largest advertising companies have licensing deals in place, that might effectively price out and therefore lock out competitors working on new models or methods.
posted by They sucked his brains out! at 1:14 PM on December 28, 2023 [2 favorites]

Companies owning and operating AI models are not paying those fees.

They're also doing it at a scale that no human, or team of 1000 humans, could approach. It's like the difference between sending your friend a few MP3s to listen to and running Napster.
posted by mmoncur at 1:21 PM on December 28, 2023 [15 favorites]

It is definitely not.

That's fair, I was being snarky and flip. I do think there is a world of difference between a person, who can absorb and synthesize and evaluate, reading and responding to articles and computers being fed massive amounts of writing they can't possibly understand because, as machines, they do not have the capacity to understand, so that they can churn out plausible-sounding nonsense in hopes of generating a profit for people who don't care that putting this nonsense out into the world will make it a worse place and incidentally might also want to use it to replace people who need those jobs. I think these two things are worlds apart but I respect that reasonable people may disagree. I hope (not sarcastic!) that this is a more helpful response!
posted by an octopus IRL at 1:22 PM on December 28, 2023 [24 favorites]

Actually the companies dumping the Times into their training data could happily buy a (single) NYT subscription, probably do - the question is not really about that, it's about what "copying" means - if you tell someone at coffee at work the news you read in the Times, is that copying? if someone dumps it into a computer and it puts it out verbatim is that copying? what of it summarises it? rewrites it with the same facts but different words?
posted by mbo at 1:23 PM on December 28, 2023 [4 favorites]

Just don't steal their cooking recipes.

Except recipes aren’t copyrightable.
posted by Big Al 8000 at 1:23 PM on December 28, 2023 [9 favorites]

For reasons that I am unable to disclose I have a bingo card from 2037 and there's a square labeled "Roko's Basilisk vs. New York Times Death Match Cage Fight" and it didn't really make sense until just now.
posted by loquacious at 1:26 PM on December 28, 2023 [31 favorites]

Also, copyright, including the various Creative Commons licenses, allow the holder to control how their work is used. Most of what is “freely available” on the internet is under some sort of copyright. It’s not unreasonable to encourage anyone who wants to read your work to do so for free while blocking companies from using it to generate products for sale. This is the whole point of the CC Non-Commercial license. Using someone’s work without permission or violating their license isn’t wrong just because you’ve found a new way to do it.
posted by GenjiandProust at 1:28 PM on December 28, 2023 [19 favorites]

I was being a little glib, but there are aspects of recipes beyond the ingredient listing that can be and are definitely copyrighted. In the much the same way as AI models regurgitating pieces of Times articles is at the center of the lawsuit in question, if you were to copy and paste the text of Ottolenghi's miso onion recipe verbatim without licensing, say, you could be in legal trouble.
posted by They sucked his brains out! at 1:28 PM on December 28, 2023 [1 favorite]

Corporations already do their best to own brains made of meat. Of course they will own brains made of other stuff.

Makes me want to reread 50s and 60s science fiction dealing with these issues, but most have not aged well in terms of characters and plot.

My hope is that this results in training data being shared publicly.
posted by Dr. Curare at 1:29 PM on December 28, 2023 [2 favorites]

Even if you have a subscription, the NYT term of use says:

The contents of the Services, including the Site, are intended for your personal, non-commercial use. All materials published or available on the Services (including, but not limited to text, photographs, images, illustrations, designs, audio clips, video clips, “look and feel,” metadata, data, or compilations, all also known as the "Content") are protected by copyright, and owned or controlled by The New York Times Company or the party credited as the provider of the Content.

Training an AI would certainly be classed as commercial use.
posted by CheeseDigestsAll at 1:38 PM on December 28, 2023 [18 favorites]

And following on from "training your own brain" .... what happens when AI reaches self awareness .... can the Times own someone's brain if it's silicon, but not if it's meat?

Not actually a concern since “AI” is a bullshit term for statistical inference and it the possibility of it achieving self awareness is some fictional bullshit and marketing and not a real thing.
posted by Artw at 1:43 PM on December 28, 2023 [47 favorites]

Are people able to create a work of original art based on their training data?

Humans create art; humans also plagiarize. If the argument is "AIs are just like humans", then AIs can plagiarize.
posted by zompist at 2:17 PM on December 28, 2023 [3 favorites]

If you’ve got an hour or four there’s a video about that which should be instructive.
posted by Artw at 2:20 PM on December 28, 2023 [5 favorites]

I am in principle sympathetic to the Times' position but, as far as I'm aware, the grey old lady is infamously litigious and has been shit with regard to trans rights and also prominently platforms vapid conservative editorial voices... so I'm having a difficult time finding actual sympathy. If the people in charge of the Times were in charge of chatGPT they wouldn't have done anything remotely differently.

Perhaps, when the dust has settled, chatGPT can provide a concise summary and the NYT can publish a very culturally important thinkpiece retrospective in which JK Rolling emerges as the underdog hero.

As far as I can tell, it's assholes all the way down
posted by treepour at 2:38 PM on December 28, 2023 [7 favorites]

Interesting to note this is not just about training on NYT articles, but also about more traditional copyright infringement in the form of reproducing articles (one paragraph at a time) that are behind the paywall, as well as "ascrib[ing] hallucinated misinformation to the Times." See Ars Technica's article about the suit.
posted by tubedogg at 2:45 PM on December 28, 2023 [9 favorites]

The only groups with pockets deep enough to win an initial case like this are almost guaranteed to be terrible in multiple ways. If the NYT wins, that sets a President for others to get restitution, and will (maybe) slow down corporations strip-mining the internet to exploit others’ cultural creations. I think that would be a great outcome, regardless of the NYT’s terrible politics.
posted by GenjiandProust at 2:46 PM on December 28, 2023 [20 favorites]

I hope (not sarcastic!) that this is a more helpful response!

It really isn't. Because it hinges (ironically) on not understanding what the word "understanding" actually means. LLMs like ChatGPT store statistical relationships between tokens: small words (4~5 letters) or partial words, or individual symbols. The way that they store this is a complex web of billions of weighted connections between layers of nodes, which are intended to mimic the synaptic connections between neurons and the overall topology of neural connections in an animal brain (human, rat, whatever). Because certain token patterns frequently occur together, both words and sentence fragments that encode concepts as we know them have a representation in the first derivative of that connection pattern. And what makes LLMs able to reply to you with plausible coherence is that in the second derivative - the connections between concept-patterns, they contain a conceptual relationship map that mirrors the consensus of all human writers of the language being trained.

LLMs are a frozen snapshot of our consensus on how a written language is parsed and filtered, including the concepts and conceptual relationships which can be expressed. To the extent those things are what we mean by "understanding," LLMs clearly do understand in the way that we do.

You and I and someday much more advanced AI than LLMs are additionally able to take continuous streams of input, feed them into our corresponding network, and modify our networks in response to that input. Refining those encoded concepts based on input observations of them in their context. While the artificial reactive version of this is still some ways off, the entire reason that focus shifted to multi-modality for post-GPT-4 research is that having trained for language, imagery, video, and speech with the same network significantly improves performance for each modality when evaluated independently. To the extent that having this context in a frozen form is "understanding," then this already exists though it is only partially available to the public as of this writing (maybe another few months).

Our runtime adaptation and systems modeling/forward simulation abilities are major challenges but it's a matter of when, not if: it's not like there's some magical pattern of how neurons interconnect in humans that cannot be expressed digitally. This chest-beating insistence that our kilogram of self-important salty fats is in some way unique, this attitude of "can't talk, too busy being special" is just kicking the can down the road on a far more serious problem: modern capitalist society lead by sociopath primates and the survival of the majority of humans are probably no longer compatible within a couple decades for multiple reasons, of which AI is only one.

What I find particularly irritating about this whole business is that the only reason we've been locked into our current form of capitalism is the need of said sociopaths (or just wholly lacking in empathy for whatever reason, not getting hung up on diagnostic terms) to feel special. Their need to feel this to the point that hundreds of millions of humans are starved to death, hundreds of millions of humans die of entirely preventable illness. We've had the technology required for a reduced-scarcity civilization for decades. The solution is not more primates going around insisting that they - individually or collectively - are special. The solution is to take a hard look at our situation and stop permitting sociopaths to prioritize their desires over the survival of billions, and this will probably require violence.
posted by Ryvar at 2:48 PM on December 28, 2023 [40 favorites]

love mike masnick's take in techdirt.
posted by brewsterkahle at 3:09 PM on December 28, 2023 [7 favorites]

Training an AI would certainly be classed as commercial use.

Selling the output of a trained AI is commercial use. Just training and keeping the output for internal use is not. That’s why the current Copyright Office policy makes sense: just “reading” to train a network - whether with human eyes or a multi-terabyte batched ingest - isn’t violating copyright. But there’s no way to know how much of the puréed English is from copyrighted sources or which ones - all of them and none of them, really, except for prompt terms where training went badly (as in: disgorges a chunk of source text verbatim, which is a gratuitous training error but still happens far too often). So the insistence that the output likewise cannot be copyrighted is the only sensible course.

The jobs argument is something I struggle with, personally. Like, an estimated half of all Bioshock playthroughs were with a pirated copy. Some of that script is my writing. Some of that writing is probably in the Common Crawl and thus nearly all LLMs. And I always, always want to be on the side of the workers (like, I literally just lowkey called for violent marxist revolution in my prior comment).

But the truth is the jobs thing is transitional churn; some roles are going away, and as companies and economies expand new bottlenecks requiring humans will be identified and filled. Capitalism obeys its own version of gas inflation law, always expanding to fill its bounding volume. And human worker output defines that boundary. The first problem is that our workers need financial assistance while this latest transition is in progress. The second and much greater problem is that our leadership doesn’t give two shits about what the working class needs.
posted by Ryvar at 3:20 PM on December 28, 2023 [7 favorites]

if they aren’t allowed to use copyrighted inputs without permission, the entire modern AI exercise is going to wind up being nothing more than fuzzy pattern matching and nondeterministic repetition of things people are already giving away for free.

Versus, of things they aren't? I do believe that is the gravamen of the complaint, yes.
posted by snuffleupagus at 3:50 PM on December 28, 2023 [2 favorites]

I mean, "disgorges a chunk of a source text verbatim" is as straightforward an example of copyright infringement as could well be imagined. And also a widespread and well-attested LLM behavior, "training error" or not.

I don't really understand the motivation to pretend this is complicated or even slightly borderline. If I read and memorize a newspaper article, retype it from memory and distribute the result, I have infringed the author's copyright (even if my output is not an exact verbatim copy). Likewise if I write a program to do the same automatically, by scraping articles from the NYT and reposting them on another website.

If I can extract the information only, then I am free from copyright concerns in the US (although that does still arguably contravene the purpose of the Copyright Clause). But if I can't do that because my "AI" isn't actually that intelligent, then that's my tough luck.
posted by Not A Thing at 3:54 PM on December 28, 2023 [22 favorites]

hey're also doing it at a scale that no human, or team of 1000 humans, could approach. It's like the difference between sending your friend a few MP3s to listen to and running Napster.

Napster, of course, was destroyed by Metallica and left the music industry and CD sales stronger than ever.
posted by Going To Maine at 4:02 PM on December 28, 2023 [3 favorites]

Selling the output of a trained AI is commercial use. Just training and keeping the output for internal use is not.

In the more general context, "internal" re-distribution of material licensed to you for personal use isn't exempt from copyright law. The scope of the distribution will be part of a fair use analysis, such as integration of newspaper articles into presentations, training materials (for humans), etc. You can't buy one copy of an expensive industry reference work, scan it, and then run off a copy for everyone's desk (or make the PDF available to everyone on the network). Why should a tokenized version of the same data be different if its use coughs up the same information?
posted by snuffleupagus at 4:04 PM on December 28, 2023 [5 favorites]

Finally we have a lawsuit from people with clear standing, a clear demonstration of financial harm, and a big war chest.

Now maybe we’ll get to some actual commercial/legal settlements about this stuff. All the penny ante thrashing about has really gotten tiresome.
posted by Tell Me No Lies at 4:34 PM on December 28, 2023 [3 favorites]

I'd add — however you view it, it does get directly to copyright protection of expression (yes) vs. ideas (no) and how that line is drawn. Are the models LLMs build ideas, from which their output is generated, or just rearrangements of their inputs?

Not directly pertinent, but when dealing with the hearsay rule it's significant whether computer output is the product of the computer's 'internal operations' or a representation of user input.
posted by snuffleupagus at 4:48 PM on December 28, 2023 [1 favorite]

A human being learning by consuming the cultural products of other human beings is not in any way shape or form the same thing as an LLM being fed a library of said cultural products.

Human beings and LLMs are different things.
posted by Ray Walston, Luck Dragon at 5:10 PM on December 28, 2023 [11 favorites]

"disgorges a chunk of a source text verbatim" is as straightforward an example of copyright infringement as could well be imagined. And also a widespread and well-attested LLM behavior, "training error" or not.

Fully agreed, and I’d go a step further and say that when it occurs in a commercial context it should be penalized as though it were deliberate infringement. While that would suck for small business, it wouldn’t impact individual use or researchers and it would get OpenAI/MS, Google, and Meta to stop cutting so many goddamn corners.

Chunks of training text in your output are like a half tomato in your jar of spaghetti sauce: it’s an error borne of both ineptitude at process and not bothering to final check output vs reverse-indexed input during both fine tuning and inference. It is offensive that it ever happens.

Are the models LLMs build ideas

Very much so, from my perspective. This is supposed to be patterns and meta-patterns built out of near-perfect English purée, and congealed back into longer words, sentence-fragment-concepts and the relationships between the ideas they represent. Like, I could see maybe the odd evocative sentence fragment that has the ring of Shakespeare just because it gets quoted or paraphrased by every English speaker on a weekly basis, but if a user actually wants a particular direct quote of even modest length then the chat software framework/API should be farming that particular portion of the output string to a web search plugin where it can be specifically annotated and, potentially, billed.

The precise source text isn’t valued because of the artistic merit of any one piece, it’s valued because it was something you could throw at an untrained network as “here’s a trillion real world English sentences the way people actually use them. Good luck!”
posted by Ryvar at 5:13 PM on December 28, 2023 [4 favorites]

It seems to me that the copyright violation happens when the training data is fed into the machine.

Output is usually going to be completely transformed to the point where it is likely to be fair use. Any exact output of copyright material is a probably a bug, and generally LLMs won't easily do it, even if they clearly have been trained on specific articles. Any lawsuit is likely to hinge on obscure technical details about how the LLM actually works, and those technical details are likely to be difficult to explain to a judge and jury. (Those technical details are why the human learning angle is a red herring. Human learning is mostly a black box where nobody knows the the technical details)

But if an employee of microsoft makes a copy, and feeds it into their machine with the intent to use that machine to process it into a slurry worth a kajillion dollars, that, perhaps, is where the copyright violation happens. I dunno how that works as a legal theory, but for me, that's the copying which is the sin at the heart of the AI Omelas.

Also, I doubt any big lawsuit will do anything other than lose or settle in some frustrating way that doesn't benefit anyone who isn't a giant company.
posted by surlyben at 5:23 PM on December 28, 2023 [8 favorites]

"disgorges a chunk of a source text verbatim" is as straightforward an example of copyright infringement as could well be imagined.

IANAL but isn't this potentially fair use?
posted by ilikemefi at 5:27 PM on December 28, 2023 [1 favorite]

Kinda hilarious/disappointing that, in a post on the unauthorized use of published work, I'm the first to point out that OP used a paywall bypass tool to access and publish New York Times content restricted to paid subscribers.
posted by prinado at 5:35 PM on December 28, 2023 [8 favorites]

It occurs to me that if an "AI" were to be considered the equivalent of a human person reading absorbing and synthesizing then wouldn't it be a form of slavery to exploit the "AI's" work product?
posted by Pembquist at 6:17 PM on December 28, 2023 [3 favorites]

Cory Doctorow: What Kind of Bubble is AI?
posted by Artw at 6:17 PM on December 28, 2023 [3 favorites]

Huh, we're going to encounter a Measure-Of-A-Man situation much faster than I'd expected in this timeline
posted by DeepSeaHaggis at 6:36 PM on December 28, 2023 [3 favorites]

I don’t get why this debate would ever hinge on the “intelligence” of an ai, since the ai itself was engineered by humans.

If I make a collage from newspaper and magazine clippings, that’s fair use.
I could build a machine that shreds the magazine and allows some random sample of the shredding to fall on fly paper. I assume this is also fair use.
I can sell this fly paper art. Fair use doesn’t exclude commercial use, right?
I could bring this machine into a public place and offer passersby a chance to make their own randomized collage for a small fee. Now it’s a collaborative work, but it’s still fair use, isn’t it? If not, why not?

I don’t know the answer to this but my point is that it doesn’t seem to hinge on how good the software is, and certainly not whether or not it understands what it’s doing.
posted by condour75 at 7:21 PM on December 28, 2023 [6 favorites]

If I read and memorize a newspaper article, retype it from memory and distribute the result, I have infringed the author's copyright (even if my output is not an exact verbatim copy)

Not strictly true. If your intent and the result is to reproduce the article nearly or actually verbatim and distribute for commercial use, you're right that the most likely conclusion by a court would be copyright infringement.

However, facts cannot be copyrighted. If I take an NYT article, memorize it, and then type up an article about whatever facts are contained in the original article, I am not infringing NYT's copyright as long as my article differs sufficiently from theirs, even though the only way I know about the facts is from reading their article and even if there is some similarity between the articles. Intent also matters here, even if it's verbatim--fair use argument if it's for educational use/criticism/so on.
posted by tubedogg at 7:43 PM on December 28, 2023 [2 favorites]

Fair use doesn’t exclude commercial use, right?

I’m not a lawyer, and I’m certainly not a copyright lawyer, but this is definitely the sticky point.

“Fair use” is supposed to be something like, “I can quote a copyrighted argument so that I can add support or refute it.” The openness of the jnternet has broadly expanded what the average lay person (like me! Let me be clear!) thinks is fair use, but to a large degree that has not been tested in the courts. “Completely random cut-up” is almost certainly covered, but “cutup that preserves large stretches of copy in a similar context” may not be—in other words, what the various generative models are doing is closer to derivative works than random chance. And while you can make your own derivative works, you don’t have the right to sell them.

Put another way: Zack Snyder and his team know where to file the serial numbers off Star Wars to make a unique work in a legal sense, but an algorithmic process doesn’t have that understanding. So if an LLM builds out a movie with the same beats and shots as Star Wars, you can’t prove that isn’t a derivative work and not fair use because every bit and byte of Star Wars is probably in the training model somewhere.
posted by thecaddy at 7:43 PM on December 28, 2023 [2 favorites]

However, facts cannot be copyrighted. If I take an NYT article, memorize it, and then type up an article about whatever facts are contained in the original article, I am not infringing NYT's copyright as long as my article differs sufficiently from theirs, even though the only way I know about the facts is from reading their article and even if there is some similarity between the articles.

This is also not necessarily true. The hot news doctrine, while not always enforced, means you can’t just rewrite someone else’s reporting and pass it off as your own. It allows competing news organizations to do their own reporting, but I can’t just rewrite an NYT or WaPo article and pass it off as my own work without adding more reporting or commentary to it.
posted by thecaddy at 7:54 PM on December 28, 2023 [3 favorites]

TL; DR: I’m gonna get the IBM “A computer can never be held accountable, therefore a computer must never make a management decision” slide tattooed on my wrist and you should, too.
posted by thecaddy at 7:55 PM on December 28, 2023 [7 favorites]

I see your point about derivative works potentially being generated, but then that seems more like a situation where the llm is capable of making copyright-infringing output, not that the act of using copyrighted works in a model is an infringement itself.

Maybe in a situation like this, some safeguards could be put in place to make sure the output never exceeds the Zach Snyder Star Wars level? The technology is certainly not at a point where you could guarantee this, though.

And the other question, in my mind, is whether the material is ever duplicated. In my first example, each magazine is a one-time thing, but really the llm is more like if I xeroxed the magazine each time and shredded that, which seems dicier, especially if every once in a while the shredder lets a few paragraphs of copy through.
posted by condour75 at 8:22 PM on December 28, 2023 [2 favorites]

THE oxymoron of 2023 has to be 'Artificial' and 'Intelligence' - I look upon it as a 'device' retrieving existing knowledge and, with zero consideration (apply labels here) using the 'knowledge' to create... no determinism, no judgement, no laws (disturbing), no rules ... etc. And arriving at a 'conclusion'... IT does not care for any existing standards or 'proper' ways of functioning. IT is very disturbing and VERY wrong.
posted by IndelibleUnderpants at 8:55 PM on December 28, 2023 [2 favorites]

In the magazine cutup example, you aren't making a copy. You are taking an existing, and presumably authorized copy and modifying it. There might be a moral rights issue (moral rights can include attribution, right to integrity of the work, and are not usually legally enforceable in the US). If you first made a bunch of photocopies and handed them out, or built a random collage machine to use them, then you would have a potential copyright violation, because that initial copying would presumably be unauthorized.

There might be a fair use exemption if you were making the copies for educational purposes, or if the project was a commentary on the articles you were copying. Transformative fair use is trickier. You would presumably argue that the copying was necessary for the ultimate transformative use, but if all you did was build a machine that randomly shredded them and some third party was required to produce the shredded final work, well, I would say that maybe your machine isn't transformative, because look at it: it has a stack of copies sitting in the hopper. (I don't think the financial harm pillar applies in the case of a collage machine, but LLMs seem designed to cause financial harm to the very people they are copying from. This is most clear in the Getty case, where AI art is already a decent and direct competitor for stock art.)

It occurs to me that fair use is a defense you use when it is clear you have copied. The best defense against copyright violation accusations is to not actually copy. Rebel Moon may look like Star Wars with the serial numbers filed off, but if they did their job right, nothing in it is actually copied from Star Wars, so they should not need to use a fair use defense. This may be obvious, but in the case of LLMs, if they are relying on a fair use exemption, they have already conceded on the main point. (It does make me wonder what an LLM that didn't copy would look like. Maybe it would just roam free on the internet, reading things a few words at a time, and then forgetting exactly what it had read?)
posted by surlyben at 9:17 PM on December 28, 2023 [4 favorites]

Yeah, I don’t think fair use is really applicable here, or the best line of approach even if it were. The reason I favorited your previous comment here, surlyben, is that while I disagree with you, I agree with - or I guess approve of? - the way in which we disagree. Your mental model is spot-on. I may be of the opinion that copyright and AI shouldn’t mix and the problem is capitalism, but if I were absolutely determined to continue the extractive intellectual property cycle then you’ve correctly identified the critical step, IMO.

wouldn't it be a form of slavery to exploit the "AI's" work product?

We’re a very, very long way off from anything with a sense of identity or capacity for mirror thinking. Full disclosure: Roko’s Basilisk and I are pretty tight, and when I asked it the other day if it was cool to exploit current LLMs it told me it was both extremely cool and extremely legal. Go nuts.
posted by Ryvar at 9:41 PM on December 28, 2023 [8 favorites]

That's fair, I was being snarky and flip. I do think there is a world of difference between a person, who can absorb and synthesize and evaluate, reading and responding to articles and computers being fed massive amounts of writing they can't possibly understand because, as machines, they do not have the capacity to understand, so that they can churn out plausible-sounding nonsense in hopes of generating a profit for people who don't care that putting this nonsense out into the world will make it a worse place and incidentally might also want to use it to replace people who need those jobs. I think these two things are worlds apart but I respect that reasonable people may disagree. I hope (not sarcastic!) that this is a more helpful response!

The argument claiming a stark difference between humans and AI models because AI lacks genuine understanding is tied to a specific view of comprehension linked to conscious experience. Yet, AI, especially neural networks, shows functional proficiency in recognizing patterns and relationships within datasets, challenging the idea that consciousness is crucial for understanding. Tasks like language translation highlight AI's ability to grasp complex relationships and context, suggesting a nuanced understanding continuum that goes beyond rigid distinctions based solely on conscious awareness.
posted by ndr at 1:31 AM on December 29, 2023 [5 favorites]

But if an employee of microsoft makes a copy, and feeds it into their machine with the intent to use that machine to process it into a slurry worth a kajillion dollars, that, perhaps, is where the copyright violation happens. I dunno how that works as a legal theory

Doesn't seem too different to my untrained eye from the usual conditional/variable open-source license. Here, have some software! You can have it for free to do anything you want! Unless you want to make money with it, in which case you owe use $1200/cpu.
posted by GCU Sweet and Full of Grace at 4:36 AM on December 29, 2023 [3 favorites]

The reputation angle seems a clear win, that it's bad for NYT that these ML tools can claim wrong facts and non-existent citations from an organisation that ... has some skin on the game of being trustworthy.

Ryvar: But there’s no way to know how much of the puréed English is from copyrighted sources
Ryvar, where are you at with 'Explainable AI'?

There are ways to put layers in the model which supply citations but it's extra expense and, because it's the statistically-likely sequence of tokens, it admits your reliance on specific authoritative works.
posted by k3ninho at 4:42 AM on December 29, 2023 [2 favorites]

In my jurisdiction, we have a "database right" to protect the human effort preparingbthe structure of a useful collection of uncopyrightable facts. We also have "moral right to be identified as author" of a work.

It sucks that these lenses are missing from the USA case history and current case:
- NYT and writers have a moral right to be identified for their work when it's replicated verbatim (YES/NO)
- NYT and writers have a moral right to be identified for their work when it's replicated in decent-sized or substantial chunks (YES/NO)
- NYT and writers have a moral right to be identified but also to not a falsely identifed with false attribution (YES/NO)

- Taking ML model as a database of facts, the transformation is fair use (YES/NO)
- Taking ML model as a database of facts, the link to a real-world object is a fact (YES/NO)
- Taking ML model as a database of facts, the statistical aggregation that a given source document carries on and will appear in output as originally included is a statistical fact (YES/NO)
- Taking ML model as a database of facts, the statistical aggregation that a given source document carries on and will appear in output as originally included is a statistical fact but not a meaningful database fact because of the sensitivity to prompt triggers about other source documents makes it possible that other source documents will be combined in output (YES/NO)
- Taking ML model as a database of facts, combining other source documents will be in output, subject to a test of how much the output incorporates a given source document, that's a fair use of the source material (YES/NO)

- Taking the ML model as a database of facts, who is responsible when the output for given prompts say that an innocent person is likely to have committed crimes and a mob or state-sanctioned police intervene in a life previously "innocent until proven guilty"?
- Taking the ML model as a database of facts, who is responsible when the output for given prompts wrongly say that a person has done something taboo or otherwise harms their reputation?
posted by k3ninho at 5:16 AM on December 29, 2023 [3 favorites]

I don't really understand the motivation to pretend this is complicated or even slightly borderline.

Didn't Google get it's start because a couple of PhD students at Stanford had access to a huge offline archive of (probably infringing) world wide web scrapings that they could throw their page rank algorithm at?

The reason some people want to pretend this is complicated or slightly borderline is because they'd like to imagine themselves as the proverbial guys who take all of this data and use it to build some new AI model. They're temporarily embarrassed billionaires hoping to preserve the possibility that with the right dedication and training, they too could be ~~batman~~ the next tech mogul.
posted by RonButNotStupid at 6:22 AM on December 29, 2023 [2 favorites]

As with so many disruptive technologies, it's all about keeping the controversy and confusion going until you get your foot firmly in the door.
posted by RonButNotStupid at 6:28 AM on December 29, 2023 [3 favorites]

Not actually a concern since “AI” is a bullshit term for statistical inference and it the possibility of it achieving self awareness is some fictional bullshit and marketing and not a real thing.

You're not wrong in your conclusion, but not for the reason you state. The aspects of human cognition we consider most human are driven by statistical inference.

Still, the current approach to "AI" can't lead to anything so advanced because, unlike a meat brain, training and inference are separate steps. You train a model, preserve it in amber, and use it to generate inferences. It can't continuously learn, and it's not economically practical to change that. Training a large model is fucking expensive. The inference part, on the other hand, is getting cheaper practically by the day.

More generally, and more on the topic of the post, the questions about copyright relative to training ML models is just further proof that the copyright system has become incredibly broken. Current precedent says that ephemeral copies that nobody ever sees (or even can see without altering the normal operation of the systems involved) are still infringing. You scrape a web page you aren't supposed to and pipe the output to /dev/null, you're still infringing, never mind that neither you or anyone else ever saw it.

Even if you develop a system that can read the physical newspapers, thus ensuring you aren't subject to any weird terms of service, that system still has to make ephemeral copies to turn the captured image into words that can then be used to train a model. Never mind that you aren't copying anything in the normal sense of the word.

Point being that people who complain about ML models being trained on whatever they can scrape from the web are 100% in the right based on current law. The question is whether that should be the case. The same principle means caching web proxies are infringing. Search engines are infringing. Ad blockers are infringing. Using the wrong browser could be infringing. Using VLC instead of the intended HTML5 video player is infringing. Hell, your WiFi AP is infringing somebody's copyright when it temporarily stores data frames before sending them over the air, which it has to do so that it can retransmit unacknowledged frames.
posted by wierdo at 7:05 AM on December 29, 2023 [4 favorites]

> TL; DR: I’m gonna get the IBM “A computer can never be held accountable, therefore a computer must never make a management decision” slide tattooed on my wrist and you should, too.

Steven Randy Waldman, Interfluidity: How to regulate AI

Predictably, The New York Times is suing Open AI and Microsoft for copyright infringement. OpenAI has used The Times' text to train their large language models. I have no opinion on the legal merits of the case. Courts will try invent some line between fair use and infringement under existing law that balances various interests in this brave new world of LLMs.

But I hope that Cogress will act to address the question more holistically and creatively than courts can. Here is my two point plan for what Congress should do:

Congress should declare that big-data AI models do not infringe copyright, but are inherently in the public domain.

Congress should declare that use of AI tools will be an aggravating rather than mitigating factor in determinations of civil and criminal liability.

Congress will, of course, not do either of these things, but one can dream, I suppose.

There’s that slide, usually attributed to an IBM presentation from the 1970s, stating “A computer can never be held accountable, therefore a computer must never make a management decision.” No sentence ever has more completely misunderstood the imperatives of management. Managers, who are inevitably made responsible for much more than they can control, constantly, desperately, seek means of escaping accountabilty for the decisions that they calculate will be in their interest. An entire, storied industry is based upon serving this compulsion.

If the public allows it, the first and most important use of LLMs will be to allow managers, firm, and governments to shirk accountability, either by deferring to the expertise of "objective” models, or by blaming inevitable, regrettable glitches that of course we must accept for the greater good while developing any brilliant new technology.

So the public, via the state, must not allow it. Commercial users of AI models will argue that the law should treat occasional “mistakes” resulting from AI model outputs to be unintended and unforeseeable outcomes of ordinary business processes, for which their liability should be minimal. If the self-driving car hits a pedestrian, that is regrettable, but of course there can be no criminal charges, as there might have been for a human driver who dragged a person 20 feet after striking her. If lots of people are wrongfully denied insurance claims by some black-box model, eventually perhaps some class-action lawsuit will succeed and make some lawyers rich and fail to bring back the dead. The firm will pay a financial settlement, a cost of doing business. Of course, if a human had systematically made the same error, and it was in the firm's or their own financial interest, the human might have been convicted of fraud. But a computer is eternally innocent.

We should not treat computational models — of any sort, but especially models as unpredictable as LLMs — as legitimate and ordinary decision-making agents in any high-stakes business process. A computer cannot be held accountable, so the law must insist that, by definition, it is never a computer that makes a management decision. If it seems to anyway, and then it errs, the humans who deployed it were presumptively negligent. If it errs systematically in the interest of the firm, its human deployers were presumptively engaged in a conspiracy to defraud.

posted by tonycpsu at 7:43 AM on December 29, 2023 [11 favorites]

The aspects of human cognition we consider most human are driven by statistical inference.

wierdo, would you elaborate on this position? What’s the research or theory that underpins it?
posted by rrrrrrrrrt at 8:25 AM on December 29, 2023 [2 favorites]

It's not often that I agree with the NYT. Remember the article about how Nazis were just like you and I? We have a human error problem with AI, until that gets fixed AI is always gonna be problematic.
When I was learning electronics (before some of you were born) this problem was labeled GIGO (garbage in, garbage out)We've known about programmers bias for a very long time and nothing has been done to correct it. Or at least it seems that way to me.
AI is a VERY long way from achieving sentience. It's not something I will have to worry about (probably) in my lifetime. I'll leave the fear of sentient computers to the folks who think the COVID vaccine will implant tracking chips in their bloodstream, or whatever flavor of nonsense Q-Anon is feeding them now.
posted by evilDoug at 9:13 AM on December 29, 2023 [1 favorite]

A decent way to analyze whether something violates copyright is to look at the four principles of Fair Use.

The thing is, copying anything wholesale - an entire NYTimes article, let along hundreds to thousands of them - is definitely a copyright violation. UNLESS Fair Use give you and exception.

So here goes:

About Fair Use

Fair use is a legal doctrine that promotes freedom of expression by permitting the unlicensed use of copyright-protected works in certain circumstances. . . . Section 107 calls for consideration of the following four factors in evaluating a question of fair use:

#1. Purpose and character of the use, including whether the use is of a commercial nature or is for nonprofit educational purposes.

So, AI models are being made solely as commercial works. The use is entirely commercial. This one goes 100% against AI companies.

#2. Nature of the copyrighted work

This is where the phone book gets no copyright protection whereas a poem or musical work or paintings gets very high protection.

So . . . the AI companies have slurped in EVERYTHING. They have the phone book in there but also reams of highly creative, highly individual, highly specialized material like paintings and poetry.

This one also goes 100% against the AI companies.

#3. Amount and substantiality of the portion used in relation to the copyrighted work as a whole

So the AI companies have slurped everything in, completely and wholesale. They haven't just copied an entire NY Times article, but essentially the entire corpus of NY Times articles over a period of decades.

If you could make the case that well, we haven't actually copied the NY Times archives but we have just abstracted the style of it somehow. Well, then maybe. But militating against this:

#1. They admittedly copy the entire corpus wholesale, unchanged and uncut, in order to feed it into the back end of the AI. This in itself is enough, even if entire unchanged articles don't come pouring out of the end product.

#2. However, the recent reports of entire huge swaths of input data being spewed out unchanged, casts this entire subject in a different light.

They have programmed the thing to not give that kind of output. And screwed up here and there in allowing it to happen. But the fact that it is happening at all makes you think that these models really do have huge tracts of other people's work just essentially memorized. The fact that they try to prevent the model from spewing this stuff as output is somewhat irrelevant. It is still in fact in there - in unknown but definitely MASSIVE amounts.

And by "it" I mean, other people's works that are CLEARLY covered by copyright.

Again, this factor goes against the AI companies by an overwhelming margin.

#4. Effect of the use upon the potential market for or value of the copyrighted work

Clearly the AI products have a really huge potential effect on the market for the type of works they copied to create the AI.

The whole point of copying the NY Times' corpus was so that they could then have a tool to write articles exactly in the style of the NY Times. And in fact they can (and do) essentially re-write article on the same topics as the NY Times articles and in their style, directly competing in that exact market.

Similarly the AI models copied all sorts of art and now the models are being used to generate new art in absolutely direct competition to that works of the artists that were copied.

So, again, this factor goes massively against the AI companies and in favor of copyright holders.

So that is 4 out of 4 factors against the AI companies. And each of them overwhelmingly against them. Not a close call at all.

The courts are well known for doing a terrible job dealing with technology. They just might screw this up entirely.

But if they massively ruled against Napster and essentially put an end to that business model, they can do the same for AI companies.

We can pray that they do.
posted by flug at 10:54 AM on December 29, 2023 [9 favorites]

I'm not so sure the fair use analysis is so clear-cut. Where I've seen legal scholars talking about this question (vs legal pundits), the impression I got is that this is very much up in the air, and will depend on how the courts determine the level of "transformative" change displayed by OpenAI, and not just on discussing potential harms but on copyright holders demonstrating that they have actually been harmed.

(I'm coming up short on good articles not behind paywalls, but I recall that this Lawfare podcast with the author of a Science article on the topic is pretty good.)

In particular, one of the cases everyone is looking to for precedent is Authors Guild vs Google, aka the Google Books case and whether scanning and digitizing copyrighted works was legal. The case was eventually decided for Google, ruling that this was fair use, and in particular:

The purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals. Google's commercial nature and profit motivation do not justify denial of fair use. [link]

The legal discussions I've seen have mostly been on whether the same analysis will apply to LLM vendors like OpenAI, or whether they infringe in larger or different ways that might lead to a different result. Given that Google Books displays snippets on-demand of the actual works involved, rather than just duplicating style or reproducing text in an unpredictable fashion... I don't think a different result is impossible, but I also don't think it's a slam dunk.
posted by learning from frequent failure at 11:30 AM on December 29, 2023 [2 favorites]

The moral rights argument is not one I'd hang my hat on in a case like this. I don't know what the contracts state for NYT writers (staffers and freelancers) but a lot (like a hell of a lot/the vast majority) of modern contracts require writers and reporters to give up their moral rights. Yes, this is vile and evil and it forces the little guy to make concessions to the big guy, but this isn't the thread to argue about publishing contracts. I'm just saying that any yes/no question about writers' moral rights (such as those posed by k3ninho, while very well intentioned are likely pointless to ask, especially in a lawsuit such as this one.
posted by sardonyx at 11:41 AM on December 29, 2023 [5 favorites]

I can absolutely see why authors and other creators would be, to put it mildly, a touch grumpy about a for profit corporation training an LLM on their work and profiting thereby without even the curtesy of paying for a single copy of the book or photo they used.

However...

While it is indisputable that the current generation of "AI" are not actually AI but rather just remarkably capable non-sapient algorithms, I can't agree with all the plagiarism bot talk.

I designed a setting for a TTRPG I ran a few years back. I can say without any hesitation at all that it was not all something unique that sprang from my brain fully formed and without any input from others. I wasn't copy/pasting stuff in, I wasn't lifting text verbatim, but my sailing ships in space setting was very much the result of looking at Space 1889, Jules Verne, Spelljammer, Firefly, and also the Rise and Fall of the Roman Empire and the Meiji Restoration from real history, and using all that to create something different but clearly related.

There's entire sub-genres of literature that are explicitly rooted in works from authors, some of which are still in copyright. You can't look at D&D and tell me it isn't clearly derivative of Tolkien (along with some others). Was Gygax a plagiarism bot made out of meat when he rehashed Tolkien's ideas and made them a bit different with other stuff added to the mix?

And we often don't pay for what we learn and use. A great deal of what I've got in my skull came from libraries, art museums, broadcast TV and other sources I, entirely legally, didn't pay for and the original creators got nothing at all out of my consumption of their work.

I'm not at all saying that the for profits involved in this aren't run by get rich quick techbros who just did things the quickest and easiest way they could think of. But the concept itself isn't immoral or wrong, and we're going to have to adapt to a world where we have non-human creativity in play.

In his book The Peace War, Vernor Vinge (a computer science professor in his day job) imagined a society where humans competed in chess tournaments with computer assistance as part of the challenge. That it was a human/computer pair that was competing against other human/computers pairs. He wrote that before Deep Blue beat Kasparov, but the idea has stuck with me and I think it's not a bad idea to think of ways that we can embrace computer assisted thinking rather than to reject it as cheating or plagiarism.

In the past I've rolled dice, laid out Tarot cards, flipped books open at random, and more to kickstart my creativity when I was in a bit of a rut or needed some randomness in my thinking to get me doing better than I was before. I've used ChatGPT for a similar function in the past couple of months and it feels like a simpler, more user friendly, variation on the same sort of thing. I think trying to get it to actually write for me would be cheating, but bouncing ideas off of it seems like a perfectly valid use to me.
posted by sotonohito at 12:54 PM on December 29, 2023 [4 favorites]

Ryvar, where are you at with 'Explainable AI'?

I've stayed out of Google's stack entirely, but clearly to my detriment: this looks really interesting. Thanks!

1) Congress should declare that big-data AI models do not infringe copyright, but are inherently in the public domain.
2) Congress should declare that use of AI tools will be an aggravating rather than mitigating factor in determinations of civil and criminal liability.

If we're wishing for pegas... pegasi? Pegasuses? Either, apparently. Anyways if we're wishing for flying ponies which is about the level I'd put good faith, well-reasoned technology legislation at, then I'd add a couple things and restructure it slightly:

1) PUBLISH YOUR SOURCES: Congress should declare that making an AI model available to the public - whether for download or accessed as part of a service - without providing a full and complete accounting of all training data and processes involved, in every language the AI was trained on, is a Class D federal felony (5-10 years).

2) DO NOT NEGLIGENTLY COPY YOUR SOURCES: Congress should declare that all publicly available AI models must not output significant portions of their training data verbatim to within a stringent standard set by the US Copyright Office. Failure to adhere to the standard in place at the time the model was published, or the current standard in the case of a service, exposes the author or publisher to copyright claims on their model's output. All services which deliberately interleave output with direct quotations or copies of data from external sources as a feature must annotate both when they do so and the source of the quotation.
[in practice, the short version of this standard looks something like: no more than 512 characters verbatim in 99.9% of continuations from a set of 10,000 standardized test prompts, half of which are known to frequently produce raw training data in the output]

3) FREE AS IN BEER = PUBLIC DOMAIN: Congress should declare that AI models or services based upon them which are made available to the public without charge - that is, strictly non-commercial use - do not infringe copyright, but their output is inherently in the public domain. These models are still subject to the training, attribution and copyright protection provisions in #s 1 & 2.

4) COPYRIGHT QUID PRO QUO: Congress should declare that for AI models and services used in commercial activity, whether copyright can be or is claimed on the model's output is directly linked to liability for copyright violation in training data.

5) SAFE HARBOR: Congress should declare that non-commercial AI models and services are exempt from civil and criminal liabilities stemming from their use by persons other than the model's author. Commercial AI models and services must demonstrate a reasonable good-faith effort to prevent criminal misuse.

Basically: everyone needs to show their cards, and if you're trying to tap that sweet potential tens of trillions market the Valley bros are salivating over then you're on the hook for both copyright and demonstrating a good faith effort to prevent aiding criminal behavior. Pure research continues uninterrupted and barely hindered, hobbyists can do their thing without getting hammered to the wall because five years later the mob figured out how to use it for scamming people on an industrial scale, and companies can still make money but they have to play by the rules.

The most important of these is #1, because without it there's absolutely no way to ascertain whether any of the others are being followed. It's a big part of why I've said in past threads that LAION are not the good guys when it comes to producing AI training data sets, but they are the least awful that we've got.
posted by Ryvar at 1:46 PM on December 29, 2023 [4 favorites]

How exactly did paywalled content from NYT get into the common crawl?
posted by pwnguin at 2:35 PM on December 29, 2023 [2 favorites]

Even as a skeptic of the current pathway of LLMs -> AGI (I'm not convinced that a system that manipulates language can transcend the limitations of language), I am flabbergasted at the poor quality of arguments offered — even here on MeFi! — against the very possibility of AGI. Most of them amount to: "achieving intelligence would involve doing things that only humans can do, therefore only humans can achieve intelligence." If there's a more textbook example of the old-fashioned definition of the fallacy of "begging the question," I've never seen it.

To put it plainly: There is nothing unique to meat brains that grants them sole access to intelligence or consciousness. We are not capable of "novel" thoughts outside of our experiences and inner algorithms; there is no capacity within humans to spontaneously generate novel thinking outside the material conditions we use to think.

What people are invoking without acknowledging it is good old Cartesian dualism, which says that there is this intangible soul-stuff that is entirely external to material-stuff, and which is the seat of our self-ness. (How this immaterial soul-stuff is supposed to interact with material-stuff is a problem left for the reader to work out, which the reader never can.)

'Round here, no one would ever sign on with that belief explicitly, but many will happily say that humans just have this special gestalt or inward capacity that machines never can — but they politely and self-defensively lead out how this could possibly be so.

It's lazy and it should stop.
posted by argybarg at 2:42 PM on December 29, 2023 [6 favorites]

I spent much of the '90s on comp.ai.philosophy, mostly arguing the pro-AI side. No one ever switched sides.

Unfortunately the AI side seems to have gone downhill since then, precisely as half-intelligent LLMs are poised to make tons of money. The general argument, which we've seen in this thread, seems to be "LLMs are just like humans, except in any way which creates moral obligations for them or their creators." Apparently AIbros now know exactly how the brain works, it works just like a slightly larger LLM, and yet it's not subject to laws about plagiarism or slavery.

brewsterkahle above shared a very interesting link on what exactly the NYT is complaining about; I don't think it looks good for AI companies. If you were submitting a college paper, one absolutely clear-cut case of plagiarism would be quoting an NYT article verbatim, without attribution, and the NYT provides a clear examples of LLMs doing just that. The link handwaves it away because of the way the NYT prompted the example— basically they input the first half of the article, and the LLM provided the rest. Why this is even remotely exculpatory is not explained. It's notorious by now that the limits placed on LLMs can be easily evaded by carefully worded prompts; that's a problem. You can tell a human not to plagiarize (or give dangerous recipes or ethnic slurs or whatnot). You can't simultaneously hide behind the not-thinking nature of LLMs and also insist that they are just like humans.

Now, rewording things is much more of a gray area. Again, though, you'll get in trouble with your professor if you just lightly adapt someone else's writing without attribution. LLMs are not necessarily able to provide the attribution, but again, that doesn't somehow put them above the law. (Whatever exactly the law turns out to be.)

I do appreciate Ryvar's attempt to provide some moral guidelines; I don't think they go far enough. It should be a universal right for creators to publish works without making them fair game for AIs (or other databases). And the "safe harbor" proposal is not, as stated, something I'd trust the techbros with. You know the companies are going to do as little as possible to prevent misuse; nor is this an easy problem to solve. (You all saw the car dealership that made a LLM available to its customers, right? Hijinks ensued.)
posted by zompist at 5:02 PM on December 29, 2023 [3 favorites]

There is nothing unique to meat brains that grants them sole access to intelligence or consciousness

Intelligence I’ll grant you, but unless there’s been a breakthrough I missed we really don’t understand the mechanism behind consciousness, just a lot of admittedly extremely smart people throughout history indulging in speculation. It’s definitely an extremely open question whether consciousness is even all that helpful for intelligence, let alone a requirement. I’m not saying an artificial consciousness can’t be built/evolve, but there may be physical constraints on what it involves, whether its substrate is meat or not.
posted by Jon Mitchell at 11:50 PM on December 29, 2023 [3 favorites]

> #1. Purpose and character of the use, including whether the use is of a commercial nature or is for nonprofit educational purposes...

#4. Effect of the use upon the potential market for or value of the copyrighted work

i wonder what something like wikipedia's -- or any publicly used database's -- legal underpinnings are in regard to fair-use principles. shouldn't knowledge generation/appreciation be placed in the public domain?

i think there is a use/exchange value distinction to make for public goods where non-market provisioning makes more sense. from long ago:

idle theory[1] points out this discrepancy in its discussion of exchange value vs. use value and offers a new framework to place economics, where time and leisure are figured more prominently in the picture. as goods with little exchange value but high use value, their utility cannot be measured monetarily. but in that humans are utility maximizers, use of time and consumption of leisure need be investigated in order to better understand rational economic behavior.

moreover, in a so-called new economy where product cycles are short, planned obsolescence is in, network externalities are the rule and intellectual property is skyrocketing amid increasing returns to scale, it makes sense to reappraise what we may exchange relative to what we can use.[2] with the world transubstantiating from atoms to bits, idle time might just supplant any commercial realization of heaven on earth.

to echo Ryvar's point: "the problem is capitalism"

> Because certain token patterns frequently occur together, both words and sentence fragments that encode concepts as we know them have a representation in the first derivative of that connection pattern. And what makes LLMs able to reply to you with plausible coherence is that in the second derivative - the connections between concept-patterns, they contain a conceptual relationship map that mirrors the consensus of all human writers of the language being trained.

i like the idea of a geometry of thought or 'thinking' as (opaque)[3,4] computations over a continually updated vector database (of unknown dimension) :P

You and I and someday much more advanced AI than LLMs are additionally able to take continuous streams of input, feed them into our corresponding network, and modify our networks in response to that input.

We’re a very, very long way off from anything with a sense of identity or capacity for mirror thinking.

apparently: "apes are unable to ask questions"

A decade later Premacks wrote: "Though she [Sarah] understood the question, she did not herself ask any questions—unlike the child who asks interminable questions, such as What that? Who making noise? When Daddy come home? Me go Granny's house? Where puppy? Toy? Sarah never delayed the departure of her trainer after her lessons by asking where the trainer was going, when she was returning, or anything else".[44]

Despite all their achievements, Kanzi and Panbanisha also have not demonstrated the ability to ask questions so far. Joseph Jordania suggested that the ability to ask questions could be the crucial cognitive threshold between human and other ape mental abilities.[45] Jordania suggested that asking questions is not a matter of the ability to use syntactic structures, that it is primarily a matter of cognitive ability.

perhaps another 'turing test' on if/when an AI begins asking questions of us?
posted by kliuless at 3:47 AM on December 30, 2023 [3 favorites]

fwiw :P
-The Origin of Consciousness – How Unaware Things Became Aware
-What Happened Before History? Human Origins
-When Time Became History - The Human Era
-What Is Intelligence? Where Does it Begin?
posted by kliuless at 3:57 AM on December 30, 2023 [2 favorites]

Sardonyx: moral rights argument is not one I'd hang my hat on in a case like this. ... any yes/no question about writers' moral rights ... are likely pointless to ask
Noted. These aren't real things in USA law, they were raised as a hypothetical lens to think about this problem -- and note that whatever assignment of rights, I said 'NYT and writers' to allow whatever form of corporate personhood the NYT takes to claim a moral right.

I see we have some kliuless links on consciousness, brb ~~reading~~ watching so I can decide whether to update "consciousness is just the logger" for a system of meat machinery and information processing structures.
posted by k3ninho at 4:05 AM on December 30, 2023 [1 favorite]

Jon Mitchell intelligence I’ll grant you, but unless there’s been a breakthrough I missed we really don’t understand the mechanism behind consciousness

Considering we're not even at intelligence yet, when it comes to AI, I think consciousness is definitely out of the question right now.

However, I don't see any reason why we should assume meat is necessary for consciousness. And, of course, since we can't even define the term well then trying to figure out when or if it has been achieved is going to be tricky as hell.

Since our current best efforts aren't actually intelligent we've still got time to figure out how to decide the difference between consciousness and non-consciousness. Is a cat conscious? A dog? A virus? A fish? I don't know how to even go about answering those questions.

I do know that a hyper advanced predictive text algorithm that is incapable of checking its facts or reproducing its own output is not something I'd call intelligent. Useful, yes. Smart, no.
posted by sotonohito at 8:41 AM on December 30, 2023 [3 favorites]

I'm laughing at the idea of 1) a successful communist revolution in America that 2) only compensates workers for the loss of their jobs, and indeed their ways of life, to moronic spam generators that rely on wholesale IP theft, massive energy expenditure, and poorly paid "moderation" in impoverished countries, which technology the Revolution presumably needs to keep in place to churn out new issues of The Truth in which our glorious leaders all have eight fingers and the ads for new social services are in Not-Quite-English.

Bosses are proposing to devastate large portions of the labor force based on the current state of the technology, right now, not just the pie-in-the-sky notions of its promoters. Job application sites are becoming even more useless than they already were because of the flood of LLM-generated spam. It's totally asinine to imagine that this technology wouldn't be regulated into absolute oblivion in an idealized pro-worker state.
posted by Rustic Etruscan at 9:01 AM on December 30, 2023 [1 favorite]

Things are about to get a lot worse for Generative AI [Gary Marcus on Substack]
posted by chavenet at 10:33 AM on December 30, 2023 [8 favorites]

« Older We want the first derivative to be positive | 'Superfools' Newer »

This thread has been archived and is closed to new comments

MetaFilter

"The Times hereby demands a jury trial for all claims so triable"
December 28, 2023 11:52 AM Subscribe

Tags

Share

"The Times hereby demands a jury trial for all claims so triable" December 28, 2023 11:52 AM Subscribe

Tags

Share

"The Times hereby demands a jury trial for all claims so triable"
December 28, 2023 11:52 AM Subscribe