I Built the World's Largest Translated Cuneiform Corpus using AI
June 9, 2024 6:10 PM   Subscribe

TL;DR I used a custom-trained Large Language Model (T5) to create the world’s largest online corpus of translated cuneiform texts. It’s called the AICC (AI Cuneiform Corpus) and contains 130,000 AI translated texts from the CDLI and ORACC projects.

Also of interest:
Cuneiform Digital Library Initiative - By making the form and content of cuneiform texts available online, the CDLI is opening pathways to the rich historical tradition of the ancient Middle East. In close collaboration with researchers, museums and an engaged public, the project seeks to unharness the extraordinary content of these earliest witnesses to our shared world heritage.

Open Richly Annotated Cuneiform Corpus - Oracc is a collaborative effort to develop a complete corpus of cuneiform whose rich annotation and open licensing support the next generation of scholarly research.
posted by bq (15 comments total) 17 users marked this as a favorite
 
Sure this kind of LLM is all nice and good but then one day all you hear is how inferior your copper is and how mean you are to people's servants.
posted by tclark at 6:29 PM on June 9 [17 favorites]


https://aicuneiform.com/search?q=Gilgamesh:
Inanna, you shall enter the gipar ritual. Ninegal, you shall not ... the strength of heroism. Inanna, you shall not ... my esir. I shall speak to the bull of the nether world, I shall speak to the sheep of the nether world, I shall speak to the sheep of the nether world, I shall speak to the sheep of the nether world, I shall place the silver and carnelian in the ..., I shall speak to the lady, I shall speak to the lady, I shall speak to the lady, I shall speak to the lady, I shall speak to the lady, I shall speak to the lady, I shall speak to ..., I shall speak to ..., I shall speak to ... like ... Gilgamesh, .

... ... ... ... ... ... ... ... ... ... ... Inanna, the water of the shumlu-water, the shumlu-water ... An, my beloved, ... he took a sledge? ... ... My son, who is ..., the water of the shumlu-water ... The great bull, who has left Uruk, ... The great bull, Gilgamesh, who has left Uruk, ... ... ... ... The water of the shumlu-water ... .

'Let me be the lord, let me be the lord, let me be the lord, let me be the lord, let me be the lord, let me be the lord of Gilgamesh, let me be the lord.' The great holy An of Inanna has returned. My little one, the ox of heaven, whose horns are a horn of the horizon, has returned. The lady Inanna, the ox of heaven and earth, whose horns are a horn of heaven, whose horns are not a horn of heaven, has not returned to the holy Inanna. He has not returned to the holy Inanna. He has not returned to the holy An, he has not returned to the holy Inanna. He has not returned to the holy Inanna. He has not returned to the holy Inanna. He has not returned to the ox of heaven.
I can't decide if this is the Epic of Gilgamesh as written by a Markov-Chain or Anne Carson, but it's kind of terrible and kind of wonderful.
posted by gwint at 6:41 PM on June 9 [7 favorites]


The blog post is actually interesting if you're interested in a high level overview of how people train these things. I really wish it (and the FPP as a consequence) didn't have the Hacker News clickbait title formation.
posted by hoyland at 6:53 PM on June 9 [1 favorite]


big, big, big fan of Frank* – I suspected he was the poster of this piece since he mentioned it in passing on his podcast some time ago

* of all the tech people out there he's the one most actively mining the seams I have interest in exploring, eg. his "Calca" app and so many other R&D areas such as the FPP
posted by torokunai at 7:39 PM on June 9


I can't decide if this is the Epic of Gilgamesh as written by a Markov-Chain or Anne Carson, but it's kind of terrible and kind of wonderful.

I've been trying to think of what to say about this without sounding unduly negative. The intention is beneficient, even pure! The means adopted will some day be good enough and are not absurd to try now! But I worry about the effect of these kinds of gibberish-y translations being put out into the world, even if it's in such an obscure little corner with obscure little texts. (One of the reasons there are comparatively few public translations of these texts is that they're in large part administrative records and you would die of boredom.) Somehow these kinds of projects end up having worse consequences than I could anticipate. But it's an interesting attempt, and an interesting post about the attempt.
posted by praemunire at 8:12 PM on June 9 [8 favorites]


Did nobody read Snow Crash??? Do we not know how this ends? What is wrong with people!
posted by symbioid at 10:15 PM on June 9 [16 favorites]


Sorry someone will have to indicate to me where the 'I' is in this.
posted by GallonOfAlan at 2:42 AM on June 10 [1 favorite]


This is fucking fascinating! Thank you for posting, bq. This has the potential to open doors to exciting new projects, as well as making it easier for researchers to scan for texts ripe for more careful human translation. The project has some of the classic problems of computational approaches to the humanities (particularly gaps in language skill and perhaps aspects of the field), but the author is thoughtful, passionate, and game to partner with experts in the field.

Flagged as fantastic!
posted by cupcakeninja at 2:43 AM on June 10


Yeah, yeah. It's all fun, games and philosophical s*** until Anubis-net self-actuates and then y'all be crying about how "my soul weighs more than a feather" and how you need Ozempic For The Soul.
posted by JustSayNoDawg at 5:15 AM on June 10


"It’s a little funny that this network designed for translation is now broaching the realm of artificial general intelligence (AGI), but I digress."

It's really funny how often people who tinker around with LLMs convince themselves they are anywhere approaching general artificial intelligence.
posted by GoblinHoney at 8:16 AM on June 10 [9 favorites]


“What I had not realized is that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.”

― Joseph Weizenbaum, describing the ELIZA Effect, from the ELIZA chatbot created in 1964.

I'm not discounting *other* paths that may not be visible to us where various companies are trying to bootstrap AGI, but looking at a chat bot and assigning intentionality to its output is not a new problem.
posted by tclark at 9:18 AM on June 10 [3 favorites]


I've been trying to think of what to say about this without sounding unduly negative. The intention is beneficient, even pure! The means adopted will some day be good enough and are not absurd to try now! But I worry about the effect of these kinds of gibberish-y translations being put out into the world, even if it's in such an obscure little corner with obscure little texts.

It doesn't bother me at all, but perhaps that's because I have had a wide exposure to amateur human-generated anime fan-subs. There are times when I'm watching an anime and the subtitles translate something literally instead of figuratively in a way that you have to really stretch to figure out what this character is meaning right now. That's probably because the translator has a working knowledge of Japanese, but not a knowledge of the art of translating, where the goal is not to swap out each word for the closest cognate that makes sense in context, but to convey the intent of the entire phrase. Several years ago, I can't remember exactly where, I was reading some article about translation and there was a sample translated paragraph (maybe from the Tale of Genji or maybe I'm hallucinating that, as the AI peeps say) that contained something along the lines of 'flowers with smiling faces'. That phrase struck me as so awkward and weird that it stuck with me for years. I was driving somewhere a couple of weeks ago and it circled back through my head, as these things do, and it finally clicked - what it should have been translated as is probably 'cheerful flowers'. Of course the flowers aren't cheerful any more than they have smiling faces, but one is a direct word-for-word translation and the other is much more functional.
Given this exposure and my own language-learning experience, I recognize that the first type of translation, the word-for-word clunky translation, can be very useful and serves a distinct purpose. In a sense, every translator has to obtain a level of fluency that takes them through level one to get to level two, and synthesize them into a fluent, readable translation - that's how we get different translations of the same passage by different translators.
posted by bq at 9:26 AM on June 10


This reminds me of Steven Peck's A Short Stay in Hell where the dead are doomed to wander through an infinite library, looking for a book with their life story but encountering endless gibberish.

"... ... ... weeping ... weeping ... weeping ... he was happy ... he went ... his name ... he went out ... he sat ... he was sated ... he was sated ... he was sated ... he was seated before him ... the gods, his sons ... his position"
posted by mecran01 at 11:41 AM on June 10 [4 favorites]


Here be nam-shubs.
posted by BReed at 4:21 PM on June 10


perhaps that's because I have had a wide exposure to amateur human-generated anime fan-subs

I'm familiar with fan-subs, but the social context (or maybe lack of context?) is different for something like this. Too long a comment required to explain what I mean, though. I'm not seriously anticipating doom from this in particular; I just hate the feeling I get when these tools are used now that there are going to be really stupid and obnoxious long-term consequences somehow. I want to just be able to revel in the coolness.
posted by praemunire at 7:58 AM on June 11


« Older There’s a whole lot more to unlife than blood...   |   50,000 Year Old Neanderthal Bones Have Remains of... Newer »


You are not currently logged in. Log in or create a new account to post comments.