LOCO: the 88-million-word language of conspiracy corpus
January 10, 2022 2:39 AM   Subscribe

LOCO: The 88-million-word language of conspiracy corpus The spread of online conspiracy theories represents a serious threat to society. To understand the content of conspiracies, here we present (...) an 88-million-token corpus composed of topic-matched conspiracy (N = 23,937) and mainstream (N = 72,806) documents harvested from 150 websites.

Per the abstract: "Mimicking internet user behavior, documents were identified using Google by crossing a set of seed phrases with a set of websites. LOCO is hierarchically structured, meaning that each document is cross-nested within websites (N = 150) and topics (N = 600, on three different resolutions). A rich set of linguistic features (N = 287) and metadata includes upload date, measures of social media engagement, measures of website popularity, size, and traffic, as well as political bias and factual reporting annotations. We explored LOCO’s features from different perspectives showing that documents track important societal events through time (e.g., Princess Diana’s death, Sandy Hook school shooting, coronavirus outbreaks), while patterns of lexical features (e.g., deception, power, dominance) overlap with those extracted from online social media communities dedicated to conspiracy theories. By computing within-subcorpus cosine similarity, we derived a subset of the most representative conspiracy documents (N = 4,227), which, compared to other conspiracy documents, display prototypical and exaggerated conspiratorial language and are more frequently shared on Facebook. We also show that conspiracy website users navigate to websites via more direct means than mainstream users, suggesting confirmation bias. LOCO and related datasets are freely available at https://osf.io/snpcg/."
posted by Shepherd (13 comments total) 15 users marked this as a favorite
 
The acronym LOCO might suggest the idea that conspiracy theories and theorists are all crazy. Far from this position, we rather highlight the polarizing phenomenon by which, regardless of the belief position, the “others” are considered crazy. [From the notes, Note 1]
posted by chavenet at 4:51 AM on January 10, 2022 [1 favorite]


As for the rhetorical style used by the representative subset, we observe higher values for certainty (category certain), and interrogative (category interrog) language, along with higher use of question and exclamation marks (categories Exclam, QMark). This is in line with the observation that the rhetorical style of conspiracy narratives is built upon refutational strategies based on questioning the dubious version of the official story while highlighting the lack of answers from official sources

[...]

When users of conspiracy Facebook pages are exposed to debunking information, they increase traffic towards conspiracy-like content (Zollo et al., 2017). This behavior suggests a confirmation bias: people avoid cognitive dissonance while searching for reinforcement

This is really interesting. Thanks!
posted by joannemerriam at 4:59 AM on January 10, 2022


I'm a little concerned that the researchers chose to freely release the dataset.

Couldn't a bad actor just as easily use this data to optimize and fine-tune their own conspiracy content?
posted by schmod at 5:01 AM on January 10, 2022


Couldn't a bad actor just as easily use this data to optimize and fine-tune their own conspiracy content?

This is a bit like putting the dam back up after the flood. The genre is well established, and, i think, it's also organic. Conspiracy theorists i think aren't often deliberate con artists, they're true believers themselves.
posted by dis_integration at 5:05 AM on January 10, 2022 [10 favorites]


And now I want to learn how to train GPT-2 on new data sets, like LOCO ...

Anyone else remember the 1972 Parry chatbot?
posted by cstross at 5:59 AM on January 10, 2022 [1 favorite]


The 88 reference is a nazi thing right?
posted by Lawn Beaver at 6:17 AM on January 10, 2022


The 88 reference is a nazi thing right?

Yep. See here.
posted by mandolin conspiracy at 6:19 AM on January 10, 2022


And now I want to learn how to train GPT-2 on new data sets, like LOCO ...

You'd be building a weapon.
posted by mhoye at 6:39 AM on January 10, 2022


Couldn't a bad actor just as easily use this data to optimize and fine-tune their own conspiracy content?

But a good actor can then come and, using this data, just as easily expose that bad actor. Or, at least, put forth a counter-conspiracy content. Both of them will find their followers ...
posted by Green-eyed grenade at 7:44 AM on January 10, 2022 [1 favorite]


I used to question the factual basis for a friend's conspiracy theories through email. He always said he was working up a response, but never actually replied to my emails. I finally gave up on him. I am thankful that he turned me on the Bitchute, a repository of batshit videos. (It did bug me that he always sent videos, not text. Who has the time?)
posted by kozad at 7:44 AM on January 10, 2022 [2 favorites]


It did bug me that he always sent videos, not text. Who has the time?

It's a half-baked derail for sure, but: Documenting an actual network of interconnected facts and bits of evidence is something hypertext is perfect for. The conveyance of emotions like fear, resentment, self-righteousness, belligerent in-group-signaling and camaraderie is something video is perfect for.

The conspiracy theorists usually go for video.
posted by Western Infidels at 8:14 AM on January 10, 2022 [15 favorites]


I expect that the vast majority of conspiracy theories are really held by people, but I just ran into this video which starts with "birds aren't real", a satirical conspiracy theory. It's mostly about about moon landing hoaxes and the Van Allen belts, though.
posted by Nancy Lebovitz at 8:29 AM on January 10, 2022


The conveyance of emotions like fear, resentment, self-righteousness, belligerent in-group-signaling and camaraderie is something video is perfect for.
I don't disagree. But, it's also true that sharing where the plastic latches are located inside the case of a 5 year old laptop is also, seemingly, only communicated by video these days, even by company engineers. I don't understand it. (But, I also don't understand why television news ever existed.)

Repeating this exercise on youtube transcripts would be interesting.
posted by eotvos at 8:34 AM on January 10, 2022 [7 favorites]


« Older There Was a Time Once When the World Was Beautiful   |   Brunhild and Fredegund Newer »


This thread has been archived and is closed to new comments