Code is Law
November 28, 2011 6:42 PM   Subscribe

YaCy is an open source fully decentralized peer-to-peer search engine designed prevent any single entity from exercising power over search results.

In related news, the Namecoin project (.bit) has apparently diverted developer interest away from the p2p DNS project (.p2p), which showed so much promise one year ago.

Also, the U.S. has seized another 150ish domains this month, roughly 1 year after the seizure of 82 domains that prompted p2p DNS. Except this round the European Parliament has condemned unilateral domain name seizures by the U.S.

Namecoin is heavily based on the Bitcoin source code, but not the block chain that assigns ownership of bitcoins.
posted by jeffburdges (25 comments total) 26 users marked this as a favorite
 
To be clear, seizing a domain name does not involve jackbooted thugs ramming your door open and taking your server. It means they changed a number in certain very important files on certain very important servers, and now your former URL doesn't go to your website.

I suppose it's fairly civilized as censorship goes.
posted by LogicalDash at 6:50 PM on November 28, 2011 [1 favorite]


I was just thinking about doing a YaCy post. Apparently they are at 1.0 now.

The main thing stopping me was that for the life of me I can't understand how they work. I don't get how decentralization prohibits poisoning or prevents tracking.
posted by DU at 6:53 PM on November 28, 2011 [1 favorite]


Related: The Sovereign Keys Project from EFF is an attempt to fix the SSL PKI mess [previously] using a "semi-centralized, verifiably append-only data structure".
posted by finite at 6:58 PM on November 28, 2011 [2 favorites]


Decentralizing a system does not prevent any particular attack from working, but it makes damage control a great deal easier. If you manage to poison any particular YaCy net, then everyone on that net flips a switch and goes to the next one.

This does, of course, assume that everyone's pretty vigilant about finding out when their net is compromised, which is perhaps unwarranted. But it can be done, since you can inspect the source code.
posted by LogicalDash at 6:59 PM on November 28, 2011 [1 favorite]


Search terms are hashed before they leave the user's computer. Different from conventional search engines, YaCy is designed to protect the users' privacy.

And statements like this don't make me feel much better. How is hashing going to help privacy? If I search for "karl marx" and that gets hashed to "(#*&%(*# HFJHFJ" how am I any more secure? Anyone interested in people searching for "karl marx" can just look for those hashes in incoming queries.

But there's also something about exchanging indexes, so maybe my query never leaves my computer? But that goes against the above quote, not to mention the fact that my computer can't be big enough to index the entire Internet.

Like I said: I've never seen this thing adequately explained.
posted by DU at 7:04 PM on November 28, 2011 [1 favorite]


An index is a search term and a search term is an index. The only thing that makes an index an index is that you use it to look something else up. The notion of a key:value pair doesn't require the key to be of any particular data type.
posted by LogicalDash at 7:09 PM on November 28, 2011


Well...exactly. So changing "karl marx" into "(#*&%(*# HFJHFJ" retains the same semantics, if in a different encoding. You don't have to be able to unhash to determine my search terms. You just need to be able to hash the things you are looking for.
posted by DU at 7:14 PM on November 28, 2011


DNS is one of the last of the centralized internet applications. It really has outlived its usefulness, and it's long since past time we moved to the next thing. The Hacker Ethic - distrust authority, promote decentralization. p2p hosting of static content is an inevitable next step, in terms of regaining bandwidth efficiency and enabling scalability for publishers of content who don't have the resources to run a datacenter or hire someone to run a web presence, and it needs to be quick and secure.

Freenet was sort of the way to go, until a kiddy-porn scare campaign was leveled against it that its leadership was slow to refute, and its seeding was way too slow for the modern internet. p2p DNS looked like it was the right solution, but then the namecoin mindshare fork happened. Hopefully this new project can step in to fill the hole. DNS has become a very active vector for APT, particularly spear-phishing and reputation attacks.

If I search for "karl marx" and that gets hashed to "(#*&%(*# HFJHFJ" how am I any more secure? Anyone interested in people searching for "karl marx" can just look for those hashes in incoming queries.

MORBU: Cryptographic hashing does not work that way!
posted by Slap*Happy at 7:16 PM on November 28, 2011 [5 favorites]


Dexter: *Searching Netrangler™*
Intern: Y'know Googles kinda five minutes ago right? Try eliot. Yeah, it uses a target algorithm to aggregate content without getting tripped up by all that sneaky SEO bullshit. *types in http//:www.eliotsearchengine.cl*
Dexter: Intern puppy dog has skills.
posted by unliteral at 7:24 PM on November 28, 2011 [1 favorite]


At first blush, I understand what DU is saying, and it doesn't matter whether the hashing is cryptographic (e.g., SHA-1) or not (e.g., CRC-64).

The implementation that I have in mind is that Alice performs a 1-way but unsalted hash from term to H(term), transmits H(term) to the search engine run by Bob, who find hashed pages where H(term) is also present. Bob returns these results to Alice.

Assume for a minute that Eve has H(term) and also knows the function H(x) (She has to, to be able to perform a search herself). Even if something prevents her from just submitting H(term) in order to see the same search results that Alice saw (and what is that "something"?), Eve can also take a dictionary of the top million terms appearing on any web page (or e.g., the top million terms related to "how do I dispose of a body"), find H(word) for all of them, and thus reverse the hash H(x).

If the hash is H(salt || x), then Bob has a big problem: he has to store his index of the whole internet 'salt' times, so that no matter which salted version of the search term Alice searches for, he has a match in his index.

It's entirely possible that there's some clever system that would rightfully be called "hashing" that really protects Alice's search terms from Eve—if so, I'd love to know what it is.
posted by jepler at 7:34 PM on November 28, 2011 [5 favorites]


Indeed, this is an area of active research. For instance, a few internet searches turn up paywalled papers with titles like "Fuzzy keyword search over encrypted data in cloud computing". These address systems where Alice buys computing power and storage from Bob, yet doesn't want Bob or Eve to be able to determine the inputs, operations, or outputs of the computation she performs.
posted by jepler at 7:38 PM on November 28, 2011 [1 favorite]


I'm a little out of my depth with deep crypto (don't make me break out the CISSP study guide, I may get a hernia), but to my knowledge, modern hashing systems use block encryption to preclude bijection...

Bob isn't indexing H(term) but decrypting H(term) into (term), which returns (result) and then encodes that into H(result) which is sent back to Alice. Without the cypher, Eve can't decode the H(result).

The papers you're referring to involve a situation where Bob or Eve can deconstruct virtual hardware with software to read what's going on in the system, and gain unauthorized access to data that way. The real exciting stuff is actually performing computation on encrypted data without needing to decrypt it as it enters and leaves memory and the processor. I don't even begin to pretend to understand how that works, and it's all theoretical at the moment anyhow.
posted by Slap*Happy at 7:57 PM on November 28, 2011 [1 favorite]


If bob "decrypts" H (performs H-1(H(x)) to get back x), then H isn't a hash.

I agree that the paper whose title I gave doesn't directly apply to situations like YaCy.
posted by jepler at 8:02 PM on November 28, 2011 [2 favorites]


designed [to] prevent any single entity from exercising power over search results.

Sounds like a challenge to 4chan.
posted by BrotherCaine at 8:23 PM on November 28, 2011 [1 favorite]


I'd incorrectly assumed that YaCy employes a google style page rank algorithm with multiple peers checking one another's work, which might work given any particular adversary remained below some fraction of the network. Instead, YaCy is apparently a "personal web crawler and search engine" with a "peer-to-peer index exchange network" (see pdf and freecode).

I've gleaned that YaCy verifies index entries itself before displaying search results. I'd imagine every search requires recomputing the appropriate steady state eigenvector for a Markov chain based search algorithm, ala Google's PageRank. If not, a naive search might suffice for censored searches like "Nightwish discography filetype:torrent". AltaVista reborn!

As a rule, we should prefer decentralized, federated, peer-to-peer, etc. information technologies whenever reasonable because they always create freedom. Yet, many shall remain borderline threats against authoritarian "east coast code". We aren't holding this discussion on a freesite, but Freenet provides a credible threat that the internet could vanish from the authority's view if they squeezed too hard.
posted by jeffburdges at 8:47 PM on November 28, 2011 [1 favorite]


It's fascinating to me just how many of these projects pay indirect homage to Ted Nelson's 40-year old vision of Xanadu, the original, peer-to-peer hypertext system (an article in Wired 3.06, June 1995, before it turned into The Magazine Of Shiny Things). Even Tim Berners-Lee is talking more about browser platforms being less consumers of web content and becoming components of its actual structure: locally caching versions of sites for the Wayback Machine, acting as alternative delivery sources of pages forced offline by censorship or technical glitches. Everything old is new again.
posted by Bora Horza Gobuchul at 9:49 PM on November 28, 2011 [2 favorites]


I may be misunderstanding something, but DU's point seems perfectly clear: if some third party maintains something like a rainbow table of common search phrases, then hashing will do nothing to obscure (at least some of) the controversial search terms in a user's history.

Re-reading the thread, that seems to be exactly jepler's point here.
posted by invitapriore at 8:11 AM on November 29, 2011 [1 favorite]


Hashing the search terms has some interesting side effects, too- one is that all of the terms must be normalized prior to hashing. "karl marx" may hash to "(#*&%(*# HFJHFJ", but "Karl Marx" would hash to "AASAGGF*(((***!". Since you can't decrypt a hash, you'll need to do things like remove spaces, stem keywords, and convert search terms to lower case. Boolean searches would also be problematic, depending on how robust the API is.
posted by jenkinsEar at 10:04 AM on November 29, 2011


Yeah, I didn't even think of normalization. In fact, it's worse than that since presumably "karl marx" and "marx karl" should return the same results (at least when unquoted). Which means each term would have to be separately hashed or alphabetized or something. That makes it all the easier to "unhash" the search terms.
posted by DU at 11:20 AM on November 29, 2011


Indeed, this is an area of active research. For instance, a few internet searches turn up paywalled papers with titles like "Fuzzy keyword search over encrypted data in cloud computing". These address systems where Alice buys computing power and storage from Bob, yet doesn't want Bob or Eve to be able to determine the inputs, operations, or outputs of the computation she performs.
My understanding is that those things are totally impractical in the real world, exponentially increasing the run-time of the opperations (so far)

Really though. 2+ terabyte hard drives are pretty cheap now. It would be pretty easy to for people to store a fairly huge internet index, plus an offline Wikipedia on their own hard drives. If you download all the search terms then there's no need to worry about sending the actual terms you search for elsewhere.

For practical purposes, the best thing to do is probably just do your searches over Tor. You can also use alternative search engines besides Google.
posted by delmoi at 11:31 AM on November 29, 2011


The other problem of course is that if 'no one' controls the search engine then spammers will.
posted by delmoi at 11:32 AM on November 29, 2011 [1 favorite]


Yeah, that's what my "prohibits poisoning" question was about. Although I see there's supposed to be some kind of learning algorithm that a) figures out what you like and b) keeps that stuff locally. So maybe that solves both problems. Most of your queries never leave your machine (in the limit) and you've trained the local engine to ignore what you consider spam. Dunno how fast that all converges, though.
posted by DU at 11:40 AM on November 29, 2011


The question that average person will have is whether or not they will be able to find what they're looking for with YaCy. If it can't go toe-to-toe with Google (or Bing) YaCy isn't likely to grow beyond the security-conscious regardless of how well it protects users' privacy.
posted by tommasz at 1:42 PM on November 29, 2011


OpenDNS has released a tool called DNSCrypt that should stop replay, observation, and timing attacks.
(Not sure it'll help reduce government interference though)
posted by jeffburdges at 1:46 PM on December 8, 2011


MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl
posted by jeffburdges at 9:02 PM on December 18, 2011


« Older Beattitudes   |   We are star stuff. Newer »


This thread has been archived and is closed to new comments