Join 3,501 readers in helping fund MetaFilter (Hide)


Open source climate data and algorithms
January 15, 2010 10:33 AM   Subscribe

Do you want to personally verify climate science? You can, with open source data and algorithms. OpenTemp.org: An Open Analysis of the Historical Temperature Record. Clear Climate Code: Python reimplementation of GISTEMP, the NASA GISS surface temperature analysis. EDGCM: a research-grade Global Climate Model (GCM) with a user-friendly interface that can be run on a desktop computer.
posted by stbalbach (42 comments total) 13 users marked this as a favorite

 
OpenTemp.org looks like it hasn't been updated in over two years.
posted by Slothrup at 10:36 AM on January 15, 2010


It was 32°F here yesterday. It's 43°F today.

GLOBAL WARMING: PROVEN. YOUR WELCOME, SCIENCE.
posted by Plutor at 10:36 AM on January 15, 2010 [5 favorites]


Agh, "you're". I'm so embarrassed.
posted by Plutor at 10:37 AM on January 15, 2010


No, "YOUR" was perfect.
posted by LordSludge at 10:41 AM on January 15, 2010 [11 favorites]


You're not so shabby, yourself. *wink*
posted by Plutor at 10:51 AM on January 15, 2010 [1 favorite]


The code was already open, so this is more a fork than an opening of anything. And in order to "personally verify" anything, won't I need to know stuff about climate modeling?

I guess my point is: Isn't science already open? Implying otherwise, like OpenTemp and ClimateAudit do, kinda sets the agenda.
posted by DU at 10:53 AM on January 15, 2010 [2 favorites]


1.3oC here. YOUR also welcome.
posted by blue_beetle at 10:54 AM on January 15, 2010


Implying otherwise, like OpenTemp and ClimateAudit do, kinda sets the agenda.

From Wikipedia:
In September 2007, the GISTEMP software which is used to process the GISS version of the historical instrument data was made public. The software that was released has been developed over more than 20 years by numerous staff and is mostly in FORTRAN; large parts of it were developed in the 1980s before massive amounts of computer memory was available as well as modern programming languages and techniques.

Two recent open source projects have been developed by individuals to re-write the processing software in modern open code. One, http://www.opentemp.org/, was by John van Vliet. More recently, a project which began in April 2008 (Clear Climate Code) by staff of Ravenbrook Ltd to update the code to Python has so far detected two minor bugs in the original software which did not significantly change any results.
The science, in this case, was not open until 2007, and even then it was a mess of old Fortran code that no one programs in anymore, so the new Open Source projects are more open-ish, that is, something people can actually use.
posted by stbalbach at 11:29 AM on January 15, 2010 [3 favorites]


Hate on the Fortran all you want, but it's simple and screamingly fast. Python is very slow in comparison.
posted by scruss at 12:31 PM on January 15, 2010


scruss: "Hate on the Fortran all you want, but it's simple and screamingly fast. Python is very slow in comparison."

Hate on <insert name of programming language other than ruby> all you want ... Python is very slow in comparison.
posted by idiopath at 12:50 PM on January 15, 2010


And yes, Fortran not only is fast, for numerical code it in many cases defines the limits of fast, there are apps that are otherwise written in C and assembler macros that use Fortran code for the subroutines that have to be fast.
posted by idiopath at 12:52 PM on January 15, 2010 [1 favorite]


Do you want to personally verify climate science? You can

I'm skeptical. Seems like this is the data we would be needing - not that many people would actually have the skills necessary to make sense of it, but you know what I mean - and it's gone.
posted by Dasein at 1:12 PM on January 15, 2010


I think it's a bit of a myth that Fortran is faster than C. It seems to propagate because a lot of numerical/scientific code is historically written in Fortran, so Fortran is what scientists know and like, despite it being enormously ugly code.

There should be no difference if the compiler is any good, and anyway if what you really care about is performance you should write in assembly, and use hardware features like SIMD and parallelism intelligently and not worry about which language is better.
posted by snoktruix at 2:22 PM on January 15, 2010


It's not really gone, Dasein. Way down in paragraph six of that article it says "In a statement on its website, the CRU said: “We do not hold the original raw data but only the value-added (quality controlled and homogenised) data.”

So someone else could gather all the information from the individual weather stations again and plug into their own models, thus verifying the CRU's conclusions. It would be really hard because you'd be redoing decades of work, but still possible.
posted by Kevin Street at 2:30 PM on January 15, 2010


snoktruix: "I think it's a bit of a myth that Fortran is faster than C"

It comes down to the way that arrays are indexed and accessed in fortran. Yes, you can do the same thing if you are willing to write brittle and non idiomatic code that most c developers will think is done wrong and will be prone to bugs because it will break with common sense of c code. Or you can use fortran where the faster usage of arrays is idiomatic.
posted by idiopath at 2:32 PM on January 15, 2010


Also, regarding the assembly point, well written fortran tends to outperform naive use of assembly, and is much easier to maintain and debug (and newer fortran can use simd, natch).
posted by idiopath at 2:34 PM on January 15, 2010


You're suggesting to use arrays "fast" in C you have to write "brittle and non idiomatic code that most c developers will think is done wrong". I don't see why.

I think you may mean that it's slightly easier in Fortran to do high level things, like dynamically allocate vectors and matrices, and to pass them to subroutines. I admit that in C, handling matrices can be a bit painful, but it boils down to needing to stuff that Fortran will be doing "under the hood" anyway.

You can't really be faster than C in theory. C is as close as the machine as you're going to get without writing assembly. There's no way Fortran outperforms hand-writeen assembly! Maybe "naive" assembly, but what does that mean? Obviously sufficiently well written C also tends to outperform sufficiently "naive" assembly.
posted by snoktruix at 6:34 PM on January 15, 2010


snoktruix: "I think you may mean that it's slightly easier in Fortran to do high level things, like dynamically allocate vectors and matrices, and to pass them to subroutines"

No.

It is primarily about the way that arrays are indexed and accessed, and the language is designed in such a way that it can be parallelized better than c.

snoktruix: "You can't really be faster than C in theory."

Fortran is no closer or further than C from the machine. But idiomatic fortran has better performance for numerical computing, because of language design decisions.
posted by idiopath at 6:40 PM on January 15, 2010


Anyway, the point is, the difference in performance depends much more on the compiler and the specific hardware features you're exploiting (or that the compiler cleverly exploits) than on the language you're writing in. I'm just noting that C is really equivalent conceptually to assembly programming (it's "portable assembly"), while Fortran deals with slightly higher level abstractions, so in that sense C gives more control for optimization. In practice, in some cases the availability of a nicely optimized abstraction or piece of library code in Fortran will make it easier for someone to write a fast program in Fortran, than someone "rolling their own" in C (or using less mature numerical libraries). Fortran is not faster than C, just 40 year old tuned Fortran libraries with an excellent commercial Fortran compiler will naturally tend to be faster than 5 year old C libraries with gcc. Switching to Fortran for Playstation 3 coding is unlikely to help - we have a highly optimized C standard library and compiler for that.
posted by snoktruix at 6:46 PM on January 15, 2010


It is primarily about the way that arrays are indexed and accessed

I still think that cannot make Fortran fundamentally faster. Does the compiler somehow emit assembly instructions that C compiler cannot? Or you mean, it's just more difficult to do it fast in C so in practice people don't do it? (Got an example?).

..the language is designed in such a way that it can be parallelized better than c.

I know that F90 has language support for vectorization, which certainly makes it easier to set up code which uses SIMD optimizations. Good C compilers also have auto-vectorization features, which may be as good. But in general, in C you'd have to use the compiler intrinsics to get optimal results, which is harder but closer to the machine (doing what the Fortran compiler is doing under the hood).
posted by snoktruix at 7:03 PM on January 15, 2010


Fortran is not a higher level language than C is. They are both from an era where pretty much every programming language was "portable assembly", with very few exceptions (those exceptions being the parent of every new language and language feature feature since circa 1980 or so).

There are compiler specific extensions (pragmas I believe) for some C compilers (gcc for example) that allow the kinds of optimizations that Fortran makes. But these are not standard, and you cannot use these optimizations with normal C code - you have to code in a way that is compatible with Fortran. One could say at that point that you are coding Fortran in C. C is a low level language for systems programming. Fortran is a low level language for number crunching. Given all other things as equal (programmer skill, compiler quality, etc.) they each outperform the other in the appropriate domain, when used in a sane manner (ie. you can do insane optimizations in any low level language, but risk unreadable or bug prone code).
posted by idiopath at 7:06 PM on January 15, 2010 [1 favorite]


On lack of preview, I think we are mostly on the same page here. Yeah, I mean that given readable, normal best practices coding styles, they each have their advantages in specific domains.
posted by idiopath at 7:09 PM on January 15, 2010


I think it's mistaken to imagine that "Fortran" is making any optimizations. It's the team that wrote the Fortran compiler that made them (and I bet they wrote the Fortran compiler in C, or more likely, C++). Ultimately, you can do exactly the same things in either language with more or less difficulty, because both languages are just interfaces to some underlying hardware. But it's not really harder to write code that is fast enough for supercomputers in C than Fortran, necessarily (there are obviously fast codes written in C, just as there in Fortran, e.g. pretty much all game engines use C/C++, none use Fortran AFAIK). Maybe it's harder for novices to use C, that I admit.
posted by snoktruix at 7:17 PM on January 15, 2010


Also, I think it's not unreasonable to note that Fortran is an incredibly ugly language.
posted by snoktruix at 7:18 PM on January 15, 2010


This page does a better case of explaining the differences between Fortran and C than I do.

Also I would hazard a guess that games do not use Fortran because Fortran is an archaic language syntax wise (not algol derived like all the popular languages, including C, are).

As the page I link above points out, it definitely IS harder to write multiple processor parallelized efficient array manipulation code in C.
posted by idiopath at 7:45 PM on January 15, 2010


Interesting technical discussion here about it: Is Fortran faster than C?.

The gist - Fortran does have some language features built in which allow the compiler to do optimizations that C compilers cannot do without compiler specific directives. But with modern compilers, the practical difference is small anyway (and people coding performance critical C, e.g. for games, do of course use the compiler directives).
posted by snoktruix at 7:48 PM on January 15, 2010


Hehe, we converged on the same page..
posted by snoktruix at 7:49 PM on January 15, 2010


Seems like this is the data we would be needing

True, if you wanted to verify the data produced by University of East Anglia (UEA) prior to the 1980s. But, according to Wikipedia:
There are two main global temperature datasets, both developed since the late 1970s: that maintained by the Climatic Research Unit at the University of East Anglia [3] and that maintained by the Goddard Institute for Space Studies [14].
So, get the data from Goddard.. and there are others, this is a global effort. You won't be surprised to learn there is little discrepancy between them.
posted by stbalbach at 8:13 PM on January 15, 2010


It's not really gone, Dasein. Way down in paragraph six of that article it says "In a statement on its website, the CRU said: “We do not hold the original raw data but only the value-added (quality controlled and homogenised) data.”

Kevin Street: That's exactly the point. Value-added data means that certain values have been excluded because they've been deemed, by some statistical technique, to be unreliable or otherwise insignificant. The point is that we need to be able to evaluate whether these statistical methods are valid or not, and we can only do that by using the raw data. A case in point in this regard is the hockey-stick graph of global temperatures used in Al Gore's film, based on work done by Michael Mann. Mann used an algorithm that produced a hockey-stick graph essentially irrespective of what data it was given - 99% of the time it was given random data. It was rigged to show a startling warming trend in the 20th century and to eliminate the Medieval Warm Period and Little Ice Age. Not being a statistician, I'm not 100% sure whether others could evaluate the way in which the CRU at East Anglia homogenized and quality controlled their data in the absence of the raw data, but if it was determined that they used a problematic algorithm, there is no longer any ability to access the raw data to re-evaluate the conclusions. We just have to take their word for it. That's not good science, it seems to me.
posted by Dasein at 8:33 PM on January 15, 2010


But the raw data is the stuff collected by the weather stations, right? Apparently all that pre-80s information is still out there. If someone wanted to do so, they could collect it again and use it in their own models, then see if the results conflict with the ones the East Anglia scientists got. Not an easy job by any means, but it wouldn't be impossible either.
posted by Kevin Street at 9:18 PM on January 15, 2010


Mann used an algorithm that produced a hockey-stick graph essentially irrespective of what data it was given

That is incorrect.
posted by flabdablet at 6:53 AM on January 16, 2010


People who feel inclined to repost "hockey stick" critiques here that they've read elsewhere would be well advised to check this page first. Let's keep the noise level in here down a little, shall we?
posted by flabdablet at 7:00 AM on January 16, 2010


The destruction of, or failure to retain, the original raw data so that the "quality control" and homogenization performed on it could be double-checked seems unfortunate. Anything that prevents subsequent verification and reproduction of the same results is almost guaranteed to create suspicion, particularly when the conclusions the data is being used to support are themselves arguments for fairly controversial and wide-ranging policy decisions.

I would hope that in the future scientists would be more careful to retain everything, so that not even the appearance of impropriety might be created. Once science begins to enter the political realm, which it often does when it is used to support or attack policies, the appearance of impropriety is as deadly as the impropriety itself.

You don't have to be a "climate skeptic" to think that they should have done a better job preserving all the data that went into the analysis. At the very least, it would have made the pro-fossil-fuel camp's conspiracy theories a little bit harder.
posted by Kadin2048 at 1:57 PM on January 16, 2010


It doesn't prevent verification of the same results in this case, but it does make verification more difficult, which is too bad. I can't really fault them, though. Back in the 70s and early 80s, who could have guessed that there'd be a clamour for their "raw" data anyway? The director at that time made a judgment call which turned out to be wrong. Had he known that climate research would become so controversial, he'd probably have kept everything.
posted by Kevin Street at 2:29 PM on January 16, 2010


All the original, raw data is preserved at the national meteorological agencies from which CRU originally got it. CRU used a copy of the data to produce a historical global temperature record. When CRU completed their analysis they deleted their copy of the original data because storage costs were high and the original data was not needed. CRU was not in the business of being the secondary archive of other organization's data. If anyone wants to get the original data they can do so by contacting the various entities that are responsible for archiving it.

But, why would anyone need to do so? Both NASA GISS and NOAA NCDC have also used the original data to calculate global mean surface temperatures. They used somewhat different methodologies. Guess what? While values for individual years differ slightly, the trends in the three global surface temperature reconstructions are the same. This verifies the CRU reconstruction.

In other words, there are no revelations to be found by starting from scratch with the raw data. At best you would find a few transposed digits or typos. While these corrections would be appreciated, they will not change the global temperature record in any climatologically or statistically significant manner. Independent, non-thermometer-based climate reconstructions, such as from tree rings, also verify the thermometer-based records.

Finally, the climate skeptics don't care how well the data is preserved, or how robust the climate models are, or how careful the analysis is conducted, because their goal is not to further the science but to sow doubt. Any analysis leading to a conclusion that does not agree with their position, that nothing should be done about global warming, will come under attack.
posted by plastic_animals at 2:32 PM on January 16, 2010


Sorry to continue the derail, but: snoktruix: “There's no way Fortran outperforms hand-writeen assembly! Maybe "naive" assembly, but what does that mean?”

Actually compiler-generated code has been outperforming expert handwritten code in some domains for decades, and presumably those domains are only getting larger. The compiler can make use of a very deep model of the processor's performance and the program's statistics when it's choosing and scheduling instructions, and it can do this for every single routine; a human might work on a stretch of code for days to get similar results. For a processor with deep pipelines and complicated multiple dispatch, it's really hard for a human to keep in mind everything that affects the code's performance.

posted by hattifattener at 4:17 PM on January 16, 2010


Let's keep the noise level in here down a little, shall we?

The IPCC withdrew the hockey stick from its publications after it was proved to be erroneous, so referring to critiques of it as "noise" is more than a little arrogant. The hockey stick was wrong - and the tone of the articles over at RealClimate doesn't exactly give me a lot of confidence that they don't have an agenda, either.
posted by Dasein at 9:42 AM on January 17, 2010


If anyone wants to get the original data they can do so by contacting the various entities that are responsible for archiving it.

That's comforting - I hope that's the case. Thanks for the clarification. My impression was that the data had been collected by agents of the British empire and the CRU was the repository of it. I'm glad that's not the case.

Kevin Street - given that the data was deleted a long time ago, I agree that it doesn't seem like this was a CRU coverup or anything, just that it's not appropriate to rely on value-added data without being able to evaluate how it was homogenized, and it's a real problem if people can't go back and do that.
posted by Dasein at 9:46 AM on January 17, 2010


The IPCC withdrew the hockey stick from its publications after it was proved to be erroneous

That is incorrect. See figure 6.10 on page 467 in chapter 6 of the most recent Working Group 1 report.
posted by flabdablet at 3:33 PM on January 17, 2010


Here's a direct link to the appropriate figure in the HTML version of that report. The allegedly erroneous data are the purple curve labelled MBH1999. Note that all of these temperature reconstructions show a temperature trend over the last thousand years that's basically flat, if not slightly declining; then a "hockey stick blade" of rising temperature that starts bending upward at about 1900 and gets seriously steep around 1950.
posted by flabdablet at 8:44 PM on January 17, 2010


In the climategate emails here is what climate scientists were saying to each other:
Without trying to prejudice this work, but also because of what I
almost think I know to be the case, the results of this study will
show that we can probably say a fair bit about <1>100 year variability was like with any certainty (i.e. we know
with certainty that we know fuck-all).
From here.

So, while what got into the IPCC report was filtered what at least some of them thought or knew was elided.
posted by sien at 2:08 PM on January 18, 2010


You appear to be suggesting that IPCC reports should properly contain every opinion that any climate scientist anywhere has ever expressed in a private email. I can't see how that could be made to work.
posted by flabdablet at 5:09 PM on January 18, 2010


« Older He doesn't do metaphors. He doesn't make Postmoder...  |  Friday Flash Fun Puzzler: Gra... Newer »


This thread has been archived and is closed to new comments