Calgarypuck Forums - The Unofficial Calgary Flames Fan Community - View Single Post

twotoner · 11-24-2009, 12:09 AM

Quote:

Originally Posted by Bagor

Congratulations on being able to cut and paste from a blog.

Now elaborate with your own thoughts why it's a gong show.

What's changed in the last 24 hours? Share your opinion on why it's not worthy of a high school project.

What's changed is I've started to look at some of the code and some of the related documentation around how they produce their datasets that everyone uses for inputs. I've also started to appreciate the influence of the CRU on gov't bodies, especially the UN and its various climate related arms.

From the CRU's own website (emphasis mine):
"The Climatic Research Unit is widely recognised as one of the world's leading institutions concerned with the study of natural and anthropogenic climate change. Consisting of a staff of around thirty research scientists and students, the Unit has developed a number of the data sets widely used in climate research, including the global temperature record used to monitor the state of the climate system, as well as statistical software packages and climate models. "

I've been doing custom software development for the past 12 years. But don't trust me that I know a GONG SHOW when I see one. Skip my paragraph below and read it for yourself straight from one of Hadley's finest.

Why is it a gong show/Not worthy.
- missing data
- constantly guessing and hoping for the best
- massaged data that is then massaged again and again until there is no truth
- blatant fudging of data
- undocumented processes
- un-repeatable results
- giving up and faking it
- no source control
- no version history on files
- no code reviews
- no test code
- code is not documented, data sets and files are not meaningfuly named
- directories not meaningfully named
- a house of cards to say the least.
- no standards
This is what millions in funding produces?

Here is another cut & paste for you Bangor, from the actual
docs that were leaked. You can find it in /documents/HARRY_README.txt:

"I am seriously close to giving up, again. The history of this is so complex that I can't get far enough
into it before by head hurts and I have to stop. Each parameter has a tortuous history of manual and
semi-automated interventions that I simply cannot just go back to early versions and run the update prog.
I could be throwing away all kinds of corrections - to lat/lons, to WMOs (yes!), and more.

So what the hell can I do about all these duplicate stations? Well, how about fixdupes.for? That would
be perfect - except that I never finished it, I was diverted off to fight some other fire. Aarrgghhh.

I - need - a - database - cleaner.

What about the ones I used for the CRUTEM3 work with Phil Brohan? Can't find the bugger!! Looked everywhere,
Matlab scripts aplenty but not the one that produced the plots I used in my CRU presentation in 2005. Oh,
IT. Sorry. I will have to WRITE a program to find potential duplicates. It can show me pairs of headers,
and correlations between the data, and I can say 'yay' or 'nay'. There is the finddupes.for program, though
I think the comment for *this* program sums it up nicely:

' program postprocdupes2
c Further post-processing of the duplicates file - just to show how crap the
c program that produced it was! Well - not so much that but that once it was
c running, it took 2 days to finish so I couldn't really reset it to improve
c things. Anyway, *this* version does the following useful stuff:
c (1) Removes and squirrels away all segments where dates don't match;
c (2) Marks segments >5 where dates don't match;
c (3) Groups segments from the same pair of stations;
c (4) Sorts based on total segment length for each station pair'

You see how messy it gets when you actually examine the problem?

This time around, (dedupedb.for), I took as simple an approach as possible - and almost immediately hit a
problem that's generic but which doesn't seem to get much attention: what's the minimum n for a reliable
standard deviation?

I wrote a quick Matlab proglet, stdevtest2.m, which takes a 12-column matrix of values and, for each month,
calculates standard deviations using sliding windows of increasing size - finishing with the whole vector
and what's taken to be *the* standard deviation.

The results are depressing. For Paris, with 237 years, +/- 20% of the real value was possible with even 40
values. Windter months were more variable than Summer ones of course. What we really need, and I don't think
it'll happen of course, is a set of metrics (by latitude band perhaps) so that we have a broad measure of
the acceptable minimum value count for a given month and location. Even better, a confidence figure that
allowed the actual standard deviation comparison to be made with a looseness proportional to the sample size."