One of the challenges of rationality verification is that most people who are willing to contribute personal data for it are already familiar with the techniques involved. This makes it difficult to tell if their performance on any form of rationality test is due to their training or their innate abilities. Does the start of a new sequence present a way around this for that sequence's content?
I believe that it might, and will propose some ideas on how we can take advantage of these opportunities. But first I would suggest that you try to think through the problem for yourself (I know this is slightly different from what is talked about in that post, but I think the principle holds).
Did you think through the general problem of rationality verification for new sequences before thinking of any solutions. Did you then think of your own solutions before getting your mind contaminated with mine? If yes, good. If no, not so good.
If we had good measures of general rationality that could be retaken by the same person multiple times without losing reliability we could simply ask LWers to take them at various intervals and see if they improved after reading the new sequence. As that is not the case I suspect we would have to create specific measures for each sequence. It seems that most writers have a decent idea of what benefits they expect people to gain from their sequences' so perhaps they could try to come up with specific measures for the things that their sequences are supposed to improve. Then before running the sequence main sequence they could put out a call for people to complete these measures and send them in. They could then collect the data again from people who have read the completed sequence, preferably after they have had enough time to practice the material, but not long enough to have had to many other life changes. The necessity and viability of having additional experimental controls will vary between sequences. But I think we will generally be fine with a simple before and after picture.
While there are some time and talent limitations I would be willing to help with creating the measures, collecting and interpreting the data and any other necessary steps.
I declare Crocker's rules on the content and style of this post. This includes the title.
The field of psychometrics is all about this kind of thing. The keyword here is "practice effect". It sort of looks like Wikipedia's deletionists have trimmed their content on the concept down to two sentences in an article that's been nominated for deletion as non-notable, but if you hunt around you can find pre-existing content on ways to control for practice effects.
The unique thing about the situation with LW in this respect seems to me to be that there are a lot of people who tend to conceive and execute and publish polls with relatively sophisticated methodology for the internet in the complete absence of grants or formal publication or whatever. We're doing as a hobby for a community blog, (yes, a blog) what academics make an entire career out of! Maybe not fully solid with control groups yet, but we're kind of close to this already actually.
To make this sort of "surprising but casual competence" more dramatic and effective, it seems like it might be worthwhile to do a sequence on current best practices for community members to run studies on the community itself. Between the free polling technology available via the forms built into google docs and and chapters from psychometrics textbooks, I bet it wouldn't be that hard for someone to pull together such content for a sequence that makes it easier for people here to spice their efforts up with pretty solid techniques :-)
For example, we probably could get some interesting control data for practice effects the way Anna got control data in her recent poll via mechanical turk. If you've developed the quiz content you could ask LWers to take it and turkers to take it, and then get the same LWers and turkers to re-take it with some being exposed to whatever manipulation you tried on LW by posting content and others not... it wouldn't be a perfect control, but (ignoring the costs) it would be better than nothing...
...which naturally leads me to wonder. What is the value of information here? Is there some change in behavior that certain results would cause? What kind of increase in value could be expected from such a change in behavior? Anyone have guesses here?
If the sequence was shown to be useful we would be able to use the data to help show people that LW is useful. If the sequence is not useful we would likely need to do more research to determine why. If we find that the sequence was merely ineffective in instilling the techniques we could re-write it to be more effective. If it turns out that the techniques themselves are ineffective we could stop teaching them. Preferably we wouldn't remove the sequence just add a warning at the start of each post. This would save people time and encourage them to create alternative techniques.