In reply to:

the cost of doing these less flexible studies will approach the cost of the raw product to be tested. For most web companies, that's $0.

Nope. The cost of doing less flexible studies will be the cost of losing that flexibility. For companies which expect a particular result from a study this cost can be considerable.

Do you mean that companies might see the opportunity cost of their ability to (amorally) rig a study as a bad thing? Or do you mean that companies might (legitimately) need complex study design to show a real but qualified positive effect?

If the former, then this proposal seems already to have taken that into account as a chief reason for creating the institution in the first place. If the latter then it might be possible to add options for important and epistemically useful variations in study design to the web app.

Fixing Moral Hazards In Business Science

I'm a LW reader, two time CFAR alumnus, and rationalist entrepreneur.

Today I want to talk about something insidious: marketing studies.

Until recently I considered studies of this nature merely unfortunate, funny even. However, my recent experiences have caused me to realize the situation is much more serious than this. Product studies are the public's most frequent interaction with science. By tolerating (or worse, expecting) shitty science in commerce, we are undermining the public's perception of science as a whole.

The good news is this appears fixable. I think we can change how startups perform their studies immediately, and use that success to progressively expand.

Product studies have three features that break the assumptions of traditional science: (1) few if any follow up studies will be performed, (2) the scientists are in a position of moral hazard, and (3) the corporation seeking the study is in a position of moral hazard (for example, the filing cabinet bias becomes more of a "filing cabinet exploit" if you have low morals and the budget to perform 20 studies).

I believe we can address points 1 and 2 directly, and overcome point 3 by appealing to greed.

Here's what I'm proposing: we create a webapp that acts as a high quality (though less flexible) alternative to a Contract Research Organization. Since it's a webapp, the cost of doing these less flexible studies will approach the cost of the raw product to be tested. For most web companies, that's $0.

If we spend the time to design the standard protocols well, it's quite plausible any studies done using this webapp will be in the top 1% in terms of scientific rigor.

With the cost low, and the quality high, such a system might become the startup equivalent of citation needed. Once we have a significant number of startups using the system, and as we add support for more experiment types, we will hopefully attract progressively larger corporations.

Is anyone interested in helping? I will personally write the webapp and pay for the security audit if we can reach quorum on the initial protocols.

Companies who have expressed interested in using such a system if we build it:

(I sent out my inquiries at 10pm yesterday, and every one of these companies got back to me by 3am. I don't believe "startups love this idea" is an overstatement.)

So the question is: how do we do this right?

Here are some initial features we should consider:

  • Data will be collected by a webapp controlled by a trusted third party, and will only be editable by study participants.
  • The results will be computed by software decided on before the data is collected.
  • Studies will be published regardless of positive or negative results.
  • Studies will have mandatory general-purpose safety questions. (web-only products likely exempt)
  • Follow up studies will be mandatory for continued use of results in advertisements.
  • All software/contracts/questions used will be open sourced (MIT) and creative commons licensed (CC BY), allowing for easier cross-product comparisons.

Any placebos used in the studies must be available for purchase as long as the results are used in advertising, allowing for trivial study replication.

Significant contributors will receive:

  • Co-authorship on the published paper for the protocol.
  • (Through the paper) an Erdos number of 2.
  • The satisfaction of knowing you personally helped restore science's good name (hopefully).

I'm hoping that if a system like this catches on, we can get an "effective startups" movement going :)

So how do we do this right?

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 6:41 PM
Select new highlight date
All comments loaded

I am having trouble visualizing it. Could you tell a story that is a use case?

Max L.

Thanks for pointing this out.

Let's use Beeminder as an example. When I emailed Daniel he said this: "we've talked with the CFAR founders in the past about setting up RCTs for measuring the effectiveness of beeminder itself and would love to have that see the light of day".

Which is a little open ended, so I'm going to arbitrarily decide that we'll study Beeminder for weight loss effectiveness.

Story* as follows:

Daniel goes to (our thing).com and registers a new study. He agrees to the terms, and tells us that this is a study which can impact health -- meaning that mandatory safety questions will be required. Once the trial is registered it is viewable publicly as "initiated".

He then takes whatever steps we decide on to locate participants. Those participants are randomly assigned to two groups: (1) act normal, and (2) use Beeminder to track exercise and food intake. Every day the participants are sent a text message with a URL where they can log that day's data. They do so.

After two weeks, the study completes and both Daniel and the world are greeted with the results. Daniel can now update Beeminder.com to say that Beeminder users lost XY pounds more than the control group... and when a rationalist sees such claims they can actually believe them.

  • Note that this story isn't set in stone -- just a sketch to aid discussion

Thanks for the example. It leads me to questions:

  1. For more complicated propositions, who does the math and statistics? The application apparently gathers the data, but it is still subject to interpretation.
  2. Is the data (presumably anonymized) made publicly available, so that others can dispute the meaning?
  3. If the sponsoring company does its own math and stats, must it publicly post its working papers before making claims based on the data? Does anyone review that to make sure it passes some light smell test, and isn't just pictures of cats?
  4. What action does the organization behind the app take if a sponsor publicly misrepresents the data or, more likely, its meaning? If the organization would take action, does it take the same action if the statement is merely misleading, rather than factually incorrect?
  5. What do the participants get? Is that simply up to the sponsor? If so, who reviews it to assure that the incentive does not distort the data? If no one, will you at least require that the incentive be reported as part of the trial?
  6. Does a sponsor have any recourse if it designed the trial badly, leading to misleading results? Or is its remedy really to design a better trial and publicize that one?
  7. Can sponsors do a private mini-trial to test its trial design before going full bore (presumably, with their promise not to publicize the results)?
  8. Have you considered some form of reputation system, allowing commenters to build a reputation for debunking badly supported claims and affirming well-supported claims? (Or perhaps some other goodie?) I can imagine it becoming a pastime for grad students, which would be a Good Thing (TM).

I imagine these might all be very basic questions that arise out of my ignorance of such studies. If so, please spend your time on people with more to contribute than ignorance!

Max L.

7 - Can sponsors do a private mini-trial to test its trial design before going full bore (presumably, with their promise not to publicize the results)?

This is an awesome idea. I had not considered this until you posted it. This sounds great.

6 - Does a sponsor have any recourse if it designed the trial badly, leading to misleading results? Or is its remedy really to design a better trial and publicize that one?

This is a hard one. I anticipate that at least initially only Good People will be using this protocol. These are people who spent a lot of time creating something to (hopefully) make the world better. Not cool to screw them if they make a mistake, or if v1 isn't as awesome as anticipated.

A related question is: what can we do to help a company that has demonstrated its effectiveness?

Not cool to screw them if they make a mistake

This is exactly the moral hazard companies face with the normal procedure too.

The main advantage I see is that the webapp approach is much cheeper allowing companies t do it early thus reducing the moral hazard.

3 - If the sponsoring company does its own math and stats, must it publicly post its working papers before making claims based on the data? Does anyone review that to make sure it passes some light smell test, and isn't just pictures of cats?

At minimum the code used should be posted publicly and open-source licensed (otherwise there can be no scrutiny or replication). I also think paying to have a third party review the code isn't unreasonable.

2 - Is the data (presumably anonymized) made publicly available, so that others can dispute the meaning?

That was the initial plan, yes! Beltran (my co-founder at GB) is worried that will result in either HIPPA issues or something like this, so I'm ultimately unsure. Putting structures in place so the science is right the first time seems better.

The privacy issue here is interesting.

It makes sense to guarantee anonymity. Participants recruited personally by company founders may be otherwise unwilling to report honestly (for example). For health related studies, privacy is an issue for insurance reasons, etc.

However, for follow-up studies, it seems important to keep earlier records including personally identifiable information so as to prevent repeatedly sampling from the same population.

That would imply that your organization/system needs to have a data management system for securely storing the personal data while making it available in an anonymized form.

However, there are privacy risks associated with 'anonymized' data as well, since this data can sometimes be linked with other data sources to make inferences about participants. (For example, if participants provide a zip code and certain demographic information, that may be enough to narrow it down to a very few people.) You may want to consider differential privacy solutions or other kinds of data perturbation.

http://en.wikipedia.org/wiki/Differential_privacy

8 - Have you considered some form of reputation system, allowing commenters to build a reputation for debunking badly supported claims and affirming well-supported claims? (Or perhaps some other goodie?) I can imagine it becoming a pastime for grad students, which would be a Good Thing (TM).

I hadn't. I like the idea, but am less able to visualize it than the rest of this stuff. Grad students cleaning up marketing claims does indeed sound like a Good Thing...

I was thinking something like the karma score here. People could comment on the data and the math that leads to the conclusions, and debunk the ones that are misleading. A problem would be that, If you allow endorsers, rather than just debunkers, you could get in a situation where a sponsor pays people to publicly accept the conclusions. Here are my thoughts on how to avoid this.

First, we have to simplify the issue down to a binary question: does the data fairly support the conclusion that the sponsor claims? The sponsor would offer $x for each of the first Y reviewers with a reputation score of at least Z. They have to pay regardless of what the reviewer's answer to the question is. If the reviewers are unanimous, then they all get small bumps to their reputation. If they are not unanimous, then they see each others' reviews (anonymously and non-publicly at this point) and can change their positions one time. After that, those who are in the final majority and did not change their position get a bump up in reputation, but only based on the number of reviewers who switched to be in the final majority. (I.e. we reward reviewers who persuade others to change their position.) The reviews are then opened to a broader number of people with positive reputations, who can simply vote yes or no, which again affects the reputations of the reviewers. Again, voting is private until complete, then people who vote with the majority get small reputation bumps. At the conclusion of the process, everyone's work is made public.

I'm sure that there are people who have thought about reputation systems more than I have. But I have mostly seen reputation systems as a mechanism for creating a community where certain standards are upheld in the absence of monetary incentives. A reputation system that is robust against gaming seems difficult.

Max L.

5 - What do the participants get? Is that simply up to the sponsor? If so, who reviews it to assure that the incentive does not distort the data? If no one, will you at least require that the incentive be reported as part of the trial?

We need to design rules governing participant compensation.

At a minimum I think all compensation should be reported (it's part of what's needed for replication), and of course not related to the results a participant reports. Ideally we create a couple defined protocols for locating participants, and people largely choose to go with a known good solution.

StackOverflow et al are also free and offer no compensation except for points and awards and reputation. Maybe it can be combined. Points for regular participation, prominent mention somewhere and awards being real rewards. The downside is that this may pose moral hazards of some kind.

Oh, interesting.

I had been assuming that participants needed to be drawn from the general population. If we don't think there's too much hazard there, I agree a points system would work. Some portion of the population would likely just enjoy the idea of receiving free product to test.

I would worry about sampling bias due to selection based on, say, enjoying points.

4 - What action does the organization behind the app take if a sponsor publicly misrepresents the data or, more likely, its meaning? If the organization would take action, does it take the same action if the statement is merely misleading, rather than factually incorrect?

I imagined similar actions as the Free Software Foundation takes when a company violates the GPL: basically a lawsuit and press release warning people. For template studies, ideally what claims can be made would be specified by the template (ie "Our users lost XY more pounds over Z time".)

One option is simply to report it to the Federal Trade Commission for investigation, along with a negative publicity statement. That externalizes the cost.

If you would like assistance drafting the agreements, I am a lawyer and would be happy to help. I have deep knowledge about technology businesses, intellectual property licensing, and contracting, mid-level knowledge about data privacy, light knowledge about HIPAA, and no knowledge about medical testing or these types of protocols. I'm also more than fully employed, so you'd have the constraint of taking the time I could afford to donate.

Max L.

1 - For more complicated propositions, who does the math and statistics? The application apparently gathers the data, but it is still subject to interpretation.

This problem can be reduced in size by having the webapp give out blinded data, and only reveal group names after the analysis has been publicly committed to. If participating companies are unhappy with the existing modules, they could perhaps hire "statistical consultants" to add a module, permanently improving the site for everyone.

This could be related to your #8 as well :)

Thank you! This is exactly the kind of discussion I was hoping for.

The general answer to your questions is: I want to build whatever LessWrong wants me to build. If it's debated in the open, and agreed as the least-worst option, that's the plan.

I'll post answers to each question in a separate thread, since they raise a lot of questions I was hoping for feedback on.

Those participants are randomly assigned to two groups: (1) act normal, and (2) use Beeminder to track exercise and food intake.

These kind of studies suffer from the Hawthorne effect. It is better to assign the control group to do virtually anything instead of nothing. In this case I'd suggest to have them simply monitor their exercise and food intake without any magical line and/or punishment.

Thank you. I had forgotten about that.

So let's say the two groups were, as you suggest:

  • Tracking food & exercise on Beeminder
  • Tracking food & exercise in a journal

Do you have any thoughts on what questions we should be asking about this product? Somehow the data collection and analysis once we have the timeseries data doesn't seem so hard... but the protocol and question design seems very difficult to me.

I wonder if there should be a group where they still get Beeminder's graph, but they don't pay anything for going off their road. (In order to test whether the pledge system is actually necessary.)

He then takes whatever steps we decide on to locate participants.

Even if the group assignments are random, the prior step of participant sampling could lead to distorted effects. For example, the participants could be just the friends of the person who created the study who are willing to shill for it.

The studies would be more robust if your organization took on the responsibility of sampling itself. There is non-trivial scientific literature on the benefits and problems of using, for example, Mechanical Turk and Facebook ads for this kind of work. There is extra value added for the user/client here, which is that the participant sampling becomes a form of advertising.

Yeah, this is a brutal point. I wish I knew a good answer here.

Is there a gold standard approach? Last I checked even the state of the art wasn't particularly good.

Facebook / Google / StumbleUpon ads sound promising in that they can be trivially automated, and if only ad respondents could sign up for the study, then the friend issue is moot. Facebook is the most interesting of those, because of the demographic control it gives.

How bad is the bias? I performed a couple google scholar searches but didn't find anything satisfying.

To make things more complicated, some companies will want to test highly targeted populations. For example, Apptimize is only suitable for mobile app developers -- and I don't see a facebook campaign working out very well for locating such people.

A tentative solution might be having the company wishing to perform the test supply a list of websites they feel caters to good participants. This is even worse than facebook ads from a biasing perspective though. At minimum it sounds like disclosing how participants were located prominently will be important.

There are people in my department who do work in this area. I can reach out and ask them.

I think Mechanical Turk gets used a lot for survey experiments because it has a built-in compensation mechanism and there are ways to ask questions in ways that filter people into precisely what you want.

I wouldn't dismiss Facebook ads so quickly. I bet there is a way to target mobile app developers on that.

My hunch is that like survey questions, sampling methods are going to need to be tuned case-by-case and patterns extracted inductively from that. Good social scientific experiment design is very hard. Standardizing it is a noble but difficult task.

To work well, I think it needs a good name. In terms of long term social dynamics, creating a meta-brand that helps smaller brands seems essential. Like when people initially see the "tested by X" logo they won't know what it means.

Assuming the web app works as intended, and assuming any significant fraction the population just stop believing any of the classes of claims that might be tested this way and lack the logo, then the process should gain more and more credibility over the course of months and years. The transition from an unknown logo to a trusted logo will be tricky for the larger institutional hack to work, and the name itself might be key to the logic of acceptance at the beginning.

I ground through various options at the command line with $ whois $OPTION | grep "[A-Z].COM"... trying to find things that get the right idea and aren't already registered.

  • DoesItWork .com (taken)
  • justtestit .com (taken)
  • efficacy .com (taken)
  • forrealz .com (taken)
  • proveitforreal .com (available!)
  • simplytested .com (taken)
  • quickproofs .com (taken)
  • openproducttesting .com (available!)
  • opensourcetesting .com (taken)
  • tested .com (taken)
  • testedclaim .com (available!)
  • thirdpartytested .com (available!)
  • 3rdpartytested .com (available!)

Namespace is huge and finding a good name seems key. The names I looked for my be too boring or too long or too easy to misspell? Please comment in response to this comment, one name suggestion per comment, and then find the 3 best suggestions from other people (assuming that there are lots to choose from) and vote them up :-)

Edited to add: I'm seeing lots of votes and no suggestions. Also, ProveItForReal seems to be winning but I think that works better in a {{citation needed}} context (ie you say {{prove it for real}} to dubious claims) but it works less well for logos on products. Imagine a logo that is worked into product's packaging that says: "TestedClaim: X gives benefit Y in Z% of users"... that seems good in that context, but {{this needs to be a tested claim}} is awkward. Surely something better is possible than either of these?

proveitforreal.com

Acronym: PIFR.

Used like {{needs citation}} it really shines as {{prove it for real}} ...but how does it look on a product label?

simpleproofoftruth.com

Acronym: SPOT, sounds kind of neat with "SPOT test" or "SPOT tested".

It also works as a potential prod like citation needed... {{ simple proof of truth needed }}

One thing that slightly bothers me is that it relies on an older and less precise sense of the word "proof" that comes more from english common law than from mathematics, and the mathematical sense of the word "proof" is dear to my heart.

3rdpartytested.com

Acronym: 3PT is distinctive. TPT less so.

As a past tense claim, it really shines if you imagine what the logo could do for a product on a website. The link says "3rd Party Tested" and you click on it and it takes you to the open study. Simple and clean.

Downside: if the name overruns a pre-existing phrase because "third party tested" means something already, then you get confusing semantic collisions if someone has third party tested products that weren't tested by Third Party Tested (the unique tool).

In the Reproducibility Initiative PloS, and a few partner came together to improve the quality of science.

I would suggest all the people listed as advisors in the Reproducibility Initiative whether there are interested in your project. PloS would be a good trusted third-party with an existing brand.

Thank you. I had not seen the reproducibility initiative. Link very much appreciated, I'll start the conversation tonight. PLoS hosting the application would be ideal.

Apptimize (YC S13) is also interested. (disclosure: the CEO and CTO, Nancy and Jeremy, are friends of mine, and I worked there as a programmer).

If anyone else would like to be included, please reply here.

the cost of doing these less flexible studies will approach the cost of the raw product to be tested. For most web companies, that's $0.

Nope. The cost of doing less flexible studies will be the cost of losing that flexibility. For companies which expect a particular result from a study this cost can be considerable.

Do you mean that companies might see the opportunity cost of their ability to (amorally) rig a study as a bad thing? Or do you mean that companies might (legitimately) need complex study design to show a real but qualified positive effect?

If the former, then this proposal seems already to have taken that into account as a chief reason for creating the institution in the first place. If the latter then it might be possible to add options for important and epistemically useful variations in study design to the web app.

I mean both.

However now that I've looked at the OP more carefully, it seems to imply that a "webapp" will do research for multiple businesses at zero cost. I don't think I understand how that's supposed to work.

One issue that seems more likely to be problematic when the web application is being created and launched than later on, is whether the questions are well designed. There's a whole area of expertise that goes into creating scales that are reliable, valid, and discriminative. One possibility is to construct them from scratch from first principles, and then make them publicly available, but another possibility is to find the best of what exists already that is open sourced.

For general biotics and meal squares it seems like some measure of "not having a happy tummy" is a relevant thing to measure. If soylent gets in on the process they might have a similar interest?

A little bit of googling turned up the Gastrointestinal Symptom Rating Scale. It has 15 items (which might be too many?) and it is interview based (so hard to fit into an automated system). The really nice thing was that I could find a PDF and it all looked pretty basic.

A 2006 paper by van Zanten tipped me off to the existence of:

The Glasgow Dyspepsia Severity Scale

The Leeds Dyspepsia Questionnaire (public domain, with a Mandarin version!)

The Severity of Dyspepsia Assessment

The Nepean Dyspepsia Index

I'm feeling like in this situation, I can safely say "I love standards, there are so many to choose from"! One of the things that turned up in my searches that seems like a really useful "meta find" is the Proqolid Clinital Outcomes Assessment database but it requires membership to use the internal search function and I need to pause to grab some dinner.

Here are some initial features we should consider:

  • Data will be collected by a webapp controlled by a trusted third party, and will only be editable by study participants.
  • The results will be computed by software decided on before the data is collected.
  • Studies will be published regardless of positive or negative results.
  • Studies will have mandatory general-purpose safety questions. (web-only products likely exempt)
  • Follow up studies will be mandatory for continued use of results in advertisements.
  • All software/contracts/questions used will be open sourced (MIT) and creative commons licensed (CC BY), allowing for easier cross-product comparisons.

These requirements are higher than the average study in social sciences could fulful.[Citation needed]

That being said, I would put more faith in this startup if it targeted more professional research first and thus made itself more compatible with traditional papers. In a first step it would require researchers to announce a study and then publish the results regardless of the outcome (as is already done by some journals, as far as I know.) In a second step, require the results to be analysed by code published in advance under some kind of open content / open source licence. In a third step require there to be a replication under the same conditions for the claim to be published "officially". And so on.

I'll think about it some more, but the whole thing seems like it has been discussed on LW before.

Thanks so much! I didn't know about them, it's a good datapoint. Do you happen to know if they are active?