Thoughts on A/B testing

BowsAndArrows

Senior Member
Joined
Mar 14, 2024
Messages
468
Location
Ireland
Decided to start another thread because this topic tends to come up a lot and i didn't want to hijack the other thread where it came up again recently. 🙏 @Steve Woodhouse brought up some points he had about only drawing conclusions about SQ differences if statistically significant results are obtained in A/B testing... check out the post https://forum.wiimhome.com/threads/about-external-dacs.6960/post-121648

No gullibility involved or implied.

Of course, I’m more than interested in why the method suggested is in anyway flawed.

before reading this just know that in my opinion: Keeping an open mind, and the willingness to challenge our own preconceived notions is probably the part of this hobby that everyone can work to improve on. (myself included) 🤣

also, please note that i do not feel like A/B/x testing has no place in audio whatsoever... it's just that the results should be interpreted appropriately, and the weaknesses of the method as it is applied to audio perception must be understood before trying to do so.

there are many issues with A/B testing and many variations such as A/B/x etc to counteract them... but i wanted to shed some light on other weaknesses of the method.

one illuminating nugget is the series of approx 75 blind A/B tests done by Acoustic Research in the 60s using their famous AR-3 loudspeakers vs a live string quartet. 👀 the listeners were unable to notice a difference in every single one of these tests conducted all across the US. 🤯 (can't find a link to the specific literature - but according to wikipedia the New York Times reported on it?) in other words - blind A/B/x testing conducted in audio generally tends to favour the null hypothesis (high rates of Type 2 Errors) - even in cases where there IS a proven real difference. (i'm sure nobody here would argue a recording is equivalent to a live quartet?? not in the sense of "SQ" or whatever, but in the empirical sense that they are actually WAY different things - e.g. a live musician can play slightly off-beat or play the wrong note)

the reasons for this are debatable - but the best distilled explanation i can come up with is that basically our perception + memory is not deterministic. i.e. for a given set of identical inputs, an "event" - our perception AND memory of them will not be identical. you can't interrogate the auditory system by performing some sort of measurement the same way we e.g. can the pressure of a football. 🤷‍♂️

it's not all doom and gloom in this post though -there is a potential solution to the issues above! we can modify our testing methodology, and i guess temper our expectations in terms of how generelisable the results are... 🤔

passion for sound did an interesting video on this topic recently. he uses examples of visual perception to illustrate what he means about audio perception, and it actually kinda works really well. of particular note, he refers to this 2018 paper "On Human Perceptual Bandwidth and Slow Listening" (https://bit.ly/SlowListening) which goes into details about why @Burnside 's approach may be better.

so i guess in an ideal world, a friend would turn up at your doorstep with a black box. hook it up to your system with everything else unchanged. then they'd come back in a couple days/weeks, switch it out with a different box, and perform this enough times to get a statistically significant result. 😅 this is obviously enormously more labour intensive than traditional A/B would involve, but it might mitigate type 2 errors we noted above...

many reviewers (as passion for sound mentions) also now perform A/B testing after this "slow listening" period - so the perceptions/memories (although fallible in any case) formed over time can be interrogated more thoroughly. a lot of them have also started (un?)consciously including a "SQ after i removed the tweak/component" section at the end of their reviews where they "double-check" to make sure the differences they heard were significant or not after removing the change in their system for a few listening sessions. 👏

 
in other words - blind A/B/x testing conducted in audio generally tends to favour the null hypothesis (high rates of Type 2 Errors) - even in cases where there IS a proven real difference.
Good post.
Could you expand on the quoted section above please? What is null hypothesis and type 2 errors?
TIA.
 
Decided to start another thread because this topic tends to come up a lot and i didn't want to hijack the other thread where it came up again recently. 🙏 @Steve Woodhouse brought up some points he had about only drawing conclusions about SQ differences if statistically significant results are obtained in A/B testing... check out the post https://forum.wiimhome.com/threads/about-external-dacs.6960/post-121648



before reading this just know that in my opinion: Keeping an open mind, and the willingness to challenge our own preconceived notions is probably the part of this hobby that everyone can work to improve on. (myself included) 🤣

also, please note that i do not feel like A/B/x testing has no place in audio whatsoever... it's just that the results should be interpreted appropriately, and the weaknesses of the method as it is applied to audio perception must be understood before trying to do so.

there are many issues with A/B testing and many variations such as A/B/x etc to counteract them... but i wanted to shed some light on other weaknesses of the method.

one illuminating nugget is the series of approx 75 blind A/B tests done by Acoustic Research in the 60s using their famous AR-3 loudspeakers vs a live string quartet. 👀 the listeners were unable to notice a difference in every single one of these tests conducted all across the US. 🤯 (can't find a link to the specific literature - but according to wikipedia the New York Times reported on it?) in other words - blind A/B/x testing conducted in audio generally tends to favour the null hypothesis (high rates of Type 2 Errors) - even in cases where there IS a proven real difference. (i'm sure nobody here would argue a recording is equivalent to a live quartet?? not in the sense of "SQ" or whatever, but in the empirical sense that they are actually WAY different things - e.g. a live musician can play slightly off-beat or play the wrong note)

the reasons for this are debatable - but the best distilled explanation i can come up with is that basically our perception + memory is not deterministic. i.e. for a given set of identical inputs, an "event" - our perception AND memory of them will not be identical. you can't interrogate the auditory system by performing some sort of measurement the same way we e.g. can the pressure of a football. 🤷‍♂️

it's not all doom and gloom in this post though -there is a potential solution to the issues above! we can modify our testing methodology, and i guess temper our expectations in terms of how generelisable the results are... 🤔

passion for sound did an interesting video on this topic recently. he uses examples of visual perception to illustrate what he means about audio perception, and it actually kinda works really well. of particular note, he refers to this 2018 paper "On Human Perceptual Bandwidth and Slow Listening" (https://bit.ly/SlowListening) which goes into details about why @Burnside 's approach may be better.

so i guess in an ideal world, a friend would turn up at your doorstep with a black box. hook it up to your system with everything else unchanged. then they'd come back in a couple days/weeks, switch it out with a different box, and perform this enough times to get a statistically significant result. 😅 this is obviously enormously more labour intensive than traditional A/B would involve, but it might mitigate type 2 errors we noted above...

many reviewers (as passion for sound mentions) also now perform A/B testing after this "slow listening" period - so the perceptions/memories (although fallible in any case) formed over time can be interrogated more thoroughly. a lot of them have also started (un?)consciously including a "SQ after i removed the tweak/component" section at the end of their reviews where they "double-check" to make sure the differences they heard were significant or not after removing the change in their system for a few listening sessions. 👏

Nocebo ( can’t hear any difference in A/B tests ) is as powerful as placebo .

A/B test with stereo speakers are impossible to do because each pair has a place in the room where they sound the best . 20cm can be the difference between good and not good .

To make valid judgement one has to listen and compare during relaxed forms and it gonna take at least a couple of hours at home to do that .
 
Good post.
Could you expand on the quoted section above please? What is null hypothesis and type 2 errors?
TIA.
thanks, and oh yeah, sorry about that.

the null hypothesis is a statement that there is no effect or no difference between the tested groups (in this case that A and B are the same, or that the difference is insignificant). it serves as a starting point for statistical testing. we try to prove/disprove the hypothesis in question by asking the question how likely is it by chance alone to come up with the results i did in a given experiment? (a little bit oversimplified, but hope that helps)

a type 2 error is when the null hypothesis is not rejected when it is actually false, i.e a false negative.
 
If an analogy is made with pharmaceutical clinical trials then I would suggest that each listener should never have been exposed to either piece of equipment previously and (something I have not seen mentioned here) that it be a crossover trial i.e. rather than just A/B, half the subjects are randomly assigned A/B and half are B/A which will minimise some of the problems mentioned in this thread. You may even wish a ‘washout’ period of time between switching. I would also suggest scoring on an ordinal (graded) scale rather than a binary one that just states which is best. You should then be able to power a study i.e. calculate the number of participants required to obtain a statistically significant result, based on how much of a difference in scoring you feel would be necessary to define a worthwhile upgrade, such as a difference in sound of 1 on a scale of 1 to 10.
 
before reading this just know that in my opinion: Keeping an open mind, and the willingness to challenge our own preconceived notions is probably the part of this hobby that everyone can work to improve on. (myself included) 🤣

also, please note that i do not feel like A/B/x testing has no place in audio whatsoever... it's just that the results should be interpreted appropriately, and the weaknesses of the method as it is applied to audio perception must be understood before trying to do so.

there are many issues with A/B testing and many variations such as A/B/x etc to counteract them... but i wanted to shed some light on other weaknesses of the method.

one illuminating nugget is the series of approx 75 blind A/B tests done by Acoustic Research in the 60s using their famous AR-3 loudspeakers vs a live string quartet. 👀 the listeners were unable to notice a difference in every single one of these tests conducted all across the US. 🤯 (can't find a link to the specific literature - but according to wikipedia the New York Times reported on it?) in other words - blind A/B/x testing conducted in audio generally tends to favour the null hypothesis (high rates of Type 2 Errors) - even in cases where there IS a proven real difference.
Please note that without knowing the exact details of the test protocol Acoustic Research employed we cannot say whether the conclusion that "blind A/B/x testing conducted in audio generally tends to favour the null hypothesis" is justified.
There are at least two other possible conclusions we could make from the anecdote you shared:
2) There was a methodological error in Acoustic Research test protocol that masked a difference that would otherwise be audible in a blind listening test.
3) Acoustic Research test protocol was correct, but within their test scope there really was no audible difference to be heard.

Any of these three conclusions could be correct or incorrect based on the data provided so far.

(i'm sure nobody here would argue a recording is equivalent to a live quartet?? not in the sense of "SQ" or whatever, but in the empirical sense that they are actually WAY different things - e.g. a live musician can play slightly off-beat or play the wrong note)
Note that a recording can contain off-beat or wrong notes too. This in itself doesn't invalidate the test. However in general facilitating a true controlled blind test between live musicians and loudspeakers seems like a very difficult task to achieve - I'd be truly surprised if they managed to do this without very significant compromises. Again, without details this is all conjecture.

A/B test with stereo speakers are impossible to do because each pair has a place in the room where they sound the best . 20cm can be the difference between good and not good .
There are ways to control for loudspeaker placement in blind listening tests, but I agree it is difficult/tedious and most people will never try to do it.
 
I think it’s important to be clear on certain things.

Conducting an A/B test is not proof that there is/isn’t an audible difference. But it is evidence that the person listening can’t hear a difference.

Second to that, if a difference doesn’t show up in an A/B test, but it’s possible some audible difference is there but was missed, then that difference, by definition, must be incredibly slight.

The great thing about scientific testing is that it’s designed to be repeatable - indeed, that’s the idea.

So if you A/B a pair of DACs, and can’t hear a difference, and then others do the same, you then have a body of evidence which is difficult to ignore.

But here’s the thing.

Let’s say you are thinking of buying a DAC, and you’re faced with a choice between a sub-£100 DAC and another costing £1,000. And let’s say you A/B test them and can’t hear a difference. And, as both DACs measure as transparent, science tells us that result is the expected one.

Whilst it’s possible to hypothesise that there may be something amiss with the testing, we still need to say that’s all we have. We have evidence - even if you believe it’s potentially flawed - that there’s no audible difference, and we have science to support that. But we have absolutely no evidence that you can hear a difference.

In that instance, on what basis do you choose the £1,000 DAC?
 
In that instance, on what basis do you choose the £1,000 DAC?
While I agree with the reasoning and overal sentiment of your post, I'd just like to point out that (not) being able to hear differences between two pieces of gear may not be the only factor influencing a purchasing decision (my thoughts on this).

I.e. even if a person can't tell two devices apart in a controlled listening test that doesn't neccesarily make them a fool if they want to buy the more expensive one. I fear many people on both sides of the argument feel this is implied. :confused:
 
While I agree with the reasoning and overal sentiment of your post, I'd just like to point out that (not) being able to hear differences between two pieces of gear may not be the only factor influencing a purchasing decision (my thoughts on this).

I'd add location to that excellent list - while some might be happy with a non-descript plastic box, a bulky external power supply and a mass of cables in their "man-cave", that just wouldn't cut it for me in my living room
 
I think it’s important to be clear on certain things.

Conducting an A/B test is not proof that there is/isn’t an audible difference. But it is evidence that the person listening can’t hear a difference.

Second to that, if a difference doesn’t show up in an A/B test, but it’s possible some audible difference is there but was missed, then that difference, by definition, must be incredibly slight.

The great thing about scientific testing is that it’s designed to be repeatable - indeed, that’s the idea.

So if you A/B a pair of DACs, and can’t hear a difference, and then others do the same, you then have a body of evidence which is difficult to ignore.

But here’s the thing.

Let’s say you are thinking of buying a DAC, and you’re faced with a choice between a sub-£100 DAC and another costing £1,000. And let’s say you A/B test them and can’t hear a difference. And, as both DACs measure as transparent, science tells us that result is the expected one.

Whilst it’s possible to hypothesise that there may be something amiss with the testing, we still need to say that’s all we have. We have evidence - even if you believe it’s potentially flawed - that there’s no audible difference, and we have science to support that. But we have absolutely no evidence that you can hear a difference.

In that instance, on what basis do you choose the £1,000 DAC?
Money isn’t the only factor, typically, once you get into those bigger price tags.

Can’t hear the difference between a Geshelli J2S in a gorgeous custom, hand-carved wood case, and an SMSL SU-1, which costs 1/8th the money and uses the exact same chip (AKM 4493), but I’d take that Geshelli for non-sound-related reasons.

-Ed
 
There is a common problem that happens when listening to some systems. The system sounds great when you first listen to it but after an extending listening session, like a hour or two, people get a noticeable headache. That happens with some systems, but not others and clearly also depends on the listener. That effect indicates that there is more to our hearing that can be determined in a few minutes of A/B testing.
 
There is a common problem that happens when listening to some systems. The system sounds great when you first listen to it but after an extending listening session, like a hour or two, people get a noticeable headache. That happens with some systems, but not others and clearly also depends on the listener. That effect indicates that there is more to our hearing that can be determined in a few minutes of A/B testing.

I’d like to see scientific evidence of how non-audible levels of noise and/or distortion can cause a headache.

Do we have any reputable medical evidence of this ever happening?

And we’re in danger of losing sight here. The headaches should be inducible in a blind test.
 
There is a common problem that happens when listening to some systems. The system sounds great when you first listen to it but after an extending listening session, like a hour or two, people get a noticeable headache. That happens with some systems, but not others and clearly also depends on the listener. That effect indicates that there is more to our hearing that can be determined in a few minutes of A/B testing.
Frankly, I'm not convinced that it does indicate this.

If such an effect can be demonstrated under controlled conditions, then I see no reason why controlled listening tests couldn't confirm it - and even be used to find what kind of audio system characteristic causes it. On the other hand, if the effect cannot be demonstrated under controlled conditions, then how can we be sure about what is causing it?

It is perhaps worth to add that A/B tests (and controlled listening tests in general) don't need to only last a few minutes - the main requirement is just that they are controlled for known sources of bias and systematic error (e.g. playback level differences, inconsistent listening conditions, knowing which device is currently playing, etc).

Actually, well-prepared controlled listening tests will make it easier for listeners to hear and identify minor (but real) audible differences - not harder.
Uncontrolled listening tests on the other hand will hide any true audible differences behind a whole suite of systematic errors and well-known perceptual and cognitive biases.

Lastly, let me add a reference to a very interesting article on listening test methodology by Stuart Yaniger; perhaps some will find it interesting!
 
I’d like to see scientific evidence of how non-audible levels of noise and/or distortion can cause a headache.

Do we have any reputable medical evidence of this ever happening?

And we’re in danger of losing sight here. The headaches should be inducible in a blind
So, a person cannot get a headache unless it can be measured on
 
I’m quite surprised about the ‘headache’ phenomena.

For starters, it’s not a hi-fi problem, it’s a medical problem induced by hi-fi.

Very easy to figure out if a different DAC is causing it. Get your little helper to choose the DAC for the night, unsighted by you. Have an evening’s listening. Ask your helper to randomly choose a DAC for the next night. Could be the same DAC, could be the second DAC.

Get them to keep a diary of which DAC it was on which night, and you keep a diary of when you get a headache.
 
Back
Top