Thoughts on A/B testing

BowsAndArrows · Apr 14, 2025

Decided to start another thread because this topic tends to come up a lot and i didn't want to hijack the other thread where it came up again recently.

@Steve Woodhouse brought up some points he had about only drawing conclusions about SQ differences if statistically significant results are obtained in A/B testing... check out the post https://forum.wiimhome.com/threads/about-external-dacs.6960/post-121648

Steve Woodhouse said:
No gullibility involved or implied.

Of course, I’m more than interested in why the method suggested is in anyway flawed.

before reading this just know that in my opinion: Keeping an open mind, and the willingness to challenge our own preconceived notions is probably the part of this hobby that everyone can work to improve on. (myself included)

also, please note that i do not feel like A/B/x testing has no place in audio whatsoever... it's just that the results should be interpreted appropriately, and the weaknesses of the method as it is applied to audio perception must be understood before trying to do so.

there are many issues with A/B testing and many variations such as A/B/x etc to counteract them... but i wanted to shed some light on other weaknesses of the method.

one illuminating nugget is the series of approx 75 blind A/B tests done by Acoustic Research in the 60s using their famous AR-3 loudspeakers vs a live string quartet.

the listeners were unable to notice a difference in every single one of these tests conducted all across the US.

(can't find a link to the specific literature - but according to wikipedia the New York Times reported on it?) in other words - blind A/B/x testing conducted in audio generally tends to favour the null hypothesis (high rates of Type 2 Errors) - even in cases where there IS a proven real difference. (i'm sure nobody here would argue a recording is equivalent to a live quartet?? not in the sense of "SQ" or whatever, but in the empirical sense that they are actually WAY different things - e.g. a live musician can play slightly off-beat or play the wrong note)

the reasons for this are debatable - but the best distilled explanation i can come up with is that basically our perception + memory is not deterministic. i.e. for a given set of identical inputs, an "event" - our perception AND memory of them will not be identical. you can't interrogate the auditory system by performing some sort of measurement the same way we e.g. can the pressure of a football.

it's not all doom and gloom in this post though -there is a potential solution to the issues above! we can modify our testing methodology, and i guess temper our expectations in terms of how generelisable the results are...

passion for sound did an interesting video on this topic recently. he uses examples of visual perception to illustrate what he means about audio perception, and it actually kinda works really well. of particular note, he refers to this 2018 paper "On Human Perceptual Bandwidth and Slow Listening" (https://bit.ly/SlowListening) which goes into details about why @Burnside 's approach may be better.

so i guess in an ideal world, a friend would turn up at your doorstep with a black box. hook it up to your system with everything else unchanged. then they'd come back in a couple days/weeks, switch it out with a different box, and perform this enough times to get a statistically significant result.

this is obviously enormously more labour intensive than traditional A/B would involve, but it might mitigate type 2 errors we noted above...

many reviewers (as passion for sound mentions) also now perform A/B testing after this "slow listening" period - so the perceptions/memories (although fallible in any case) formed over time can be interrogated more thoroughly. a lot of them have also started (un?)consciously including a "SQ after i removed the tweak/component" section at the end of their reviews where they "double-check" to make sure the differences they heard were significant or not after removing the change in their system for a few listening sessions.

Matt_Holland · Apr 14, 2025

BowsAndArrows said:
in other words - blind A/B/x testing conducted in audio generally tends to favour the null hypothesis (high rates of Type 2 Errors) - even in cases where there IS a proven real difference.

Good post.
Could you expand on the quoted section above please? What is null hypothesis and type 2 errors?
TIA.

hoohoohoo · Apr 14, 2025

Matt_Holland said:
Good post.
Could you expand on the quoted section above please? What is null hypothesis and type 2 errors?
TIA.

Null hypothesis - Wikipedia

en.wikipedia.org

Type I and type II errors - Wikipedia

en.wikipedia.org

Musician · Apr 14, 2025

BowsAndArrows said:
Decided to start another thread because this topic tends to come up a lot and i didn't want to hijack the other thread where it came up again recently. @Steve Woodhouse brought up some points he had about only drawing conclusions about SQ differences if statistically significant results are obtained in A/B testing... check out the post https://forum.wiimhome.com/threads/about-external-dacs.6960/post-121648

before reading this just know that in my opinion: Keeping an open mind, and the willingness to challenge our own preconceived notions is probably the part of this hobby that everyone can work to improve on. (myself included)

also, please note that i do not feel like A/B/x testing has no place in audio whatsoever... it's just that the results should be interpreted appropriately, and the weaknesses of the method as it is applied to audio perception must be understood before trying to do so.

there are many issues with A/B testing and many variations such as A/B/x etc to counteract them... but i wanted to shed some light on other weaknesses of the method.

one illuminating nugget is the series of approx 75 blind A/B tests done by Acoustic Research in the 60s using their famous AR-3 loudspeakers vs a live string quartet. the listeners were unable to notice a difference in every single one of these tests conducted all across the US. (can't find a link to the specific literature - but according to wikipedia the New York Times reported on it?) in other words - blind A/B/x testing conducted in audio generally tends to favour the null hypothesis (high rates of Type 2 Errors) - even in cases where there IS a proven real difference. (i'm sure nobody here would argue a recording is equivalent to a live quartet?? not in the sense of "SQ" or whatever, but in the empirical sense that they are actually WAY different things - e.g. a live musician can play slightly off-beat or play the wrong note)

the reasons for this are debatable - but the best distilled explanation i can come up with is that basically our perception + memory is not deterministic. i.e. for a given set of identical inputs, an "event" - our perception AND memory of them will not be identical. you can't interrogate the auditory system by performing some sort of measurement the same way we e.g. can the pressure of a football.

it's not all doom and gloom in this post though -there is a potential solution to the issues above! we can modify our testing methodology, and i guess temper our expectations in terms of how generelisable the results are...

passion for sound did an interesting video on this topic recently. he uses examples of visual perception to illustrate what he means about audio perception, and it actually kinda works really well. of particular note, he refers to this 2018 paper "On Human Perceptual Bandwidth and Slow Listening" (https://bit.ly/SlowListening) which goes into details about why @Burnside 's approach may be better.

so i guess in an ideal world, a friend would turn up at your doorstep with a black box. hook it up to your system with everything else unchanged. then they'd come back in a couple days/weeks, switch it out with a different box, and perform this enough times to get a statistically significant result. this is obviously enormously more labour intensive than traditional A/B would involve, but it might mitigate type 2 errors we noted above...

many reviewers (as passion for sound mentions) also now perform A/B testing after this "slow listening" period - so the perceptions/memories (although fallible in any case) formed over time can be interrogated more thoroughly. a lot of them have also started (un?)consciously including a "SQ after i removed the tweak/component" section at the end of their reviews where they "double-check" to make sure the differences they heard were significant or not after removing the change in their system for a few listening sessions.

Nocebo ( can’t hear any difference in A/B tests ) is as powerful as placebo .

A/B test with stereo speakers are impossible to do because each pair has a place in the room where they sound the best . 20cm can be the difference between good and not good .

To make valid judgement one has to listen and compare during relaxed forms and it gonna take at least a couple of hours at home to do that .

BowsAndArrows · Apr 14, 2025

Matt_Holland said:
Good post.
Could you expand on the quoted section above please? What is null hypothesis and type 2 errors?
TIA.

thanks, and oh yeah, sorry about that.

the null hypothesis is a statement that there is no effect or no difference between the tested groups (in this case that A and B are the same, or that the difference is insignificant). it serves as a starting point for statistical testing. we try to prove/disprove the hypothesis in question by asking the question how likely is it by chance alone to come up with the results i did in a given experiment? (a little bit oversimplified, but hope that helps)

a type 2 error is when the null hypothesis is not rejected when it is actually false, i.e a false negative.

Hiphiman · Apr 14, 2025

If an analogy is made with pharmaceutical clinical trials then I would suggest that each listener should never have been exposed to either piece of equipment previously and (something I have not seen mentioned here) that it be a crossover trial i.e. rather than just A/B, half the subjects are randomly assigned A/B and half are B/A which will minimise some of the problems mentioned in this thread. You may even wish a ‘washout’ period of time between switching. I would also suggest scoring on an ordinal (graded) scale rather than a binary one that just states which is best. You should then be able to power a study i.e. calculate the number of participants required to obtain a statistically significant result, based on how much of a difference in scoring you feel would be necessary to define a worthwhile upgrade, such as a difference in sound of 1 on a scale of 1 to 10.

dominikz · Apr 15, 2025

BowsAndArrows said:
before reading this just know that in my opinion: Keeping an open mind, and the willingness to challenge our own preconceived notions is probably the part of this hobby that everyone can work to improve on. (myself included)

also, please note that i do not feel like A/B/x testing has no place in audio whatsoever... it's just that the results should be interpreted appropriately, and the weaknesses of the method as it is applied to audio perception must be understood before trying to do so.

there are many issues with A/B testing and many variations such as A/B/x etc to counteract them... but i wanted to shed some light on other weaknesses of the method.

one illuminating nugget is the series of approx 75 blind A/B tests done by Acoustic Research in the 60s using their famous AR-3 loudspeakers vs a live string quartet. the listeners were unable to notice a difference in every single one of these tests conducted all across the US. (can't find a link to the specific literature - but according to wikipedia the New York Times reported on it?) in other words - blind A/B/x testing conducted in audio generally tends to favour the null hypothesis (high rates of Type 2 Errors) - even in cases where there IS a proven real difference.

Please note that without knowing the exact details of the test protocol Acoustic Research employed we cannot say whether the conclusion that "blind A/B/x testing conducted in audio generally tends to favour the null hypothesis" is justified.
There are at least two other possible conclusions we could make from the anecdote you shared:
2) There was a methodological error in Acoustic Research test protocol that masked a difference that would otherwise be audible in a blind listening test.
3) Acoustic Research test protocol was correct, but within their test scope there really was no audible difference to be heard.

Any of these three conclusions could be correct or incorrect based on the data provided so far.

BowsAndArrows said:
(i'm sure nobody here would argue a recording is equivalent to a live quartet?? not in the sense of "SQ" or whatever, but in the empirical sense that they are actually WAY different things - e.g. a live musician can play slightly off-beat or play the wrong note)

Note that a recording can contain off-beat or wrong notes too. This in itself doesn't invalidate the test. However in general facilitating a true controlled blind test between live musicians and loudspeakers seems like a very difficult task to achieve - I'd be truly surprised if they managed to do this without very significant compromises. Again, without details this is all conjecture.

Musician said:
A/B test with stereo speakers are impossible to do because each pair has a place in the room where they sound the best . 20cm can be the difference between good and not good .

There are ways to control for loudspeaker placement in blind listening tests, but I agree it is difficult/tedious and most people will never try to do it.

Steve Woodhouse · Apr 15, 2025

I think it’s important to be clear on certain things.

Conducting an A/B test is not proof that there is/isn’t an audible difference. But it is evidence that the person listening can’t hear a difference.

Second to that, if a difference doesn’t show up in an A/B test, but it’s possible some audible difference is there but was missed, then that difference, by definition, must be incredibly slight.

The great thing about scientific testing is that it’s designed to be repeatable - indeed, that’s the idea.

So if you A/B a pair of DACs, and can’t hear a difference, and then others do the same, you then have a body of evidence which is difficult to ignore.

But here’s the thing.

Let’s say you are thinking of buying a DAC, and you’re faced with a choice between a sub-£100 DAC and another costing £1,000. And let’s say you A/B test them and can’t hear a difference. And, as both DACs measure as transparent, science tells us that result is the expected one.

Whilst it’s possible to hypothesise that there may be something amiss with the testing, we still need to say that’s all we have. We have evidence - even if you believe it’s potentially flawed - that there’s no audible difference, and we have science to support that. But we have absolutely no evidence that you can hear a difference.

In that instance, on what basis do you choose the £1,000 DAC?

dominikz · Apr 15, 2025

Steve Woodhouse said:
In that instance, on what basis do you choose the £1,000 DAC?

While I agree with the reasoning and overal sentiment of your post, I'd just like to point out that (not) being able to hear differences between two pieces of gear may not be the only factor influencing a purchasing decision (my thoughts on this).

I.e. even if a person can't tell two devices apart in a controlled listening test that doesn't neccesarily make them a fool if they want to buy the more expensive one. I fear many people on both sides of the argument feel this is implied.

Mr Ee · Apr 15, 2025

Thinking of changing my name... it does rhyme after all...

Burnside · Apr 15, 2025

dominikz said:
While I agree with the reasoning and overal sentiment of your post, I'd just like to point out that (not) being able to hear differences between two pieces of gear may not be the only factor influencing a purchasing decision (my thoughts on this).

I'd add location to that excellent list - while some might be happy with a non-descript plastic box, a bulky external power supply and a mass of cables in their "man-cave", that just wouldn't cut it for me in my living room

EddNog · Apr 15, 2025

Steve Woodhouse said:
I think it’s important to be clear on certain things.

Conducting an A/B test is not proof that there is/isn’t an audible difference. But it is evidence that the person listening can’t hear a difference.

Second to that, if a difference doesn’t show up in an A/B test, but it’s possible some audible difference is there but was missed, then that difference, by definition, must be incredibly slight.

The great thing about scientific testing is that it’s designed to be repeatable - indeed, that’s the idea.

So if you A/B a pair of DACs, and can’t hear a difference, and then others do the same, you then have a body of evidence which is difficult to ignore.

But here’s the thing.

Let’s say you are thinking of buying a DAC, and you’re faced with a choice between a sub-£100 DAC and another costing £1,000. And let’s say you A/B test them and can’t hear a difference. And, as both DACs measure as transparent, science tells us that result is the expected one.

Whilst it’s possible to hypothesise that there may be something amiss with the testing, we still need to say that’s all we have. We have evidence - even if you believe it’s potentially flawed - that there’s no audible difference, and we have science to support that. But we have absolutely no evidence that you can hear a difference.

In that instance, on what basis do you choose the £1,000 DAC?

Money isn’t the only factor, typically, once you get into those bigger price tags.

Can’t hear the difference between a Geshelli J2S in a gorgeous custom, hand-carved wood case, and an SMSL SU-1, which costs 1/8th the money and uses the exact same chip (AKM 4493), but I’d take that Geshelli for non-sound-related reasons.

-Ed

Deleted member 18192 · Apr 15, 2025

dtc · Apr 15, 2025

There is a common problem that happens when listening to some systems. The system sounds great when you first listen to it but after an extending listening session, like a hour or two, people get a noticeable headache. That happens with some systems, but not others and clearly also depends on the listener. That effect indicates that there is more to our hearing that can be determined in a few minutes of A/B testing.

Steve Woodhouse · Apr 15, 2025

dtc said:
There is a common problem that happens when listening to some systems. The system sounds great when you first listen to it but after an extending listening session, like a hour or two, people get a noticeable headache. That happens with some systems, but not others and clearly also depends on the listener. That effect indicates that there is more to our hearing that can be determined in a few minutes of A/B testing.

I’d like to see scientific evidence of how non-audible levels of noise and/or distortion can cause a headache.

Do we have any reputable medical evidence of this ever happening?

And we’re in danger of losing sight here. The headaches should be inducible in a blind test.

Steve Woodhouse · Apr 15, 2025

I’m in agreement that a purchase isn’t made purely on sound quality.

Other factors are obviously valid.

But let’s just take that as understood, and move on. This thread should be completely about sound quality.

dominikz · Apr 15, 2025

dtc said:
There is a common problem that happens when listening to some systems. The system sounds great when you first listen to it but after an extending listening session, like a hour or two, people get a noticeable headache. That happens with some systems, but not others and clearly also depends on the listener. That effect indicates that there is more to our hearing that can be determined in a few minutes of A/B testing.

Frankly, I'm not convinced that it does indicate this.

If such an effect can be demonstrated under controlled conditions, then I see no reason why controlled listening tests couldn't confirm it - and even be used to find what kind of audio system characteristic causes it. On the other hand, if the effect cannot be demonstrated under controlled conditions, then how can we be sure about what is causing it?

It is perhaps worth to add that A/B tests (and controlled listening tests in general) don't need to only last a few minutes - the main requirement is just that they are controlled for known sources of bias and systematic error (e.g. playback level differences, inconsistent listening conditions, knowing which device is currently playing, etc).

Actually, well-prepared controlled listening tests will make it easier for listeners to hear and identify minor (but real) audible differences - not harder.
Uncontrolled listening tests on the other hand will hide any true audible differences behind a whole suite of systematic errors and well-known perceptual and cognitive biases.

Lastly, let me add a reference to a very interesting article on listening test methodology by Stuart Yaniger; perhaps some will find it interesting!

dtc · Apr 15, 2025

Steve Woodhouse said:
I’d like to see scientific evidence of how non-audible levels of noise and/or distortion can cause a headache.

Do we have any reputable medical evidence of this ever happening?

And we’re in danger of losing sight here. The headaches should be inducible in a blind

So, a person cannot get a headache unless it can be measured on

Steve Woodhouse · Apr 15, 2025

dtc said:
So, a person cannot get a headache unless it can be measured on

That’s not what I said at all.

Steve Woodhouse · Apr 15, 2025

I’m quite surprised about the ‘headache’ phenomena.

For starters, it’s not a hi-fi problem, it’s a medical problem induced by hi-fi.

Very easy to figure out if a different DAC is causing it. Get your little helper to choose the DAC for the night, unsighted by you. Have an evening’s listening. Ask your helper to randomly choose a DAC for the next night. Could be the same DAC, could be the second DAC.

Get them to keep a diary of which DAC it was on which night, and you keep a diary of when you get a headache.

Thoughts on A/B testing

Senior Member

Member

Senior Member

Major Contributor

Senior Member

Active member

Senior Member

Major Contributor

Senior Member

Major Contributor

Major Contributor

Major Contributor

Deleted member 18192

Guest

Senior Member

Major Contributor

Major Contributor

Senior Member

Senior Member

Major Contributor

Major Contributor

Similar threads