BowsAndArrows
Senior Member
Decided to start another thread because this topic tends to come up a lot and i didn't want to hijack the other thread where it came up again recently.
@Steve Woodhouse brought up some points he had about only drawing conclusions about SQ differences if statistically significant results are obtained in A/B testing... check out the post https://forum.wiimhome.com/threads/about-external-dacs.6960/post-121648
before reading this just know that in my opinion: Keeping an open mind, and the willingness to challenge our own preconceived notions is probably the part of this hobby that everyone can work to improve on. (myself included)
also, please note that i do not feel like A/B/x testing has no place in audio whatsoever... it's just that the results should be interpreted appropriately, and the weaknesses of the method as it is applied to audio perception must be understood before trying to do so.
there are many issues with A/B testing and many variations such as A/B/x etc to counteract them... but i wanted to shed some light on other weaknesses of the method.
one illuminating nugget is the series of approx 75 blind A/B tests done by Acoustic Research in the 60s using their famous AR-3 loudspeakers vs a live string quartet.
the listeners were unable to notice a difference in every single one of these tests conducted all across the US.
(can't find a link to the specific literature - but according to wikipedia the New York Times reported on it?) in other words - blind A/B/x testing conducted in audio generally tends to favour the null hypothesis (high rates of Type 2 Errors) - even in cases where there IS a proven real difference. (i'm sure nobody here would argue a recording is equivalent to a live quartet?? not in the sense of "SQ" or whatever, but in the empirical sense that they are actually WAY different things - e.g. a live musician can play slightly off-beat or play the wrong note)
the reasons for this are debatable - but the best distilled explanation i can come up with is that basically our perception + memory is not deterministic. i.e. for a given set of identical inputs, an "event" - our perception AND memory of them will not be identical. you can't interrogate the auditory system by performing some sort of measurement the same way we e.g. can the pressure of a football.
it's not all doom and gloom in this post though -there is a potential solution to the issues above! we can modify our testing methodology, and i guess temper our expectations in terms of how generelisable the results are...
passion for sound did an interesting video on this topic recently. he uses examples of visual perception to illustrate what he means about audio perception, and it actually kinda works really well. of particular note, he refers to this 2018 paper "On Human Perceptual Bandwidth and Slow Listening" (https://bit.ly/SlowListening) which goes into details about why @Burnside 's approach may be better.
so i guess in an ideal world, a friend would turn up at your doorstep with a black box. hook it up to your system with everything else unchanged. then they'd come back in a couple days/weeks, switch it out with a different box, and perform this enough times to get a statistically significant result.
this is obviously enormously more labour intensive than traditional A/B would involve, but it might mitigate type 2 errors we noted above...
many reviewers (as passion for sound mentions) also now perform A/B testing after this "slow listening" period - so the perceptions/memories (although fallible in any case) formed over time can be interrogated more thoroughly. a lot of them have also started (un?)consciously including a "SQ after i removed the tweak/component" section at the end of their reviews where they "double-check" to make sure the differences they heard were significant or not after removing the change in their system for a few listening sessions.

No gullibility involved or implied.
Of course, I’m more than interested in why the method suggested is in anyway flawed.
before reading this just know that in my opinion: Keeping an open mind, and the willingness to challenge our own preconceived notions is probably the part of this hobby that everyone can work to improve on. (myself included)

also, please note that i do not feel like A/B/x testing has no place in audio whatsoever... it's just that the results should be interpreted appropriately, and the weaknesses of the method as it is applied to audio perception must be understood before trying to do so.
there are many issues with A/B testing and many variations such as A/B/x etc to counteract them... but i wanted to shed some light on other weaknesses of the method.
one illuminating nugget is the series of approx 75 blind A/B tests done by Acoustic Research in the 60s using their famous AR-3 loudspeakers vs a live string quartet.


the reasons for this are debatable - but the best distilled explanation i can come up with is that basically our perception + memory is not deterministic. i.e. for a given set of identical inputs, an "event" - our perception AND memory of them will not be identical. you can't interrogate the auditory system by performing some sort of measurement the same way we e.g. can the pressure of a football.

it's not all doom and gloom in this post though -there is a potential solution to the issues above! we can modify our testing methodology, and i guess temper our expectations in terms of how generelisable the results are...

passion for sound did an interesting video on this topic recently. he uses examples of visual perception to illustrate what he means about audio perception, and it actually kinda works really well. of particular note, he refers to this 2018 paper "On Human Perceptual Bandwidth and Slow Listening" (https://bit.ly/SlowListening) which goes into details about why @Burnside 's approach may be better.
so i guess in an ideal world, a friend would turn up at your doorstep with a black box. hook it up to your system with everything else unchanged. then they'd come back in a couple days/weeks, switch it out with a different box, and perform this enough times to get a statistically significant result.

many reviewers (as passion for sound mentions) also now perform A/B testing after this "slow listening" period - so the perceptions/memories (although fallible in any case) formed over time can be interrogated more thoroughly. a lot of them have also started (un?)consciously including a "SQ after i removed the tweak/component" section at the end of their reviews where they "double-check" to make sure the differences they heard were significant or not after removing the change in their system for a few listening sessions.
