Good question 
@slartibartfast.  Apologies for the long response, but hopefully the context helps frame how the pieces fit together. 
The buffering that you're thinking about is a receive-side function where packets are being marshalled and put into the proper sequence or order (there's a sequence number in the RTP header of the audio datagram to do this).  The burstiness from the transmit side is smoothed out to remove the jitter from the variances in the inter-packet arrival times.  The receive side buffer will then de-queue the packets to the audio player function.  The buffer is normally some percentage full and this helps the streaming audio to be continuous without noticeable delays, shudders, or pops.
I think what people are confusing is 'reliability of delivery' and thinking that audio streams are always reliable like a file transfer.  A file transfer that 
@Steve Woodhouse raised doesn't have any timing constraints on it like audio does.  With audio, the packets have to arrive quickly or the buffer that was mentioned above can end up with no packets in it.  So, a file transfer has the luxury of time and can use more reliable delivery (hey, did you get my message?  No, then I'll resend it) where audio is often sent more like a fire-and-forget (hey, I just sent a message and hope that you got it). 
One can absolutely build a 2-way reliable delivery of audio but you have to make assumptions about the network while respecting the need to be timely.  Here, we could compare a 1-hop delivery on a LAN vs the multihop delivery over the public internet.  To the user, reliable delivery can come at the cost of continuous audio which would make them hit the return item button on Amazon.  These are choices that the designer of the 2 sides of an audio application makes because someone's mom can't make that choice.  So, servers like Plex or Emby or LMS can use reliable delivery on your 1-hop broadcast LAN.  But when you're building a massive internet service, the cost of building and operating a service to manage and operate 2-way reliable communications with tens of thousands of music players at scale is non-trivial.  I've been leaning more into this internet service item than the LAN in my comments.
When the audio is put into the receive-side buffer, if it never actually got there, there's a tiny gap (a few milliseconds) in the audio stream.  If you're at the edge of radio range with WiFi, for example, then there would definitely be instances where the packet (or more likely, several packets in a row) just doesn't get there as there are limits even on what is considered 'reliable delivery' because the receiver can't hear the sender.  When they move too far away, even if the receiver is saying I didn't get your message, the sender can't hear the intended receiver's response)
If you have more questions, please ask away.