There are several sources of delay, and the human ear can start detecting it at very small values. (I saw a reference about 7-8 years ago that 20-30 milliseconds delay was enough to be noticeably annoying, although the delays on tools like Skype and mobile phones is typically much greater. Other people seem to have far more tolerance for this than me!) You will probably find that the delays that you find so annoying are generally only several hundred milliseconds - enough to disrupt conversation, as turn-taking cues are missing.
Sources for the delay include:
- Frame size: When your voice is sent over the Internet or your work network, it is broken up into "frames" of varying length, depending on the codec. 10ms is typical (e.g. G.729) but for lower-bandwidth applications frames may be 30ms. (e.g. G.723.1) The first part of the frame is not sent until the last part of the frame is recorded, so there is up to 30ms delay already. Your old-fashioned land-line didn't have this problem - it would be transmitting immediately.
- Speed-Of-Light Delay: As you point out, it can take a couple of hundred milliseconds for the signal to travel half-way around the world - more as it doesn't follow the shortest (Great Circle) distance. Others have pointed out that going via a geo-stationary satellite adds a huge additional delay - one that is normally avoided if possible. (I was told once that, back when satellites were more commonly used that, wherever possible, the satellite would only be used for one direction, and under-sea cables for the other, to reduce round-trip delays.)
- Congestion: If packets are sent over an internet connection there may be points at which they contend for bandwidth with other packets, adding delay. Even if there are no other packets to content with, each router along the way adds some processing delay.
- Jitter Buffer: Packets sent over the Internet are not guaranteed to all arrive in a set time. Some may be a few milliseconds faster or slower than others. In extreme cases, they can even be overtaken. This variance in delay is known as jitter. Any jitter in the actual playing of the sound causes the human ear to have trouble making out the words. So the receiver has to queue up the packets for a short period, to line them up evenly for smooth playing. (The jitter buffer turns jitter into delay, which is a compromise solution.) This jitter buffer size varies on the quality of the connection, but can easily add 60 ms to the delay.
- Algorithmic Delay: It takes time to process the compression - it is very CPU intensive, especially with transcoding, where it is necessary to convert from one compression format to another). Also, some compression techniques require "lookahead", which for G.723 adds an additional 7.5 ms delay. (Ref)
- If you are using a general-purpose operating system (e.g. a Windows laptop) that isn't made for real-time audio systems, you can expect to add some additional delays in the OS's and sound-card's responsiveness; these aren't the typical operations the OS is optimised to handle. They are tuned to run some longer jobs faster, not to handle very small jobs responsively.
Remember, when you see a person hesitating to respond, you are seeing the round-trip delay - therefore you need to double the total delays mentioned here.
Note: Any delays for beeping out obscenities and the like will occur
after the conversation is recorded. That is, you will hear the broadcast a few seconds later, but they won't interfere with the conversation being held between the interviewer and interviewee. Adding a 5-7 second delay at that point would make the conversation untenable.