How to detect ringbacktone ivrs and hold music from an audio recording

Started by vmangipudi 7 years ago16 replieslatest reply 7 years ago867 views

I was pointed to this website by Mr. Allen Downey.
I would really appreciate your guidance on the following problem:

Typically in a recorded phone call there is

-- dial tone

-- tring tring ringtone 

-- hold music

-- conversation

-- end of conversation

I have an audio call attached below. I want to know how to use python to detect the regions of the  tring tring ringback tone and the hold music regions in the audio. 

Link : dropbox.com/s/5m36ekltyg9oiqt/tringtring.mp3?dl=0

i.e I would like to know how to detect regions of  these ringbacktones(tring tring) and hold music and ivrs on recorded audio calls.

I sought help of stackexchange and stackoverflow with little help. 



Thanks in advance for your time.

My approaches :

1. I used audacity to see if the waveforms/spectrograms provide any insight . I don't know how to read them but clearly, the "tring tring" regions have a very distinct pattern visually.

only ringtone waveform + spectrogram
tring tring_65505.png


ringtone + voices waveform + spectrogram



2. I  am now trying to generate mfcc features for particluar duration of the phone call. if these mfcc tell us anything

3. somehow measure either the power/amplitude/loudness at fixed intervals in the audio, say every 25ms or so. idea being : tring tring will/should have very distinct values compared to the rest of the speech. as seen above. 

Another approach I just thought of was to somehow convert the audio .mp3 .wav into a numeric data  and run some sort of unsupervised clustering on it. So that all the tring tring regions would be at one place.

My plan is to implement this is python.


The detection has to be at least 95% accurate.

   i.e it should identify regions of ringback tone, ivrs and hold music with almost absolute certainity. Then I want to remove it from the audio.

[ - ]
Reply by timburnettDecember 14, 2017

The first thing I would try would be goertzel/FFT bin detection of the ring frequency (correlation would probably work too). 

This is a pretty active site, so I'm sure others will chime in as well. But if not, I have some suggestions for how you can get more responses to your questions on message boards. First, your problem is not well defined. Are you only trying to detect the presence/absence of the tone, or are you trying to detect the end of the conversation as well, so that you know to resume music playback? Do you just need the algorithm, or is coding it in Python the bigger challenge? Second, you seem to be asking for a complete solution, without indicating that you have already put some effort into solving your own problem. What approaches you have already tried, and what results did they have?  

Hopefully my advice is helpful to you.


[ - ]
Reply by vmangipudiDecember 14, 2017

Thank you for the response. I have just updated my question with more details.

1. yes presence and absence is what I'm trying to find, but more specifically I want to detect "regions" in an audio where there is "tring tring"  so if  I were to pass an .mp3 or .wav file, the output should be time stamps [start, end] [start,end] [start,end] [start,end] of the tring tring. At this point the concern is with just detecting regions of tring tring , hold music and ivrs etc. 

2. I come from a data analysis & Viz background, and I'm brand new to Python-Audio Processing; So I really don't know where to start, I have been doing research for the most part. (I'm sure the coding with python struggles will come later on. :D )

3. Approaches I looked at or thouht of were :

  1.  Using spectrogram to see if it tells us any more information , for example the ringtone regions are very very distinct compared to the areas with speech. So I want to somehow analyze the signal per every 10-20ms and measure it's loudness or decibel.  From spectrogram I can see the value of the amplitude. So everytime I encounter a given loudness for a certain fixed duration,  I will want to save that time stamp.
  2. by taking small samples of audio in the tring tring regions, I want extract mfcc or plp feattures, and see if they will help distinguish these regions from the speech segment regions.

^above two approaches were mostly theoritical and didn't translate into a solution. 

[ - ]
Reply by DaveATvocalDecember 14, 2017

If you are trying to solve the problem generally, then you want generalized call progress detection.  Not every ringback will be the same, and there are also other sequences, such as busy, fast busy, etc.  The algorithm can entail tone detection, as well as power level and cadence tracking.

[ - ]
Reply by vmangipudiDecember 14, 2017

Thank you Dave. You make an interesting point. I just noticed that each "tring tring " can be different from another "tring tring" 

International calls have longer "tring tring" domestic calls seem to have shorter "tring tring"

Can you please point me to any resources around the suggested "generalized call progress detection" 

I had a similar idea, i.e measure either the power/amplitude/loudness at fixed intervals , the tring tring etc will have a fixed a unique value - being the underlying hypothesis.

[ - ]
Reply by CedronDecember 14, 2017
The two biggest questions in an application like this are:

1) Does it need to be done real time?  Your answer is no.

2) Does it have to be done efficiently? You haven't provided an answer.

The latter question is key because there will always be a trade off between how much processing you can throw at if versus false results.  With the example you have shown, I think your assumption that you can do it simply by analyzing the amplitudes and power levels over short intervals is spot on.

If your tring trings are always louder than your conversation, then finding them is merely a matter of setting a threshold slightly below the maximum found and scanning backwards till you find it.  A little further processing to find the alternating "stripes" would also be recommended.  This is done by using your amplitudes.  A more robust approach, put more processing intensive, is to use FFTs to find those little white patches you see in your diagram.  They won't occur like that in music or conversation.

Conversations can generally be distinguished from music by the presence of pauses between words.  This is done by looking at power intervals.

If you are open to contracting this out, I am currently looking for projects.  You can contact me via email at cedron at exede dot net.

[ - ]
Reply by vmangipudiDecember 14, 2017

thank you for your inputs.

[ - ]
Reply by dgshaw6December 14, 2017

The detection of the "tring tring" is a standard call progress detection problem.

There are three tones to look for: 350, 440, 480 and 620 Hz.

Each call progress is a combination of a pair of these tones.

1) Dial tone is 350+440 and is usually continuous.

2) Ringback (tring tring) 440+480.  Cadence in the USA is 2 sec on 4 sec off.  InEngland and other former colonies, there is usually a double tring tring in the first two seconds instead of only one continuous.

3) Reorder or busy is 480+620 and the meaning is contained in the cadence.  Fast busy means circuits are all used up (reorder), while slow busy means users phone is already off hook.

As mentioned by someone elsewhere in this discussion, the Goertzel algorithm is often used for the detection process.

More details about call progress and cadence: Call progress specification

I hope that this helpful.

[ - ]
Reply by vmangipudiDecember 14, 2017

Thank you for the well constructed and resourceful answer, kind Sir.

[ - ]
Reply by dgshaw6December 14, 2017

Your are most welcome.  I have built these detectors more times than I care to mention.

Goertzel detectors require some experimentation to get them right.  Depends on how accurate your timing needs to be.

If you have to be really accurate, you may need to run multiple copies of each frequency detector with different start times for the integrate and dump periods.

They are most often used for DTMF tones for regular telephony.

They are individual bins of a DFT, each at the desired frequency.

I view them as "driven" oscillators, so that if you drive them with exactly the right frequency, they will grow internal energy without bound.  This is the reason for having to "integrate and dump".  At the dump time, you measure how much energy grew within the detector, and decide if it was enough to say the said signal was present.

For DTMF tones, the rules of engagement are:

1) Tone must be sent for at least 60 msec

2) Tone must be declare present if it is there for >~ 45 msec.

For DTMF discrimination - with sampling at 8 kHz - the best integrate and dump time is ~ 110 samples.  However, that is about 13 msec, so you need multiple detectors for each of the individual frequencies to achieve a detection start and stop time within about 5 msec.

Too much information I know.

Goertzel detectors



[ - ]
Reply by dszaboDecember 14, 2017

At risk of sounding naive, wouldn’t be about as efficient to do a multiply by the complex frequency to be detected, use a recursive moving average, the multiply by the conjugate complex frequency?  I feel like that would give you a bunch (if not too much) data and be just as efficient?

[ - ]
Reply by dgshaw6December 14, 2017

The Goertzel is very efficient.

It is effectively the left side of a biquad with only one coefficient requiring a multiply and a couple of added, and two data element exchanges.

When the dump time occurs, there are a couple more multiplies.

[ - ]
Reply by dszaboDecember 14, 2017

I thought you were implying that you were overlapping multiple instances of the filter to improve time domain resolution.  If that were the case, a complex multiply would be two multiples and a pair of CICs would be two adds, two subtracts and two data exchanges.  Depends on how much decimation you want. Also, the cics would chew a fair amount of ram, and if you’re counting clocks at 8khz, you might be saving bytes for a rainy day. Thanks though

[ - ]
Reply by dgshaw6December 14, 2017
You make it sound simple, but a complex multiply by what?

How much has to be done to figure out the values of the complex coefficients for the multiplies?  Table lookup? Calculated sin and cos?

In the context of time resolution, if 10-20 msec is close enough then you only need one per tone.  We only need to run multiple instances if you need fairly accurate timing information.

[ - ]
Reply by dszaboDecember 14, 2017

Fair enough. You mean trig functions aren’t free!?  I guess I was leaning towards precalculating, but since the sample rate isn’t divisible by the measured frequency, that sounds like a bear.  The irony is that I’ve got trig function evaluation coming up soon on my backlog for a project so you’d think it would have been closer to the front of my mind, but I guess I wasn’t thinking.  Thanks

[ - ]
Reply by dgshaw6December 14, 2017

Hehe.  Been there and done that!!

No worries :-)

[ - ]
Reply by vmangipudiDecember 14, 2017

@dgshaw6 @dszabo. 
Thank you for your inputs. What are your thoughts on detection and removal of hold music. i.e being able to accurately cut out hold music from the audio so that , we are left with just the speech portions?