Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WP2000 - Codec 2 Algorithm Description #31

Merged
merged 41 commits into from
Dec 11, 2023
Merged

WP2000 - Codec 2 Algorithm Description #31

merged 41 commits into from
Dec 11, 2023

Conversation

drowe67
Copy link
Owner

@drowe67 drowe67 commented Nov 17, 2023

#20

Copy link
Collaborator

@tmiw tmiw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good currently. Will other sections be handled in this PR too or another one?

@drowe67
Copy link
Owner Author

drowe67 commented Nov 18, 2023

@tmiw - oh I'm sorry, did I press the review button by mistake? Or maybe GitHub requests a review automatically? Anyway, I've only just started. Its very much WIP, not needing any review at this stage

@tmiw
Copy link
Collaborator

tmiw commented Nov 18, 2023

@tmiw - oh I'm sorry, did I press the review button by mistake? Or maybe GitHub requests a review automatically? Anyway, I've only just started. Its very much WIP, not needing any review at this stage

I think I get emails whenever a PR gets created, not just when requested for review. Sorry for the confusion!

@drowe67
Copy link
Owner Author

drowe67 commented Nov 19, 2023

@tmiw - oh I'm sorry, did I press the review button by mistake? Or maybe GitHub requests a review automatically? Anyway, I've only just started. Its very much WIP, not needing any review at this stage

I think I get emails whenever a PR gets created, not just when requested for review. Sorry for the confusion!

That's fine. Actually at some stage it would be good to get feedback from Hams on what they would like to know about Codec 2, so I can answer the most common questions. Perhaps after a first draft of the document is ready.

Copy link
Collaborator

@tmiw tmiw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments based on this initial pass.

doc/codec2.tex Outdated
@@ -52,7 +52,7 @@ \section{Codec 2 for the Radio Amateur}

\subsection{Model Based Speech Coding}

A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at an 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of "what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.
A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of "what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of "what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.
A speech codec takes speech samples from an A/D converter (e.g. 16 bit samples at 8 kHz or 128 kbits/s) and compresses them down to a low bit rate that can be more easily sent over a narrow bandwidth channel (e.g. 700 bits/s for HF). Speech coding is the art of ``what can we throw away". We need to lower the bit rate of the speech while retaining speech you can understand, and making it sound as natural as possible.

doc/codec2.tex Outdated
@@ -83,7 +83,7 @@ \subsection{Sinusoidal Speech Coding}
A sinewave will cause a spike or spectral line on a spectrum plot, so we can see each spike as a small sine wave generator. Each sine wave generator has it's own frequency that are all multiples of the fundamental pitch frequency (e.g. $230, 460, 690,...$ Hz). They will also have their own amplitude and phase. If we add all the sine waves together (Figure \ref{fig:sinusoidal_model}) we can produce reasonable quality synthesised speech. This is called sinusoidal speech coding and is the speech production ``model" at the heart of Codec 2.

\begin{figure}[h]
\caption{The sinusoidal speech model. If we sum a series of sine waves, we can generate a speech signal. Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown). We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves.}
\caption{The sinusoidal speech model. If we sum a series of sine waves, we can generate a speech signal. Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown). We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves we can fit in 4kHz.}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding spacing before KHz to be consistent with other mentions:

Suggested change
\caption{The sinusoidal speech model. If we sum a series of sine waves, we can generate a speech signal. Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown). We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves we can fit in 4kHz.}
\caption{The sinusoidal speech model. If we sum a series of sine waves, we can generate a speech signal. Each sinewave has it's own amplitude ($A_1,A_2,... A_L$), frequency, and phase (not shown). We assume the frequencies are multiples of the fundamental frequency $F_0$. $L$ is the total number of sinewaves we can fit in 4 kHz.}

doc/codec2.tex Outdated
\draw [->] (3,2) -- (4,2);
\draw [xshift=4.2cm,yshift=2cm,color=blue] plot[smooth] file {hts2a_37_sn.txt};

\end{tikzpicture}
\end{center}
\end{figure}

The model parameters evolve over time, but can generally be considered constant for short snap shots in time (a few 10s of ms). For example pitch evolves time, moving up or down as a word is articulated.
The model parameters evolve over time, but can generally be considered constant for short time window (a few 10s of ms). For example pitch evolves over time, moving up or down as a word is articulated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The model parameters evolve over time, but can generally be considered constant for short time window (a few 10s of ms). For example pitch evolves over time, moving up or down as a word is articulated.
The model parameters evolve over time, but can generally be considered constant for short time windows (a few 10s of ms). For example pitch evolves over time, moving up or down as a word is articulated.

doc/codec2.tex Outdated

Once we have the desired frame rate, we ``quantise"" each model parameter. This means we use a fixed number of bits to represent it, so we can send the bits over the channel. Parameters like pitch and voicing are fairly easy, but quite a bit of DSP goes into quantising the spectral amplitudes. For the higher bit rate Codec 2 modes, we design a filter that matches the spectral amplitudes, then send a quantised version of the filter over the channel. Using the example in Figure \ref{fig:hts2a_time} - the filter would have a band pass peaks at 500 and 2300 Hz. It's frequency response would follow the red line. The filter is time varying - we redesign it for every frame.

You'll notice the term "estimate" being used a lot. One of the problems with model based speech coding is the algorithms we use to extract the model parameters are not perfect. Occasionally the algorithms get it wrong. Look at the red crosses on the bottom plot of Figure \ref{fig:hts2a_time}. These mark the amplitude estimate of each harmonic. If you look carefully, you'll see that above 2000Hz, the crosses fall a little short of the exact centre of each harmonic. This is an example of a ``fine" pitch estimator error, a little off the correct value.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You'll notice the term "estimate" being used a lot. One of the problems with model based speech coding is the algorithms we use to extract the model parameters are not perfect. Occasionally the algorithms get it wrong. Look at the red crosses on the bottom plot of Figure \ref{fig:hts2a_time}. These mark the amplitude estimate of each harmonic. If you look carefully, you'll see that above 2000Hz, the crosses fall a little short of the exact centre of each harmonic. This is an example of a ``fine" pitch estimator error, a little off the correct value.
You'll notice the term ``estimate" being used a lot. One of the problems with model based speech coding is the algorithms we use to extract the model parameters are not perfect. Occasionally the algorithms get it wrong. Look at the red crosses on the bottom plot of Figure \ref{fig:hts2a_time}. These mark the amplitude estimate of each harmonic. If you look carefully, you'll see that above 2000Hz, the crosses fall a little short of the exact centre of each harmonic. This is an example of a ``fine" pitch estimator error, a little off the correct value.

doc/codec2.tex Outdated

Table \ref{tab:bit_allocation} presents the bit allocation for two popular Codec 2 modes. One additional parameter is the frame energy, this is the average level of the spectral amplitudes, or ``AF gain" of the speech frame.

At very low bit rates such as 700C, we use Vector Quantisation (VQ) to represent the spectral amplitudes. We construct a table such that each row of the table has a set of spectral amplitude samples. In Codec 2 700C the table has 512 rows. During the quantisation process, we choose the table row that best matches the spectral amplitudes for this frame, then send the \emph{index} of the table row. The decoder has a similar table, so can use the index to look up the output values. If the table is 512 rows, we can use a 9 bit number to quantise the spectral amplitudes. In Codec 2 700C, we use two tables of 512 entries each (18 bits total), the second one helps fine tune the quantisation from the first table.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first mention of specific modes rather than just bit rates that I could tell offhand. This should be reworded as well as the modes themselves introduced earlier in the document so that the reader can properly associate e.g. 700C with "very low bit rate".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, "Detailed Design" below talks about specific modes. Maybe this should just be "700 bits/second", i.e.

Suggested change
At very low bit rates such as 700C, we use Vector Quantisation (VQ) to represent the spectral amplitudes. We construct a table such that each row of the table has a set of spectral amplitude samples. In Codec 2 700C the table has 512 rows. During the quantisation process, we choose the table row that best matches the spectral amplitudes for this frame, then send the \emph{index} of the table row. The decoder has a similar table, so can use the index to look up the output values. If the table is 512 rows, we can use a 9 bit number to quantise the spectral amplitudes. In Codec 2 700C, we use two tables of 512 entries each (18 bits total), the second one helps fine tune the quantisation from the first table.
At very low bit rates such as 700 bits/second, we use Vector Quantisation (VQ) to represent the spectral amplitudes. We construct a table such that each row of the table has a set of spectral amplitude samples. In Codec 2 700C the table has 512 rows. During the quantisation process, we choose the table row that best matches the spectral amplitudes for this frame, then send the \emph{index} of the table row. The decoder has a similar table, so can use the index to look up the output values. If the table is 512 rows, we can use a 9 bit number to quantise the spectral amplitudes. In Codec 2 700C, we use two tables of 512 entries each (18 bits total), the second one helps fine tune the quantisation from the first table.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think I'll move some of the DD intro info right up the top, and explain the modes/700C nomenclature there.

doc/codec2.tex Outdated

Some features of Codec 2:
\begin{enumerate}
\item A range of modes supporting different bit rates, currently (Nov 2023): 3200, 2400, 1600, 1400, 1300, 1200, 700 bits/s. These are referred to as ``Codec 2 3200", ``Codec 700C"" etc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of "C" in "Codec2 700C" isn't exactly clear, especially since some Codec2 modes use letters and some don't. Suggest clarifying here.

doc/codec2.tex Outdated
@@ -100,7 +93,7 @@ \subsection{Speech in Time and Frequency}

Note that each harmonic has it's own amplitude, that varies across frequency. The red line plots the amplitude of each harmonic. In this example there is a peak around 500 Hz, and another, broader peak around 2300 Hz. The ear perceives speech by the location of these peaks and troughs.

\begin{figure}[H]
\begin{figure}
\caption{ A 40ms segment from the word "these" from a female speaker, sampled at 8kHz. Top is a plot again time, bottom (blue) is a plot against frequency. The waveform repeats itself every 4.3ms ($F_0=230$ Hz), this is the "pitch period" of this segment.}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double check all usages of double-quotes and replace opening quotes with `` as appropriate.

@drowe67
Copy link
Owner Author

drowe67 commented Dec 10, 2023

@tmiw - I've set up a Makefile to build the document automatically, so we can make sure if doesn't suffer from bit rot, what do you think about:

  1. Should we build the doc as part of the ctests, or have a separate github actions hook? It will require a few more packages to be installed.
  2. Like freedv-gui - it will generate another codec2.pdf that clashes with the committed version. However I like the idea of having the PDF ready to read, and not require an end user to build it.
  3. Note the pdflatex build stuff is super verbose, but the doc is building OK for me on two machines.

@Tyrbiter - I've kicked off another review WP in #37, or feel free to review in this PR. I'm doing some proof reading myself, but I know I'll miss stuff.

@drowe67
Copy link
Owner Author

drowe67 commented Dec 10, 2023

OK, I've added the doc building as a ctest, as it needs c2sim built anyway.

@tmiw
Copy link
Collaborator

tmiw commented Dec 10, 2023

@tmiw - I've set up a Makefile to build the document automatically, so we can make sure if doesn't suffer from bit rot, what do you think about:

  1. Should we build the doc as part of the ctests, or have a separate github actions hook? It will require a few more packages to be installed.
  2. Like freedv-gui - it will generate another codec2.pdf that clashes with the committed version. However I like the idea of having the PDF ready to read, and not require an end user to build it.
  3. Note the pdflatex build stuff is super verbose, but the doc is building OK for me on two machines.

FWIW, freedv-gui uses GitHub actions for this but doesn't actually generate the PDF/HTML for the user manual until PRs are merged to master to avoid having to constantly deal with merge conflicts. It would probably be a good idea to also either have a ctest to verify document changes or somehow suppress the additional check-in for the module in the GitHub action unless the PR is being merged.

@drowe67
Copy link
Owner Author

drowe67 commented Dec 10, 2023

@tmiw - The doc ctest is bombing on GitHub, but works OK for me at home on three Ubuntu 22 machines. Could you pls take a look and see if there are any obvious issues?

@drowe67
Copy link
Owner Author

drowe67 commented Dec 10, 2023

@tmiw - I've set up a Makefile to build the document automatically, so we can make sure if doesn't suffer from bit rot, what do you think about:

  1. Should we build the doc as part of the ctests, or have a separate github actions hook? It will require a few more packages to be installed.
  2. Like freedv-gui - it will generate another codec2.pdf that clashes with the committed version. However I like the idea of having the PDF ready to read, and not require an end user to build it.
  3. Note the pdflatex build stuff is super verbose, but the doc is building OK for me on two machines.

FWIW, freedv-gui uses GitHub actions for this but doesn't actually generate the PDF/HTML for the user manual until PRs are merged to master to avoid having to constantly deal with merge conflicts. It would probably be a good idea to also either have a ctest to verify document changes or somehow suppress the additional check-in for the module in the GitHub action unless the PR is being merged.

The doc has it's own Makefile, so I was thinking of:

  1. If you really want to create a new codec2.pdf (say after editing) run the Makefile from the doc dir:
    cd ~/codec2/doc
    make
    
  2. If we are just testing the doc build procedure run the ctest:
    cd ~/codec2/build_linux
    ctest -R test_codec2_doc
    
    We could configure the ctest version up to write the codec2.pdf to a temp dir so Git doesn't see it as a changed file.

@drowe67 drowe67 merged commit 93dbb62 into main Dec 11, 2023
2 checks passed
@tmiw
Copy link
Collaborator

tmiw commented Dec 12, 2023

@drowe67, did you still want me to investigate why the ctest for the documentation isn't running in the GitHub environment? I haven't had time to get around to it yet but just noticed that you merged.

@drowe67
Copy link
Owner Author

drowe67 commented Dec 12, 2023

@drowe67, did you still want me to investigate why the ctest for the documentation isn't running in the GitHub environment? I haven't had time to get around to it yet but just noticed that you merged.

Sure if you want to that would be great. I've bumped that task to the further work WP: #37

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants