• Home
  • Annotate
  • current directory
Name Date Size #Lines LOC

..29-Oct-2021-

lib/29-Oct-2021-

ops/29-Oct-2021-

python/29-Oct-2021-

BUILD A D29-Oct-20213.6 KiB134123

README.md A D29-Oct-20213.5 KiB7659

audio_microfrontend.cc A D29-Oct-20217.3 KiB213167

audio_microfrontend.h A D29-Oct-20211.1 KiB3011

audio_microfrontend_test.cc A D29-Oct-20216.5 KiB200145

README.md

1# Audio "frontend" TensorFlow operations for feature generation
2The most common module used by most audio processing modules is the feature
3generation (also called frontend). It receives raw audio input, and produces
4filter banks (a vector of values).
5
6More specifically the audio signal goes through a pre-emphasis filter
7(optionally); then gets sliced into (overlapping) frames and a window function
8is applied to each frame; afterwards, we do a Fourier transform on each frame
9(or more specifically a Short-Time Fourier Transform) and calculate the power
10spectrum; and subsequently compute the filter banks.
11
12## Operations
13Here we provide implementations for both a TensorFlow and TensorFlow Lite
14operations that encapsulate the functionality of the audio frontend.
15
16Both frontend Ops receives audio data and produces as many unstacked frames
17(filterbanks) as audio is passed in, according to the configuration.
18
19The processing uses a lightweight library to perform:
20
211. A slicing window function
222. Short-time FFTs
233. Filterbank calculations
244. Noise reduction
255. Auto Gain Control
266. Logarithmic scaling
27
28Please refer to the Op's documentation for details on the different
29configuration parameters.
30
31However, it is important to clarify the contract of the Ops:
32
33> *A frontend OP will produce as many unstacked frames as possible with the
34> given audio input.*
35
36This means:
37
381. The output is a rank-2 Tensor, where each row corresponds to the
39  sequence/time dimension, and each column is the feature dimension).
402. It is expected that the Op will receive the right input (in terms of
41  positioning in the audio stream, and the amount), as needed to produce the
42  expected output.
433. Thus, any logic to slice, cache, or otherwise rearrange the input and/or
44  output of the operation must be handled externally in the graph.
45
46For example, a 200ms audio input will produce an output tensor of shape
47`[18, num_channels]`, when configured with a `window_size=25ms`, and
48`window_step=10ms`. The reason being that when reaching the point in the
49audio at 180ms there’s not enough audio to construct a complete window.
50
51Due to both functional and efficiency reasons, we provide the following
52functionality related to input processing:
53
54**Padding.** A boolean flag `zero_padding` that indicates whether to pad the
55audio with zeros such that we generate output frames based on the `window_step`.
56This means that in the example above, we would generate a tensor of shape
57`[20, num_channels]` by adding enough zeros such that we step over all the
58available audio and still be able to create complete windows of audio (some of
59the window will just have zeros; in the example above, frame 19 and 20 will have
60the equivalent of 5 and 15ms full of zeros respectively).
61
62<!-- TODO
63Stacking. An integer that indicates how many contiguous frames to stack in the output tensor’s first dimension, such that the tensor is shaped [-1, stack_size * num_channels]. For example, if the stack_size is 3, the example above would produce an output tensor shaped [18, 120] is padding is false, and [20, 120] is padding is set to true.
64-->
65
66**Striding.** An integer `frame_stride` that indicates the striding step used to
67generate the output tensor, thus determining the second dimension. In the
68example above, with a `frame_stride=3`, the output tensor would have a shape of
69`[6, 120]` when `zero_padding` is set to false, and `[7, 120]` when
70`zero_padding` is set to true.
71
72<!-- TODO
73Note we would not expect the striding step to be larger than the stack_size
74(should we enforce that?).
75-->
76