1
2# Micro Speech Training
3
4This example shows how to train a 20 kB model that can recognize 2 keywords,
5"yes" and "no", from speech data.
6
7If the input does not belong to either categories, it is classified as "unknown"
8and if the input is silent, it is classified as "silence".
9
10You can retrain it to recognize any combination of words (2 or more) from this
11list:
12
13```
14yes
15no
16up
17down
18left
19right
20on
21off
22stop
23go
24```
25
26The scripts used in training the model have been sourced from the
27[Simple Audio Recognition](https://www.tensorflow.org/tutorials/audio/simple_audio)
28tutorial.
29
30## Table of contents
31
32-   [Overview](#overview)
33-   [Training](#training)
34-   [Trained Models](#trained-models)
35-   [Model Architecture](#model-architecture)
36-   [Dataset](#dataset)
37-   [Preprocessing Speech Input](#preprocessing-speech-input)
38-   [Other Training Methods](#other-training-methods)
39
40## Overview
41
421.  Dataset: Speech Commands, Version 2.
43    ([Download Link](https://storage.cloud.google.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz),
44    [Paper](https://arxiv.org/abs/1804.03209))
452.  Dataset Type: **Speech**
463.  Deep Learning Framework: **TensorFlow 1.5**
474.  Language: **Python 3.7**
485.  Model Size: **<20 kB**
496.  Model Category: **Multiclass Classification**
50
51## Training
52
53Train the model in the cloud using Google Colaboratory or locally using a
54Jupyter Notebook.
55
56<table class="tfo-notebook-buttons" align="left">
57  <td>
58    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/examples/micro_speech/train/train_micro_speech_model.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Google Colaboratory</a>
59  </td>
60  <td>
61    <a target="_blank" href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/examples/micro_speech/train/train_micro_speech_model.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />Jupyter Notebook</a>
62  </td>
63</table>
64
65*Estimated Training Time: ~2 Hours.*
66
67For more options, refer to the [Other Training Methods](#other-training-methods)
68section.
69
70## Trained Models
71
72| Download Link        | [speech_commands.zip](https://storage.googleapis.com/download.tensorflow.org/models/tflite/micro/micro_speech_2020_04_13.zip)           |
73| ------------- |-------------|
74
75The `models` directory in the above zip file can be generated by following the
76instructions in the [Training](#training) section above. It
77includes the following 3 model files:
78
79| Name           | Format       | Target Framework | Target Device             |
80| :------------- | :----------- | :--------------- | :------------------------ |
81| `model.pb`     | Frozen       | TensorFlow       | Large-Scale/Cloud/Servers |
82:                : GraphDef     :                  :                           :
83| `model.tflite` | Fully        | TensorFlow Lite  | Mobile Devices            |
84: *(<20 kB)*     : Quantized*   :                  :                           :
85:                : TFLite Model :                  :                           :
86| `model.cc`     | C Source     | TensorFlow Lite  | Microcontrollers          |
87:                : File         : for              :                           :
88:                :              : Microcontrollers :                           :
89
90**Fully quantized implies that the model is **strictly int8** quantized
91**including** the input(s) and output(s).*
92<!-- **Fully quantized implies that the model is **strictly int8** except the
93input(s) and output(s) which remain float.* -->
94
95## Model Architecture
96
97This is a simple model comprising of a Convolutional 2D layer, a Fully Connected
98Layer or a MatMul Layer (output: logits) and a Softmax layer
99(output: probabilities) as shown below. Refer to the [`tiny_conv`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/speech_commands/models.py#L673)
100model architecture.
101
102![model_architecture.png](../images/model_architecture.png)
103
104*This image was derived from visualizing the 'model.tflite' file in
105[Netron](https://github.com/lutzroeder/netron)*
106
107This doesn't produce a highly accurate model, but it's designed to be used as
108the first stage of a pipeline, running on a low-energy piece of hardware that
109can always be on, and then wake higher-power chips when a possible utterance has
110been found, so that more accurate analysis can be done. Additionally, the model
111takes in preprocessed speech input as a result of which we can leverage a
112simpler model for accurate results.
113
114## Dataset
115
116The Speech Commands Dataset. ([Download Link](https://storage.cloud.google.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz),
117[Paper](https://arxiv.org/abs/1804.03209)) consists of over 105,000 WAVE audio
118files of people saying thirty different words. This data was collected by
119Google and released under a CC BY license. You can help improve it by
120contributing five minutes of your own voice. The archive is over 2GB, so this
121part may take a while, but you should see progress logs, and once it's been
122downloaded you won't need to do this again.
123
124## Preprocessing Speech Input
125
126In this section we discuss spectrograms, the preprocessed speech input to the
127model. Here's an illustration of the process:
128
129![spectrogram diagram](https://storage.googleapis.com/download.tensorflow.org/example_images/spectrogram_diagram.png)
130
131The model doesn't take in raw audio sample data, instead it works with
132spectrograms which are two dimensional arrays that are made up of slices of
133frequency information, each taken from a different time window.
134
135The recipe for creating the spectrogram data is that each frequency slice is
136created by running an FFT across a 30ms section of the audio sample data. The
137input samples are treated as being between -1 and +1 as real values (encoded as
138-32,768 and 32,767 in 16-bit signed integer samples).
139
140This results in an FFT with 256 entries. Every sequence of six entries is
141averaged together, giving a total of 43 frequency buckets in the final slice.
142The results are stored as unsigned eight-bit values, where 0 represents a real
143number of zero, and 255 represents 127.5 as a real number.
144
145Each adjacent frequency entry is stored in ascending memory order (frequency
146bucket 0 at data[0], bucket 1 at data[1], etc). The window for the frequency
147analysis is then moved forward by 20ms, and the process repeated, storing the
148results in the next memory row (for example bucket 0 in this moved window would
149be in data[43 + 0], etc). This process happens 49 times in total, producing a
150single channel image that is 43 pixels wide, and 49 rows high.
151
152In a complete application these spectrograms would be calculated at runtime from
153microphone inputs, but the code for doing that is not yet included in this
154sample code. The test uses spectrograms that have been pre-calculated from
155one-second WAV files in the test dataset generated by running the following
156commands:
157
158```
159python tensorflow/tensorflow/examples/speech_commands/wav_to_features.py \
160--input_wav=/tmp/speech_dataset/yes/f2e59fea_nohash_1.wav \
161--output_c_file=/tmp/yes_features_data.cc \
162--window_stride=20 --preprocess=average --quantize=1
163
164python tensorflow/tensorflow/examples/speech_commands/wav_to_features.py \
165--input_wav=/tmp/speech_dataset/no/f9643d42_nohash_4.wav \
166--output_c_file=/tmp/no_features_data.cc \
167--window_stride=20 --preprocess=average --quantize=1
168```
169
170
171## Other Training Methods
172
173### Use [Google Cloud](https://cloud.google.com/).
174
175*Note: Google Cloud isn't free. You need to pay depending on how long you use
176run the VM and what resources you use.*
177
1781. Create a Virtual Machine (VM) using a pre-configured Deep Learning VM Image.
179
180```
181export IMAGE_FAMILY="tf-latest-cpu"
182export ZONE="us-west1-b" # Or any other required region
183export INSTANCE_NAME="model-trainer"
184export INSTANCE_TYPE="n1-standard-8" # or any other instance type
185gcloud compute instances create $INSTANCE_NAME \
186        --zone=$ZONE \
187        --image-family=$IMAGE_FAMILY \
188        --image-project=deeplearning-platform-release \
189        --machine-type=$INSTANCE_TYPE \
190        --boot-disk-size=120GB \
191        --min-cpu-platform=Intel\ Skylake
192```
193
1942. As soon as instance has been created you can SSH to it:
195
196```
197gcloud compute ssh "jupyter@${INSTANCE_NAME}"
198```
199
2003. Train a model by following the instructions in the [`train_micro_speech_model.ipynb`](train_micro_speech_model.ipynb)
201jupyter notebook.
202
2034. Finally, don't forget to remove the instance when training is done:
204
205```
206gcloud compute instances delete "${INSTANCE_NAME}" --zone="${ZONE}"
207```
208