1 2# Micro Speech Training 3 4This example shows how to train a 20 kB model that can recognize 2 keywords, 5"yes" and "no", from speech data. 6 7If the input does not belong to either categories, it is classified as "unknown" 8and if the input is silent, it is classified as "silence". 9 10You can retrain it to recognize any combination of words (2 or more) from this 11list: 12 13``` 14yes 15no 16up 17down 18left 19right 20on 21off 22stop 23go 24``` 25 26The scripts used in training the model have been sourced from the 27[Simple Audio Recognition](https://www.tensorflow.org/tutorials/audio/simple_audio) 28tutorial. 29 30## Table of contents 31 32- [Overview](#overview) 33- [Training](#training) 34- [Trained Models](#trained-models) 35- [Model Architecture](#model-architecture) 36- [Dataset](#dataset) 37- [Preprocessing Speech Input](#preprocessing-speech-input) 38- [Other Training Methods](#other-training-methods) 39 40## Overview 41 421. Dataset: Speech Commands, Version 2. 43 ([Download Link](https://storage.cloud.google.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz), 44 [Paper](https://arxiv.org/abs/1804.03209)) 452. Dataset Type: **Speech** 463. Deep Learning Framework: **TensorFlow 1.5** 474. Language: **Python 3.7** 485. Model Size: **<20 kB** 496. Model Category: **Multiclass Classification** 50 51## Training 52 53Train the model in the cloud using Google Colaboratory or locally using a 54Jupyter Notebook. 55 56<table class="tfo-notebook-buttons" align="left"> 57 <td> 58 <a target="_blank" href="https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/examples/micro_speech/train/train_micro_speech_model.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Google Colaboratory</a> 59 </td> 60 <td> 61 <a target="_blank" href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/examples/micro_speech/train/train_micro_speech_model.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />Jupyter Notebook</a> 62 </td> 63</table> 64 65*Estimated Training Time: ~2 Hours.* 66 67For more options, refer to the [Other Training Methods](#other-training-methods) 68section. 69 70## Trained Models 71 72| Download Link | [speech_commands.zip](https://storage.googleapis.com/download.tensorflow.org/models/tflite/micro/micro_speech_2020_04_13.zip) | 73| ------------- |-------------| 74 75The `models` directory in the above zip file can be generated by following the 76instructions in the [Training](#training) section above. It 77includes the following 3 model files: 78 79| Name | Format | Target Framework | Target Device | 80| :------------- | :----------- | :--------------- | :------------------------ | 81| `model.pb` | Frozen | TensorFlow | Large-Scale/Cloud/Servers | 82: : GraphDef : : : 83| `model.tflite` | Fully | TensorFlow Lite | Mobile Devices | 84: *(<20 kB)* : Quantized* : : : 85: : TFLite Model : : : 86| `model.cc` | C Source | TensorFlow Lite | Microcontrollers | 87: : File : for : : 88: : : Microcontrollers : : 89 90**Fully quantized implies that the model is **strictly int8** quantized 91**including** the input(s) and output(s).* 92<!-- **Fully quantized implies that the model is **strictly int8** except the 93input(s) and output(s) which remain float.* --> 94 95## Model Architecture 96 97This is a simple model comprising of a Convolutional 2D layer, a Fully Connected 98Layer or a MatMul Layer (output: logits) and a Softmax layer 99(output: probabilities) as shown below. Refer to the [`tiny_conv`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/speech_commands/models.py#L673) 100model architecture. 101 102![model_architecture.png](../images/model_architecture.png) 103 104*This image was derived from visualizing the 'model.tflite' file in 105[Netron](https://github.com/lutzroeder/netron)* 106 107This doesn't produce a highly accurate model, but it's designed to be used as 108the first stage of a pipeline, running on a low-energy piece of hardware that 109can always be on, and then wake higher-power chips when a possible utterance has 110been found, so that more accurate analysis can be done. Additionally, the model 111takes in preprocessed speech input as a result of which we can leverage a 112simpler model for accurate results. 113 114## Dataset 115 116The Speech Commands Dataset. ([Download Link](https://storage.cloud.google.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz), 117[Paper](https://arxiv.org/abs/1804.03209)) consists of over 105,000 WAVE audio 118files of people saying thirty different words. This data was collected by 119Google and released under a CC BY license. You can help improve it by 120contributing five minutes of your own voice. The archive is over 2GB, so this 121part may take a while, but you should see progress logs, and once it's been 122downloaded you won't need to do this again. 123 124## Preprocessing Speech Input 125 126In this section we discuss spectrograms, the preprocessed speech input to the 127model. Here's an illustration of the process: 128 129![spectrogram diagram](https://storage.googleapis.com/download.tensorflow.org/example_images/spectrogram_diagram.png) 130 131The model doesn't take in raw audio sample data, instead it works with 132spectrograms which are two dimensional arrays that are made up of slices of 133frequency information, each taken from a different time window. 134 135The recipe for creating the spectrogram data is that each frequency slice is 136created by running an FFT across a 30ms section of the audio sample data. The 137input samples are treated as being between -1 and +1 as real values (encoded as 138-32,768 and 32,767 in 16-bit signed integer samples). 139 140This results in an FFT with 256 entries. Every sequence of six entries is 141averaged together, giving a total of 43 frequency buckets in the final slice. 142The results are stored as unsigned eight-bit values, where 0 represents a real 143number of zero, and 255 represents 127.5 as a real number. 144 145Each adjacent frequency entry is stored in ascending memory order (frequency 146bucket 0 at data[0], bucket 1 at data[1], etc). The window for the frequency 147analysis is then moved forward by 20ms, and the process repeated, storing the 148results in the next memory row (for example bucket 0 in this moved window would 149be in data[43 + 0], etc). This process happens 49 times in total, producing a 150single channel image that is 43 pixels wide, and 49 rows high. 151 152In a complete application these spectrograms would be calculated at runtime from 153microphone inputs, but the code for doing that is not yet included in this 154sample code. The test uses spectrograms that have been pre-calculated from 155one-second WAV files in the test dataset generated by running the following 156commands: 157 158``` 159python tensorflow/tensorflow/examples/speech_commands/wav_to_features.py \ 160--input_wav=/tmp/speech_dataset/yes/f2e59fea_nohash_1.wav \ 161--output_c_file=/tmp/yes_features_data.cc \ 162--window_stride=20 --preprocess=average --quantize=1 163 164python tensorflow/tensorflow/examples/speech_commands/wav_to_features.py \ 165--input_wav=/tmp/speech_dataset/no/f9643d42_nohash_4.wav \ 166--output_c_file=/tmp/no_features_data.cc \ 167--window_stride=20 --preprocess=average --quantize=1 168``` 169 170 171## Other Training Methods 172 173### Use [Google Cloud](https://cloud.google.com/). 174 175*Note: Google Cloud isn't free. You need to pay depending on how long you use 176run the VM and what resources you use.* 177 1781. Create a Virtual Machine (VM) using a pre-configured Deep Learning VM Image. 179 180``` 181export IMAGE_FAMILY="tf-latest-cpu" 182export ZONE="us-west1-b" # Or any other required region 183export INSTANCE_NAME="model-trainer" 184export INSTANCE_TYPE="n1-standard-8" # or any other instance type 185gcloud compute instances create $INSTANCE_NAME \ 186 --zone=$ZONE \ 187 --image-family=$IMAGE_FAMILY \ 188 --image-project=deeplearning-platform-release \ 189 --machine-type=$INSTANCE_TYPE \ 190 --boot-disk-size=120GB \ 191 --min-cpu-platform=Intel\ Skylake 192``` 193 1942. As soon as instance has been created you can SSH to it: 195 196``` 197gcloud compute ssh "jupyter@${INSTANCE_NAME}" 198``` 199 2003. Train a model by following the instructions in the [`train_micro_speech_model.ipynb`](train_micro_speech_model.ipynb) 201jupyter notebook. 202 2034. Finally, don't forget to remove the instance when training is done: 204 205``` 206gcloud compute instances delete "${INSTANCE_NAME}" --zone="${ZONE}" 207``` 208