Video based action recognition is one of the important and challenging problems in computer vision research. Bag of visual words model (BoVW) with local features has been very popular for a long time and obtained the state-of-the-art performance on several realistic datasets, such as the HMDB51, UCF50, and UCF101. BoVW is a general pipeline to construct a global representation from local features, which is mainly composed of five steps; (i) feature extraction, (ii) feature pre-processing, (iii) codebook generation, (iv) feature encoding, and (v...