An end-to-end trainable convolutional neural network (CNN) framework which mimics bag of visual words (BoVW) is proposed for image classification. To this end, a new paradigm for histogram-like image representation is introduced and optimal transport (OT) distance is utilized for the similarity assessment. Any patch of an image is considered as a unique visual word and the image is represented as the uniform histogram of the visual words with the histogram bins associated to embedding vectors according to the semantic meanings of the corresponding visual words. Thus, in the CNN framework, the output of the last convolutional block is considered as the global representation of the image and the embeddings are inherently learned within the classification framework. With the proposed formulation, undesired quantization for the BoVW representation is no more required; moreover, the learned CNN features are naturally interpretable. The experiments on CIFAR-10, CIFAR-100 and SVHN datasets show that the replacement of the global pooling and fully connected layers with the proposed representation together with OT distance improves the baseline CNN framework.