Development brief : Divide the data set into 70% training set, 15% validation set, 15% test set. Some of the characters that will affect the training, such as question marks, etc. are removed. The model uses the GRU decoder plus the end-to-end model plus the discarding layer, and finally adds a dense layer. The purpose is to generate a Western vocabulary for each output time step probability distribution.