In this paper, we present TellMeTalk, an innovative approach for generating expressive talking face videos based on multimodal inputs. Our approach demonstrates robustness across various identities, languages, expressions, and head movements. It overcomes four key limitations of existing talking face video generation methods: (1) reliance on single -modal learning from audio or text, lacking the complementary nature of multimodal inputs; (2) deployment of traditional convolutional neural network generation, leading to restricted capture of spatial features; (3) the absence of natural head move...