Text-based person search aims to retrieve target person from a large gallery based on natural language description. Existing methods take it as one-to-one embedding or many-to-many embedding matching problem. The former approach relies on the assumption of the existence of strong alignment between text and images, while the latter inevitably leads to issues of intra-class variation. Rather than being confined to these two approaches, we propose a new strategy that achieves cross-modal alignment with synthetic caption for joint image-text-caption optimization, named CASC. The core of this strat...