Using pre-trained models for vision-language understanding tasks