Visually grounded language understanding and generation