physically grounded vision language models