Unsupervised Learning of the 4D Audio-Visual World from Sparse Unconstrained Real-World Samples