Fine-grained activity recognition using multimodal datasets