From Pixels To JSON: A Vision-Language Approach to Document Understanding