The rapid development of large language models has enabled new opportunities for integrating vision and language capabilities, giving rise to Vision-Language Models. Despite their potential, fine-tuning this models remains computationally expensive, limiting accessibility for many researchers and organizations. To address this challenge, we propose an efficient and scalable method of improving the VLM accuracy by integrating specific vision downstream capabilities, through the model context protocol. This approach reduces the computational power needed for improving the performance of a model. We validate our method in the medical domain, as we assume this domain will benefit the most from the proposed approach, evaluating both classification and NLP-based metrics to assess the quality of image understanding, captioning, and diagnostic reasoning. Our results demonstrate that the method achieves competitive performance in multi-modal tasks while using substantially less amount of compute power for training. In our work we proposed a new innovative way of integrating specific vision downstream tasks to vision language models, enhanced an oral health dataset containing 832 samples that can be used for image captioning and proposed a new method that requires significantly less computational resources to integrate and fine tune specialized downstream vision tasks.
Short Bio:
I’m a part time PHD Candidate at West University of Timisoara with research interests in medical applications of Artificial Intelligence as well as Distributed Systems. As for the primary research fields, my work focuses on CV and Machine Learning. Apart from my research activity, I also work full time as a Software Engineer at Microsoft, helping build the next storage infrastructure of Azure.