Historically, visually impaired students faced significant challenges in accessing education, often relying on limited tools and segregated systems. Before the 20th century, education for the blind was rare and typically confined to specialized institutions. In the United States, for example, schools like the Perkins School for the Blind (founded in 1829) taught students using tactile methods, such as raised-letter books and, later, Braille (introduced in the U.S. in the 1860s). These students learned through touch and sound, memorizing lessons read aloud by teachers or using rudimentary aids like slates and styluses for writing. Mobility training was minimal, often limited to verbal guidance from peers or staff, as white canes weren’t widely adopted until the 1930s, and guide dogs emerged later, in the mid-20th century.
By the mid-20th century, integration into mainstream schools began, but resources were scarce. In 1950, 88% of visually impaired children were educated in special schools, dropping to 32% by 1972 as inclusion policies grew. Students in regular classrooms depended on sighted peers or teachers to describe visual content, take notes, or read aloud—an inconsistent and often unreliable method. Text access relied on Braille books, which were bulky, expensive, and slow to produce, or audio recordings, which required specialized equipment. Orientation and mobility skills were taught sporadically, leaving many students unprepared for dynamic environments like school campuses. These historical approaches, while groundbreaking for their time, highlight a legacy of dependence and limited autonomy—issues VisionMate seeks to overcome with real-time, independent scene description.
Visual impairment remains a significant global issue, with growing relevance for students and educational systems. According to the World Health Organization (WHO) in 2023, approximately 2.2 billion people worldwide have some form of vision impairment, with over 1 billion cases preventable or treatable. The International Agency for the Prevention of Blindness (IAPB) Vision Atlas projects that by 2050, this number could rise to 1.8 billion, including 61 million blind individuals, driven by aging populations and lifestyle factors like increased screen time and diabetes.
In the U.S., the National Eye Institute (NEI) reported in 2019 that 3.2 million adults had visual impairment (best-corrected vision of 20/40 or worse) in 2015, with projections doubling to 6.95 million by 2050. For children, the American Community Survey (ACS) estimated in 2022 that 600,000 individuals under 18 had vision difficulties, representing about 0.8% of that age group. The U.S. Department of Education’s IDEA data for 2022–23 shows that visual impairments account for less than 0.5% of students aged 3–21 served under the Individuals with Disabilities Education Act (IDEA)—approximately 25,000 students—though this likely undercounts those with multiple disabilities.
Globally, the prevalence of visual impairment among students has increased, partly due to the COVID-19 pandemic. A 2021 study in Guangzhou, China, of over 1 million students aged 6–18 found visual impairment rates rising from 53.48% in 2019 to 54.65% in 2020, linked to reduced outdoor time and increased screen exposure. For visually impaired students, educational outcomes lag: only 43% of U.S. transition-aged youths with vision loss used the internet regularly (compared to 95% of sighted peers), and over 70% of working-age adults with significant vision loss are not employed full-time, underscoring long-term impacts of inadequate early support.
VisionMate addresses a pressing need for accessible, real-time assistive technology for visually impaired individuals, including students, who face significant barriers in perceiving their environments. Visual impairment limits the ability to independently navigate spaces, identify objects, and access written information—tasks that sighted individuals often take for granted. Traditional aids like white canes and guide dogs, while valuable, cannot describe surroundings or read text, leaving gaps in functionality. Modern alternatives, such as smartphone apps or wearable devices, often require internet access, expensive hardware, or complex interfaces, making them impractical for many users, especially in resource-limited settings or for younger students transitioning to independence.
VisionMate fills this void by offering an offline, affordable, and user-friendly solution that uses a standard webcam and computer to detect objects and text, converting them into spoken descriptions. For students, this is particularly critical: education relies heavily on visual input (e.g., reading boards, identifying classroom materials), and without proper support, visually impaired students risk falling behind academically and socially. VisionMate’s dual-mode design—default object detection and optional OCR—caters to varying needs, providing flexibility for both basic navigation and detailed information access, such as reading signs or handouts. Its offline capability ensures reliability in schools or homes without consistent internet, while its open-source nature allows customization, making it a scalable tool for diverse populations.
##What is VisionMate?
VisionMate is an innovative, open-source assistive technology tool designed to support visually impaired individuals, with a particular focus on students, by providing real-time auditory descriptions of their surroundings. Built using Python, it integrates computer vision and text-to-speech capabilities to process live webcam feeds, detect objects, and optionally recognize text, converting these observations into spoken words. The system operates in two modes: a default mode that identifies objects (e.g., people, chairs) and an OCR mode that adds text detection (e.g., signs, labels), toggleable with a simple key press ('2'). VisionMate runs offline on standard hardware—a computer with a webcam—making it accessible, cost-effective, and privacy-conscious compared to cloud-dependent alternatives.
At its core, VisionMate combines three key agents:
- Vision Agent: Uses YOLOv8 for object detection and EasyOCR for text recognition.
- Reasoning Agent: Generates concise, user-friendly descriptions from detected data.
- Action Agent: Converts descriptions into speech using pyttsx3.
This modular design ensures flexibility and ease of maintenance, while its open-source nature invites community contributions to enhance its features.
For visually impaired students, VisionMate serves as a transformative tool that bridges the gap between their visual limitations and the demands of educational and social environments.
Here’s how it benefits them:
1- Enhanced Classroom Participation:
VisionMate can describe objects in the classroom, such as "I see a book at x=150, y=300," helping students locate materials independently. In OCR mode, it reads aloud text on whiteboards, handouts, or labels (e.g., "The text says 'Homework due Friday'"), reducing reliance on teachers or peers for verbal descriptions.
This real-time feedback allows students to follow lessons more actively, keeping pace with sighted classmates.
2- Improved Navigation and Safety:
By identifying objects like "a person at x=100, y=200" or "a chair at x=50, y=400," VisionMate helps students navigate crowded classrooms, hallways, or cafeterias, avoiding obstacles and enhancing mobility confidence.
In OCR mode, it can read directional signs (e.g., "Exit" or "Room 12"), aiding orientation in unfamiliar school buildings.
3- Access to Printed Information:
Historically, visually impaired students depended on Braille or audio recordings, which weren’t always available for spontaneous materials like worksheets or notices. VisionMate’s OCR mode instantly reads such text aloud, ensuring timely access to critical information.
4- Fostering Independence:
Unlike past methods requiring constant assistance, VisionMate empowers students to explore and interpret their surroundings on their own. For example, a student can point the webcam at a science lab setup and hear "I see a beaker at x=200, y=250," enabling hands-on learning without a sighted guide.
5- Support Beyond the Classroom:
VisionMate extends its utility to homework (reading book pages), social settings (identifying friends or objects), and daily tasks (finding a backpack), preparing students for broader life skills and reducing the educational outcome gap highlighted in statistics.
# import cv2 from vision_agent import VisionAgent from reasoning_agent import ReasoningAgent from action_agent import ActionAgent # Initialize agents vision_agent = VisionAgent() reasoning_agent = ReasoningAgent(use_llm=False) # Set to True if you have a local LLM action_agent = ActionAgent() print("VisionMate is running... Press 'q' to exit, '2' to toggle OCR mode.") # Open webcam cap = cv2.VideoCapture(0) # Mode flag: False = Object Detection only, True = Object Detection + OCR ocr_mode = False while True: ret, frame = cap.read() if not ret: print("Failed to read frame from webcam.") break # Vision: Detect objects and text objects, text = vision_agent.detect(frame) # Depending on mode, include or exclude text in the description if ocr_mode: description = reasoning_agent.generate_description(objects, text) print("OCR Mode - Description:", description) else: description = reasoning_agent.generate_description(objects, "") # No text in default mode print("Default Mode - Description:", description) # Action: Speak the description action_agent.speak(description) # Display the frame cv2.imshow("VisionMate", frame) # Check for key presses key = cv2.waitKey(1) & 0xFF if key == ord('q'): print("Exiting...") break elif key == ord('2'): ocr_mode = not ocr_mode # Toggle OCR mode mode_status = "OCR Mode ON" if ocr_mode else "OCR Mode OFF" print(mode_status) action_agent.speak(mode_status) # Announce mode change elif key != 255: # 255 means no key was pressed print(f"Key pressed: {key}") # Debug: Show ASCII value of pressed key cap.release() cv2.destroyAllWindows()
# vision_agent.py import cv2 import numpy as np from ultralytics import YOLO import easyocr class VisionAgent: def __init__(self, model_path="models/yolov8n.pt"): # Load YOLOv8n pre-trained model self.yolo_model = YOLO(model_path, task='detect') # Specify task='detect' for object detection # Initialize EasyOCR reader for English (offline) self.ocr_reader = easyocr.Reader(['en'], gpu=False) # Set gpu=False to avoid GPU dependency def preprocess(self, frame): # Resize and preprocess the frame for YOLO shape = frame.shape[:2] new_shape = (480, 480) r = min(new_shape[0] / shape[0], new_shape[1] / shape[1]) new_unpad = int(round(shape[1] * r)), int(round(shape[0] * r)) dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1] # wh padding dw, dh = np.mod(dw, 32), np.mod(dh, 32) # Ensure padding is divisible by 32 dw /= 2 dh /= 2 img = cv2.resize(frame, new_unpad, interpolation=cv2.INTER_LINEAR) img = cv2.copyMakeBorder( img, int(round(dh - 0.1)), int(round(dh + 0.1)), int(round(dw - 0.1)), int(round(dw + 0.1)), cv2.BORDER_CONSTANT, value=(114, 114, 114) ) return img, shape, new_shape def detect(self, frame): # Preprocess the frame processed_frame, ori_shape, processed_shape = self.preprocess(frame) # Run YOLO inference results = self.yolo_model(processed_frame) # Process YOLO results objects = [] boxes = results[0].boxes.xywh.cpu().numpy() # Bounding boxes in xywh format confidences = results[0].boxes.conf.cpu().numpy() # Confidence scores class_ids = results[0].boxes.cls.cpu().numpy() # Class IDs class_names = results[0].names # Class names dictionary for box, conf, cls_id in zip(boxes, confidences, class_ids): if conf > 0.5: # Confidence threshold x, y, w, h = box # Scale coordinates back to original frame gain = min(processed_shape[0] / ori_shape[0], processed_shape[1] / ori_shape[1]) pad = (processed_shape[1] - ori_shape[1] * gain) / 2, (processed_shape[0] - ori_shape[0] * gain) / 2 x = (x - pad[0]) / gain y = (y - pad[1]) / gain w = w / gain h = h / gain obj_name = class_names[int(cls_id)] objects.append({ "object": obj_name, "position": f"x={int(x)}, y={int(y)}", "confidence": float(conf) }) # Detect text using EasyOCR ocr_results = self.ocr_reader.readtext(frame, detail=0) # detail=0 returns only the text text = " ".join(ocr_results).strip() # Combine all detected text into a single string return objects, text if __name__ == "__main__": vision_agent = VisionAgent() cap = cv2.VideoCapture(0) while True: ret, frame = cap.read() if not ret: break objects, text = vision_agent.detect(frame) print("Objects:", objects) print("Text:", text) cv2.imshow("Frame", frame) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows()
# action_agent.py import pyttsx3 class ActionAgent: def __init__(self): self.engine = pyttsx3.init() self.engine.setProperty('rate', 150) # Speed of speech self.engine.setProperty('volume', 0.9) # Volume (0.0 to 1.0) def speak(self, description): self.engine.say(description) self.engine.runAndWait() if __name__ == "__main__": action_agent = ActionAgent() description = "I see a person at x=100, y=200, and some text that says 'Exit sign'." action_agent.speak(description)
# reasoning_agent.py from transformers import GPT2LMHeadModel, GPT2Tokenizer class ReasoningAgent: def __init__(self, use_llm=True): self.use_llm = use_llm if use_llm: try: self.tokenizer = GPT2Tokenizer.from_pretrained("gpt2") self.llm_model = GPT2LMHeadModel.from_pretrained("gpt2") except Exception as e: print(f"Failed to load LLM: {e}. Falling back to rule-based logic.") self.use_llm = False else: self.use_llm = False def generate_description(self, objects, text): if not self.use_llm: # Rule-based fallback description = "I see " if objects: for obj in objects: description += f"a {obj['object']} at {obj['position']}, " if text: description += f"and some text that says '{text}'." else: description += "no text in the scene." return description.strip(", ") # LLM-based description prompt = f"Detected objects: {objects}. Detected text: {text}. Describe the scene for a visually impaired user in a concise way." inputs = self.tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512) outputs = self.llm_model.generate( **inputs, max_new_tokens=50, do_sample=True, top_p=0.95, temperature=0.7 ) description = self.tokenizer.decode(outputs[0], skip_special_tokens=True) return description if __name__ == "__main__": reasoning_agent = ReasoningAgent(use_llm=False) # Set to True if you have a local LLM objects = [{"object": "person", "position": "x=100, y=200", "confidence": 0.9}] text = "Exit sign" description = reasoning_agent.generate_description(objects, text) print("Description:", description)
Beyond object detection and spatial awareness with voice conversion, our project incorporates multiple cutting-edge technologies to enhance accessibility and independence for visually impaired students. Our goal is to provide real-time assistance that empowers users in both academic and everyday environments.
Navigating unfamiliar indoor spaces, such as schools, libraries, and offices, can be a daunting challenge for visually impaired individuals. To address this, we are implementing an indoor navigation system using ArUco markers, a type of unique visual code that can be recognized by computer vision systems.
ArUco markers are strategically placed throughout the environment, such as at entrances, hallways, classrooms, or important landmarks.
A camera-based system detects these markers and interprets their location.
The system then provides real-time voice-based instructions to guide the user through the space.
For example, if a student enters a library, the system can identify markers near bookshelves and seating areas and provide auditory directions like:
"You are near the entrance. Walk forward two steps to reach the study area."
By using this method, visually impaired students can navigate safely and independently without requiring human assistance. This technology is particularly useful in dynamic environments where traditional tactile navigation aids, such as raised floor markers or handrails, may not be sufficient.
Interacting with smart devices can be challenging for individuals with visual impairments, as most modern technology relies heavily on touchscreens or visual cues. To bridge this gap, we are developing a smart remote with Braille buttons, integrated with ESP (Wi-Fi-enabled microcontroller) to send data over the internet.
Braille-Embossed Buttons – Designed for ease of use, allowing users to control functions without needing visual assistance.
IoT Connectivity via ESP Module – Enables remote control of smart devices such as lights, fans, classroom projectors, or even a computer screen.
Customizable Actions – Users can program the remote to trigger specific commands, like adjusting brightness, turning on accessibility tools, or selecting a particular audio mode.
For instance, in a classroom setting, a student can use the remote to:
Adjust the audio output for a lecture recording.
Toggle between accessibility modes on a digital learning platform.
Control a smart whiteboard to read out the contents displayed.
By integrating IoT and Braille technology, we ensure that visually impaired students have greater control over their surroundings, enhancing independence and ease of learning.
Vision-Language Model for Real-Time Scene Descriptions
Understanding one’s surroundings is crucial for mobility and daily activities. While object detection helps recognize specific items, a Vision-Language Model (VLM) takes accessibility to the next level by providing full scene descriptions.
The system captures real-time visual data using a camera.
A pre-trained Vision-Language Model processes the scene, identifying multiple objects, their relationships, and contextual details.
The model generates descriptive text, which is then converted into natural speech using a text-to-speech engine.
For example, if a user enters a classroom, instead of only identifying isolated objects, the system can say:
"You are in a classroom. There are five tables ahead, a whiteboard on the left, and a teacher standing near the desk."
This feature is especially useful in dynamic and cluttered environments, such as:
Academic institutions – Helping students locate chairs, desks, and learning materials.
Public transport stations – Describing bus stops, signboards, or nearby passengers.
Shopping malls and offices – Providing awareness of counters, aisles, or workspaces.
By integrating computer vision with natural language understanding, we are building a more intuitive and real-time assistant for visually impaired individuals, enhancing their awareness and autonomy in various settings.
Our mission is to bridge the accessibility gap for visually impaired students, empowering them with cutting-edge technology that enhances their independence, confidence, and academic success. By integrating computer vision, AI-driven scene descriptions, IoT-enabled Braille controls, and real-time voice assistance, we are not just creating tools—we are shaping a future where no student is left behind.
This is only the beginning. As technology evolves, so will our solutions, ensuring that visually impaired individuals can navigate the world with greater ease, efficiency, and dignity. Together, we are building a world where education, mobility, and accessibility have no barriers.
Join us in making a difference—because innovation should be for everyone. (Team VisionMate)