Speech Recognition Magic: How Soraha Helps Students Speak English and Kiswahili Confidently
I'll never forget the moment that convinced me we needed speech recognition in Soraha. I was observing at a pilot school in Kajiado when I noticed a Grade 3 student named James sitting alone during break time, quietly repeating English words to himself. When I asked what he was doing, he said, "Practicing so I don't embarrass myself in class." He was trying to master pronunciation of "through" and "thought"—words he consistently avoided during lessons because he feared making mistakes in front of classmates. That conversation haunted me. Here was a motivated student wanting to improve but lacking a safe space to practice without judgment. That's when Joseph and I committed to building speech recognition into Soraha, despite everyone telling us it was impossible to run offline on budget devices.
I'm Billy Gareth, Co-Founder and CEO of Soraha, and building our speech recognition system took a full year of intensive engineering work. The challenge wasn't just technical—it was philosophical. We could have taken the easy path and built cloud-based speech recognition like everyone else. But cloud-based solutions require constant internet connectivity, which excludes the rural students who most need pronunciation practice. We refused to build features that only worked for privileged, well-connected students. If we couldn't make speech recognition work offline on 5,000 KES devices, we wouldn't build it at all.
The Speaking Practice Gap We're Solving
In a typical Kenyan primary classroom, 40-50 students share one teacher's attention. Even with perfect classroom management and dedicated speaking practice time, each student receives perhaps 2-3 minutes of individual speaking practice per language lesson. Multiply that across a typical week of five language lessons, and you get maybe 15 minutes of actual speaking time per student weekly. This is grossly insufficient for developing confident oral language skills, yet it's the reality in most schools.
The problem compounds for students learning English as a second or third language. They need more speaking practice than native speakers, not less, but they're least likely to volunteer during class discussions out of fear of making mistakes. Students who most need speaking practice self-select out of the limited opportunities available, creating a vicious cycle where struggling students fall further behind while confident speakers get most practice time.
Speaking practice is also the most difficult skill to practice independently. Students can read texts, complete writing exercises, and solve problems on their own. But pronunciation practice traditionally requires a conversation partner or teacher to model correct pronunciation and provide feedback. Students practicing alone often reinforce incorrect pronunciation patterns without realizing it, making bad habits harder to break later.
When designing Soraha, Joseph and I recognized that solving the speaking practice gap required technology that could provide unlimited individual practice with immediate feedback—technology that didn't depend on teacher availability or internet connectivity. Speech recognition was the obvious solution, but making it work in our context required solving problems that kept our engineering team up at night for months.
Building Offline Speech Recognition: The Technical Challenge
Running sophisticated speech recognition entirely offline on budget mobile devices was considered impossible by many EdTech developers we consulted. Cloud-based speech recognition uses powerful servers processing audio remotely—services like Google Speech or Amazon Transcribe depend on data centers with massive computational resources. We needed the entire pipeline running locally on devices costing 5,000-7,000 KES with 2GB RAM and modest processors.
The technical challenge breaks down into several problems. First, speech recognition models are typically huge—hundreds of megabytes or even gigabytes. We needed models small enough to download on slow connections and store on devices with limited capacity. Second, speech recognition is computationally intensive. We needed processing efficient enough to run in real-time on weak processors without draining batteries. Third, we needed accuracy comparable to cloud services despite model compression and processing constraints.
Joseph and the engineering team spent months training custom speech recognition models optimized specifically for our use case. Rather than general-purpose models recognizing all possible speech, our models focus on educational vocabulary, pronunciation patterns common in Kenyan students, and phonemes particularly challenging for language learners. This focused approach allows smaller model sizes while maintaining accuracy for educational contexts.
We achieved compression ratios that seemed impossible initially. Our English model is under 50MB while maintaining accuracy comparable to cloud services for educational vocabulary. The Kiswahili model is even smaller at 35MB. These compressed models download in minutes on slow connections and run efficiently on budget devices. The compression required novel techniques combining pruning, quantization, and knowledge distillation that Joseph presented at a machine learning conference in Nairobi.
The audio processing pipeline runs efficiently enough for real-time feedback. Students speak, and pronunciation assessment appears within one to two seconds—fast enough to feel responsive and maintain immersion. This real-time performance on modest hardware required optimization work at every pipeline stage, from audio capture through feature extraction to model inference. We hand-optimized critical code paths and used mobile-specific acceleration when available.
English and Kiswahili: Dual Language Support
Soraha's speech recognition supports both English and Kiswahili, recognizing that Kenyan students need strong oral skills in both languages. This dual-language support addresses the practical reality of our multilingual education system where students must develop fluency in multiple languages simultaneously. Building two complete speech recognition systems doubled our development effort, but serving Kenya's linguistic reality was non-negotiable.
The Kiswahili speech recognition is particularly valuable because many English-dominant educational technology platforms neglect African languages entirely. We treat Kiswahili as a first-class language with dedicated speech recognition models, pronunciation standards, and feedback systems. Students receive the same quality practice in Kiswahili as in English, validating both languages as equally important in their education.
The pronunciation standards reflect appropriate norms for each language context. English pronunciation targets comprehensible Kenyan English—students don't need to sound like they grew up in London or New York. They need to pronounce English clearly enough for mutual comprehension in diverse English-speaking contexts. We consulted with Kenyan English teachers and linguists to establish pronunciation standards that respect local accents while ensuring international intelligibility.
Similarly, Kiswahili standards reflect standard coastal Kiswahili while recognizing regional variation. We worked with Kiswahili educators to ensure our models accept regional pronunciation patterns while maintaining standard language norms. This flexibility respects Kenya's linguistic diversity rather than enforcing narrow standards that privilege one regional variety over others.
The code-switching common in Kenyan communication presented an interesting challenge. While formal language practice focuses on individual languages, students might naturally incorporate English words into Kiswahili speech or vice versa. Our models recognize this linguistic flexibility rather than penalizing students for communication patterns that reflect actual language use in Kenya.
Integration Into Gameplay: Making Practice Natural
We integrated speech recognition directly into gameplay, making pronunciation practice a natural part of the learning experience rather than isolated drills. This integration was crucial—if pronunciation practice felt like tedious exercises, students would avoid it. Students encounter speaking challenges integrated into game narratives and puzzles. They might need to pronounce vocabulary words correctly to unlock treasure chests, speak commands to their in-game characters, or read passages aloud to progress through story sequences.
The game design makes pronunciation practice feel purposeful rather than arbitrary. Students aren't just repeating words—they're using language to achieve game goals they care about. This purposeful context increases engagement and makes pronunciation practice feel like an integral part of gaming rather than an interruption.
The feedback system provides specific guidance rather than vague corrections. When a student mispronounces a word, the system identifies which phonemes or syllables need improvement: "The 'th' sound in 'through' needs to be softer—try placing your tongue between your teeth." This specificity helps students understand exactly what to change rather than just knowing something is wrong.
The system adapts difficulty to student performance. Initially, students might practice individual words in isolation. As pronunciation improves, challenges progress to phrases, sentences, and eventually longer passages. This progressive challenge ensures students work consistently at their skill edge—not so easy that practice becomes boring, not so difficult that it becomes frustrating.
Privacy Through Local Processing
One advantage of offline speech recognition that wasn't initially our motivation but became important to parents is privacy. Audio never leaves student devices—all processing happens locally. Students' voices aren't transmitted to servers, recorded in cloud storage, or shared with third parties. This privacy-preserving architecture addresses legitimate concerns about student data while enabling powerful speech recognition functionality.
We're transparent about how speech recognition works. Parents can see exactly what data is collected (pronunciation accuracy scores, practice frequency, progress metrics) and what isn't collected (actual audio recordings). This transparency builds trust that cloud-based speech recognition can't match—parents and students know with certainty that voices aren't being recorded or analyzed by remote servers.
The privacy benefits extend to classroom contexts. Teachers can confidently use speech recognition features knowing that student audio isn't being captured or stored inappropriately. This privacy assurance removes barriers to adoption in contexts where data protection concerns might otherwise prevent speech recognition use.
Teacher Integration and Dashboard Insights
While speech recognition enables independent student practice, it integrates with teacher workflows through dashboard insights. Teachers access pronunciation analytics showing which sounds or patterns students struggle with most. If many students struggle with 'th' sounds, teachers provide targeted whole-class instruction on tongue placement and airflow. The data informs direct instruction without replacing it.
The pronunciation analytics reveal patterns that might not be obvious in classroom interactions. A student might pronounce a word correctly during one-on-one teacher time when nervous and carefully articulating, but revert to incorrect patterns during independent practice when less attentive. The system captures patterns across multiple practice sessions, providing more accurate pictures of actual pronunciation habits than occasional classroom observations.
Teachers can assign specific speaking practice as homework with dashboard visibility into completion and performance. "Practice this week's vocabulary words until achieving 80% pronunciation accuracy" becomes a measurable, tracked assignment rather than vague homework that may or may not happen. This accountability ensures speaking practice receives the same priority as reading and writing practice.
Building Confidence Through Private Practice
One psychological benefit we didn't fully anticipate during design is how powerfully private practice builds speaking confidence. Students can make mistakes without public embarrassment, try difficult words repeatedly without classmates listening, and practice at their own pace without holding up the class. This psychological safety transforms speaking practice from anxiety-inducing performance to comfortable skill-building.
For students who are self-conscious about their accents, vocabulary gaps, or pronunciation challenges, private practice with Soraha provides a judgment-free environment to improve. The technology doesn't mock mistakes, show impatience, or make comparisons to other students. It simply provides accurate feedback and encouragement to try again. This unconditionally supportive environment helps students develop confidence alongside competence.
Students can revisit challenging words and phrases repeatedly without feeling like they're burdening teachers or wasting class time. If a student struggles with a particular sound or word, they can practice that specific item ten times, twenty times, a hundred times—however many repetitions they need to achieve mastery. This personalized practice volume is impossible in traditional classroom contexts where teacher time must be distributed among many students.
Teachers report that students who regularly use speech recognition features demonstrate not just improved pronunciation but increased willingness to participate orally in class discussions. The confidence built through private practice transfers to public speaking situations. Students carry certainty from repeated successful practice into classroom contexts, making them more likely to volunteer and participate orally.
Special Education Applications
Speech recognition technology has particular value for students with learning differences affecting oral language development. Students with speech disorders, language delays, or auditory processing difficulties often need intensive, specialized practice that's impossible to provide in typical classroom contexts. Soraha's unlimited speaking practice with patient, consistent feedback provides therapeutic-level practice volume.
For students with speech disorders, the consistent feedback helps with articulation practice. They can work systematically through sounds they find challenging, receiving immediate feedback on whether they're producing sounds correctly. This practice supplements rather than replaces speech therapy, providing extensive between-sessions practice that accelerates progress. Speech therapists we've consulted recommend Soraha as home practice between therapy sessions.
Students with auditory processing difficulties benefit from the visual and game-based reinforcement accompanying speaking practice. They're not just hearing sounds—they're seeing visual representations, reading text, and getting game feedback. This multimodal approach provides multiple pathways to understanding and remembering correct pronunciation patterns.
The self-paced nature allows students who need extra time to work at comfortable speeds without feeling rushed or pressured. If a student needs to hear a word modeled five times before attempting pronunciation, the system patiently provides those models. If they need to practice a sound twenty times before moving forward, the platform allows that repetition without conveying impatience.
Real Results: Measurable Pronunciation Improvement
The impact data from schools using Soraha shows measurable pronunciation improvement across student populations. Teachers report that students who engage regularly with speech recognition practice demonstrate clearer pronunciation, increased speaking confidence, and greater willingness to participate orally in class discussions. These qualitative observations are backed by quantitative dashboard data.
The pronunciation analytics quantify improvement in accuracy scores over time. Students who practice consistently show steady improvement trajectories, with particularly notable gains in sounds and words they initially struggled with. This quantifiable improvement provides concrete evidence that speaking practice translates to genuine skill development rather than just gaming engagement.
Perhaps more importantly, students themselves report feeling more confident speaking in English and Kiswahili. In surveys we've conducted with pilot schools, 87% of students who used speech recognition features said they felt more comfortable speaking during class discussions. This confidence shift might be even more valuable than pronunciation improvements themselves—students who feel confident speaking seek more speaking practice opportunities, creating virtuous cycles of improvement.
What We Learned Building This
Building offline speech recognition taught us important lessons about educational technology development. First, the "impossible" technical challenges are often just hard, not actually impossible. Many developers told us offline speech recognition couldn't work on budget devices. They were wrong—it just required more engineering effort than most companies are willing to invest.
Second, building for the most constrained contexts creates solutions that work everywhere. Our speech recognition optimized for 5,000 KES devices works brilliantly on flagship devices while remaining accessible to budget users. This universal design approach is more work upfront but creates broader impact.
Third, privacy-preserving designs often have technical advantages beyond privacy. Local processing eliminates network latency, reduces infrastructure costs, and improves reliability. What we initially did for privacy turned out to have multiple benefits.
Fourth, integrating advanced technology into gameplay requires extensive playtesting. Our first speech recognition integration felt tacked-on and interrupting. Version 15 felt natural and purposeful. Finding the right integration took iterations, but the engagement difference between awkward and natural integration is enormous.
The Future of Speech in Educational Gaming
We're expanding speech recognition to more complex language tasks—reading comprehension through oral responses, conversational practice for advanced learners, accent refinement for students preparing for international contexts. The technical foundation we've built supports increasingly sophisticated speech-based learning experiences.
For now, watching a student like James—the boy I met practicing alone during break—using Soraha's speech recognition to practice privately, building confidence without fear of embarrassment, and eventually volunteering regularly in class discussions makes every engineering challenge worthwhile. Joseph and I built speech recognition into Soraha because we believe every student deserves unlimited, patient, private pronunciation practice. No student should avoid speaking out of fear. Technology can provide the safe practice space that builds confidence alongside competence. That's the promise of speech recognition in educational gaming, and that's why we invested a year making it work offline on budget devices.

Comments :
John Doe - June 8, 2026
ReplayGreat article! Very informative and well-written.