What We Learned Building a Self-Hosted Speech Translation Platform
5d 5h ago by lemmy.world/u/dhs in opensource@lemmy.mlOver the past few months, we've been working on a project called PolyTalk.
The original goal was pretty simple: make real-time multilingual communication possible without depending on external translation APIs or cloud-only services.
While testing existing solutions, we noticed that many of them required sending conversations through third-party infrastructure. That works for some use cases, but it wasn't a great fit for organizations that care about privacy, deployment flexibility, or keeping communication workflows under their own control.
So we started building a self-hosted, open-source speech-to-speech translation platform instead.
A few things we've focused on:
Real-time speech translation Self-hosted deployment Open-source core No external translation APIs Live audio translation
The project is still evolving, but it's been interesting exploring the challenges of multilingual communication, local AI infrastructure, and real-time translation workflows.
I'd be curious to hear how others here approach translation.
Are you using cloud-based services, self-hosted tools, or something in between?
GitHub: https://github.com/PolyTalkIO/polytalk
Website: https://polytalk.io/
What We Learned Building a Self-Hosted Speech Translation Platform
Okay, but... what did you actually learn? Your post doesn't go into it, and the links just go to the repository. (That's a long README, by the way.) And a question lingers on my mind since it's important to me personally: You use AI for the translating tech, of course, but how much AI is involved in the other parts of the project? (Such as code, documentation, testing, marketing posts like this one, ...)
Fair point. Looking back, the title probably promised more specifics than the post delivered.
A few things we've learned so far:
Running speech recognition, translation, and TTS locally is absolutely possible, but latency becomes one of the biggest challenges. Supporting multiple audio sources (microphones, meetings, browser tabs, system audio, etc.) often ends up being more complex than the translation itself. Self-hosting is a much stronger requirement than we initially expected for organizations with privacy, compliance, or data sovereignty concerns. Choosing models is a constant tradeoff between quality, speed, hardware requirements, and language coverage.
Regarding AI usage: the translation pipeline itself is AI-based. For the rest of the project, we've used AI tools where they were helpful, for example, coding assistance, drafting documentation, brainstorming, and editing content, but all code, documentation, testing, and releases are reviewed and validated by the team before becoming part of the project.
Thanks for the feedback. You're right that this post ended up being more of a project introduction than a lessons-learned write-up.
This is a pretty interesting project! Assuming one wanted to run everything locally, what’s the minimum viable hardware stack for near-realtime performance?
Thanks! We're still benchmarking different setups, so I don't want to give a misleading "minimum spec" number yet. In practice, the hardware requirements depend much more on the STT/translation/TTS models you choose than on PolyTalk itself. For a single-user setup, you don't necessarily need expensive hardware. As you push for lower latency, larger models, or multiple simultaneous streams, the requirements increase pretty quickly. Proper hardware benchmarks are something we plan to publish once we've tested a wider range of configurations.
It's nice to you release as open source self hosted pj and also nice to see py and FastApi stack. There has multiple open source tts sst pipeline, what is your pj's selling points? I can't find in your src. Amd what exactly is near real-time? Sub 200 ms?
Thanks! You're right that there are already excellent open-source STT, translation, and TTS projects. PolyTalk isn't trying to replace them, and we build on top of them. What we're focused on is creating a complete, self-hosted real-time communication platform that ties those components together and can handle different audio sources (microphones, meetings, browser tabs, system audio, etc.) through a single workflow.
Regarding latency, we're not targeting sub-200 ms. In our testing, we've intentionally favored translation quality and conversational flow over minimizing latency at all costs. Depending on the setup, end-to-end latency is typically around 2 seconds. One thing we've already improved is processing complete sentences rather than translating word-by-word. That gives the translation more context and generally produces much more natural results. We're also working on additional context-aware translation improvements, and tone adaptation is on our roadmap.