Text to speech

This feature is only available on the following devices: Roku Streaming Stick (3600X), Roku Express (3700X) and Express+ (3710X), Roku Premiere (4620X) and Premiere+ (4630X), Roku Ultra (4640X), and any Roku TV running Roku OS 7.2 and later.


Text to speech components

Components available since Roku OS 7.2

Text to speech (TTS) allows the developer to provide an audible spoken version of the strings shown to the user in the app. For platforms that are required to comply with the FCC Communications and Video Accessibility Act of 2010 (CVAA), this capability can be used as part of compliance with CVAA, and the current text to speech flite_tts library is built into the image. The Roku text to speech capability supports different languages, voices, rates of speech, volume of speech, and other aspects of text to speech. Roku provides text to speech support in the following components, interfaces, and events:

Components available since Roku OS 7.5

Screen reader behavior for SceneGraph nodes

  • Button: text of button is spoken only if focused
  • ButtonGroup: speaks focused Button, followed by navigation hint (“button 1 of 4”), followed by button-specific hint, if any (Button-specific hint is spoken only for StarRatingButton).
  • CheckList: speaks focused item (ContentMetaData::AUDIO_GUIDE_TEXT if any; otherwise ContentMetaData::TITLE) followed by navigation hint (“checkbox, checked, 1 of 4”)
  • Dialog: speaks title, message, and bulletText (if any), then reads focused button
  • Keyboard: speaks hint about caps lock toggling (once), then speaks focused key
  • Label: speaks text field
  • MarkupGrid: speaks focused ContentMetaData::AUDIO_GUIDE_TEXT if any; otherwise speaks ContentMetaData::TITLE, followed by navigation hint, then ContentMetaData::AUDIO_GUIDE_SUFFIX (if any), then MEDIA speech (see below)
  • PinDialog: speaks dialog title, whether in key pad, then focused key or button
  • Poster: if focused, speaks audioGuideText field (if set)
  • ProgressDialog: speaks dialog title, message, and bulletText every 15 seconds. Speaks focused button if there is any.
  • RadioButtonList: speaks focused item (ContentMetaData::AUDIO_GUIDE_TEXT if any; otherwise, ContentMetaData::TITLE), followed by navigation and selection hint
  • RenderableNode: if speaking focused item (depends on context), will speak focused descendant; otherwise, will speak all descendants
  • RowList: speaks row label (when row becomes focused), then speaks focused PosterGrid or MarkupGrid (MarkupGrid is used if itemComponentName is non-empty)
  • Video: speaks HUD if displayed by user

Screen reader behavior for built-in SceneGraph panels and scenes:

  • PanelSet:

    • If left panel is focused, speaks focused left panel, then unfocused right panel (if any)
    • If right panel is focused, speaks unfocused left panel, then focused right panel
    • If no panel is focused, speaks unfocused left panel, then unfocused right panel (if any)
  • Scene: speaks dialog (if any); otherwise speaks PanelSet (if any); otherwise speaks as RenderableNode

MEDIA speech is spoken in the following order:

Implementation tips

TTS interruptions

Many UI elements have default TTS behavior. It is possible that speech triggered by these implementations can interrupt your TTS implementation at times. You should keep track of the IDs of your TTS utterances, as returned by say() and silence(), and handle interruptions accordingly.

Other TTS implementation changes

Other TTS implementations may change the current voice, the current language, the current volume, the current pitch, and/or the current speech rate. You should keep track of how these parameters might change.

Long text delays

A long text string to be spoken by TTS may have a noticeable delay before starting the speech, at least for the first speech of the long string. For long text strings, you can break up the text string so that the first speech is a reasonably short sentence, followed by longer sentences as needed. You should not break up the long text string into individual words, as it will affect phrasing without improving the perceived delay in any noticeable way.