RTTM format specification and its application

Maksim PanfilovMaksim Panfilov
26 min read

Rich Transcription Time Marked (RTTM) is a widely used, text-based format for annotating audio and video, representing results of speech recognition, speaker diarization, and related metadata. Developed by NIST in the early 2000s, RTTM files consist of 10 fields per line, detailing aspects such as type, time boundaries, and identifiers for speech segments, words, and other audio events. The format supports overlapping events and is highly integrated into modern tools like pyannote.audio, NVIDIA NeMo, and DScore, making it a standard in speech processing for both research and industry. Despite its strict structure with redundant fields, RTTM remains a reliable and comprehensive means to describe audio annotations for various applications.

Rich Transcription Time Marked (RTTM) is a text-based format used for annotating audio/video, representing the results of speech recognition, diarization (speaker segmentation), and related metadata. The format was developed as part of the NIST Rich Transcription evaluations in the early 2000s and has since become the standard for storing results of automatic speech recognition (ASR) and diarization. An RTTM file is a simple text file, where each line corresponds to a single annotated segment (object), specifying its type, time boundaries, and other attributes. This unified format allows the description of speech segments with speaker identification, transcription words, noise events, and reference information about speakers, all in one file.

RTTM is widely used in datasets and tools for diarization and ASR. For instance, many corpora from LDC are accompanied by RTTM annotations, and competitions in diarization (such as DIHARD) and keyword spotting (OpenKWS) require output in RTTM format. Modern libraries, ranging from academic (e.g., pyannote.audio, DScore) to industrial (e.g., NVIDIA NeMo), support this format for exchanging annotations of results.

Below is a comprehensive description of the RTTM specification: segment types and their fields, the 10-field RTTM line format, differences between versions of the standard, examples of entries, and the capabilities and limitations of this format.

Types of RTTM segments and their fields

RTTM defines several segment types, each describing a specific event or metadata type within an audio recording. All RTTM lines have a fixed format consisting of 10 fields; however, depending on the segment type, some fields contain values, while others are not used (marked as <NA> – not applicable). Table 1 lists the main RTTM segment types, their purpose, and which fields are informational (mandatory) for each type, as well as which remain empty (<NA>).

TypePurpose (Object)Data FieldsNon-Data Fields
SPEAKERSpeech segment from a specific speaker (for diarization)File, Chnl, Tbeg, Tdur, Name (Speaker ID)Ortho, Stype, Conf, Slat always <NA>
LEXEMELexeme (word) in speech transcriptionFile, Chnl, Tbeg, Tdur, Ortho (word text), Stype (word subtype)Name typically <NA>; Conf and Slat may be specified or <NA>
NON-LEXNON-LEXNon-verbal vocalization (e.g., laughter, cough)File, Chnl, Tbeg, Tdur, Stype (sound category, e.g., "cough")Ortho not used (<NA>); Name <NA>; Conf/Slat optional
NON-SPEECHNon-speech background activity (noise, music, etc.)File, Chnl, Tbeg, Tdur, Stype (category, e.g., "noise")Ortho <NA>; Name <NA>; Conf/Slat optional
SPKR-INFOSpeaker metadata (e.g., gender, age, etc.)File, Chnl, Stype (speaker category), Name (Speaker ID), Conf (optional)Time fields Tbeg, Tdur, Ortho not used (<NA>); Slat <NA>
STRUCTURAL:
SEGMENTStructural segment of the recording (e.g., evaluation region)File, Chnl, Tbeg, Tdur; Stype may be "eval" for marked evaluation areasOrtho, Name, Conf, Slat typically <NA>
NOSCOREArea to be excluded from evaluationFile, Chnl, Tbeg, Tdur (skip interval)All other fields (Ortho, Stype, Name, Conf, Slat) <NA>
NO_RT_METADATAArea without real-time metadata (no timestamped metadata)File, Chnl, Tbeg, Tdur (interval)Remaining fields <NA>
The fields File (file name) and Chnl (channel number) are present and populated for all types of RTTM lines. The fields marked as "optional" may contain values if the corresponding information or evaluation is available; otherwise, they are filled with <NA>. For example, for LEXEME segments and other lexical objects, the Conf field is often used to represent the probability of word correctness (range 0–1). However, if the system does not provide this value, <NA> is used. For most fields that are not applicable to a given type, the standard requires that the string <NA> (without quotes) be entered instead of leaving the field empty.

As seen in the table, different RTTM segment types are intended for different aspects of annotation:

  • SPEAKER and SPKR-INFO types relate to speakers: the former defines speech intervals associated with a speaker's label (e.g., Speaker A or B), while the latter stores information about the speaker (e.g., gender, age group, etc.).

  • LEXEME, NON-LEX, and NON-SPEECH types describe the content of speech: specifically, the words spoken (lexemes) and non-content sounds (such as filler pauses, laughter, and noise).

  • Structural types (SEGMENT, NOSCORE, NO_RT_METADATA) are used to denote areas on the timeline: for example, segments for evaluation (eval), regions that should not be evaluated (noscore), or the absence of metadata.

It is important to note that not all types are used in every scenario. In diarization tasks, SPEAKER lines are typically generated (and SPKR-INFO for each speaker if gender/age information is available). In speech recognition tasks with transcript annotations, LEXEME (words) and labels for noise/vocalizations (NON-SPEECH/NON-LEX) are more commonly found. However, the RTTM format is versatile, and if necessary, all types of objects can coexist in a single file (for example, in IARPA Babel datasets, RTTM contains both the full transcription with word segmentation and speaker information).

RTTM format: description of 10 line fields

Each RTTM line consists of 10 fields separated by spaces. Below is a description of these fields in order (1–10), along with their values and acceptable formats:

  1. Type – segment type (object). This is a text label for one of the types listed above: for example, SPEAKER, LEXEME, NON-SPEECH, SPKR-INFO, etc. This field is mandatory and cannot be <NA>. The value must strictly correspond to one of the types defined by the RTTM specification (see the table above).

  2. File – file identifier (record name). This is usually the base name of the audio file, without extension and path. It serves to group segments by one record: all lines with the same File refer to the same audio file. Alphanumeric strings, underscores, etc., are allowed; spaces are not (if the file name contains spaces, they are typically replaced with underscores). Example: lecture_01 for the file lecture_01.wav.

  3. Chnl – channel identifier (Channel ID). Specifies the number of the audio channel to which the segment belongs. Acceptable values are integer strings. Typically, "1" is used for mono recordings. In stereo recordings, "1" or "2" can be used, depending on the channel (left/right track). (In some tools, the channel may be denoted by "0" or another number, but the NIST standard assumes 1-indexed channels.) This field is mandatory; if the concept of a channel does not apply, "1" is often used by default.

  4. Tbeg – segment start time (Turn onset/beginning time). A floating-point number in seconds from the beginning of the recording, representing the start of the annotated interval. For example, 5.000 means 5.000 seconds from the start of the file. For segments without specific time binding (e.g., SPKR-INFO), <NA> is used. The start time must be non-negative. High precision is allowed (up to thousandths of a second, e.g., 5.230 or 5.23 – both are acceptable). In cases where time is unknown or inapplicable, <NA> should be used instead of a numerical value.

  5. Tdur – segment duration (Turn duration). A floating-point number (in seconds) representing the duration of the interval starting at Tbeg. For example, if Tbeg=5.000 and Tdur=3.000, the segment spans from 5.000 to 8.000 seconds. If the object has no duration (e.g., speaker metadata), <NA> is used.

    💡
    If both Tbeg and Tdur are "dummy" values (e.g., used for synchronization but not corresponding to actual absolute times), some specifications recommend marking them with an asterisk (e.g., 12.34* instead of 12.34).
  6. Ortho – orthography field. Contains the text content of the segment for lexical objects. For LEXEME, this indicates the word or transcribed speech unit as it is written (in normative orthography). For ordinary words, the subtype lex (see Stype) is the word itself (e.g., "hello"). For other subtypes, there may be special notations, such as spelled-out numbers or coded phrases. If no orthographic representation is available for this type (e.g., for noise labels or speaker segments), <NA> is used. Even for words, if the transcript is undefined or unclear, <NA> may be used (e.g., for the frag subtype – an unclear fragment of speech). Generally, acceptable values depend on the type: for LEXEME – a string of characters (may include apostrophes, hyphens, etc., if part of the word), for NON-LEX and NON-SPEECH, always <NA>.

  7. Stype – subtype of the segment. A clarifying field that classifies the object within its type. The value is given as text (no spaces). The list of acceptable subtypes depends on Type:

    • For LEXEME (words), possible values include: lex (regular word), fp (filled pause, e.g., "uh"), frag (word fragment), un-lex (uncertainly recognized word or laughter within a word), for-lex (foreign word), alpha (spelling or alphabetic pronunciation), acronym, interjection, propernoun, and other. Example: LEXEME ... ortho="uh" stype="fp" for the sound "uh".

    • For NON-LEX (non-verbal vocalizations), subtypes include: laugh, breath, lipsmack, cough, sneeze, and other.

    • For NON-SPEECH (background sounds), subtypes include: noise (unidentified noise), music, and other. In some corpora, noise is detailed as <sta> (static background), and other covers sounds like <click>, <ring>, <dtmf>, <prompt>, and <overlap>.

    • For SPEAKER lines, the subtype is not applicable – always <NA>.

    • For SPKR-INFO, the subtype typically refers to the speaker category – the type of voice source: for example, adult_male (adult male), adult_female (adult female), child (child), or unknown if data is unavailable. Earlier standards (before Babel) may have used just male/female without age specification.

    • For SEGMENT, the subtype may indicate the segment’s purpose. For example, eval – a segment used for system evaluation (scoring), <NA> – a segment without special status.

    • For NOSCORE and NO_RT_METADATA, the subtype is not used (marked as <NA>).

The Stype field is mandatory for those types where subtypes are defined (see the table above), otherwise, <NA> is used. An invalid combination of Type/Stype (not defined in the specification) will make the RTTM file invalid.

  1. Name – speaker or object name (identifier). Mainly used in lines related to speakers:

    • For SPEAKER, this specifies the speaker label to which the segment belongs. This label should uniquely identify the speaker within the file. For example, names like Speaker_A, spk1, guest1, etc., can be used. If the diarization system does not know the real identity, abstract identifiers are used. If the identity is known, the real name or ID (e.g., John_Doe) can be specified.

    • For SPKR-INFO, the Name is also mandatory and should match the speaker identifier to which these metadata refer. In other words, SPKR-INFO is linked to SPEAKER segments via a shared Name.

    • For lexical objects (LEXEME, NON-LEX, NON-SPEECH), the Name field is typically not used and is filled with . The RTTM format does not require specifying the speaker’s name for each lexeme – it is assumed that the information about who spoke is carried by the SPEAKER segments. Therefore, to determine which speaker pronounced a particular word (LEXEME), one must look at the overlap with SPEAKER segments.

      💡
      Some systems may extend the format by duplicating the speaker identifier in each word line, but this is not part of the official standard.
    • For structural types (SEGMENT, NOSCORE, etc.), Name is not used (<NA>).

If the speaker’s name is unknown or does not apply (e.g., in single-channel speech without speaker separation), <NA> is used.

  1. Conf – confidence score. A measure of the system’s confidence in the correctness of the information. A number in the range [0.0, 1.0] (as a text representation of the number), where 1.0 indicates complete confidence. For example, for a recognized word, this could be the probability of correctness. If the concept of confidence does not apply to this object or the system does not provide this metric, <NA> is used. In practice, the Conf field is most often used in LEXEME lines (for example, ASR output may provide probabilities for each word). In diarization, confidence scores are usually not provided (set to <NA> for SPEAKER segments). The field is optional – if no data is available, it can be omitted, meaning <NA> is left.

  2. Slat – Signal Look Ahead Time. This is the time (in seconds) of the last signal sample that was used by the algorithm to make a decision about this object. In other words, some event detection algorithms may need to look a bit ahead of the nominal end of the event – this time is recorded in Slat. If the algorithm does not compute this parameter, <NA> is used. The Slat field was added in later versions of the standard and almost always remains <NA> (few systems explicitly fill it). Many tools ignore this field entirely. For example, the official NIST md-eval-v21.pl script for DER calculation in diarization does not read the 10th field, so the absence of Slat or its value being does not affect the evaluation. Nevertheless, the specification requires the presence of this field (even if it is just ) so that each line contains exactly 10 fields.

The final structure of an RTTM line is: <Type> <File> <Chnl> <Tbeg> <Tdur> <Ortho> <Stype> <Name> <Conf> <Slat>. For example, the format of a diarization line (speaker and time) would be:

SPEAKER <file_id> <channel> <start> <duration> <NA> <NA> <speaker_id> <NA> <NA>

For instance: SPEAKER meeting_01 1 13.500 4.200 <NA> <NA> spk1 <NA> <NA> (speaker spk1 speaks from 13.5s to 17.7s). And a line for a recognized word:

LEXEME <file_id> <channel> <start> <duration> <orthography> <subtype> <NA> <confidence> <slat>

For example: LEXEME meeting_01 1 15.300 0.500 hello lex <NA> 0.95 <NA> indicates the word "hello" with a duration of 0.5s, recognized with a confidence of 0.95.

Evolution of the RTM standard and version differences

The RTTM format has gone through several iterations during a series of Rich Transcription evaluations and related projects. The core structure (10 fields, multiple segment types) was established in the mid-2000s, but there are some differences between versions of the specifications from different years and organizations:

  • Early Versions (RT-03 and LDC v1.3 Specification): Initially, the RTTM format contained 9 fields. In the 2003 Rich Transcription evaluation plans (RT-03F) and related LDC materials (the RTTM-format-v13.pdf for LDC2004T12), the Slat field was absent. The lines consisted of the Type, File, Chnl, Tbeg, Tdur, Ortho, Stype, Name, Conf fields. The addition of the 10th field came later. For example, in RT-03 for diarization, the output annotation included SPEAKER lines with 9 fields, and the probability was indicated in Conf only if the system provided it (often Conf was not used). The SPKR-INFO field existed but may not have been applied; when used, the subtype was typically male or female (without the adult_ prefix) to indicate the speaker's gender, as the "child" category might not have been considered in those datasets.

  • Introduction of the Slat Field (RT-09): By 2009, NIST expanded the format to 10 fields by adding the Signal Lookahead Time. The RT-09 Evaluation Plan (Appendix A) defined all ten fields, including Slat. This was part of the unification of formats for various tasks (diarization, meeting speech, keyword spotting). However, in practice, in 2009 and beyond, most systems left Slat empty (<NA>), and evaluation tools (DER, STT) did not consider this field. Thus, the transition from 9 to 10 fields did not alter the core output but formally required the addition of <NA> at the end of each line. Many existing RTTM files from the early 2000s did not have this field — using them with modern tools often required adding a dummy <NA> field to each line.

  • OpenKWS and Babel Data (around 2013–2015): The IARPA Babel and OpenKWS programs used RTTM to represent reference transcriptions, combining word-level annotations and speaker metadata. In the OpenKWS specifications (e.g., the 2015 Evaluation Plan), the RTTM format is described with the same 10 fields but with expanded subtype lists for better support of different languages and annotations. Specifically, the alpha (spelled-out sequences) and acronym subtypes were added for LEXEME, and tags for vocalizations and noises (e.g., breath, lipsmack) were more clearly defined. Babel annotation also adopted the practice of using SPEAKER lines for smoothed speaker segmentation (even with word-level segmentation available) to separate utterances. The subtypes for SPKR-INFO expanded to include age-based categories: adult_male, adult_female, child — a departure from older datasets where only male/female was typically used. In OpenKWS, structural types (SEGMENT, NOSCORE) were also explicitly defined to mark evaluation regions or excluded regions within a single RTTM, although earlier NIST evaluations often used separate files for these purposes (e.g., UEM – Un-partitioned Evaluation Map).

  • Differences in LDC and Other Sources: LDC in its documentation might make small adjustments. For example, the mentioned RTTM-format-v13 (LDC2004T12) did not include the Slat field, reflecting the state of the standard in 2003. Some LDC tools (such as XTrans) use their own formats based on RTTM (e.g., Tab-Delimited Format, TDF), but maintain compatibility by exporting data to RTTM when necessary. Overall, the RTTM format stabilized after 2009, with subsequent variations mainly concerning subtype dictionaries and naming conventions. Modern open-source tools (see below) typically expect RTTM in the format equivalent to RT-09/NIST OpenKWS: 10 fields, standard type and subtype names.

  • Practical Differences in Usage: Different subsets of RTTM may be used for different tasks. For example, for speech recognition quality evaluation (Speech-To-Text), the STM/CTM format is sometimes preferred, but RTTM is also supported — conversion scripts exist, such as rttm2ctm.pl, to convert RTTM to CTM for WER calculation using SCTK/SCLITE. For diarization, however, RTTM has become the main output and evaluation format (DER). Diarization evaluation tools typically require only SPEAKER lines in RTTM hypothesis and reference files (other types are ignored). Therefore, some systems output only 9 fields (omitting Slat) or fill all optional fields with <NA> to match the expected SPEAKER ... <NA> <NA> pattern. This is allowed post-factum since the extra field does not affect the metric, but strictly according to the format, it is better to include the 10th field even if it is <NA>.

Examples of acceptable RTTM recordings

Below are examples of RTTM lines for different types, illustrating the correct format and field values. Let's assume we have a snippet of annotation for an audio file meeting_audio.wav, where one participant (Speaker_A) is speaking, there is background noise before their speech, and a cough is noted during their speech. Also, assume the speaker’s gender is known. Then, the RTTM annotation could include the following lines:

NON-SPEECH meeting_audio 1 0.000 5.000 <NA> noise <NA> <NA> <NA>
SPEAKER    meeting_audio 1 5.000 3.000 <NA> <NA> Speaker_A <NA> <NA>
LEXEME     meeting_audio 1 5.000 0.500 hello lex <NA> <NA> <NA>
NON-LEX    meeting_audio 1 7.000 0.300 <NA> cough <NA> <NA> <NA>
SPKR-INFO  meeting_audio 1 <NA>   <NA>  <NA> male Speaker_A <NA> <NA>

Let's break down these lines:

  • NON-SPEECH: meeting_audio 1 0.000 5.000 <NA> noise <NA> <NA> <NA> – from 0.000 to 5.000 seconds in channel 1 of the file meeting_audio, background noise (subtype noise) is noted. The Ortho and Name fields are not applicable (<NA>). This line describes a non-content audio zone (e.g., room noise or silence) lasting 5 seconds before speech.

  • SPEAKER: meeting_audio 1 5.000 3.000 <NA> <NA> Speaker_A <NA> <NA> – speech segment from speaker Speaker_A, starting at 5.000 seconds with a duration of 3.000 seconds (i.e., until 8.000 seconds). The Ortho and Stype fields are not used for SPEAKER type (they are set to <NA>). This line means: “Speaker Speaker_A is speaking from 5.0 to 8.0 seconds of the file.” It might have been generated by a diarization system. Note: the Speaker_A identifier is local to this file — if the same speaker appeared in another file, they might have a different name if that wasn't predefined.

  • LEXEME: meeting_audio 1 5.000 0.500 hello lex <NA> <NA> <NA> – the word (lexeme) with the orthography "hello", starting at 5.000 seconds and lasting 0.500 seconds. The lex subtype indicates this is a regular word. Name is not provided (<NA>), which follows the rule – the specific speaker for the word is not explicitly stated, but we can infer that Speaker_A was speaking during the 5.0–5.5 second interval (based on the previous SPEAKER line). Confidence and Slat are not specified (<NA>), presumably because the system didn’t provide confidence for this word, or it was not needed (if this is a reference transcription line). Thus, this line represents a transcription word spoken within Speaker_A's segment.

  • NON-LEX: meeting_audio 1 7.000 0.300 <NA> cough <NA> <NA> <NA> – a non-verbal vocal event from 7.000 to 7.300 seconds, in this case, subtype cough. The orthography field is not applicable (<NA>) since coughing doesn’t have a text representation. The speaker’s name is not explicitly listed, but by the timing, this cough also belongs to Speaker_A (as it falls within the 5–8 second speech segment). This line might have been generated if the cough was marked separately during annotation (for example, in a manual transcription or with an event detector). It signals a brief cough sound in the recording.

  • SPKR-INFO: meeting_audio 1 <NA> <NA> <NA> male Speaker_A <NA> <NA> – speaker metadata for Speaker_A: the male subtype indicates that this is a male. The time fields are not filled () because these are global file-level metadata. Channel is marked as 1 (i.e., it is implied that the speaker is on channel 1, which makes sense for a mono recording). Confidence is marked as — typically, SPKR-INFO can include confidence about gender/age if such a classifier was used, but here it is not provided. This line essentially declares: “In the file meeting_audio on channel 1, the speaker with the ID Speaker_A is male.” Similarly, it could be female or child in other cases. If the gender were unknown, unknown would be used. SPKR-INFO lines are typically placed either at the beginning or end of the RTTM file – their position doesn’t affect the timeline, as they have no time fields.

All the above lines together form a coherent annotation: first, background noise is noted, followed by speech from Speaker_A, where they say the word "hello" and cough, and it is specified that this speaker is male. The format is maintained: each line contains exactly 10 fields (including <NA> where there is no data).

In a real RTTM file, there could be additional lines, for example, for subsequent utterances or other speakers. If a second participant appeared, we would see new SPEAKER lines like Speaker_B and corresponding words, and possibly SPKR-INFO for Speaker_B. Also, if part of the audio should not be evaluated (e.g., overlapping speech that was excluded from error calculation), there could be a NOSCORE line for that interval.

The examples demonstrate the versatility of RTTM: it can reflect both content (words, noises) and the structure of the conversation (who spoke and when, who the speakers are), all in a single temporal space within the recording.

Features and limitations of the RTTM format

Advantages and Capabilities

  • RTTM is human-readable: Since it is a text format, RTTM files are easy to read and edit. Each line is a meaningful record that can be interpreted without specialized software.

  • Unified format for different types of annotations: One file can represent multiple types of annotations, including speaker segmentation, full speech transcriptions (with words and vocalizations), noise/music labels, and additional information about channels and speakers. This makes RTTM a powerful tool for integrated speech annotation. NIST specifically describes RTTM as a "cross-evaluation file format" that can be applied across different tasks.

  • Supports overlapping and simultaneous events: Unlike simple line-by-line transcription, where speakers alternate, an RTTM file can contain lines with overlapping time intervals (e.g., simultaneous speech from two speakers will be represented by two SPEAKER lines with overlapping Tbeg/Tdur). This allows for the representation of overlapping speech or background noises in parallel with speech. Many tools (e.g., DER scorer) correctly handle such overlaps when calculating metrics.

  • Detailed format: RTTM can store not only words but also word types (such as pauses or fragments, useful for normalization metrics in word search), noise types, and confidence values. For example, for keyword search systems, each word hypothesis can also be formatted as an RTTM entry with the Conf field, enabling RTTM to be used for this task as well.

  • Ease of parsing and generation: RTTM is easy to parse programmatically, as lines are split by spaces into a fixed number of columns. This reduces the chance of errors when integrating different systems. Many libraries have built-in functions for reading and writing RTTM files.

  • Standard for diarization quality evaluation: Tools like NIST md-eval, the Python library DScore, pyannote.metrics, and others calculate DER using RTTM files for hypotheses and references. Similarly, for calculating WER/STT, results can often be represented in RTTM and converted to CTM. Thus, a single RTTM annotation can serve as the source for different metrics.

Limitations and Nuances

  • RTTM does not explicitly link words to speakers: The absence of the Name field in LEXEME lines means that to associate a word with a speaker, you need to analyze the time overlap with SPEAKER segments. In reference data, where everything is accurately annotated, this is not an issue. However, for ASR + diarization hypotheses, you need to merge two annotations: the diarization system (SPEAKER) and the transcription (words with timestamps), and they need to be combined. The RTTM format didn’t account for, for example, a Name field in LEXEME lines (although technically it exists, it should be <NA> by specification), so there is no explicit linkage key—only time. This complicates tasks like evaluating who spoke what directly from one RTTM file. Some systems solve this by generating separate RTTM files for diarization and CTM for words, or by violating the format and duplicating the speaker name in each word line.

  • Redundancy and fixed format: RTTM always has 10 fields, even if many of them are not used in a specific task. This results in many <NA> values in the files. The format is not like JSON, where insignificant keys can be omitted—it is strict about the positions. This increases the likelihood of mismatches (e.g., forgetting one <NA> and shifting columns). It also fixes a specific set of categories. Adding a new object type or extra information is difficult without breaking the standard. For instance, you cannot directly add a field like "emotion" or "speaker language"—you would either have to introduce a new type (which is not compatible with existing parsers), or encode it in one of the existing fields (which is not according to the specification).

  • Limited support for nesting or complex structures: RTTM is a flat list of events. It is not possible to explicitly reflect, for example, the structure of a dialogue (questions and answers) except by introducing special types in Stype (which is also static). There is no way to specify relationships between objects except by matching fields. Thus, RTTM is not a replacement for full XML/JSON when rich annotation structures are required; it is more of a list-based format.

  • Time agreement requirements: RTTM assumes that times within a single file are consistent. However, it does not store the explicit duration of the entire file. If a system outputs a segment that extends beyond the actual length of the audio, or overlapping speaker segments for a single speaker, the format doesn’t prohibit this, but such RTTM files are considered incorrect in terms of data. You must take care of consistency, for example, ensuring that one speaker does not have overlapping segments (each file/channel has its own timeline, and one speaker should not speak at the same time as themselves). Checking such conditions is left to external scripts (e.g., validate_rttm.py in DScore warns about annotation rule violations).

  • Compatibility and variability in practice: Despite the existence of an official specification, variations are common in practice. As noted, some systems output 9 fields instead of 10—parsers must account for this (for example, DScore will accept 9 fields and automatically add the missing <NA>). Not all types are used in all cases: encountering SEGMENT or NO_RT_METADATA lines in real data is rare; UEM files are more often used for evaluation regions. Therefore, support for these fields in software may not always be complete. Integrating RTTM with other formats also requires conversion—for example, to calculate WER, RTTM is converted to CTM, and for human transcripts, RTTM is often derived from STM. This creates some barriers, and the format could be simpler, but this is how it has historically evolved.

In summary, RTTM is a convenient and time-tested format for the unified representation of speech recognition and diarization results, but it is designed for a strict structure and is limited by the initial scope of RT.eval tasks. Nevertheless, thanks to open tools and scripts, most practical limitations can be easily circumvented.

Application of RTTM in modern tools and systems

RTTM Integration in Modern Tools

The RTTM format is still widely used in research and industrial tools, especially those related to diarization. Below is a brief overview of how RTTM is integrated into some modern solutions:

  • Pyannote.audio – A popular library suite for diarization and audio analysis from HuggingFace/Hervé Bredin. It uses RTTM as the primary format for representing diarization hypotheses and references. Specifically, utilities from pyannote.metrics expect hypothesis and reference files to be in RTTM format for calculating DER, JER, and other metrics. The pyannote-metrics diarization team accepts pairs of RTTM files. Additionally, Pyannote can read RTTM files with multiple speakers and supports the entire NIST standard (10 fields, requiring <NA> where appropriate). There is even a GUI annotator (Interspeech 2024 demo) based on Gradio that allows for editing diarization and exporting it to RTTM.

  • NVIDIA NeMo – NVIDIA's toolkit for ASR and diarization. NeMo uses RTTM when preparing training data and when outputting diarization results. According to NVIDIA NeMo's documentation, for end-to-end diarization training, ground truth RTTM files with speaker segmentation are required. An example line provided by NeMo follows the standard: SPEAKER TS3012d.Mix-Headset 1 32.679 0.671 <NA> <NA> MTD046ID <NA> <NA> – This includes the 10 fields, channel 1, speaker name, and all unnecessary fields as <NA>. NeMo generates similar RTTM files during inference (e.g., for call diarization). Thus, NeMo ensures compatibility with common formats, allowing results to be directly fed into DScore or pyannote for evaluation.

  • DScore – An open set of Python scripts from NIST for diarization evaluation (DER, etc.), developed as an addition to the outdated md-eval.pl. DScore clearly describes the input file format on GitHub: “Rich Transcription Time Marked (RTTM) files are space-delimited text files containing one turn per line, each line containing ten fields...”. Essentially, the description repeats the specification: Type should be SPEAKER, and the remaining fields should be formatted correctly. DScore includes the script validate_rttm.py to check RTTM file compliance with the specification. The score.py script accepts lists of reference and hypothesis RTTM files and calculates DER. DScore has become the standard tool for diarization competitions (DIHARD, VoxSRC), so the RTTM format is mandatory for submissions – for example, in DIHARD, participants are required to submit diarization results in RTTM files.

  • Kaldi – While Kaldi does not usually use RTTM internally (preferring its own text segment files and utt2spk), it does include scripts in the egs recipes for generating RTTM. For example, in the CallHome diarization recipe, there is a make_rttm.py script that combines segment and label files into RTTM format. This script generates SPEAKER lines with fields like: recording-id, channel (fixed as 0 or 1), times, <NA> in orthography/type, speaker name, <NA> in conf/slat. Thus, even Kaldi, the de facto standard for ASR, supports exporting diarization to RTTM, which is convenient for subsequent use with standard scoring or visualization tools.

  • Other Tools and Formats: Many other packages ensure compatibility with RTTM. For instance, the ESPnet library (for ASR) contains an RTTM parser in the module espnet2.fileio.rttm. Third-party speech analysis scripts (e.g., for Voice Activity Detection, speaker turn detection) often also output RTTM to unify input/output formats. There are also widely used converters between RTTM and other formats: in addition to rttm2ctm, there are scripts for converting RTTM ↔ CSV, RTTM ↔ ELAN (for convenient GUI annotation), etc. Even if a system doesn't natively output RTTM, it is typically easy to write a converter because the format is very simple.

    💻
    For developers working with RTTM files, the RTTM Syntax HL extension for Visual Studio Code offers syntax highlighting. This enhancement simplifies the process of viewing and editing RTTM files, making it easier to work with speaker diarization, ASR, and audio annotation tasks.

In conclusion, RTTM remains a key format for representing diarization and transcription results. Its support is built into nearly all modern speech processing pipelines—ranging from dataset preparation to algorithm evaluation. The historical legacy (NIST RT Evaluations) has made RTTM a sort of "Latin" for audio annotation: it may seem a bit cumbersome, but almost every tool "understands" it. As a result, researchers and engineers can exchange speech results and metadata between systems without needing to develop a new format from scratch. The RTTM format, being technical and strict, has proven itself to be a reliable and comprehensive way to describe who spoke, when, what was said, and what was happening in the background — all within one coherent structure.

0
Subscribe to my newsletter

Read articles from Maksim Panfilov directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Maksim Panfilov
Maksim Panfilov