To digitize vinyl records and cassette tapes, a cassette deck or a turntable is used—depending on the original format—along with an audio amplifier and a computer. The audio signal from the deck or turntable is transmitted to the computer through the amplifier. Using recording software such as Audacity or Wavepad, the entire playback is captured in a digital file. Depending on the condition of the cassette or record, the resulting file may require additional processing to remove noise and enhance sound quality. Once the audio has been optimized, it is converted into a video file by adding an intro card specifically designed for this type of content. This process is automated through the use of a script and FFmpeg software. Finally, each file is manually extracted, catalogued, and tagged with the relevant information for its inclusion in the corpus.
Each video or audio file in the corpus is assigned a unique code that allows it to be identified based on several criteria: the type of humorous text (mon=monologue; chis=joke; ske=sketch; ven=ventriloquist), speaker characteristics (sex = M/H), generation or age (young=1, adult=2, elderly=3), speaker’s professional category (non-professional=0, professional=1), the province code, and the corresponding number within the corpus. For example, the code 0171-CHI-BARH21 refers to a joke (number 0171 in Humcor) from a male, adult, professional speaker from the province of Barcelona.
In the transcription of the files, the general rules of Spanish spelling are followed, with the exception of capital letters, which are reserved exclusively for proper names. For encoding, a minimal markup system based on Standard Generalized Markup Language (SGML) is used, following the specifications of the Text Encoding Initiative (TEI). Conventional punctuation marks are not used; instead, specific tags are employed to indicate pauses of varying durations.