AAWM Journal 13/No. 2 (2025)

What is a “Note”? Agreement and Disagreement in Transcriptions of Japanese Folk Songs

Gakuto Chiba, Yuto Ozaki, and Patrick E. Savage

Ethnomusicology; Japanese folk music; Music cognition; Cultural evolution

WHAT is a musical “note”? Previous analysis of cross-cultural transcriptions using Western staff notation suggests that disagreement about how to conceptualize a “note” is the main factor limiting transcription consistency and reliability. In min’yō (Japanese folk songs), melodies are traditionally highly ornamented and taught via oral transmission without written notation. Min’yō thus represents a good case to investigate agreement and disagreement among listeners in how musical “notes” are conceptualized and perceived within an orally transmitted musical culture. In this report, we focus on differences in transcription of Japanese folk songs by Japanese listeners with and without experience in singing the songs. We used Tony and Sonic Visualiser to automatically extract fundamental frequencies without constraining pitches to equal temperament or forcing note onsets into a fixed metrical structure. Two of us (first and second authors) independently segmented nine audio recordings of min’yō into what we perceived as notes and then compared our note segmentations against each other using quantitative and qualitative analysis. The recordings consisted of the first verse of three songs – Kuroda-bushi, Esashi-Oiwake and Yagi-bushi – in three different versions, totaling approximately 12 minutes. Our analyses suggest agreement but with substantial variation (57–96%) depending on the song/performer/agreement metric. They also delineate that different conceptualizations of “note” can lead to contrasting segmentation results. In particular, the first author (an experienced Japanese folk singer) showed higher sensitivity to microrhythm and microtonality, while the second author segmented into longer acoustic units using continuous pitch and linked acoustic units to the mora-like unit as seen in the notation of Japanese vocal music. Our results reveal within-culture variation in music perception, with implications for cross-cultural analysis of music based on transcription into musical “notes.”

Gakuto Chiba is a PhD candidate at the Graduate School of Media and Governance, Keio University Shonan Fujisawa Campus (gakuto.shamisen1207@keio.jp).

Yuto Ozaki is a senior researcher at the Keio Research Institute of Keio University Shonan Fujisawa Campus.

Patrick E. Savage is the director of the CompMusic Lab for comparative and computational musicology. He is a Senior Research Fellow in the School of Psychology at the University of Auckland / Waipapa Taumata Rau and an Associate Professor in the Faculty of Environment and Information Studies at Keio University Shonan Fujisawa Campus (psavage@keio.jp).

Click for DOI, citation, PDF version

Introduction

NOTES are discretizations of musical phenomena that facilitate performance, comprehension, and analysis, and are often considered the building blocks of music (Nattiez 1990). They are represented visually using musical notation, and in Western staff notation, notes are defined in terms of their temporal sequence, pitch, and duration. But is that interpretation applicable to music outside the Western tradition? In Japanese folk songs, the concept of musical notes loosely corresponds to a term called “Fushi,” which originally denoted segments of pitch (melody) and duration (rhythm). It also refers to the entire melodic structure of the song itself. While most Japanese folk songs are transmitted orally without notation, “Esashi-Oiwake” is unusual in having an official notated score (Figure 1). It is based on the idea of a Western staff, but the notation is closer to a sonogram and doesn’t divide the melody into discrete “notes” in terms of specific pitch categories and durational values; instead it shows one continuous line per phrase (“Fushi”) with lots of continuous microvariation. This example demonstrates how musical notes can be conceived in different forms in different cultures (and potentially even by different individuals within the same culture). Our transcriptions may reflect such intra- and inter-cultural differences in music perception and cognition.

Figure 1. The staff notation of Esashi-Oiwake, a min’yō from Hokkaido in Japan (http://esashi-oiwake.com/utaikata). A continuous line in each section represents the melody of the song. Eight types of microvariations are used (as indicated at the bottom of the staff notation).

[2] Although Figure 1 is inspired by Western staff notation, it does not indicate specific pitches; instead, the rises and falls of pitch are shown by the upward and downward contours of the line. Additionally, the lyrics are written in Japanese within parentheses, but the timing as to when each word should be sung cannot be determined definitively from the staff notation itself.

[3] The methods of transcription and analysis within ethnomusicology, particularly the validity of using a “prescriptive” notation practice from one tradition (e.g., Western staff notation) as a “descriptive” notation for others, have been extensively discussed (Seeger 1958; Nettl & Bohlman 1991; Nettl 2015; Savage 2022). Despite the existence of various notation practices outside the Western sphere (Kaufmann 1967; Vandor 1975; Church 2015), most of the notations used in previous studies have been based on Western convention. This has even been referred to as a ‘universal notation’ (Rastall 1983). Such applications of Western staff notation arguably lead to a distorted evaluation of music (Herzog 1964; Kolinski 1973). Although the development of new notation practices to accurately transcribe and describe musical processes outside the Western cultural sphere have been proposed (McCollester 1960; Koetting 1970; Reid 1977; Korde 1985; Killick 2020), these new practices often sacrifice readability and accuracy (Knust 1959).

[4] Additionally, musical transcription may take a large variety of forms, depending on analytic goals and the analyzed musical context (List 1963; England 1964; Herzog 1964; List 1974; Vandor 1975; Stanyek 2014). An early study of the commonalities and discrepancies between transcriptions of the same piece by several experts was made at a 1963 symposium on transcription. Four leading ethnomusicologists independently transcribed a single recording (A Hukwe song with musical bow) on Western staves (England 1964), resulting in “four rather different transcriptions” (Nettl 2015, 82). In more recent years, analysis of the degree of agreement in cross-cultural transcription using Western staves has suggested that disagreement about how to conceptualize a “note” is the main factor limiting transcription agreement and reliability (Ozaki et al. 2021).

[5] Manual transcription by ear is time-consuming and subjective (List 1974; England 1964). The technology for automatically generating spectrograms was developed (Metfessel 1928; Seeger 1957; Jairazbhoy and Balyoz 1977), and pitch traces were used to visualize musical sounds in lieu of abstract Western staff notation (Moore 1974). However, the reliability for vocal music, especially music from outside the Western tradition, remains low (Benetos et al. 2019; Müller et al. 2019; Rafii et al. 2018; Holzapfel & Benetos 2019; Ozaki et al. 2021).

[6] In a recent study, Ozaki et al. (2021) found that transcriptions by three expert raters trained in Western staff notation showed substantial agreement (~90%; κ ~.7) on a global sample of 32 traditional song excerpts. Automated methods, on the other hand, showed poor agreement (<60%; κ <.4). However, the three raters were not experts in the musical traditions they were transcribing. Therefore, we do not know how well their manual transcriptions accurately captured the conceptions of the musical cultures they were analyzing. Indeed, it has been pointed out that manual transcription by those with knowledge of the tradition may be superior to automated transcription:

There is little doubt that automatic transcriptions, with their detailed, external view of music, will eventually help us to understand some of the physiological and cultural processes of [hu]man[s]. However, when the subject of study is concerned with the psychological or communicational aspects of music within a culture, aural transcriptions by a trained ethnomusicologist who has steeped [themself] in that culture may well be far more meaningful (Jairazbhoy 1977, 270).

[7] The current study was carried out by one of the five teams (Japanese, Chinese, Russian, Alpine, Romaniote Jewish) forming part of the VocalNotes project. This project aims to investigate cross-cultural patterns in how transcribers conceive of notes in vocal performances by studying agreements and disagreements in their transcriptions of music from cultures in which we ourselves are experts to address the limitations identified above (Proutskova et al. 2026). Our team focused on Japanese folk songs, whose melodies are traditionally highly ornamented and taught via oral transmission without written notation (Hughes 2008; Chiba and Savage, 2023).

[8] We utilized sound graph technology to visualize the music of our specialized cultural area instead of Western staff notation. This was supplemented by manual transcription to achieve more concrete and accurate descriptions. Additionally, we compared transcriptions of Japanese listeners with and without experience singing Japanese folk songs. The first author, Gakuto Chiba (hereinafter referred to as “GC”) had singing experience, while the second author, Yuto Ozaki (hereinafter referred to as “YO”) did not.^[1] We aimed to quantitatively and qualitatively analyze how transcribers conceptualized notes and what factors resulted in agreement or disagreement between transcriptions of Japanese folk songs. This unique dataset provided a complex interplay of recording age, sound quality, singer, place of origin, and existence of musical scores.

Methods

a) Dataset

[9] Japanese folk songs exist throughout the country and have been extensively documented, particularly through national broadcasting company NHK’s massive Min’yō Taikan (“Japanese folk song anthology”) project, which produced over 6,000 audio recordings over several decades during the mid-20th century (Machida 1944–1994; Savage et al. 2022). However, in this study, to balance this large data volume with the manual transcription time, we restricted ourselves to about 10 minutes of recordings, following the VocalNotes protocol. In order to take differences in Japanese folk song performance across time and space into account, we made selections from the Global Jukebox, hereinafter referred to as “GJB” (Wood et al. 2022). Specifically, three min’yō were selected: Esashi-Oiwake, a fishermen’s song from the northernmost island of Hokkaido; Yagi-bushi, a festival song from Tochigi Prefecture in central Japan; and Kuroda-bushi, a ballad from Fukuoka Prefecture in the southwestern island of Kyushu (Figure 2). These songs were recorded over 80 years ago in relatively distant parts of Japan and are still sung today to the extent that our two coauthors with expertise in the tradition already knew how to sing them.

[10] We used three versions of each min’yō, performed by three different singers:

The Global Jukebox (GJB): GJB is a free, public database containing almost 6,000 audio recordings of traditional songs from almost 1,000 societies around the world, coded for standardized features of performance style using the “Cantometrics” classification scheme (Wood et al., 2022). The original source of these three min’yō recordings were all from the Columbia World Library of Folk and Primitive Music (Lomax 1964). Each of the three GJB folk songs was recorded in 1943 by a folk song specialist in each region, but no records of the singers’ identity have survived;
Gakuto Chiba (GC): The first author, who has experience with min’yō since the age of four; and
Patrick E. Savage (PES): The last author, who has experience singing min’yō since 2011 (Table 1).

[11] GC and PES both made their recordings during 2021. The lyrics can change depending on the singer, but GC and PES matched the lyrics with one another (though not necessarily with the GJB lyrics). Min’yō do not have the concept of a fixed tempo, so the songs never have the same duration. The total duration of the nine audio recordings (i.e., three different versions of three min’yō) was approximately 12 minutes and 30 seconds.

Figure 2. The geographical origin of the three folk songs, Esashi-Oiwake, Yagi-bushi, and Kuroda-bushi, shown on a map of Japan.

Table 1. Overview of the nine audio recordings (three different versions of three min’yō). Year of recording and singer identity are specified.

b) Transcribers

[12] The first author (GC) and the second author (YO) transcribed the nine audio recordings. GC has been learning Japanese folk songs for many years, whereas YO has not performed them (though he has listened to them). YO has sung Western classical choir music for more than 10 years as a member of an amateur choir led by professional conductors and accompanied by professional orchestras. He has also received lessons on the classical guitar for around five years. GC started learning the electone (electronic piano) at the age of four and received musical education from his parents, who were both music teachers (his mother specialized in piano and was a choral instructor; his father specialized in percussion and was a wind band instructor). GC served as a choral conductor from junior high school to high school, winning gold medals in chorus competitions for five years in a row.

c) Segmentation

[13] GC and YO used Tony (Mauch et al. 2015) to automatically extract fundamental frequency curves and segment them into notes without constraining pitches to equal temperament or forcing note onsets into a metrical grid. We then independently corrected these automated segmentations into what we perceived as notes. Specifically, based on audio recordings and supplementary visualized audio waveforms and F0 traces, the lengths and onset positions of notes (gray boxes in Figure 3; light blue boxes in the actual Tony) were manually adjusted, and they were cut, added, or deleted as necessary. Furthermore, these segmented notes and the automatic pitch curves were exported from Tony and imported into Sonic Visualiser (Cannam et al. 2010). Here, pitch was manually adjusted (Proutskova et al. 2026). The order of the songs to be transcribed was left to the discretion of the transcribers, who were allowed to go back and forth repeatedly among the nine audio recordings to make adjustments until they were satisfied with the results.

d) Qualitative analysis

[14] Sonic Visualiser was used to visualize the segmentation matches and discrepancies between the two transcribers (cf. Proutskova et al. 2026). If the overlapping time, calculated by subtracting the earliest start time from the latest end time among the two split notes, was 5.2% or less of the total time, or if there are multiple segmented notes in one transcription and a single segmented note in the other, we subjectively consider them as mismatched (shown using red in Figs. 3 and S1). In the latter case, only the segmented note with the longest temporal overlap is deemed as in agreement.

Figure 3. An example of determining agreements and disagreements between segmented notes with a significant temporal overlap (a) and segmented notes that do not overlap clearly (b). This is an excerpt of Esashi-Oiwake sung by PES, transcribed by GC (above) and YO (below). The x-axis represents the time of the song (in seconds), and the y-axis represents the pitch (in Hz); since the absolute pitch values are not important in this study, all pitch labels have been removed. The black dots forming a wavy line indicate the automated pitch curve estimation, while the gray boxes represent individual notes manually segmented by the transcribers.

e) Quantitative analysis

[15] To reveal the extent to which the manually performed segmentations of GC and YO were consistent, we adopted the simplest method for comparing pairs of sequences. Specifically, we calculated their percent identity (PID). PID corresponds to the number of segmented and aligned notes that are identical (ID; i.e., judged to be in agreement) divided by the sequence length (L) as specified in the following equation:

[16] We also calculated the percentage of temporal agreement between the two segmentations, considering the variation in the durations of the segmented notes. Specifically, we subtracted the sum of the durations of the mismatched note segmentations in GC’s and YO’s transcriptions from the overall duration of the song, divided it by the overall duration, and multiplied by 100.

Results

a) Degree of transcriptions agreement

[17] In our quantitative analysis of both percent identity (PID) and percent of time, GC’s and YO’s transcriptions showed general agreement for two of the songs (i.e., 75–96% agreement for Kuroda-bushi and Yagi-bushi), but lower agreement for Esashi-Oiwake (57–86% agreement; Table 2). Comparing percent identity and percent of time, except for Yagi-bushi, percent of time shows a slightly higher agreement. This suggests that disagreement increases near shorter, more “ornamental” notes. In terms of sequence length, like the qualitative analysis (Figure S1), the number of GC’s segmented notes (L1) is greater than the number of YO’s segmented notes (L2) in all nine audio recordings, indicating that GC segments notes in a more fine-grained manner, whereas YO segments notes in a way that allows for greater pitch variation within each segment. Tony, however, requires a note to have a single fixed pitch. Therefore, while Tony’s note representation captures YO’s sense of rhythmic units, it does not reflect his pitch perception.

Table 2. The agreement rate of manual segmentations by Gakuto Chiba (GC) and Yuto Ozaki (YO) across the nine recordings calculated from percent identity (PID) and percent of time. Sequence length 1 (L1) refers to the number of notes segmented and aligned by GC, while sequence length 2 (L2) refers to the number of notes segmented and aligned by YO. Identity (ID) refers to the number of notes for which the pairs of arrays in L1 and L2 are judged to match. Total time refers to the overall duration of a song from the first note to the last note. Duration indicates the time during which the arrays segmented by GC and YO match within the total time, and the calculated proportion is represented by percent of time.

b) Factors causing disagreement

Ornamental sounds and calls in Esashi-Oiwake

[18] Both the qualitative and the quantitative analyses show that the transcriptions of Esashi-Oiwake are overwhelmingly more inconsistent than the other two songs. We believe this is due to the fact that its melodies are traditionally highly ornamented, and it is the only one of the three songs that includes Ohayashi.^[2]

[19] We selected four highly ornamental “Setsudo^[3],” “Honsukuri^[4],” “Hansukuri^[5],” and “Sukui^[6]” passages from the musical score of Esashi-Oiwake and colored our segmented notes corresponding to them (Figure 4). It can be seen that the ornaments are perceived as separate musical notes only in GC’s transcriptions, whereas they are typically integrated into a single note in YO transcriptions. Furthermore, what stands out is that when GC transcribes his own performances, he clearly separates ornaments as distinct notes, whereas in the transcriptions of performances by PES and GJB (performances different from their own), ornaments are integrated with other notes.

[20] In the Esashi-Oiwake recordings used in this study, the GJB performances represent the formal rendition of the song, including accompaniment by shakuhachi and Ohayashi. However, GC’s and PES’s recordings were made in their respective homes, and there was no third party to provide calls. In these instances, some individuals, such as GC, imagine both the accompaniment and the Ohayashi, singing only the lyrics. Others, like PES, focus solely on the accompaniment and perform the Ohayashi while singing on their own, ensuring that pauses and the song’s flow are not disrupted. In all performance forms, Ohayashi serves solely as a signal or call to support the song and must be distinguished from the lyrics. However, in the case of GJB’s and PES’s rendition of Esashi-Oiwake, YO divides the Ohayashi (Soi calls), represented by light blue squares, into notes. This indicates that the segmentation of notes is influenced by his knowledge of Japanese folk songs. This could be one of the factors that contributed to the relatively large number of transcriptional discrepancies in Esashi-Oiwake.

Figure 4. Excerpts from Gakuto Chiba’s (GC) and Yuto Ozaki’s (YO) transcriptions of Esashi-Oiwake (sung by GJB, GC, and Patrick E. Savage). Based on the indication of ornaments in Esashi-Oiwake, segmented notes are divided into Setsudo (Purple), Honsukuri (Orange), Hansukuri (Yellow), Sukui (Green), and others (Red). The light blue squares are the Ohayashi (also called the “Soi call”) in Esashi-Oiwake. See Figure 3 for further details.

Differences in musical experience

[21] There is an interesting passage in our transcription of the three Japanese folk songs where YO has combined some of the finely segmented sounds identified by GC into a single note (Figure 5). While all sounds are sung in one breath with a sustained vowel, the F0 trace shows that the vowel exhibits pitch changes or ornamentation. Furthermore, when comparing the vowels of the three excerpted songs, the vowels in Esashi-Oiwake and Yagi-Bushi are melismatic, whereas those in Kuroda-Bushi exhibit characteristics reminiscent of a glissando. In fact, YO does recognize the pitch changes that are represented in GC’s transcriptions. But due to limitation that Tony requires a single fixed pitch for each note as explained earlier, we believe that Tony’s notation does not adequately capture GC’s pitch perception, although our current methodology cannot prove this. How to segment melismas and glissandos with pitch changes and ornamentation as notes may reflect GC’s and YO’s different experience levels.

Figure 5. Excerpts from Gakuto Chiba’s (GC) and Yuto Ozaki’s (YO) transcriptions of Esashi-Oiwake and Kuroda-bushi (sung by an unknown singer from the Global Jukebox), and of Yagi-bushi (sung by Patrick E. Savage). See Figure 3 for further details.

Sound quality and instrumental sound

[22] The transcriptions of the three audio recordings from GJB show more disagreement among themselves compared to those of the performances by GC and PES (Table 2). This may arise from the poor sound quality on the recording from 1943, from changes in folk songs over the past 80 years, from difficulties for those of us in our 20~30s to interpret them, and/or from differences in complexity from one folk song to another. In any case, the agreement rate for Yagi-bushi is higher than for the other two. Furthermore, there is no variance in the number of segmentations between GC and YO for Yagi-bushi as sung by GJB (L1 = 106 and L2 = 105). YO misinterpreted a taiko (drum) sound as a single note (Figure 6), but this mistake may only have had a marginal effect on the agreement calculation.

Figure 6. Excerpts from Gakuto Chiba’s (GC) and Yuto Ozaki’s (YO)transcriptions of Yagi-bushi (sung by an unknown singer). The light blue squares represent the Taiko (drum) sounds in Yagi-bushi. See Figure 3 for further details.

Discussion

[23] Our study investigated agreement and disagreement as to how musical “notes” were conceptualized and perceived in an orally transmitted musical culture. We focused on differences in transcription of Japanese folk songs by two Japanese listeners with and without experience in singing them. Although there were variations among songs and singers, Gakuto Chiba’s (GC) and Yuto Ozaki’s (YO) transcriptions suggested some general agreement. On the other hand, disagreement was mainly indicated by the difference in conceptualization of a “note,” as discussed in detail later. There were other types of disagreements such as the misclassification of the musical instruments and Ohayashi (i.e., signal calls) that potentially stemmed from differences in musical experience within the genre of min’yō. These factors, however, had only a minor influence on the agreement evaluation compared to the differences in conceptualization of a “note.” Our results indicated within-culture variation in music perception and representation.

a) Ground-truth perception

[24] Most of the materials used in our research were performed by two of the authors, GC and Patrick E. Savage (PES). One reason for this was a simple curiosity about how the performances of the authors, who are also musicians, would be transcribed by the transcribers. One of the authors (GC) transcribed his own performance. Generally, performers are believed to have the best understanding of their own performances. Therefore, the decision to transcribe one’s own performance was based on the belief that it would lead to a more accurate “ground-truth” perception of the musical notes. In fact, GC mentioned that transcribing his own performance took the least amount of time in that this recording was more easily segmented into musical notes compared to the others.

b) Limitations on generality

Sampling biases

[25] While our method of involving authors both as performers and transcribers has a number of strengths, it also entails significant limitations. Due to the amount of time required for manual transcription and the need for consistency with other Vocal Notes datasets, we chose to limit our study to nine Japanese folk songs (three versions of each of three songs) transcribed by two people. In the future, other research designs that consider further generalization may be necessary. Specifically, it would be desirable to involve more transcribers with different levels of experience (e.g., different degrees of experience in singing, performing min’yō) and to simplify (or automate) the manual transcription process, enabling more diverse samples and larger sample sizes.

Historical changes in performance practice

[26] In our dataset, in addition to the performances by the two authors, GC and PES, we used folk songs recorded over 80 years ago from the Global Jukebox (GJB) (Wood et al. 2022). Nearly a century later, these same songs contain lyrics and performance expressions that are not very familiar. Additionally, the recording technology at that time was different, resulting in poorer sound quality. We felt that interpreting the notes is extremely difficult for anyone who did not live during the relevant recording era, even if they are experienced with folk songs. In fact, our performances do not perfectly replicate the lyrics and performance expressions from GJB, but rather are performed in a manner familiar to modern practice. While the tonal center and lyrics are aligned as much as possible in our performance, they do differ from those in GJB. Therefore, the accuracy of folk song transcriptions may be influenced by the historical context in which the music was transcribed (folk song experience) and originally recorded.

The order of transcribing the pieces

[27] In our experiment, the order of transcribing the pieces was not prescribed, but was left to each transcriber. However, considering the possibility of increasing the number of transcribers and samples in the future, it could be useful to randomize the order of transcribing the pieces to avoid potential order effects.

Degree of exposure to Japanese folk songs

[28] Transcriber YO has neither performed nor learned Japanese folk songs, but having collaborated on folk song research in the same seminar and on the same campus, he has probably listened to more folk songs than the average Japanese person. It is possible that if a Japanese listener without experience in folk songs had never heard the folk songs in the dataset, the degree of agreement and disagreement in the transcriptions might have changed.

c) Commonalities in Transcriber Backgrounds

[29] Both transcribers, GC and YO, coincidentally had experience with choral music from the Western tradition. Naturally, there are differences in singing style and expression between Japanese folk songs and Western choral pieces, but considering that we share some level of experience with both, it would not be surprising if we share a common understanding in interpreting musical notes. This aspect may also influence the degree of agreement across transcriptions.

d) The existence of staff notation of Japanese folk songs

[30] The existence of staff notation is generally seen as a culturally accepted and agreed-upon standard, leading to more consistent interpretations, even for beginners (Knust, 1959; Hopkins, 1966). However, Esashi-Oiwake staff notation is primarily intended to be prescriptive (i.e., to ensure that performers know how and when to add micro-ornamentation) rather than descriptive (i.e., it is not intended to assist listeners in transcribing performances). GC (and PES) have learned to sing from the prescribed notation, but YO had not, and thus could not initially understand which part of the performance corresponds to which part of the notation. Thus, the existence of the staff notation did not affect his transcription. Additionally, there are various reasons for creating staff notation, one of which is the difficulty in performing (understanding) the piece. In this case, Esashi-Oiwake is often considered to be the most complicated and difficult Japanese folk song and part of the reason for creating the staff notation was to preserve its complex microtonality (Hughes, 2008). Therefore, despite Esashi-Oiwake being the only one of the three songs with an existing staff notation, we believe its complexity and difficulty explain why the transcriptions of Esashi-Oiwake by the two individuals were the least consistent of the three songs.

e) Who decides what is a note?

[31] Regarding whether the note segmentations by GC or YO were appropriate, we discussed them among the three of us and the other VocalNotes members. Everyone’s opinions differed. Here we each express our own opinion:

GC

[32] I believe that my segmentation of notes is appropriate. The primary reason for this confidence is my extensive experience learning Japanese folk songs orally from my teacher, Kōzan Kanno. This has deeply informed my understanding of these songs. As mentioned at the beginning, staff notation is generally not used in the acquisition of Japanese folk songs. My teacher, in particular, emphasized learning folk songs through auditory means rather than visual aids. Consequently, I did not use sheet music to learn Esashi-Oiwake; instead, I learned entirely from my teacher’s guidance and recordings of professional folk singers. This experience not only exposed me to the challenges of learning folk songs but also trained me to be attentive to even the slightest nuances in sound. Here, I will describe the musical factors and judgments that determined my note segmentation in this study. First, the elements of phrases and Ma^[7] are crucial in shaping the overall rhythm of the piece and can be considered the most important musical expressions that define Japanese folk songs. I determined these based on acoustic features such as changes in relative loudness, length of resonance, and presence of breaths. Second, as to the changes in syllables, while Japanese folk songs have short lyrics, each syllable contains a wealth of musical information and meaning. The slow tempo further makes them easily distinguishable. However, when transitions to the next syllable are rapid or when vowels are continuous, it becomes difficult to differentiate them as separate notes, and multiple syllables are judged as a single note. Third, while Japanese folk songs lack fixed pitches like Western music, there are differences in the amount of pitch variation. However, for the unique Japanese vibrato known as Kobushi, I judged it as a single note due to its oscillation around a specific pitch. In this way, I believe that my experience and knowledge from learning folk songs have made me sensitive to subtle variations in all folk songs, not just the ornamentation in Esashi-Oiwake. This allows me to recognize each as distinct note. On the other hand, I do not entirely discount YO’s interpretation of dividing segments as single acoustic units. In Japanese folk music, there is a term called Fushi Mawashi, which refers to the movement of a single line, akin to a sonogram in the notation of Esashi-Oiwake. Additionally, there is the term Nagare, which signifies the smoothness of transitions and developments in a song. Both of these are highly valued in the Japanese folk music community. Furthermore, in my lesson, my teacher often emphasized the importance of cohesion by stating: “Areas where you struggle to sing well may indicate issues in the sections before and after.” However, I do not believe that YO’s acoustic units necessarily align with my own acoustic units.

YO

[33] I have never learned Japanese folk songs like GC and PES, so my concept of “notes” for Japanese min’yō is largely influenced by my own interpretation of the notation system used for Esashi-Oiwake and kodai-kayō of gagaku, which resembles staffless neumes. The notation system of kodai-kayō of gagaku is called hakase, and it could arguably be one of the earliest notation systems developed for Japanese vocal music, as the score of gagaku already existed in 747CE (Tanaka 2021). Hakase is also used in Buddhist texts to imply pitch contours in chanting. Similar to staffless neumes, hakase is a more text-oriented notation style where the contour of pitch is assigned to individual words or morae. This notation system might suggest that pitch is regarded as continuous, and linguistic units determine the segmentation of vocalization. Although several notation systems have been developed for different genres of Japanese vocal music (e.g., goma of heikyoku), min’yō has not adopted any notation system except for Esashi-Oiwake. However, even in the case of Esashi-Oiwake, it is evident from Figure 1 that the representation devised by Esashi-Oiwake singers does not attempt to segment vocalization by changes in pitch, beats, or isochronic rhythm. This spirit has already surfaced since the earlier versions of Esashi-Oiwake scores (Tanabe 1967). This may suggest that Esashi-Oiwake singers have traditionally symbolized and segmented the vocalization of Esashi-Oiwake from a viewpoint very different from Western staff notation. Given these historical perspectives, I considered the acoustic units interpreted in the realm of Japanese vocal music to be characterized by the continuity of vocalization. I applied this concept to Tony/Sonic Visualiser’s segmentation. This resulted in encompassing multiple discretizable pitches (in the sense of Western staff notation) in a single unit and segmentation tending to occur in sync with changes in linguistic unit. However, I would like to note that Tony does not have the functionality to express the change or glide in pitch within segments. Due to this limitation, the playback of my segmentation by Tony does not accurately reflect my pitch perception. This explains why the agreement with GC’s transcriptions increases as the song becomes less melismatic, since changes in pitch and changes in words/syllables/morae become more synchronized, given that GC’s transcriptions are more descriptive about changes in pitches. Although my interpretation and GC’s interpretation of note and segmentation are quite different, especially in the context of performance, I agree that the detailed segmentation seen in GC’s transcription can be more practical because it allows us to more precisely follow acoustic phenomena occurring during the performance. Furthermore, discrete pitch contours are more memorable compared to gliding pitch contours (Haiduk et al. 2020). Thus, chunking acoustic sounds into discretizable pitch units would be a more cognitively appropriate representation for performers who need to remember many repertoires, especially if performers are expected to accurately follow pitch contours.

PES

[34] Unlike GC and YO, I am not ethnically Japanese, and did not learn Japanese as my first language or grow up in Japan. However, unlike YO, I have spent years learning Japanese folk songs under my teacher, Shūsei Ogita, through oral tradition, generally recording and imitating my teacher’s singing without the use of notation. Interestingly, while both GC and I learned to sing Esashi-Oiwake orally from our teachers, I also took separate lessons specifically for Esashi-Oiwake with a different, specifically authorized teacher, Tatsuo Matsunaga, in order to compete in the national competition. For this competition, adhering precisely to the notated score – including the exact type and number of ornamentations – was heavily emphasized by my teacher and the judges (cf. Figure 7). So, while I can understand YO’s argument about why he chose not to segment fine-grained ornamentation in Esashi-Oiwake and other songs based on historical traditions such as hakase notation, I am confident that GC’s segmentation of notes is a more accurate notation of the way that notes are conceptualized by contemporary min’yō performers and judges – at least for Esashi-Oiwake. At the same time, I want to point out that the very idea that there should be a prescriptive notation for a folk song (the melody of which itself evolved from other melodies through a process of oral transmission and variation; cf. Machida & Takeuchi, 1965; Savage, 2020) represents the adoption of values about notation from Western classical music culture into a tradition that formerly did not share these values (Hughes, 2008).

Figure 7. PES’s specialist Esashi-Oiwake teacher, Tatsuo Matsunaga, uses a pointer to follow along the official notation as PES sings to ensure he follows each prescribed micro-ornamental note. PES’s daughter listens carefully. Photo by PES (June 10, 2014).

VocalNotes members

[35] The other teams involved in the VocalNotes project argued that, while GC and YO have different experiences with Japanese folk songs, both being experts rooted in Japanese music, their segmentation of notes is appropriate, and there is no need to determine which interpretation is correct. In fact, it was demonstrated that varying cultural knowledge among transcribers can lead to variations in note interpretation.

Conclusion

[36] This investigation, conducted by our Japanese team, of agreement and disagreement among transcribers with and without performing experience of orally transmitted Japanese folk songs highlights intra-cultural variation in music cognition. Even within a single genre (Japanese min’yō), we observed substantial variation in the conceptualizations of “notes” between two Japanese transcribers, based on our diverse musical backgrounds and experiences. This study also proposes an interpretation that implies the potential impact of both differences in cultural knowledge and variations in specialized experience on the conceptualization of musical notes. When considered in the context of the broader VocalNotes study (Proutskova et al. 2026) our study highlights the role of intra-cultural variation in music cognition and note segmentation, and its balance against more cross-culturally universal features of pitch perception.

Data availability

Audio files are available at https://osf.io/8kg6y/

Author contributions

Conceptualization: Patrick E. Savage, Gakuto Chiba, Yuto Ozaki

Recording: Gakuto Chiba, Patrick E. Savage

Transcription: Gakuto Chiba, Yuto Ozaki

Analysis: Gakuto Chiba

Writing: Gakuto Chiba, Patrick E. Savage, Yuto Ozaki

Writing –reviewing/editing: Gakuto Chiba, Patrick E. Savage, Yuto Ozaki

Project administration/supervision/funding acquisition: Patrick E. Savage

References

Benetos, Emmanouil, Simon Dixon, Zhiyao Duan, and Sebastian Ewert. 2019. “Automatic Music Transcription: An Overview.” IEEE Signal Processing Magazine 36 (1): 20–30. https://doi.org/10.1109/MSP.2018.2869928

Cannam, Chris, Christian Landone, and Mark Sandler. 2010. “Sonic Visualiser: An Open Source Application for Viewing, Analysing, and Annotating Music Audio Files.” In Proceedings of the 18th ACM International Conference on Multimedia, 1467–68. Florence, Italy: ACM. https://doi.org/10.1145/1873951.1874248

Chiba, Gakuto, and Patrick E. Savage. 2023. “Traditional Folk Music in Contemporary Japan: Case Studies of Standardization and Diversification in Tsugaru Shamisen and Folk Song.” In Handbook of Japanese music in the modern era, edited by H. Johnson, 137–155. Brill. https://doi.org/10.1163/9789004687172_010

Church, Michael. (Ed.). 2015. The Other Classical Musics: Fifteen Great Traditions. Boydell Press.

England, Nicholas M, Robert Garfias, Mieczyslaw Kolinski, George List, Willard Rhodes, and Charles Seeger. 1964. “Symposium on Transcription and Analysis: A Hukwe Song with Musical Bow.” Ethnomusicology 8 (3): 223–33.

Haiduk, Felix, Cliodhna Quigley, and W. Tecumseh Fitch,. 2020. “Song Is More Memorable Than Speech Prosody: Discrete Pitches Aid Auditory Working Memory.” Frontiers in Psychology, 11. https://www.frontiersin.org/article/10.3389/fpsyg.2020.586723

Hopkins, Pandora. 1966. “The Purpose of Transcription.” Ethnomusicology 10 (3): 310–31.

Herzog, Avigdor. 1964. “Transcription and Transnotation in Ethnomusicology.” Journal of the International Folk Music Council 16: 100–101.

Holzapfel, Andre, and Emmanouil Benetos. 2019. “Automatic Music Transcription and Ethnomusicology: A User Study.” In Proceedings of the 20th International Society for Music Information Retrieval Conference, 678–84. Delft The Netherlands.

Hughes, David W. 2008a. Folk Music: From Local to National to Global. In The Ashgate Research Companion to Japanese Music, edited by A. Tokita and D. Hughes, 281–302. Aldershot, UK: Ashgate.

———. 2008b. Traditional Folk Song in Modem Japan: Sources, Sentiment and Society. Kent: Global Oriental.

Jairazbhoy, Nazir A. 1977. “The ‘Objective’ and Subjective View in Music Transcription.” Ethnomusicology 21 (2): 263–282.

Jairazbhoy, Nazir A., and Balyoz, Hal. 1977. “Electronic Aids to Aural Transcriptions.” Ethnomusicology 21 (2): 275-281.

Kaufmann, Walter. 1967. Musical Notations of the Orient. London: Indiana University Press.

Killick, Andrew. 2020. “Global notation as a tool for cross-cultural and comparative music analysis.” Analytical Approaches to World Music 8 (2): 235–279. ISSN 2158-5296.

Knust, Albrecht. 1959. ”An Introduction to Kinetography Laban (Labanotation).” Journal of the International Folk Music Council 11: 73–76.

Koetting, James. 1970. “Analysis and Notation of West African Drum Ensemble Music.” Selected Reports in Ethnomusicology 1 (3): 115–146.

Kolinski, Mieczyslaw. 1973. “A Cross-Cultural Approach to Metro-Rhythmic Patterns.” Ethnomusicology 17 (3): 494–506.

List, George. 1963. “The Musical Significance of Transcription.” Ethnomusicology 7 (3): 193–197.

———. 1974. “The Reliability of Transcription.” Ethnomusicology 18 (3): 353–77. https://doi.org/10.2307/850519.

Lomax, Alan. 1964. Columbia World Library of Folk and Primitive Music, Volume XI: Japan, The Ryukyus, Formosa and Korea. Columbia Records KL 214.

Machida, Kashō, and Tsutomu Takeuchi (Eds.). 1965. Folk song genealogies: Esashi Oiwake and Sado Okesa [4 LPs]. Columbia. AL-5047/50.

町田佳声・竹内勉（編）「江差追分と佐渡おけさーー民謡源流行」（コロンビア）1965

Machida, Kashō (Ed.). 1944-1994. NHK Nihon Minyō Taikan [Japanese folk song anthology], NHK (Nippon Hōsō Kyōkai).

町田佳声（編）「NHK 日本民謡大観」（日本放送協会）1944–1994

Mauch, Matthias, Chris Cannam, Rachel Bittner, George Fazekas, Justin Salamon, Jiajie Dai, Juan Bello, and Simon Dixon. 2015. “Computer-Aided Melody Note Transcription Using the Tony Software: Accuracy and Efficiency.” In Proceedings of the 1st International Conference on Technologies for Music Notation and Representation. Paris, France.

McCollester, Roxane. 1960. “A Transcription Technique Used by Zygmunt Estreicher.” Ethnomusicology 4 (3): 129–32.

Metfessel, Milton. 1928. “The collecting of folk songs by phonophotography.” Science 67 (1724): 28–31. https://doi.org/10.1126/science.67.1724.28

Müller, Meinard, Emilia Gómez, and Yi-Hsun Yang (Eds.). 2019. “Computational Methods for Melody and Voice Processing in Music Recordings (Dagstuhl Seminar 19052).” Dagstuhl Reports 9 (1): 125–77. https://doi.org/10.4230/DAGREP.9.1.125

Nattiez, Jean-Jacques. 1990 [1987]. Music and Discourse: Toward a Semiology of Music [Musicologie Générale et Sémiologie]. Translated by Carolyn Abbate. Princeton, NJ: Princeton Univ. Press.

Nettl, Bruno, and Philip V. Bohlman. (Eds.). 1991. Comparative musicology and anthropology of music: Essays on the history of ethnomusicology. University of Chicago Press.

Nettl, Bruno. 2015. The Study of Ethnomusicology: Thirty-Three Discussions, 3rd Ed. Champaign IL USA: University of Illinois Press.

Ozaki, Yuto, John Mcbride, Emmanouil Benetos, Peter Pfordresher, Joren Six, Adam Tierney, Polina Proutskova, et al. 2021. “Agreement among Human and Automated Transcriptions of Global Songs.” In Proceedings of the 22nd International Conference on Music Information Retrieval (ISMIR 2021), 500–08. https://doi.org/10.31234/osf.io/jsa4u

Proutskova, Polina, et al. (2026). VocalNotes – Investigating the Perception of Note Pitch and Boundaries through Varying Transcriptions of Vocal Performances from Five Musical Cultures. Analytical Approaches to World Musics 13(2).

Rafii, Zafar, Antoine Liutkus, Fabian-Robert Stöter, Stylianos Ioannis Mimilakis, Derry FitzGerald, and Bryan Pardo. 2018. “An Overview of Lead and Accompaniment Separation in Music.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (8): 1307–35. https://doi.org/10.1109/TASLP.2018.2825440

Reid, James. 1977. “Transcription in a New Mode.” Ethnomusicology 21(3): 415–33.

Savage, Patrick E. 2020. “Measuring the cultural evolution of music: Cross-cultural and cross-genres case studies.” PsyArXiv Preprint. https://doi.org/10.31234/osf.io/mxrkw

———. In Press. “Comparative musicology: Evolution, universals, and the science of the world’s music.” Oxford University Press. https://doi.org/10.1093/9780191872303.001.0001

Savage, Patrick E., Sam Passmore, Gakuto Chiba, Thomas E. Currie, Haruo Suzuki, and Quentin D. Atkinson. 2022. “Sequence Alignment of Folk Song Melodies Reveals Cross-Cultural Regularities of Musical Evolution.” Current Biology 32 (6): 1395–402. E8. https://doi.org/10.1016/j.cub.2022.01.039

Seeger, Charles. 1957. “Toward a Universal Music Sound-Writing for Musicology.” Journal of the International Folk Music Council 11: 63-66.

———. 1958. “Prescriptive and Descriptive Music Writing.” The Musical Quarterly 44 (2): 184–95.

Stanyek, Jason. 2014. “Forum on Transcription.” Twentieth-Century Music 11 (1): 101–61. https://doi.org/10.1017/S1478572214000024

Tanabe, Hideo. 1967. Hokkaido no Oiwake-bushi ni tsuite [Oiwake-bushi in Hokkaido]. Research in Asiatic Music, 21: 51–63.

田辺秀雄「北海道の追分節について」 21号 51-63 東洋音楽研究 1967

Tanaka, Kenji. 2021. Zukai Nihon Ongaku shi Zouho Kaitei ban [Illustrated Japanese Music History – enlarged and revised edition]. Tokyodo Shuppan Co. Ltd.

田中健次「図解日本音楽史増補改訂版」東京堂出版 2021

Vandor, Ivan. 1975. “Tibetan Musical Notation.” The World of Music 17 (2): 3–7.

Wood, Anna L. C., Kathryn R. Kirby, Carol R. Ember, Stella Silbert, Sam Passmore, Hideo Daikoku, John McBride, Forrestine Paulay, Michael J. Flory, John Szinger, Gideon D’Arcangelo, Karen Kohn Bradley, Marco Guarino, Maisa Atayeva, Jesse Rifkin, Violet Baron, Miriam El Hajli, Martin Szinger, and Patrick E. Savage. 2022. “The Global Jukebox: A Public Database of Performing Arts and Culture.” PLOS ONE 17 (11): e0275469. https://doi.org/10.1371/journal.pone.0275469

Supplementary materials

Figure S1. Comparison of each segmentation (three folk songs x three versions) between Gakuto Chiba (GC) in the upper row and Yuto Ozaki (YO) in the lower row in the nine full audio recordings.

[1]. Our third author, Patrick Savage, is also trained in Japanese folk song performance and has won competitions (cf. Chiba & Savage 2023 for details of Chiba and Savage’s training in Japanese folk music performance).

[2]. “Ohayashi” is a signal or call for the beginning of a song, and it serves to enliven the atmosphere of the occasion.

[3]. “Setsudo” involves raising the pitch sharply from the bottom to the top, and vocalizing as if bouncing. It braces up/strengthens the flow of the entire song and helps sustain the breath.

[4]. “Honsukuri” means to sing as if turning the note, leading the next note to a higher note.

[5]. “Hansukuri” means to sing as if turning the note, but not as dynamic as “Honsukuri.” It leads the next note to a lower note.

[6]. “Sukui” has almost the same vocalization and role as “Setsudo,” but raises the pitch slightly higher than “Setsudo.”

[7]. “Ma” suggests temporal interval created by the breathing of the folk song performer.

Analytical Approaches to World Musics