Hong Kong Cantonese Child Language Corpus (CANCORP)

Established and made public in 1996, this corpus was the joint effort of Thomas Hun-tak Lee (Principal Investigator CUHK), Colleen Wong (HKPU) and Samuel Cheung-Shing Leung (then HKU, now HKIEd), supported by an earmarked grant from the Hong Kong Research Grants Council (“The development of grammatical competence in Cantonese- speaking children’, 1991-1993, CUHK 335/95H). The following research students and assistants made a pivotal contribution to the project: Alice Shuk-yee Cheung, Patricia Man, Kitty Ka-Sinn Szeto, and Cathy Sin-Ping Wong.

The corpus is a longitudinal record of the early language development of 8 Cantonese-speaking children, each of whom was observed for one year from the time when they were between one and a half to two years old. Four of the children are male, and the other four female.

The corpus contains 171 files coded according to the CHAT format (Codes for the Human Analysis of Transcripts), 14 megabytes in size, and tagged with 33 parts-of-speech labels. The transcripts record conversational exchanges between the children and various adults, mostly the investigators, and often caretakers and other members of the family as well.

There are several versions of the corpus. The original version of the corpus contains transcripts with the utterances given in Chinese characters in the main tier, and parts of speech tags in a subsidiary tier. An updated version of this corpus in this format can be downloaded from the following URLs.

http://www.arts.cuhk.edu.hk/~lal/

http://humanum.arts.cuhk.edu.hk/~cancorp/

Another version of the corpus, in the form of a zipped file ‘LeeWongLeung.zip’, contains transcripts with parts of speech tags laid out in a different format than that of the original CANCORP corpus. This version of CANCORP, due to the work of Paul Fletcher’s research group at HKU, can be downloaded from the CHILDES URL below.

http://childes.psy.cmu.edu/data/EastAsian/Cantonese/