Yonsei Corpus

Introduce

‘Corpus’ is a large and structured set of digitalized linguistic data, which is necessary in various academic fields studying languages since it presents a comprehensive view of linguistic variation. The project for Yonsei Corpus started in 1986 with the start of the Korean Dictionary Society. We started by building a corpus for compiling dictionaries in 1988. Later, we extended the scope of the corpus to incorporate more various types of linguistic data for studies in Korean linguistics, Korean education, Human Linguistics, or Teaching Korean as a Foreign Language.



Lists

Number Name Size
1 Yonsei Corpus 1 2,900,000
2 Yonsei Corpus 2 1,100,000
3 Yonsei Corpus3 5,980,000
4 Yonsei Corpus 4 770,000
5 Yonsei Corpus 5 8,600,000
6 Yonsei Corpus 6 7,230,000
7 Yonsei Corpus 7 13,670,000
8 Yonsei Corpus 8 870,000
9 Yonsei Corpus 9 1500,000
10 Yonsei Corpus 10 780,000
11 Yonsei Corpus 11 730,000
12 Yonsei Corpus of Korean in the 20th Century 150,378,870
13 Corpus of Korean Textbooks (Complete) 724,856
14 Corpus of Korean Textbooks (Conversation) 119,598
15 Yonsei Korean Learner Corpus 278,542
16 Korean Elementary Textbook Corpus after Independence 1,496,280
17 The 6th and 7th Korean Elementary Textbook Corpus 1,681,769
18 Yonsei Balanced Corpus of Written Discourse 1,054,362
19 Yonsei Balanced Corpus of Spoken Discourse 998,934
20 Yonsei Corpus of Polysemy 1,165,224
21 Yonsei Corpus of Hangul tripitaka 386,472
22 Corpus of <Tongnip Sinmun> Newspaper 144,309
23 Corpus of Popular Songs in the Modern Era 29,339
24 Yonsei Corpus of Multimodal Data 18,986
25 Twitter Corpus 945,175,620
26 Political Discourse corpus 306,681
Total 1,148,089,842