MATAS corpus (version 3.0) description updated, manually checked, morphologically annotated corpus MATAS LANGUAGE Lithuanian previous versions - 1. MATAS v0.2 (http://hdl.handle.net/20.500.11821/9) 2. MATAS v1.0 (http://hdl.handle.net/20.500.11821/33). Formats, standarts: 1. CoNLL-U (https://universaldependencies.org/format.html); 2. JABLONSKIS tagset v2 (https://sitti.vdu.lt/jablonskis-en/); 3. MULTEXT-East tagset (http://nl.ijs.si/ME/V4/msd/html/index.html) 4. UTF-8 SIZE Tokens (incl. punctuation): 2,137,287 words: 1,694,819 sentences: 144,047 documents: 1,234. Genres - contains 5 genres: documents (14%), fiction (19%), periodicals (36%), scientific texts (24%), transcripts(7%). Publisher - Institute of Digital Resources and Interdisciplinary Research (SITTI), Vytautas Magnus University.
Use this url to cite dataset: https://hdl.handle.net/20.500.12259/274272