Unsupervised morphological analysis using tries

Başlık:

Personal Author:

Ak, Koray, 1985- author.

Yayın Bilgileri:

[s.l. : s.n.], 2011.

Fiziksel Tanımlama:

viii, 40 leaves : illustrations, tables ; 30 cm + 1 CD-ROM.

Genel Not:

Date of approval: 29.04.2011.

Includes list of figures, tables.

Abstract:

Morphological analysis or decomposition studies the structure, formation, fıınc- tion of words, identifies the morphemes (smallest meaning-bearing elements) of the language and attempts to formulate rules that model the language. It is widely used in different areas such as speech recognition, machine translation, information retrieval, text understanding, and statistical language modeling. Considering that the natural language processing applications are dealing with large amounts of data, it is not fea- sible to use linguists to analyze text corpus by hand, the complexity and real time Processing requirements leads to automated morphological analysis. As an alternative to the hand-made systems, there exist algorithms that work unsupervised manner and autonomously do morphological analysis for the words in an unannotated text corpus. In this thesis, an unsupervised leaming algorithm is proposed to extract infor-mation about the text corpus and the model of the language. The proposed algorithm constructs a trie that consists of characters and the occurrences of the words as nodes. The algorithm then detects roots of the given words by examining the occurrences in the path of the word. When the root is revealed, the algorithm creates a new trie from the affix parts, left after the root for each word. The algorithm continues recursively until there is no affbc left to process. Experimental results on three languages (Finnish, English and Turkish) show that our novel algorithm performs better than most of the previous algorithms in the field.

Biçimbirim analizi ya da ayrıştırması, kelimelerin yapısını, dizilimini ve fonksi»yonlarını inceler, kelimeler içindeki en küçük anlam taşıyan morfemleri belirler ve dilin modelini çıkarmaya çalışır. Konuşma işleme, bilgisayarlı çeviri, bilgi bulgetir, metin anlama ve istatiksel dil modelleme gibi alanlarda kullanılır. Biçimbirim analizi, metin içinde bir çok sözcük formu olduğundan çoğu dil için hem zor hem de gerek»lidir. Çekimli dillerde aynı köke ait binlerce değişik sözcük formu olabilir, bu da çekimlenmiş sözcük dizilerini oluşturmayı zor kılar. Doğal dil işleme uygulamalarının büyük verilerle çalıştığı düşünülürse bu işin dilbilimciler tarafından el ile yapılması karmaşıklık ve gerçek zamanlı işleme açısından mümkün değildir. Bu nedenle bu işlemin otomatikleşmiş biçimbirim algoritmaları tarafından yapılması gerekmektedir. Bu bağlamda öğreticisiz biçimbirim çözümleyicilerin kullanıldığı sistemlerle işlenmemiş metin bütünceleri işlenebilir. Bu çalışmada metin bütünceleri ve dilin modeli hakkında bilgi çıkarımı yapacak bir gözetimsiz öğrenme algoritması önerilmiştir. Tasarlanan algoritma, metin bütünce- sinde geçen kelimelerden oluşturduğu ağaçlar ile verilen kelimelerin kök ve eklerini ke»limelerin geçme sıklığına göre bulmaya çalışmaktadır. Kelimelerin kökleri çıkarıldıktan sonra algoritma geri kalan sözcük kısımları ile ek ağaçları oluşturup özyineli bir şekilde tüm ekleri bulur. Algoritma Fince, İngilizce ve Türkçe dillerinde denenip önceki çalışmaların çoğundan iyi sonuçlar vermiştir.

Subject Term:

Computer engineering.

Dissertations, Academic.

Added Author:

Ek Kurum Yazarı:

M.S. in Computer Engineering. Thesis.

Added Uniform Title: