And actually English is #33 in the Language Weirdness Index. Number two is spoken in Siberia by 22,000 people: Nenets (that’s where we get the word parka from). Number three is Choctaw, spoken by about 10,000 people, mostly in Oklahoma.īut here’s the rub-some of the weirdest languages in the world are ones you’ve heard of: German, Dutch, Norwegian, Czech, Spanish, and Mandarin. The language that is most different from the majority of all other languages in the world is a verb-initial tonal languages spoken by 6,000 people in Oaxaca, Mexico, known as Chalcatongo Mixtec (aka San Miguel el Grande Mixtec). In this blog post, I’ll only report languages that have a value filled in for at least two-thirds of features (239 languages). But because different features have different numbers of values and we want to reduce skewing, we actually take the harmonic mean (and because we want bigger numbers = more weird, we actually subtract the mean from one). The Weirdness Index is then an average across the 21 unique structural features. So if we had included subject-object-verb order then English would’ve gotten a value of 0.355 (we actually normalized these values according to the overal entropy for each feature, so it wasn’t exactly 0.355, but you get the idea). We end up with 21 features in total.įor each value that a language has, we calculate the relative frequency of that value for all the other languages that are coded for it. We can focus in on features that aren’t strongly correlated with each other (between two correlated features, we pick the one that has more languages coded for it). Ideally, we’d like to judge weirdness based on unrelated features. Part of this is just the nature of the features listed in WALS-there’s one for overall subject/object/verb order and then separate ones for object/verb and subject/verb. Now, one problem is that if you just stop there you have a huge amount of collinearity. The data in WALS is fairly sparse, so we restrict ourselves to the 165 features that have at least 100 languages in them (at this stage we also knock out languages that have fewer than 10 of these-dropping us down to 1,693 languages). I’m just not ready for verbs when I open my mouth.) (Aside: I’ve done some work with Hawaiian and Majang and that’s how I learned that verbs are a big commitment for me. For what it’s worth, 41.0% of the world’s languages are actually SOV order. Meanwhile only 8.7% of languages start with a verb-like Welsh, Hawaiian and Majang-so cross-linguistically, starting with a verb is unusual. For example, English word order is subject-verb-object-there are 1,377 languages that are coded for word order in WALS and 35.5% of them have SVO word order. That is, we evaluate each language in terms of how unusual it is for each feature. So rather than take an English-centric view of the world, WALS allows us take a worldwide view. These features include word order, types of sounds, ways of doing negation, and a lot of other things-192 different language features in total. The World Atlas of Language Structures evaluates 2,676 different languages in terms of a bunch of different language features. But that’s a pretty irritating definition. To this end, we might choose to define “weirdness” in terms of English. The better that a system can deal with diverse data, the more confident that you can be in its ability to handle unseen data. So one of the best ways to test an NLP system is to try languages other than English. English is far and away the language that linguists have worked on the most and it’s also the language that has the most available resources for computer science projects (and more data is almost always better in computer science). Natural language processing (NLP) is about finding patterns in language-for example, taking heaps of unstructured text and automatically pulling out its structure. The open secret about NLP is that it’s very English-centric. So far we’ve worked on (big breath): English, Portuguese (Brazilian and from Portugal), Spanish, Italian, French, Russian, German, Turkish, Arabic, Japanese, Greek, Mandarin Chinese, Persian, Polish, Dutch, Swedish, Serbian, Romanian, Korean, Hungarian, Bulgarian, Hindi, Croatian, Czech, Ukrainian, Finnish, Hebrew, Urdu, Catalan, Slovak, Indonesian, Malay, Vietnamese, Bengali, Thai, and a bit on Latvian, Estonian, Lithuanian, Kurdish, Yoruba, Amharic, Zulu, Hausa, Kazakh, Sindhi, Punjabi, Tagalog, Cebuano, Danish, and Navajo. We’re in the business of natural language processing with lots of different languages.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |