Investigation of the Value of the Internet in Linguistics and Language Based Research
Not many could argue the fact that ‘the Information Age’ is in full swing. The main tenet of such an age consists of widespread data storage and transmission possibilities afforded by an institution such as the world wide web. Considering this, the internet can be utilized to understand our world on a larger scale. In short, the Web can be seen as a microcosm of the physical world and its underlying structures. As such, the internet allows us to study language, its use, and its spread in ways never seen before.
In the 1990s, the realm of languages on the internet (or more concisely the World Wide Web) was composed almost entirely of English. This is because the internet originated in the United States. At that point in time the internet was not such a good microcosmic model of the world. However, some now predict that in the near future the Web will be largely non-English. Between 1995 and 2000 alone, the number of people with internet access in non-English speaking countries increased from 7 million to 136 million (Crystal, 2001, p. 216). Now, in 2016, it can be surmised that this number is significantly larger. This means that the study of language use and spread through the internet is only beginning.
According to Brenda Danet in The Multilingual Internet (2007), the U.S. does still have the largest portion of internet users with a 20% figure. However, this may be a rather accurate reflection of the world of language around us. Increasingly, scholars have grown concerned over the fact that English may soon dominate the world in terms of language, resulting in a sort of ‘linguistic imperialism’. The internet has only provided a new arena for English to indeed smother other languages. Nevertheless, the number of non-English speaking internet users continues to increase – with Chinese, Japanese, and Indian projected to accelerate rapidly in the coming years. Therefore, the internet is increasingly becoming a valid resource for language study and the acquisition of data on language usage.
The study of language through the internet has social repercussions as well. For example, conclusions as to the lasting effects of European Imperialism can be drawn through this sort of web-based examination. Take the case of Tanzania, a small East African country that was first colonized by Germany, eventually becoming a British mandate at the end of World War I. British rule ended in 1961 but the effects of colonialism can be still be felt even through simple language use. Swahili (the official language of Tanzania with actual African origins) is only used for instruction at the elementary level, forcing almost all people to know English (Danet, 2007, p. 17). Furthermore, while internet use in the nation has grown in the last five years, only the ‘elite’ have access. These elite, as a rule, speak English almost unwaveringly. As such, webpages in Tanzania are almost entirely in English, reflecting European imposed issues of class and Western elitism.
Taken further, the internet can be used as a way to explore the general connections between specific languages (whether this be through imperialism or not). Babel (2012) is a program developed by Hannes Mühleisen to seek out and quantify the way languages are connected on the Web. Without going too much into the methodology, the program sifts through webpages in order of popularity and determines the language they are written in. Then the program follows any links on these webpages and determines their language. By compiling this information one can get a generalized snapshot of what languages are often spoken together. For reasons discussed earlier, English has been removed from the data, as have outliers. The following is a table listing the most commonly connected languages quantified by a percentage of connections out of all examined pages in that language.
As a visual representation of how numerous these connections are, the creators of Babel have also created a chord diagram.
As represented in both the chord diagram and the numerical chart, the connection between Farsi (one of the major languages spoken in Iran) and Danish is by far the most pronounced. To explain, Denmark and the area that was formerly Persia have an extensive trading history that extends back to 1687. In closely related research, a linguist from Amsterdam has been similarly combing through internet data to draw conclusions relating to ‘preferred multilingual usage patterns’ (Dorleijn, 2016). Her research focuses specifically on the connection between Turkish and Dutch, which are also connected through a trading history extending to the 17th century.
The Internet provides us with a truly amazing way to collect linguistic data. There is no field-work involved and data can be collected over a long period of time as the internet remains in its own realm, waiting to be used for research. Moving forward, further research into the dominance of English in the world, developing language connections, and the changing face of language as part of the information age is made possible and relatively accessible thanks to the internet.
References
Crystal, D. (2001). Language and the Internet. Cambridge, UK: Cambridge University Press.
Danet, B., & Herring, S. C. (2007). The Multilingual Internet: Language, Culture, and Communication Online. Oxford: Oxford University Press.
Dorleijn, M. (2016). Can Internet Data Help to Uncover Developing Preferred Multilingual Usage Patterns? An Exploration of Data from Turkish-Dutch Bilingual Internet Fora. Journal of Language Contact, 9(1), 130-162.
Huang, X., Acero, A., & Hon, H. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Upper Saddle River, NJ: Prentice Hall PTR.
Mühleisen, H. (2015, April 28). Babel 2012 Web Language Connections. Retrieved September 11, 2016, from https://github.com/norvigaward/2012-naward25/wiki/Babel-2012—Web-Language-Connections