Wikidata talk:WikiProject Indonesia/Archive

From Wikidata
Jump to navigation Jump to search

Ping project[edit]

Arupako Rachmat04 Fadirra Martin

Notified participants of WikiProject Languages in Indonesia

Beeyan Raisha Martin

Notified participants of WikiProject Ethnicities in Indonesia

Beeyan Raisha Rachmat04

Notified participants of WikiProject Administrative Units in Indonesia
@Arupako:

Untuk mempermudah notifikasi, bagaimana kalau nama kalian juga ditaruh di Wikidata:WikiProject Indonesia/Participants. Hddty (talk) 06:34, 3 August 2020 (UTC)[reply]

@ Hddty: Sure, please. ··· 🌸 Rachmat04 · 05:41, 4 August 2020 (UTC)[reply]

Perubahan[edit]

Halo semua! Dalam beberapa hari ke depan, halaman WikiProject Indonesia akan dikembangkan dengan rangka sebagai berikut.

  • Beranda: serupa seperti sebuah "Halaman Utama" yang menyampaikan cakupan dan tujuan dari WikiProject, pengguna yang berpartisipasi, subproyek, serta atribut terkait Indonesia.
  • Diskusi: diperuntukkan sebagai halaman tanya jawab dan koordinasi WikiProject Indonesia
  • Kegiatan: berisi daftar kegiatan-kegiatan terkait Wikidata yang diadakan di/terkait Indonesia, baik oleh komunitas maupun Wikimedia
  • Topik: berisi pembahasan mengenai topik tertentu, penjelasan mencari data, atribut, butir dengan tema tertentu, dll.
  • Kueri: berisi contoh kueri-kueri tentang Indonesia, sederhana atau rumit, serta permintaan pembuatan kueri
  • Konsultasi: direncanakan sebagai halaman pelatihan pengguna 1-on-1 serta memuat pranala ke kanal Wikidata lainnya
  • Suvenir: direncanakan memuat informasi tentang kegiatan tantangan mingguan dengan hadiah suvenir dari Wikimedia Indonesia

WikiProject kini juga akan memiliki spanduk halaman seperti pada halaman Wikidata:WikiProject Indonesia/Banner. Bila teman-teman memiliki masukan, silakan tinggalkan pesan di halaman ini. RXerself (talk) 04:28, 21 October 2020 (UTC)[reply]

Barbar Islands[edit]

Hello! Could someone please check Babar Islands (Q31437) (see also Babar Island (Q14906332)). I think there may be some conflation between an administrative district and an archipelago, but I'm having trouble sorting it out. Thanks, Bovlb (talk) 17:40, 15 April 2021 (UTC)[reply]

✓ Done RXerself (talk) 05:53, 20 April 2021 (UTC)[reply]

Lexemes in "Malay" and "Indonesian"[edit]

Akbar Apriadi_ap Arupako Bennylin Boesenbergia David Wadie Fisher-Freberg FarrasFN Fexpr Martin GhoziSeptiandri Hanamaria23 Joseagush Kenny Labdajiwa LilyanAgustine Marchella Angelina Arcuscloud Quvsn Rachmat04 Rafka Aditia Raflinoer32 Rahmatdenas Raizan1 Rifat Rachim Rtnf RXerself S Kartika Supardisahabu Titanboo Zhilal Darma Volstand WanaraLima Salsa66syifa Ulumarifah Uni Riska Tielumphd HaidirAndiNovianto Athayahisyam Syenirasheila Rocky Reviko T. Lembah

Notified participants of WikiProject Indonesia

You may have heard that there is an ongoing project on Wikidata to collect data about words in various languages as lexicographical data (lexemes). There are currently 42 lexemes with language "Indonesian" and 636 with language "Malay" (thanks @Tofeiku: for adding the Malay lexemes). A recent analysis of the Malay lexemes' ability to handle the Indonesian language suggests that it is quite possible to use just one set of lexemes for both languages. While I believe this approach has merit since Indonesian is a standardized variety of Malay, I wanted to get your opinions on the idea of merging identical lexemes between the two, such as bahasa/بهاس (L31550) and bahasa (L6539), into a single lexeme, given their identical origins/pronunciations/grammatical features/etc. (marking differences between the Malaysian and Indonesian standards within them if they exist), and changing the rest of the Indonesian lexemes to have language "Malay" with a separate indication in them that they occur only in the Indonesian standard. Mahir256 (talk) 17:43, 19 April 2021 (UTC)[reply]

Also pinging to @Bennylin:, @Masjawad99:, perhaps if I also may @Meursault2004:, @Fadirra:. Speaking as an Indonesian, I personally disagree to list the lexemes as Malay since it may confuse some Indonesians who hold different views regarding the relation of Indonesian and Malay as languages. RXerself (talk) 06:03, 20 April 2021 (UTC)[reply]
@RXerself: The lexicographical data project tries its best to operate with respect to principles and classifications based on linguistics first and foremost. As a result of this, some pairs of languages which for purely political reasons are often split have their lexemes here merged: rather than maintain separate Hindi and Urdu lexemes, for example, Hindustani lexemes (with both Devanagari and Arabic spellings) are maintained instead; it helps keep them in sync with one another owing to their identical grammar and common vocabulary base, which if split not only risks the later unfair development of one variety over another, but more simply makes for effectively and possibly inconsistently duplicated work. (A similar situation exists with Punjabi (Western and Eastern), Azerbaijani (Northern and Southern), and Catalan (Catalan and Valencian) lexemes.) Seeing as the merge I have brought up here falls firmly into this scenario, I thus invite you to reconsider your implication that the possible "confusion" of some who hold a particular political stance would be detrimental to the development of a body of linguistic information that is suitable for both the Malaysian and Indonesian standards of this language. Mahir256 (talk) 10:20, 20 April 2021 (UTC)[reply]
Im disagree because Malay and Indonesian Aren't Same, it may disappear my country and my beavior language (Sorry my English isn't good enough pls forgive me if you misunderstanding) DetectivePro (talk) 14:00, 21 April 2021 (UTC)[reply]
Comments from Bennylin:
(@Bennylin: since there's a lot to unpack here, I am opting to respond to your comments inline for readability's sake.)
I believe this personal idea is against the wish of Indonesian Wikimedia community, readers, and speakers. I'm assuming it's personal, fleeting idea, and not a proposal or RfC or serious endeavor that would need to involve the bigger communities from both Indonesian and Malay(sian), not just here in Wikidata, but also in other projects, given the large impact Wikidata has on other projects.
I am floating it here first now, given that there is someone working on Malay lexemes at the moment and thus unilateral action on my part would not have been appropriate. (Granted, it would also have been helpful if there were people working on, or even aware of prior to my comment, Indonesian lexemes, whose thoughts would have been appreciated, but unfortunately there do not seem to be any. As your last edits to the Lexeme: namespace were more than about two years ago, on 24 July 2019 and 5 June 2018, since which time there has been much greater development of the namespace, I will assume that you are among those effectively unaware of the lexicographical data project before RXerself pinged you above.) If I had not thought about this idea and discussed it at length with Tofeiku before, I would not have brought it up here, and so calling it a 'personal, fleeting idea' is incorrect. Due to phab:T212843 lexemes are not currently accessible from other projects, so I do not see what 'large impact' the current small body of "Indonesian" lexemes currently has on those projects.
Indonesian language have it's distinct ISO 639 code and that's the standpoint we need to start from. Every language with its own ISO 639 code should not be merged under other languages.
The distinct ISO 639 code is not itself grounds for keeping a language separate or not. Some incredibly divergent language varieties are unfortunately grouped under a single code (such as the 'dialects' of Southern Min), while others despite their being almost entirely identical save for vocabulary choices are split on these choices alone (such as Hindi/Urdu from the codeless Hindustani (Q11051) and Bosnian/Croatian/Montenegrin/Serbian from the codeless Shtokavian (Q148893)). As a look at the current mappings of language items to language codes shows, we have had to add non-standard extensions to the ISO 639 codes at our disposal in order to denote variations in spelling/writing system for lemmas and forms which are significant within a number of languages, and some of the pairs of languages I noted in my reply to RXerself already use separate ISO 639 codes for different writing systems. It therefore follows that ISO 639 codes, while important within a language for some distinctions, are not as useful for comparison across languages as far as lexicographical data goes compared to the Wikidata items used for the "language" of those lexemes.
First, for clarification, when you use the term "Malay" and link to the article, I will assume the language code "ms", and Indonesian "id", and the idea is to merge all "id" lexemes under "ms", those are the assumptions for the rest of my comments. Second, it's not a single language; "ms"/"msa" is a macrolanguage (Curiously, the term "macro-language" doesn't appear on the article), encompassing many other language codes (36, the biggest in Asia), and you're correct when you say "id" is part of "ms" macrolanguage. But if you continue with the idea to merge all "id" lexemes under "ms", then be prepared to merge "btj", "mfb", "bjn", "bve", "kxd", "bvu", "pse", "coa", "liw", "dup", "hji", "jak", "jax", "vkk", "meo", "kvr", "mqg", "kvb", "lce", "lcf", "zlm", "xmm", "min", "mui", "zmi", "max", "orn", "ors", "mfa", "pel", "msi", "zsm", "tmw", "vkt", "urk" as well (current and future). Third, whether you agree or not, but the fact is those two languages are tied intrinsically to two different people groups and nations, as reflected in the names: Malay language for Malay people of Malaysia (so not everybody in Malaysia, and sharing heritage with Malay people in Indonesia), and Indonesian language for Indonesian people of Indonesia (exclusively Indonesia). Thus any talks of merging and so forth are intrinsically touching the subject of the people and the nations as well. You can't say it's just based on linguistic and not demographic and political, because those three are intrinsically tied in real world.
Per the end of my last comment, the proposal here is not primarily focused on the code "ms" as it is on the Wikidata item Malay (Q9237). Continuing on that same point, the language being referred to with that item ("the language on which Standard Indonesian and Malaysian are based", as the hatnote at the top of the enwiki article reads, and the one I refer to here with the term "Malay") is therefore the one whose Wikipedia page lists fewer than half of the codes you mentioned; the rest of those codes, although grouped under the macrolanguage "ms" in ISO 639, may well warrant having separate bodies of lexemes on account of significant structural differences from the language described on that Wikipedia page (here, therefore, would be an example of a deficiency in ISO 639). Among the codes that are currently listed in the enwiki article ("zlm", "kxd", "ind", "zsm", "jax", "meo", "kvr", "xmm", "min", "mui", "zmi", "max", "mfa"), it may still be defensible for some of them to have separate lexemes if their differences are significant enough structurally speaking (Minangkabau comes to mind here). A difference being "reflected in the names" of languages, as I see it, exists between Malaysian Malay (Q15065) and Indonesian (Q9240) (two different standards of the same language for "two different people groups and nations"), rather than between Indonesian (Q9240) and Malay (Q9237) (the former being part of the latter, and there existing lexemes in the latter). I would like in this discussion to stick with linguistic arguments for or against such a merge, rather than "demographic and political" ones, since nowhere in my proposal did I suggest that differences existing between the Malaysian and Indonesian standards of Malay would be eliminated in such a merge, since I would like to keep things within the realm of civility as far as what is discussed goes, and since downstream users of this data (such as the Abstract Wikipedia project) can decide for themselves how they take this lexicographical data and use it to reflect a particular language/worldview.
Also, I'm arguing here, using "ms" as a language code for Wikimedia projects was a mistake, or is a mistake in retrospect (as it was reclassified as macrolanguage 5 years after "ms" projects was born), and it should be changed to what it really is, "zsm" (Standard Malay) or Malaysian language, Malay speakers (because not all Malaysians speak Malay) from Malaysia can keep their "zsm" projects, and Malay speakers from Indonesia could create "btj", "mfb", "bve", "pse", "liw", "dup", "hji", "jak", "vkk", "kvr", "mqg", "kvb", "lce", "lcf", "xmm", "mui", "max", "pel", "tmw" projects (following "min" and "bjn"/"bvu" projects that's already existed). (but that's for different topic). Even though in practice, the code "ms" is still viewed as the synonym of Standard Malay due to the editors in "ms" projects mainly using "zsm" specific vocabularies.
It is indeed unfortunate that the macrolanguage code was adopted for the Wikipedia that conforms most closely to the Malaysian standard of the Malay language (rather than adopt e.g. the code "zsm" as you suggest). A number of issues involving the Language Committee and those tasked with wiki setup prevent such a shift to appropriate language codes for long-standing projects, and thus given the inheritance of language codes at our disposal from the language codes assigned to different wikis, we are stuck adopting the "ms" code here for Malay (not "Malaysian"; again, here, we have to make do with a deficiency in ISO 639 at the time of mswiki's creation). I want to emphasize again, however, that the proposal here is not primarily focused on the code "ms" as it is on the Wikidata item Malay (Q9237); just because other communities make certain assumptions about certain codes does not mean we have to adopt those assumptions wholesale here.
Historically, Malay language originated in Sumatra, Indonesia, and beside being lingua franca for centuries in the SE Asia archipelago and Malacca Peninsula, is still spoken until today in Sumatra, Indonesia. Thus, it's of no surprise that Indonesian from Sumatra still understand and sometimes contributed to ms.wiki projects. That's not the case for Indonesians from other islands (Java for example), of which Malay vocabularies used in ms.wiki projects are as foreign as any of Indonesian local languages not natives to those readers (such as: jv, su, min, bjn, ban, gor, bug, tet, ace, nia projects). Because of that, Wikimedia Indonesia and Indonesian Wikipedia generally refer ms.wiki as one of our local language projects, not foreign language projects, like Tagalog wiki projects, for example, thus the reverse of the assumption that "id" is part of "ms".
(See the other comments I made regarding and using the terms "Malay", "Malaysian", and "Indonesian" in my replies. Also what the Wikimedia affiliate and the Wikipedia do are "political" arguments rather than linguistic ones, which I do not want to reply to in depth for now.)
The health of Wiktionary community: Based on the Wiktionary status of two projects, "id" eclipsed "ms" by far margin, id.wikt has ~86794 Indonesian words and phrases, while ms.wikt only has ~2263 Malay lemmas; that of course reflects the sheer difference of the number of native speakers of the two diverging languages (156 million+ vs ~10 million speakers). You argument seem to be based on the miniscule amount of data available as of today in Wikidata and a lone contributor, and not from a bigger picture / point of view from the other projects and contributors. What if, for example, Wikidata suddenly gain 80 thousand+ Indonesian lexemes?
(This proposal is entirely independent of what Wiktionaries do; they can decide to adopt lexicographical data once T212843 is resolved, or they may decide not to. Despite some inconsistencies in the logic here that could be better pointed out, it is not germane to the discussion of the proposal for me to do so, and so for now I do not want to reply to this in depth.)
Internationality of Indonesian language: Aside from the number of speakers, Standard Indonesian (and not Standard Malaysian) has been touted to be an international language of the ASEAN region and beyond (another Wikipedian argued that it already is by some definitions), and more Indonesian language courses have been offered in non-Indonesian universities than Standard Malay in non-Malaysian universities. Any notions that tried to equate that success as the success of "Malay language" instead of "Indonesian" would be wrong.
(These hinge on "demographic and political" arguments rather than linguistic ones, which again I do not want to reply to in depth for now.)
Difference and divergence on lexicon and conversation: On the topic of that 1000 word list, (beside that page doesn't have any explanation on how to read the data, and other concerns), languages are not that easy to analyze just using a small sample of 1000 words/word pairs; it's essentially a Long tail. Both our languages diverged greatly since the beginning and even until in recent years due to different systems for borrowing foreign Latin terms (Malaysia was influenced/based on English, while Indonesia was influenced/based on Dutch) and borrowing+transliterating non-Latin terms (Arabic, Chinese, etc.) Indonesian also draws loanwords heavily, thousands of them, from local languages (700+ languages) and Dutch origin, while Malay have very few of those loanwords in their dictionaries. Many words also have vastly different meanings in Standard Malay and Standard Indonesian, such as "banci" and "butuh". From technological-related (IT) terminologies, health and culinary jargons, slangs, to other vernaculars (formal as well as casual), the people of both countries don't understand each other well enough that many times we have to resort to conversing in other (third) language, such as English, to avoid miscommunications, embarrassments, and other faux pas. Television programmes and movies from both countries need to have subtitles for audience in the other countries. So mutually intelligible languages on paper, doesn't translate to the day-to-day interactions between the people. Thus, the two languages are not as similar as foreign people like to think.
The page to which I linked (sorry for not providing a more thorough explanation of that page initially) is a special variant of a section of this page showing how much lexemes for different languages cover the types and quantity of words on different Wikipedias, which I requested of User:Nikki when I saw how much more mswiki was covered by ms lexemes compared to idwiki with id lexemes. At the top you can see that at the time the page was generated, there were 2294 lexeme forms in Malay, 100,137 different word forms on idwiki, and 40 million distinct instances of words on idwiki; yet 48.1% of the instances of individual words on idwiki were handled perfectly by those 2294 lexeme forms. The "1000 word list" is a list of word forms on idwiki (with the number of times they occur on that wiki) that do not occur as forms on any Malay lexemes yet. (For comparison's sake, here's a similar "1000 word list" based on mswiki, which also has similar coverage according to the main coverage page.) Differences in vocabulary are directly representable on lexemes themselves (we have properties like variety of lexeme, form or sense (P7481) and location of sense usage (P6084) to mark these explicitly on forms and senses), so the sorts of differences in borrowings/meanings that you mention can all be reflected in Malay lexemes. The current set of Mandarin Chinese lexemes is one body of lexemes despite considerably large vocabulary differences of all the sorts you describe between the standards in Mainland China and in Taiwan (to provide an example). Thus while the Malaysian/Indonesian standards may be considerably different in terms of vocabulary, that by itself doesn't mean that merging lexemes is impossible.
Addressing different usages, and different meanings, different spellings, different pronunciations, cognates/false-cognates, and plethora of other problems, as well as meanings that technically exists in dictionaries but archaic, rare, (hundreds or thousands of them), and/or not frequently used in texts and conversations, and marking them correctly as such, demands far, far more work (if your idea of merging them together is implemented), than having around 1000 out of 80 thousand+ words that have the same meaning being kept separate. So far as analysis goes, please refer to real sources of authority in linguistics, not just Wikipedian/Wikidata user research. There are just very few gain for such proposal has to offer and much more undue works for the volunteers.
(The whole lexicographical data project is a lot of work, no doubt about it, and my proposal would entail such work being applied to fewer items in the long run; while merging or not merging does not change the amount of work this takes, I have noted the advantages of such a merge elsewhere in my replies to you.) In addition to different usages/meanings/spellings being representable on lexemes, different pronunciations can also be represented: blue (L3269), for example, has different pronunciations given on L3269-F1 (the base form of that adjective) from different standards in the world. I am skeptical of the claim that only "around 1000 out of 80 thousand+ words" are similar between the Malaysian and Indonesian standards; this warrants a "real source[] of authority in linguistics, not just Wikipedian/Wikidata user research". I am also skeptical of the idea that there is little gain to be had; once a single body of lexemes is developed, texts on Abstract Wikipedia could be generated in both the Malaysian and Indonesian standards of Malay with less (and not considerably duplicated) effort than if there were two separate bodies of lexemes to be considered.
Going back to macro-languages and lexemes, I believe our case with "ms"/"zsm" and "id" is not unique in the world. There are many other macro-languages with many similar lexemes among their member languages: Arabic, Hmong, Zhuang, Chinese, and so on. Many member languages also have language codes tied to the country, such as Arabic. First figure out how Wikidata lexemes should deal with macro-languages in general, as well as diverging languages that came from a single source (Latin/Romance languages came to mind, of which the answer is clear, you don't conflate them together under "lat"), and we can continue this discussion. Also, I believe we should find the answer on how to deal with this matter in technology: how can technology being used to link these lexemes together (for example, Wiktionary works perfectly fine with regard to this), and not resorting to the easiest way out. Until then, keep "ms" and "id" separate.
The four cases you mentioned all actually do have far more significant and widespread intelligibility issues that go beyond just vocabulary differences into the morphological and syntactic realms (at least compared to the Malaysian and Indonesian standards of Malay, which largely appear to be confined to vocabulary differences). In fact, "dialects" of Arabic are being planned to be treated as separate languages for the purposes of lexeme development, and Maltese lexemes are taking a life of their own at the moment. Similarly, "dialects" of Chinese are already being treated separately (there are separate bodies of Cantonese and Southern Min lexemes, for example), and at least one variety of Zhuang has its own lexemes separate from the others. I thus would argue that there is an approach to the situation with macrolanguages, which is to consider the intelligibility between their variants and then make splits if there are significant enough differences (such as morphological or syntactic) to warrant doing so. Romance languages also have significant intelligibility issues going beyond just vocabulary differences into morphological and syntactic ones, so they already are handled in the appropriate way here (i.e. separate bodies of lexemes for them). "Deal[ing] with this matter in technology" is a great way to describe how Wikidata's lexeme namespace addresses the accessibility issues of wikitext-based storage of lexical information (similarly to how Wikidata itself addresses the accessibility issues of wikitext-based storage of information in general); calling a merge "the easiest way out" seems to contradict the point you made above about "far, far more work" in maintaining such a merge.
Please respect the wishes of the two nations that went separate way with the language, and do not enforce a foreign point of view toward this issue. Last but not least, do not reduce the national language of the fourth largest country in the world subservient to 43rd country, the logic behind it is mind-boggling. Bennylin (talk) 16:35, 21 April 2021 (UTC)[reply]
(See the set of comments I made regarding and using the terms "Malay", "Malaysian", and "Indonesian" in my replies above. No one is actively developing Indonesian lexemes, so the "separate way" is not being reflected in our lexicographical data, and by virtue of my discussions with Tofeiku and commenting here, rather than performing the merge directly, I have not made "a foreign point" come into existence yet. I have no idea what your "[l]ast but not least" point is supposed to be addressing.) Mahir256 (talk) 19:17, 21 April 2021 (UTC)[reply]
> using "ms" as a language code for Wikimedia projects was a mistake, or is a mistake in
> retrospect (as it was reclassified as macrolanguage 5 years after "ms" projects was born),
> and it should be changed to what it really is, "zsm" (Standard Malay) or Malaysian language
Agree with above analisis.
 Disagree with making -id- a subclass of -ms-. It is a pity that -id- and -ms- became two separate languages for stupid ideological and colonialistic reasons, but putting Inodonesian words such as "sepeda" or "stasiun kereta api" under -ms- is ultimately wrong. Taylor 49 (talk) 07:05, 22 April 2021 (UTC)[reply]