Wikidata:WikiProject University degrees/Reports/Turkey

From Wikidata
Jump to navigation Jump to search

Turkey (Q43), Summer term 2018

[edit]

Preparation

[edit]

We started with playing around with the Anabin Database to find out what data we can extract and use for the Wikidata project. We realized that even though universities and courses are listed for Turkey, Anabin doesn’t contain any information about the relation between them. We decided to put all universities into wikidata which have not been entered previously, but only include all the courses and degrees of three universities in the database.

Early on we had long discussion on how we would model a degree, especially since there is a lot of additional information included, such as the university, study length and so on. To not create a ton of overly specific courses such as “Master of Science in Media Informatics at the HTW Berlin” we decided a relation between university and degree would suit us best. That way we could create a relation between “Master of Science” and “HTW Berlin” and add quantifiers such as the academic discipline (“Media Informatics”).

Getting the data

[edit]

Anabin

[edit]

To scrape the data from Anabin we wrote a javascript tool utilizing JQuery (which Anabin includes and uses in their website already). Executed in the console it scrapes all the listed universities’ names for Turkey:

copy($("#institutionstabelle").find("tr").toArray().map(tr => Array.prototype.slice.call(tr.cells)).map(cells => {return cells.map(cell=>cell.innerHTML.replace(/\r?\n|\r/))}).map(uni=>uni.join("|")).join("\n"))

We could then paste the data into a CSV-file since the script copies a comma separated output to the clipboard.

Ege University

[edit]

Initially we were looking for a university website that contained all relevant information and could relatively easily be scraped automatically. Since the structure of the website of Ege University (Figure 1) was not listing all necessary information in one place, we decided to write a small node.js program. The script

  1. Collects all URLs of Bachelor and Master programs
  2. Collects name and duration of each program
  3. Exports the data to comma-separated values

The screenshot below shows links to different Bachelor programs the script had to follow and scrape individually.

Figure 1: https://ebys.ege.edu.tr/ogrenci/ebp/organizasyon.aspx?kultur=en-US&Mod=1&ustbirim=13&birim=1&altbirim=-1&program=2752&organizasyonId=157&mufredatTurId=932001

Looking at the website of a program we were not able to scrape the information about the duration of the degree right away, since the semesters were listed in a table (Figure 2). Instead we counted the rows of the table (Figure 3). For further information on the code we have submitted the scraper to github at https://github.com/Alppasa/turkishUniScaper.git.


Figure 2: Table of Semesters

Sabanci University and Aydin University =

[edit]

During our research we found an aggregator website which provided information for degrees and course of several Turkish universities. The site lists, despite its name, not only master’s degrees (https://www.masterstudies.com/). Because the website’s layout and content was not completely identical for each university, we looked for two universities featuring the same site layout, which boiled down to Sabancı University (https://www.masterstudies.com/universities/Turkey/Sabanci-University/) and Istanbul Aydin University (https://www.bachelorstudies.com/universities/Turkey/Istanbul-Aydin-University/). There we had to click on “programs”, then on each degree and execute the following script. Afterwards we had to paste the data into a CSV-file:

copy($("#listings > div.content > div > div > div > div.col-sm-10.school-info").toArray().map(s => $(s).find("header > h4 > a > span")[0].innerText.split(" in ").concat($(s).find(".labels-container .label:nth-child(3)")[0].innerText)).map(line => line.join(";")).reduce((t1, t2) => t1 + "\n" + t2))

Preparing the data

[edit]

We added the wikidata identifiers for the universities and cities manually to our table. Later we realized that we could have done this automatically by using the free tool OpenRefine. In both steps we noticed around a dozen duplicate University entries in Anabin which we removed manually from our data. In these cases WikiData labels came in very handy to ensure that the items were in fact duplicates. OpenRefine made a lot of tedious work more manageable by allowing for reconciliation of WikiData items in bulk. In many cases though we still ended up having to manually research the correct institution, academic discipline and degree.

Figure 3: OpenRefine

OpenRefine wouldn’t let us commit right away because no labels were given for newly created academic disciplines. We ended up adding English labels, which were identical to the names. We also decided to not include the study length, because even though it would have been fairly easy to clean the data to match a format and unit such as “<X> semesters” or “<X> months”, some academic disciplines were given a timespan, such as 3 - 4 semesters. This was confusing as we did not know what to make of this information. This could be the minimum and maximum duration or the how long the degree usually takes students to complete. Either way the duration was ambiguous and difficult to model, so we decided against it. In addition, we came to the conclusion that the duration property (which already exists in WikiData) would vary from student to student; instead an estimated duration or a similar property would have been more precise.

Committing the data

[edit]

During the upload of the data through OpenRefine we encountered a technical issue, where the loading bar would always get stuck at 40%. After some trial and error and research we discovered the QuickStatements tool (https://tools.wmflabs.org/quickstatements/#/) which requested authentication and authorization through WikiData. After several failed attempts we managed to get the tool to do its job. Because we had a hunch that the missing authorization was also causing OpenRefine to fail we retried the upload which then worked without a problem. We ended up having two different data-sets and OpenRefine projects, one from the NodeJS-Scraper and the other one from the JQuery-Scrapers. This was merely done for the reason that two different people collected and edited the data on different machines. That way we ended up pushing two bulk commits.

Problems with the data

[edit]

In many regards using OpenRefine to bulk edit our data worked really well, but unfortunately we still ended up leaving a bunch of mistakes in our data which ended being pushed to WikiData. For example we gave some academic majors a relation to “Baccalaureate”, instead of “bachelor’s degree”. That way we had to overwrite those relations by hand later. Probably the biggest mistake we made was to reconcile all the courses from our tables as academic discipline instead of academic major. Also adding labels to each row accidentally led to the unfortunate side-effect that even though labels were correctly added to new items, existing items’ labels were also overwritten. As a consequence some of our changes were quickly being undone by other users. In addition we accidentally added references for the instance of statement to already existing academic disciplines like Mathematics, which we probably should have avoided. Furthermore we made the mistake to add two references (reference url and retrieved) to the instances separately. So for one commit (just 28 rows of data in OpenRefine) we decided to fix all these listed issues manually. Having amended 28 rows of one commit still left us with around 140 rows to fix in the other commit, which would have taken forever to do by hand. So we decided to undo the whole commit and to upload the newly corrected data again. Unfortunately we had to ask an admin to undo the bulk commit. At the time of submitting the report we haven’t received an answer to our request (see Figure 4). Thus one of the edits still needs to be fixed which we plan on doing on Monday.

Figure 4: Undo not permitted

Community Feedback

[edit]

We ended up getting a bit of feedback on our bulk edit, unfortunately none of it positive. Some people were concerned with specific properties we had overwritten so they simply undid a handful of commits. Others started a discussion on our page (see Figure 5). They made the suggestion to use Turkish labels for all the academic majors we added, which makes sense, since the english translations provided by the university were really strange sometimes.

Figure 5: People were not happy

Conclusion

[edit]

In the end we realized that we probably should have uploaded just one or two rows initially instead of the whole table, which would have prevented spending so much time repairing the mess we made. Nonetheless we got a real insight into the workings of Wikidata and learned the hard way that the Semantic Web is not the right place to simply dump a ton of data into but a community effort that takes lots of effort and communication to contribute to.