Wikidata:Pywikibot - Python 3 Tutorial/Winter Storage

From Wikidata
Jump to navigation Jump to search

This chapter will show you how to save the data to files using different file-formats and databases.

Now that we saw that we can easily grab lots of data from Wikidata, we probably want to save it. This is important if you want to work with the data offline, run local scripts over it, use it in a spreadsheet, or plot it in a GIS. If this is not anything you need you can skip this part of the tutorial.

We will look at three possible ways to save the data. In order of increasing difficulty we will write the data to a CSV-file (comma separated values), a json-file and an SQLite-database. You can also skip to the one you might want to use.

What we are doing in this chapter can be referred to as serialization. We are taking a Python Datastructure (dictionary, list, string) and serializing it to a text that can be written to a file or database. If your query gathers any Python-objects (examples from previous chapters: ItemPage-object, ...), then these will not readily serialize and one has to use a slightly more complex approach.

Writing a simple dataset to a file

[edit]

We will work with the first example of chapter 1 to show the basics of writing to a file. The second example will look at a more complicated dataset.

Write to a CSV-File

[edit]

The CSV module is part of the Python standard library and so import csv is all that is required to use it. Working with our example above see if you can understand how the data is written to a file named output-douglas.csv.

from collections import OrderedDict
import pywikibot
import csv

site = pywikibot.Site("en", "wikipedia")
page = pywikibot.Page(site, "Douglas Adams")
item = pywikibot.ItemPage.fromPage(page)

item_dict = item.get()
lbl_dict = item_dict["labels"]
lbl_sdict = OrderedDict(sorted(lbl_dict.items())) #Sorting the dictionary

with open("output_douglas.csv", "w", newline="", encoding='utf-16') as csvf:
    fields = ["lang-code", "label"]
    writer = csv.DictWriter(csvf, fieldnames=fields)
    writer.writeheader()
    for key in lbl_sdict:
        writer.writerow({"lang-code": key, "label": lbl_sdict[key]})

The most important parts of this new code is the sorting of the dictionary. This is needed because dictionaries are not sorted according to their key. The last block is the standard way of opening (and closing!) a file. For details see the CSV documentation.

If you run the code above you will find the ouput file in the same directory as the Python script. Opening it with a text editor should show something similar to this:

lang-code,label
af,Douglas Adams
ak,Doglas Adams
als,Douglas Adams
an,Douglas Adams
ar,دوغلاس آدمز
arz,دوجلاس ادامز
ast,Douglas Adams
az,Duqlas Noel Adams
bar,Douglas Adams
be,Дуглас Адамс
be-tarask,Дуглас Адамз
...

Write to a JSON-File

[edit]

JSON is another file format that is popular for transferring data between programs and it is also a part of the Python standard library. Once you understand the concept it is just as easy to handle as a CSV file. The code for our simple example should again be self-explanatory. Try to read it first and think about what the lines do. Then run the program and look at the output:

import pywikibot
import json

site = pywikibot.Site("en", "wikipedia")
page = pywikibot.Page(site, "Douglas Adams")
item = pywikibot.ItemPage.fromPage(page)

item_dict = item.get()
lbl_dict = item_dict["labels"]

with open("output_douglas.json", "w", newline="", encoding='utf-16') as jsonf:
    json.dump(lbl_dict.__dict__, jsonf, ensure_ascii=False, sort_keys=True)

As you can see, the sorting of the output is handled by the key-word-argument sort_keys=True. We also pass the kwarg ensure_ascii=False. This will ensure that non-ASCII characters are displayed as characters instead of escape sequences such as \u1234. Writing the JSON file is even easier, because it just requires writing one long string. The script will write to a file output_douglas.json, which should look something like this:

{"af": "Douglas Adams", "ak": "Doglas Adams", "als": "Douglas Adams", "an": "Douglas Adams", "ar": "دوغلاس آدمز", "arz": "دوجلاس ادامز", "ast": "Douglas Adams", "az": "Duqlas Noel Adams", "bar": "Douglas Adams", "be": "Дуглас Адамс", "be-tarask": "Дуглас Адамз", "be-x-old": "Дуглас Адамс", "bg": "Дъглас Адамс", "bn": "ডগলাস", "br": "Douglas Adams", "bs": "Douglas Adams", "ca": "Douglas Adams", "ckb": "دەگلاس ئادمز", ... }

If you want the document to look more structured, you can also pass the kwarg indent=4. The great thing about JSON files is that they are parsed back into Python correctly regardless of the indentation you choose. This is how the output looks like with indentation:

{
    "af": "Douglas Adams",
    "ak": "Doglas Adams",
    "als": "Douglas Adams",
    "an": "Douglas Adams",
    "ar": "دوغلاس آدمز",
    "arz": "دوجلاس ادامز",
    ...
}

Write to a SQLite-Database

[edit]

The module sqlite3 is part of Python's standard libarary. It enables you to store data in a database that is fast and simple to handle. Read over the examples in the official documentation and then try to see if you can connect your Pywikibot knowledge with the writing of the SQLite database. The code could look something like this:

import pywikibot
import sqlite3

site = pywikibot.Site("en", "wikipedia")
page = pywikibot.Page(site, "Douglas Adams")
item = pywikibot.ItemPage.fromPage(page)

item_dict = item.get()
lbl_dict = item_dict["labels"]

conn = sqlite3.connect('output-douglas.db')
c = conn.cursor()
c.execute('''CREATE TABLE labels
             (langcode text, label text)''')

for key in lbl_dict:
    row = [key, lbl_dict[key]]
    c.execute("INSERT INTO labels VALUES (?, ?)", row)

conn.commit()
conn.close()
This is what you should see when you view the contents of the database (the table 'lables').

If you want to take a look at the data, we recommend using 'DB Browser for SQLite' (http://sqlitebrowser.org/). It will allow you to look at the different tables you just created and has many practical functions.

More complicated example

[edit]

[...]