Wikidata:Pywikibot - Python 3 Tutorial/Gathering data from Arabic-Wikipedia

From Wikidata
Jump to navigation Jump to search

Gathering data from left-to-right written Wikipedias is just as easy as the more common right-to-left writing. Just to demonstrate that with a example I will walk you through the code that iterates over the Place in Yemen infobox (Template:Infobox Place in Yemen (Q21403078)). Because it is only used for places in Yemen we can more or less safely assume that the item should have the statement country (P17) => Yemen (Q805).

First of all we need to call the site-object of Arabic Wikipedia and create a page generator for all pages that use the infobox:

import pywikibot
from pywikibot import pagegenerators as pg

def list_template_usage(site_obj, tmpl_name):
    name = "{}:{}".format(site.namespace(10), tmpl_name)
    tmpl_page = pywikibot.Page(site, name)
    ref_gen = pg.ReferringPageGenerator(tmpl_page, onlyTemplateInclusion=True)
    filter_gen = pg.NamespaceFilterPageGenerator(ref_gen, namespaces=[0])
    generator = site.preloadpages(filter_gen, pageprops=True)
    return generator

template_name = "منطقة يمنية"
country = "Q805"
site = pywikibot.Site("ar", "wikipedia")
tmpl_gen = list_template_usage(site, template_name)

Note that we don't need ":قالب" in the template name, because we can call site.namespace(10) to figure out where the templates are stored. The NamespaceFilterPageGenerator() filters the returned pages to only include only main namespace Wikipedia pages (no User_talk etc...).

Now that we have the page generator we can already start iterating over it and calling the corresponding Wikidata item:

for page in tmpl_gen:
    try:
        item = pywikibot.ItemPage.fromPage(page)
    except:
        continue

The try-except is important in case a Wikipedia page is not connected to a Wikidata page yet. If there is no corresponding Wikidata page, the continue tells the for-loop to go to the next page. If we know the page exists we can check for the existing statement and if it is not present, set it. We will be using country (P17) => Arabic Wikipedia (Q199700) which honestly shows where the data came from.

Now we will add a function that check, if the statement already exists:

import pywikibot
from pywikibot import pagegenerators as pg

def list_template_usage(site_obj, tmpl_name):
    name = "{}:{}".format(site.namespace(10), tmpl_name)
    tmpl_page = pywikibot.Page(site, name)
    ref_gen = pg.ReferringPageGenerator(tmpl_page, onlyTemplateInclusion=True)
    filter_gen = pg.NamespaceFilterPageGenerator(ref_gen, namespaces=[0])
    generator = site.preloadpages(filter_gen, pageprops=True)
    return generator

def check_for_statement(item, country):
    item_dict = item.get()
    claims = item_dict["claims"]
    try:
        country_claim = claims["P17"][0]
        if country_claim.target.id == country:
            return True
        else:
            return False
    except:
        return False

template_name = "منطقة يمنية"
country = "Q805"
site = pywikibot.Site("ar", "wikipedia")
tmpl_gen = list_template_usage(site, template_name)

data_site = pywikibot.Site("wikidata", "wikidata")
repo = data_site.data_repository()

for page in tmpl_gen:
    try:
        item = pywikibot.ItemPage.fromPage(page)
    except:
        print("Not connected to Wikidata")
        continue
    try:
        print(item.get()["labels"]["ar"])
    except:
        pass
    country_set = check_for_statement(item, country)
    print("Country set:", country_set)

Note that we also need a try-except for the arabic label in the for-loop to avoid any KeyErrors. The check_for_statement() function only looks at the first (claims["P17"][0]) country (P17) statement and checks if that is Yemen (Q805). The function then returns True or False. If False is returned we want to add the statement to the item. This is the complete code of the example:

import pywikibot
from pywikibot import pagegenerators as pg

def list_template_usage(site_obj, tmpl_name):
    name = "{}:{}".format(site.namespace(10), tmpl_name)
    tmpl_page = pywikibot.Page(site, name)
    ref_gen = pg.ReferringPageGenerator(tmpl_page, onlyTemplateInclusion=True)
    filter_gen = pg.NamespaceFilterPageGenerator(ref_gen, namespaces=[0])
    generator = site.preloadpages(filter_gen, pageprops=True)
    return generator

def check_for_statement(item, country):
    item_dict = item.get()
    claims = item_dict["claims"]
    try:
        country_claim = claims["P17"][0]
        if country_claim.target.id == country:
            return True
        else:
            return False
    except:
        return False

def add_country_claim(item, country, repo):
    claim = pywikibot.Claim(repo, "P17")
    target_item = pywikibot.ItemPage(repo, country)
    claim.setTarget(target_item)
    item.addClaim(claim, summary="Importing P17 from ar-wiki")

    source_target = pywikibot.ItemPage(repo, "Q199700") #Q199700 Arabic Wikipedia
    source_claim = pywikibot.Claim(repo, "P143", isReference=True)
    source_claim.setTarget(source_target)
    claim.addSources([source_claim])

template_name = "منطقة يمنية"
country = "Q805"
site = pywikibot.Site("ar", "wikipedia")
tmpl_gen = list_template_usage(site, template_name)

data_site = pywikibot.Site("wikidata", "wikidata")
repo = data_site.data_repository()
count = 0

for page in tmpl_gen:
    count += 1
    print("------({})------".format(count))
    try:
        item = pywikibot.ItemPage.fromPage(page)
    except:
        print("Not connected to Wikidata")
        continue
    try:
        print(item.get()["labels"]["ar"])
    except:
        pass
    country_set = check_for_statement(item, country)
    print("Country set:", country_set)
    if country_set == False:
        add_country_claim(item, country, repo)

Running the example will not change anything on Wikidata because this script is already running (unless some vandalism happens in the mean time). But the example could be applied to other infoboxes or other Wikipedias that still need to transfer basic information to Wikidata. Please note that setting the sources like this is deprecated and should only be done in cases where no open data source is available.