User:Envlh/Denelezh/Schema
This page is a draft of the physical data model for data persistence in Denelezh, a mixed version between the current state of the model and what could be done in the future.
Process
[edit]Quick overview of the data processing:
- From Wikidata JSON dump:
- Download last available Wikidata JSON dump.
- Parse dump with Wikidata Toolkit and generate CSV files.
- Load CSV files into MySQL database.
- From Wikidata Query Service (SPARQL):
- Fetch list of Wikimedia projects.
- Generate reports.
Tables
[edit]Technical tables
[edit]dump
[edit]List dumps loaded into Denelezh.
Field | Description | Example | Type |
---|---|---|---|
date | Date of the dump. | 2017-01-02 | DATE |
step | Step of the loading. Values: 0 = download in progress; 1 = parsing in progress; 2 = reports generation in progress; 3 = done. |
0 | TINYINT |
Raw data
[edit]These tables are partitioned by dump (one partition per dump). They all have the following column as part of their primary key:
Field | Description | Example | Type |
---|---|---|---|
dump | Date of the dump. | 2017-01-02 | DATE |
human
[edit]Base table for humans.
Field | Description | Example | Type |
---|---|---|---|
dump | Date of the dump. | 2017-01-02 | DATE |
id | Wikidata Qid of the human (without initial Q). | 2689088 | BIGINT UNSIGNED |
gender | Wikidata Qid of the gender (without initial Q), if known. | 6581072 | BIGINT UNSIGNED |
year_of_birth | Year of birth, if known. | 1950 | SMALLINT |
year_of_death | Year of death, if known. | 2015 | SMALLINT |
human_country
[edit]List countries of citizenship for each human.
Field | Description | Example | Type |
---|---|---|---|
dump | Date of the dump. | 2017-01-02 | DATE |
id | Wikidata Qid of the human (without initial Q). | 2689088 | BIGINT UNSIGNED |
country | Wikidata Qid of the country (without initial Q). | 142 | BIGINT UNSIGNED |
human_occupation
[edit]List occupations for each human.
Field | Description | Example | Type |
---|---|---|---|
dump | Date of the dump. | 2017-01-02 | DATE |
id | Wikidata Qid of the human (without initial Q). | 2689088 | BIGINT UNSIGNED |
occupation | Wikidata Qid of the occupation (without initial Q). | 82955 | BIGINT UNSIGNED |
human_property
[edit]A line of this table shows that a property is used at least one time at normal or preferred rank in an item representing a human in Wikidata.
Field | Description | Example | Type |
---|---|---|---|
dump | Date of the dump. | 2017-01-02 | DATE |
id | Wikidata Qid of the human (without initial Q). | 2689088 | BIGINT UNSIGNED |
property | Wikidata Pid of the property (without initial P). | 39 | BIGINT UNSIGNED |
human_sitelink
[edit]Existing sitelinks for a human.
Field | Description | Example | Type |
---|---|---|---|
dump | Date of the dump. | 2017-01-02 | DATE |
id | Wikidata Qid of the human (without initial Q). | 2689088 | BIGINT UNSIGNED |
sitelink | Code of the project. | frwiki | VARCHAR(32) |
occupation_parent
[edit]Stores the hierarchy of occupations. For instance, Q182436 (librarian) is a sublcass (P279) of information professional (Q1662485).
Field | Description | Example | Type |
---|---|---|---|
dump | Date of the dump. | 2017-01-02 | DATE |
occupation | Wikidata Qid (without initial Q). | 182436 | BIGINT UNSIGNED |
parent | Wikidata Qid (without initial Q). | 1662485 | BIGINT UNSIGNED |
label
[edit]This table stores all labels.
This table needs refining to handle internationalization.
Field | Description | Example | Type |
---|---|---|---|
dump | Date of the dump. | 2017-01-02 | DATE |
qid | Wikidata Qid (without initial Q). | 182436 | BIGINT UNSIGNED |
label | Label. | Librarian | VARCHAR(500) |
Reports
[edit]kpi
[edit]Statistics generated from the previous tables. Each row represents a subset of humans in Wikidata, grouped by gender, year of birth, country of citizenship, occupation, and Wikimedia project. A value of 0 on these columns means any value (for instance, a value of 0 in the column country means any country of citizenship, even none known).
This table needs refining to handle several reports (number of humans in Wikidata, number of humans in Wikidata with at least one sitelink, number of sitelinks for humans in Wikidata, ...).
Field | Description | Example | Type |
---|---|---|---|
dump | Date of the dump. | 2017-01-02 | DATE |
gender | Values: 0 = any; 1 = female; 2 = male; 3 = other. | 1 | TINYINT UNSIGNED |
year_of_birth | Id from the table year, 0 for any. | 20 | SMALLINT UNSIGNED |
country | Id from the table country, 0 for any. | 30 | SMALLINT UNSIGNED |
occupation | Id from the table occupation, 0 for any. | 40 | SMALLINT UNSIGNED |
project | Id from the table project, 0 for any. | 50 | SMALLINT UNSIGNED |
total | Number of humans in this set. | 6478 | INT UNSIGNED |
total_with_gender | Number of humans with a known gender in this set. | 6156 | INT UNSIGNED |
total_with_year_of_birth | Number of humans with a known year of birth in this set. | 5246 | INT UNSIGNED |
total_with_year_of_death | Number of humans with a known year of death in this set. | 3425 | INT UNSIGNED |
total_with_place_of_birth | Number of humans with a known place of birth in this set. | 4123 | INT UNSIGNED |
total_with_place_of_death | Number of humans with a known place of death in this set. | 1245 | INT UNSIGNED |
total_with_firstname | Number of humans with a known first name in this set. | 4236 | INT UNSIGNED |
total_with_lastname | Number of humans with a known last name in this set. | 4152 | INT UNSIGNED |
total_with_country | Number of humans with a known country of citizenship in this set. | 4263 | INT UNSIGNED |
total_with_occupation | Number of humans with a known occupation in this set. | 5236 | INT UNSIGNED |
total_with_image | Number of humans with an image in this set. | 2354 | INT UNSIGNED |
discarded_country
[edit]Countries used less than 200 times are removed from reports and put in this table.
Field | Description | Example | Type |
---|---|---|---|
dump | Date of the dump. | 2017-01-02 | DATE |
country | Wikidata Qid (without initial Q). | 142 | BIGINT UNSIGNED |
count | Number of humans with this country of citizenship. | 17 | SMALLINT UNSIGNED |
discarded_occupation
[edit]Occupations used less than 1,000 times are removed from reports and put in this table.
Field | Description | Example | Type |
---|---|---|---|
dump | Date of the dump. | 2017-01-02 | DATE |
occupation | Wikidata Qid (without initial Q). | 41546637 | BIGINT UNSIGNED |
count | Number of humans with this occupation. | 17 | SMALLINT UNSIGNED |
Reports reference tables
[edit]These tables are shared across all reports.
year
[edit]Years that are used in at least one report.
Field | Description | Example | Type |
---|---|---|---|
id | Technical id. | 7 | SMALLINT UNSIGNED |
year | Year. | 1578 | INT |
country
[edit]Countries that are used in at least one report (when not discarded).
Field | Description | Example | Type |
---|---|---|---|
id | Technical id. | 1 | SMALLINT UNSIGNED |
qid | Wikidata Qid (without initial Q). | 142 | BIGINT UNSIGNED |
occupation
[edit]Occupations that are used in at least one report (when not discarded).
Field | Description | Example | Type |
---|---|---|---|
id | Technical id. | 1 | SMALLINT UNSIGNED |
qid | Wikidata Qid (without initial Q). | 41546637 | BIGINT UNSIGNED |
project
[edit]Projects that are used in at least one report.
Field | Description | Example | Type |
---|---|---|---|
id | Technical id. | 1 | SMALLINT UNSIGNED |
qid | Wikidata Qid (without initial Q). | 41546637 | BIGINT UNSIGNED |
code | Project code. | frwiki | VARCHAR(32) |
type | Type of project. Values: commons, incubator, mediawiki, meta, wikispecies, wikidata, wikimania, wikibooks, wikinews, wikipedia, wikiquote, wikisource, wikiversity, wikivoyage, wiktionary. |
wikipedia | VARCHAR(32) |
url | URL of the project. | https://fr.wikipedia.org/ | VARCHAR(128) |
Glossary
[edit]- WDQS (Wikidata Query Service): Wikidata SPARQL endpoint.
- Wikidata Toolkit: library used to parse Wikidata JSON dumps.