Module:PropertyPath

From Wikidata
Jump to navigation Jump to search
Lua
CodeDiscussionLinksLink count SubpagesDocumentationTestsResultsSandboxLive code All modules


The PropertyPath module is an implemetation and extension of SPARQL (Q54871)  View with Reasonator View with SQID's PropertyPath in Lua for Wikidata and Wikibase client.


There is however a few difference with SPARQL regular property path and this one are:

  • this implementation adds a feature to deal with qualifiers in paths
  • unlike full sparql paths, we need to provide and item for the starting point of the path
  • we can use Wikidata's identifiers for property Ids, including property labels in the main language of the client wiki
  • the operator "^" cannot work with lua because of the limitations of the lua mw.wikibase API

Pahs

[edit]

Introduction example

[edit]

A path to find the grandchildren. "child/child" will be our first path. Used with a template that exploits this modules we can expand this path starting from an item, say the item about the greek God Zeus :

The templates shows as a first result (when writing this line) Anteros (Q572133)  View with Reasonator View with SQID. It's here because the Zeus item has a statement
⟨ Zeus (Q34201)  View with Reasonator View with SQID ⟩ child (P40) View with SQID ⟨ Ares (Q40901)  View with Reasonator View with SQID ⟩
and on Ares (Q40901)  View with Reasonator View with SQID there is a statement
⟨ Ares (Q40901)  View with Reasonator View with SQID ⟩ child (P40) View with SQID ⟨ Anteros (Q572133)  View with Reasonator View with SQID ⟩
. Actually Zeus has a lot of grandchildren and you should not see all of them with this template without adding a parameter to add to the limit.

Similarly, if we want the mothers of Zeus children we can use a path child/mother :


The next section will briefly describe the set of operators we can put in a path. "/", which we exemplified here, is one of them, but there is several others that can be combined with each other to build more complex paths.

Operators

[edit]

In the following, an item path will be a sequence of items and (one or two) properties. An item path may be ended with a wikidata value. For example Q1001) ; instance of (P31) ; human (Q5)  View with Reasonator View with SQID ; subclass of (P279) ; person (Q215627)  View with Reasonator View with SQID or Zeus (Q34201)  View with Reasonator View with SQID ; child (P40) View with SQID ; Ares (Q40901)  View with Reasonator View with SQID ; child (P40) View with SQID ; Anteros (Q572133)  View with Reasonator View with SQID will be paths. It's possible to create item paths from Wikidata statements starting from an item and following the statements' main property and main values, but also as we will see later by following the qualifiers snaks of some property. In this case our item paths could look like Fyodor Dostoyevsky (Q991)  View with Reasonator View with SQID ; place of birth (P19) View with SQID ; Saint Petersburg (Q656)  View with Reasonator View with SQID ; official name (P1448) View with SQID ; start time (P580) View with SQID ;

We will discover the operation of "matching" an item path with a path. First we need to define and operation beetween a property and an item path : "eating". We will say that a property can "eat" P a path if the first property of a path is P. The result will be a path shortened up to this property - takin a property and a path :

P can "eat" the path which will result to Saint Petersburg (Q656)  View with Reasonator View with SQID ; official name (P1448) View with SQID ; start time (P580) View with SQID. Another property could not eat this path. The operation of "eating" is used to construct the operation of "matching" an item path to a property path - it finds all the item paths we can construct on Wikidata following statements from a starting item that matches a part.

A path matches an item path if the components of a path can entirely eat the item path - except for one item or value.

This module performs the operation of finding the item paths in Wikidata that can be totally eaten by a path pattern. Additionally, all parts of a path must eat something in the sense defined in the following section.

For example : the path Mexico ; shares border with ; USA ; shares border with ; Canada will be a match for the pattern shares border with/shares border with but won't be a match for shares border with nor for shares border with/shares border with/shares border with. The first one because there will be parts of the path that are not eaten, and the second because the third shares border with has nothing to eat.

We will extend this "eating" operation to more complex path in the following.


A path is constructed using the following operators and properties :

simple property
A simple property is either a property Pid or a property label like P31 or instance of. A property "eats" a path as explained above. If using the label of the property like in "instance of", only the label of a property can be used and not an alias. The language of the label also matters : it must be written in the language of the main wiki (english for wikidata, french for french wikipedia, ...)
/
The sequence operator. See the initial example. This operator is one basic operation to build a complex path from simpler path. if P1 is a path and P2 is a path then P1/P2 is also a path. P1/P2 will eat an item path IP if P1 eats IP at first, then if P2 eats the result of P1 eating IP.
|
The alternative operator. This operator is one other basic operation to build a complex path from simpler path. if P1 is a path and P2 is a path then P1|P2 is also a path. P1|P2 will eat an item path IP if either P1 eats IP or P2 eats IP.
?
an operator that make the preceding path optional. It always eats something. For example the pattern share borders with? starting from Mexico will match both the Mexico path and the Mexico ; shares border with ; USA one.
*
an operator that, like preceding one, always matches. It allows to match a path of indefinite size if it can be eaten by repeating the "eat" operation as many times as we like on the path that is preceded by the operator. For example the code (father|mother)* will find all the ancestors of the starting point in a genealogical tree, including itself. It will find the parents of someone (the father and the mother) then the parents of the parents and so on.
+
P+ is equivalent to P/P*. Like the * operator except it must be at least one math of P.
>
This operator is related to qualifier snaks matching. For example the path union of>of will eat the qualifier "of" snaks of the statements "union of" of an item. The values from which the next operator of the path will start is the value of the snak.
It also can be used either at the very beginning of the path, if the starting point of the path is a statement and not an item.
!
This operator is a way to define property that will not be eaten. It is followed by either a single property or a list of properties separated by a | put into parenthesis. for example !(instance of|subclass of) will eat any (one-propertied) path like Donald Duck ; Father ; Picsou, but neither Ghandi ; instance of ; human nor human ; subclass of ; animal. It can also be used with the > operator to exclude a set of qualifiers for matching.
{number} and {min, max}
This operator is used to specify that a path can match several times, but with a bound. P{4} is equivalent to P/P/P/P, that is P must eat exactly four times something, and P{4,6} must match between 4 and six times. It's equivalent to P/P/P/P/P?/P?

Results

[edit]

This module manipulates and generate a datastructure called "ResultNode" that represent nodes on a matching path (a path that is totally eaten by a path pattern).

The path is designed such as the result is a linked list of Nodes. Those nodes represents a path (say <human> ; <P31> ; <Q5> ; <P279> ; <Q42> ) in backward order. Actually each node can be interpreted as either a snak or a statement (like the <P279> ; <Q42>) in the usual lua Wikidata data model (it's actually a snak or a statement which is adjoined a few functions and attributes. So a result node can be used as a snak or a statement in the functions of Module:Wikidata.

The previous snak or statement can be retrieved thanks trough the "parent" field of the node.

Interface

[edit]
  • Main functions
    iterate
    path (string), start_item (string, entity object) -> iterator (on result node)
    function that takes a path and a start item and returns an iterable object, usable in lua for loops that iterates over the set of paths matching the pattern, each represented by a ResultNode (see above)

Code

local parser = require "Module:PropertyPath/parser"
local datastructure = require "Module:PropertyPath/Path"
local results = require "Module:PropertyPath/ResultStructure"

local props = require "Module:Properties"
local iter = require "Module:Iterators"

local path = {}

--------------
-- TODO : 
--        * Update the "between" path to handle it better epsilon paths
--        * Test full path rendering
--        * 
--------------

-- Definition of a PropertyPath class

local PropertyPath = {}
PropertyPath.__index = PropertyPath

--[[ Datastructure for the paths that will match a path pattern
A path matching the pattern "subclass of*" will be a chain of statements and snaks nodes.  
If we got statements of the form (no qualifiers here, just subject with the main statement snak) :
* <human> <subclass of> <ape>
* <ape> <subclass of> <mammal>
* <mammal> <subclass of> <animal>

a matching path like "<human> -> <ape> -> <mammal> -> <animal>" will be reprensented by a linked list of "ResultNode" objects. 
A result node object is a mw.wikibase "statement" standard object augmented with a few methods and a link that goes from the statement or snak to the previous node in the path.
{ 
   <mammal> <subclass of> <animal>
   "parent" = {
      <ape> <subclass of> <mammal>
      "parent" = {
         <human> <subclass of> <ape>
         "parent" = EpsilonRNode(<human>, "parent" = nil)
      }
   }
}

--]]

local ResultNode = results.ResultNode
local StatementRNode = results.StatementRNode
local QualifierRNode = results.QualifierRNode
local EpsilonRNode = results.EpsilonRNode


------------------------------------------------------------------------------------------------------


local function iterate_on_snaks(
	start_rnode, 
	property_filter_criteria, 
	snak_map_iterator,
	rnode_type)
    assert(snak_map_iterator)
	return iter.pair_map(
    	iter.flatten(
    		iter.select_vals(
    			iter.pair_filter(
    				snak_map_iterator, 
    				property_filter_criteria
    			)
    		),
    		iter.on_vals
    	),
    	function(value) return rnode_type:create(value, start_rnode) end
    )
end

-- creates an iterator that will iterate over all the statements
-- of a specific property of an item

local function iterate_on_statement(start_rnode, property_filter_criteria)
	local item = mw.wikibase.getEntity(start_rnode:item_value())
	return iterate_on_snaks(
		start_rnode, 
		property_filter_criteria,
		iter.on_pairs(item.claims),
		StatementRNode
	)
end

local function iterate_on_statement_qualifier(statement, qualifier_filter_criteria)
	if statement.qualifiers then
		return iterate_on_snaks(
			statement, 
			 qualifier_filter_criteria,
			iter.on_pairs(statement.qualifiers),
			QualifierRNode
		)
	else
		-- is there a qualifier table when the statement has no qualifier snak ??
		mw.log(" maybe a statement with no qualifier, or a bug")
		return function() return nil end
	end
end

local iterate_on_statement_from_property = function(start_rnode, pid)
    claims = mw.wikibase.getBestStatements(
    	start_rnode:item_value(),
      	props.normalize(pid)
    ) or {}
	
    return iter.pair_map(
    	iter.pair_filter(iter.on_pairs(claims), function(key, val) return true end),
    	function(key, value) return StatementRNode:create(value, start_rnode) end
    )
end


local function iterate_on_star(start_rnode, child_pnode, depth, iterated, max)
	
	-- start_rnode : the result node from which we will iterate
	-- child_pnode : the path within the star operator (for example P31/P31 if our node is (P31/P31)*
	iterated = iterated or {} -- iterated is the store of already iterated starting points items to avoid infinite loops
	-- max : the max number of iteration depth to go, nil for no limit
	
	depth = depth or 1
	
	--[[
	In pseudo code using a « yield » operator, the algorithm would be
	
	algo star(startnode)
	   for each value v which match child_pnode from startnode
	      yield v
	      for each value vchild in star(v)
	         yield vchild
	      end for
	   end 
	end
	
	But we can’t use a yield operator if the « coroutine » module on lua is not activated. 
	So we must translate this into something more complicated.
	
	Luckily the approach to write iterators in term of composition seems to pay off and
	it seem possible to write code structurally similar to this algorithm thanks to the 
	« flatten » iterator and a recursive closure that creates iterator to handle the
	recursivity implied by the « star » operator nature.

	--]]
	
	function creator()
		return function(start_rnode)
			local depth_overflow = not (not max or depth < max)
			
			if start_rnode:has_an_item() and not iterated[start_rnode:item_value()] and not depth_overflow then
				iterated[start_rnode:item_value()] = true
				return iterate_on_star(start_rnode, child_pnode, depth + 1, iterated, max)
			else
				return function() end
			end
		end
	end
		
	return iter.chain(
		iter.singleton(start_rnode),
		iter.flatten(
			child_pnode:iterate(start_rnode), 
			creator()
		)
	)
end

local iterate_on_plus = function(start_rnode, child_pnode, max_depth)
	local first = true
	iterated = iterated or {}
	
	return iter.flatten(
		child_pnode:iterate(start_rnode),
		function(rnode)
			return iterate_on_star(rnode, child_pnode, 1, iterated, max_depth)
		end
	)

end

--[[
Test :
p.test("Q5", "subclass of+") -- at the time writing, "Q215627" is the only direct superclass of human. It does not show up, but there is numerous superclass in the result
--]]

--[[ an iteraton to handle "/" operator sequences, for example « P31/P279* »
 "creators" is a table of functions that needs to create iterators of child nodes.
 In our example, the first cretors element will be a function that takes an item object and
   will return an iterator over P31-statements of this item
 the second one will create an iterator over the path « P279* » and so on.
 The resulting iteratior will iterate on each elements of the second iterator starting from each iterator over the second one
 for each elements in the first one.
--]]


local function iterate_on_iterator_creators(start_rnode, creators, i)
	i = i or 1
	if not(tonumber(i)) then i = 1 end
	-- main iterator : the iterator that will iterate on the values on this node of the path
	local main_iterator = creators[i]:iterate(start_rnode)
	
	if i < #creators then
		--trying to initialize the iterator for the next node with a value of the current one, if we can
		local rnode = main_iterator()
		
		while rnode and not(rnode:has_an_item()) do
			rnode = main_iterator()
		end

		-- could not initialize the next iterator with a proper item ; returnun the empty iterator function
		if not rnode then return function() return end end
		
		-- we found a proper value to iterate on for the next node in the path
		
		-- final iterator : the iterator that will iterate 
		-- on elems that will be returned by each iterations 
		-- on the iterator created by the main client
		
		local final_iterator = iterate_on_iterator_creators(rnode, creators, i+1)
		return function()
			while final_iterator ~= nil do
				-- pulling the element from the next node iterator in the sequence
				local final_elem = final_iterator()
				if final_elem then
					return final_elem
				else
					-- we pulled the last elem for this value, getting a new value 
					-- for this node path and regenerate the next node iterator to pull new final values
					
					local rnode_value = main_iterator()
					
					-- return the element pulled from the next node iterator
					-- if the property has item datatype is not a special value and has the right snaktype
					-- as we can't continue path on other kind of values
					
					if rnode_value then
						if rnode_value:has_an_item() then
							final_iterator = iterate_on_iterator_creators(rnode_value, creators, i+1)
						end
					else
						--we're over, no next value for this node to continue the path
						return
					end
				end
			end
		end
	elseif i == #creators then
		return main_iterator
	end
end

--[[ JSBach : Q1339 ;
Testing with :
test("Q1339", "child/child")

wikidata query equivalent query : 
select ?grandchild where {
  wd:Q1339 wdt:P40/wdt:P40 ?grandchild
}


Adam : wd:Q70899 
test("Q70899", "child/child/child")
wikidata query equivalent query : 
select ?grandgrandchild where {
  wd:Q70899 wdt:P40/wdt:P40/wdt:P40 ?grandgrandchild
}
--]]

local iterate_on_alternatives = function(start_rnode, pnodes)
	local i=1
	local current_iter = pnodes[i]:iterate(start_rnode)
	
    return function ()
    	-- loop to go to next iterator if there is empty one in the list
        while true do
	        local res = current_iter()
	        -- res is an iterator itself ; getting its result
	        if res then
	        	return res
	        else
	        	i = i + 1
	        	if i <= #pnodes then 
	        		-- following to next iterator and resume loop
	        		current_iter = pnodes[i]:iterate(start_rnode)
	        	else
	        		-- no current iterator : ending
	        		return nil
	        	end
	        end
        end
    end
end

--[[
Adam's father or mother : no value of course
p.test('Q70899', "P22|P25")

JS Bach's
p.test("Q1339", "P22|P25")

--]]

function iterate_on_nodes_beetween(start_rnode, pnode, min, max)
	seq = {}
	i  = 1

	while i <= min do
		table.insert(seq, pnode)
		i = i + 1
	end
	
	local sequence_obj = {}
	function sequence_obj:iterate(next_rnode)
		return iterate_on_iterator_creators(next_rnode, seq, min)
	end
	if max then
		local star_obj = {}
		function star_obj:iterate(next_rnode)
			return iterate_on_star(next_rnode, pnode, 1, iterated, max-min)
		end
		return iterate_on_iterator_creators(
			start_rnode,
			{
				sequence_obj,
				star_obj
			}
		)
	else
		return sequence_obj:iterate()
	end
end

function iterate_maybe(start_rnode, pnode)
	iterator = pnode:iterate(start_rnode)
	local self_done = false
	return function()
		if not self_done then
			val = iterator()
			if val then return val else
				self_done = true
				return start_rnode
			end
		end
	end
end

function PropertyPath:new(str)
    local obj = {["path"]=str} 
    setmetatable(obj, self)
    
    ast = parser.parse_path(str)
    assert(ast, "parser did not ruturn a node")
    obj.node  = ast
    
    return obj
end

local function entityId(entity)
	if type(entity) == 'string' then
		return entity
	end
	return entity.id
end


function norm_start_point(start_point)
	if type(start_point) == "string" then
		return EpsilonRNode:create(start_point)
	elseif type(start_point) == "table" then
		if start_point["claims"] ~= nil then
			-- assume this is an item or entity object
			return EpsilonRNode:create(start_point.id)
		elseif start_point["is_RNode"] then
			return start_point
		elseif start_point["qualifiers"] or start_point["mainsnak"] then
			local itemid = string.gmatch(start_point.id, "^.*[^$]")() -- extract the item id from the starting statement
			return StatementRNode:create(start_point, EpsilonRNode:create(itemid))
		end
	end
	mw.logObject(start_point)
	error("from function norm_start_point of module PropertyPath : wrong type for start_point", tostring(start_point)) -- TODO : Log a better error
end

function PropertyPath:iterate(start_point)
	start_point = norm_start_point(start_point)
	return self.node:iterate(start_point)
end

local PropertyNode = datastructure.PropertyNode
local AlternativeNode = datastructure.AlternativeNode
local SequenceNode = datastructure.SequenceNode
local QualifiedStatementNode = datastructure.QualifiedStatementNode
local NegatedPropertySetNode = datastructure.NegatedPropertySetNode
local PlusNode = datastructure.PlusNode
local StarNode = datastructure.StarNode
local BetweenNode = datastructure.BetweenNode
local MaybeNode = datastructure.MaybeNode
local QualifierSnakNode = datastructure.QualifierSnakNode

function PropertyNode:iterate(rnode)
	return iterate_on_statement_from_property(rnode, self.property)
end

--[[
test("Q5", "subclass of")
--]]

function AlternativeNode:iterate(rnode) 
	return iterate_on_alternatives(rnode, self.nodes)
end

function NegatedPropertySetNode:iterate(rnode)
	return iterate_on_statement(rnode,
		function (property, val) return self:matches(property) end
	)
end

--[[
test("Q90, ""!(P150)")
--]]

function SequenceNode:iterate(rnode) 
	return iterate_on_iterator_creators(rnode, self.nodes)
end

function QualifiedStatementNode:iterate(rnode)
	local statement_iterator = iterate_on_statement(
		rnode,
		function (key, value)
			return self.property:matches(key)
		end
	)
	local qualifier_iterator_creator = function(statement) 
		return iterate_on_statement_qualifier(
			statement, 
			function (key, value) return self.qualifier:matches(key) end
		)
	end
	
	return iter.flatten(statement_iterator, qualifier_iterator_creator)
end

--[[ to test with :
p.test("Q79529", "union of>of")
p.test("Q105019",'P22{1,6}'
--]]

function QualifierSnakNode:iterate(statementnode)
		return iterate_on_statement_qualifier(
			statementnode, 
			function (key, value) return self:matches(key) end
	)
end

--[[ to test with :
for x in p.iterate("Q79529", "union of") do p.test(x, ">of") end
--]]

function StarNode:iterate(rnode)
	return iterate_on_star(rnode, self.node)
end

function PlusNode:iterate(rnode)
	return iterate_on_plus(rnode, self.node)
end

function BetweenNode:iterate(rnode)
	return iterate_on_nodes_beetween(rnode, self.node, self.min, self.max)
end

function MaybeNode:iterate(rnode)
	return iterate_maybe(rnode, self.node)
end

-- returns an iterator on the result set of a path from a specific node
-- ppath acn either be a string representing a path or a compiled path
function path.iterate(start_node, ppath)
	if start_node == nil then error("the start node is mandatory to get result on a path, it is nil") end
		
	if type(ppath) == "table" then
		return ppath:iterate(start_node)
	else
		return path.PropertyPath:new(ppath):iterate(start_node)
	end
end

-- function that return a boolean
-- true if there is a path matching ppath from start_node that ends with the value "value"
-- (currently only works if "value" is a Qid string)
function path.matches(start_node, ppath, value)
        for val in path.iterate(start_node, ppath) do
                if val:item_value() == value then
                       return true
                end
        end
        return false
end

----------------------------

--[[
p.test("Q5", "P279")
p.test(mw.wikibase.getEntity("Q5"),  "P279")
for x in p.iterate(mw.wikibase.getEntity("Q5"), "P279") do p.test(x,  "P279") end -- test if we can continue iteration of an RNode object
Complex test : 
p.test("Q27929033","P1552>!()/P31") => OK
p.test("Q27929033","subclass of/P1552>!()/P31") => NOK

--]]
function path.test(start_point, ppath)
	for x in path.iterate(start_point, ppath) do 
		mw.log("woot")
		if x then
			mw.log(x:item_value())
		end
	end
end

----------------------------

path.PropertyPath = PropertyPath

return path