Wikidata:Requests for permissions/Bot/Handelsregister
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Not done @SebastianHellmann: This request seems to be abandoned, please reopen it if that is not the case. Thanks. Mike Peel (talk) 19:52, 21 July 2020 (UTC)[reply]
Handelsregister (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: SebastianHellmann (talk • contribs • logs)
Task/s: Crawl https://www.handelsregister.de/rp_web/mask.do and then go to UT (Unternehmenstraeger) and add an entry for each German organisation with the basic info, especially registering court and assigned id by court into Wikidata.
Code: The code is a fork of https://github.com/pudo-attic/handelsregister (small changes only)
Function details:
Task 1, prerequisite for Task 2 Find all current organisations in Wikidata that are registered in Germany and find the correlating Handelsregister entry. Then add the data for the respective Wikidata items.
What data will be added? The Handelsregister collects information from all German courts, where all organisations in Germany are obliged to register. The data is given from the courts to a private company running the handelsregister, who makes part of the information public (i.e. UT - Unternehmenstraegerdaten, core data) and sells the other part. Each organisation can be uniquely identified by the registering court and the number assigned by this court (only the number is not enough, as two courts might assign the same number). Here is an example of the data:
- Saxony District court Leipzig HRB 32853 – A&A Dienstleistungsgesellschaft mbH
- Legal status: Gesellschaft mit beschränkter Haftung
- Capital: 25.000,00 EUR
- Date of entry: 29/08/2016
- (When entering date of entry, wrong data input can occur due to system failures!)
- Date of removal: -
- Balance sheet available: -
- Address (subject to correction): A&A Dienstleistungsgesellschaft mbH
- Prager Straße 38-40
- 04317 Leipzig
Most items are stable, i.e. each org is registered, when it is founded and assigned a number by the court: Saxony District court Leipzig HRB 32853 . Then only the address and the status can change. For Wikidata, it is no problem keeping companies that are not existing any more as they should be conserved for historical purposes.
Maintenance should be simple: Once a Wikidata item contains the correct court and the number, the entry can be matched 100% to the entry in Handelsregister. This way Handelsregister can be queried once or twice a year to update the info in Wikidata.
Question 1: bot or other tool How data is added? I am keeping the bot request, but I will look at Mix and Match first. Maybe this tool is better suited for task 1.
Question 2: modeling Which properties should be used in Wikidata? I am particular looking for the property for the court as registering organisation, i.e. that has the authority to define the identity of an org. and then also the number (HRB 32853). The types, i.e. legal status can be matched to existing Wikidata entries. Most exist in the German Wikipedia. Any help for the other properties is appreciated.
Question 3: legal I still need to read up on the right situation for importing crawled data. Here is a hint given on the mailing list:
- https://en.wikipedia.org/wiki/Sui_generis_database_rights You'd need to check whether in Germany it applies to official acts and registers too... https://meta.wikimedia.org/wiki/Wikilegal/Database_Rights
Task 2 Add all missing identifiers for the remaining orgs in Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently.
It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability
- 2. It refers to an instance of a clearly identifiable conceptual or material entity. The entity must be notable, in the sense that it can be described using serious and publicly available references. If there is no item about you yet, you are probably not notable.
The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities.
--SebastianHellmann (talk) 07:39, 16 October 2017 (UTC)[reply]
- Could you make a few example entries to illustrate how the items you want to create will look like? What strategy will you use to avoid creating doublicate items? ChristianKl (talk) 12:38, 16 October 2017 (UTC)[reply]
- I think this is a good idea, but I agree there needs to be a clear approach to avoiding creating duplicates - we have hundreds of thousands of organizations in wikidata now, many of them businesses, many from Germany, so there certainly should be some overlap. Also I'd like to hear how the proposer plans to keep this information up to date in future. ArthurPSmith (talk) 15:13, 16 October 2017 (UTC)[reply]
- There was a discussion on the mailing list. It would be easier to complete the info for existing entries in Wikidata at first. I will check mix and match for this or other methods. Once this space is clean, we can rediscuss creating new identifiers. SebastianHellmann (talk) 16:01, 16 October 2017 (UTC)[reply]
- Is there an existing ID that you plan to use for authority control? Otherwise, do we need a new property? ChristianKl (talk) 20:40, 16 October 2017 (UTC)[reply]
- I think that the ID needs to be combined, i.e. registering court and register number. That might be two properties. SebastianHellmann (talk) 16:05, 29 November 2017 (UTC)[reply]
- Is there an existing ID that you plan to use for authority control? Otherwise, do we need a new property? ChristianKl (talk) 20:40, 16 October 2017 (UTC)[reply]
- There was a discussion on the mailing list. It would be easier to complete the info for existing entries in Wikidata at first. I will check mix and match for this or other methods. Once this space is clean, we can rediscuss creating new identifiers. SebastianHellmann (talk) 16:01, 16 October 2017 (UTC)[reply]
- Given that this data is fairly frequently updated, how is it planned to maintain it?
--- Jura 16:38, 16 October 2017 (UTC)[reply]
- * The frequency of updates is indeed large: A search for deletion announcements alone in the limited timeframe of 1.9.-15.10.17 finds 6682 deletion announcements (which legally is the most seriouss change and makes approx. 10% of all announcements). So within one year, more than 50,000 companies are deleted - which for sure should be reflected in according Wikidata entries. Jneubert (talk) 15:44, 17 October 2017 (UTC)[reply]
- Hi all, I updated the bot description, trying to answer all questions from the mailing list and here. I still have three questions, which I am investigating. Help and pointers highly appreciated. SebastianHellmann (talk) 23:36, 16 October 2017 (UTC)[reply]
- Given that German is the default language in Germany I would prefer the entry to be "Sachsen Amtsgericht Leipzig HRB 32853" instead of "Saxony District court Leipzig HRB 32853". Afterwards we can store that as an external ID and make a new property for that (which would need a property proposal). ChristianKl (talk) 12:33, 17 October 2017 (UTC)[reply]
- @SebastianHellmann: you must store DE official ID as external-id from the very beginning, otherwise it'll be impossible to have traceability or updates --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)[reply]
- Isn't it sufficient to store Legal Entity Identifier (P1278) ?
--- Jura 13:21, 17 October 2017 (UTC)[reply]- Legal Entity Identifier (P1278) is about IDs derieved via ISO 17442. It's not clear to me that anything in the list is such an ID. ChristianKl (talk) 18:46, 17 October 2017 (UTC)[reply]
- As of today there are 66.3k DE companies in GLEIF. So @Jura1: no, LEI is not enough. --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)[reply]
- Legal Entity Identifier (P1278) is about IDs derieved via ISO 17442. It's not clear to me that anything in the list is such an ID. ChristianKl (talk) 18:46, 17 October 2017 (UTC)[reply]
- Thanks for the updated details here. It sounds like a new identifier property may be needed (unless one of the existing ones like Legal Entity Identifier (P1278) suffices, but I suspect most of the organizations in this list do not have LEI's (yet?)). Ideally an identifier property has some way to turn the identifiers into a URL link with further information on that particular identified entity, that de-referenceability makes it easy to verify - see "formatter URL" examples on some existing identifier properties. Does such a thing exist for the Handelsregister? ArthurPSmith (talk) 14:58, 17 October 2017 (UTC)[reply]
@SebastianHellmann: for task 1, you might also be interested in OpenRefine (make sure you use the German reconciliation interface to get better results). See https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation for details of its reconciliation features. I suspect your dataset might be a bit big though: I think it would be worth trying only on a subset (for instance, filter out those with a low capital). − Pintoch (talk) 14:52, 20 October 2017 (UTC)[reply]
Concerning Task 2, I'm a bit worried about the companies' notability (ot lack thereof), since the Handelsregister includes any and all companies. Not just the big ones where there's a good chance that Wikipedia articles, other sources, external IDs, etc exist. But also tiny companies and even one-person-companies, like someone selling stuff on Ebay or some guy selling christmas trees in his village. So it would be very hard to find any data on these companies outside the Handelsregister and the phonebook. --Kam Solusar (talk) 05:35, 21 October 2017 (UTC)[reply]
- Agreed. Do we really need to be a complete copy of the Handelsregister? What for? How about concentrating on a meaningful subset instead that addresses a clear usecase? --LydiaPintscher (talk) 10:35, 21 October 2017 (UTC)[reply]
- That of course is true. A strict reading of Wikidata:Notability could be seen as that at least two reliable sources are required. But then, that could be the phone book. Do we have to make those criteria more strict? That would require a RfC. Lymantria (talk) 07:58, 1 November 2017 (UTC)[reply]
- I would at least try an RfC, but I am not immediately sure what to propose.--Ymblanter (talk) 08:05, 1 November 2017 (UTC)[reply]
- If there's an RfC I would say that it should say that for data-imports of >1000 items the decision whether or not we import the data should be done via a request for bot permissions. ChristianKl (talk) 12:35, 4 November 2017 (UTC)[reply]
- @SebastianHellmann: is well-intended, but I agree not all companies are notable. Even worse than 1-man shops are inactive companies that nobody bothered to close yet. Just "comes from reputable source" is not enough: eg OpenStreetMaps is reputable, and it would be ok to import all power-stations (eg see Enipedia) but imho not ok to import all recyclable garbage cans. We got 950k BG companies at http://businessgraph.ontotext.com/ but we are hesitant to dump them on Wikidata. Unfortunately official trade registers usually lack measures of size or importance...
- It's true the Project Companies has not gelled yet and there's no clear Community of Use for this data. On the other hand, if we don't start somewhere and experiment, we may never get big quantities of company data. So I'd agree to this German data dump by way of experiment --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)[reply]
- If there's an RfC I would say that it should say that for data-imports of >1000 items the decision whether or not we import the data should be done via a request for bot permissions. ChristianKl (talk) 12:35, 4 November 2017 (UTC)[reply]
- I would at least try an RfC, but I am not immediately sure what to propose.--Ymblanter (talk) 08:05, 1 November 2017 (UTC)[reply]
- That of course is true. A strict reading of Wikidata:Notability could be seen as that at least two reliable sources are required. But then, that could be the phone book. Do we have to make those criteria more strict? That would require a RfC. Lymantria (talk) 07:58, 1 November 2017 (UTC)[reply]
- @Rjlabs: That hope is not founded because each jurisdiction does its own thing. OpenCorporates has a bunch of web crawling scripts (some of them donated) that they consider a significant IP. And as @SebastianHellmann: wrote their data is sorta open but not really. --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)[reply]
- I Support importing the data. Having the data makes it easier to enter the employer when we create items for new people. Companies also engage into other actions that leave marks in databases such as registering patents or trademarks and it makes it easier to import such data when we already have items for the companies. The ability to run queries about the companies that are located in a given area is useful. ChristianKl (talk) 17:20, 3 November 2017 (UTC)[reply]
- @ChristianKl: at least half of the 200M or so companies world-wide will never have notable employees nor patents, so "let's import them just in case" is not a good policy --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)[reply]
- When it comes to these mass imports I would only want to mass import datasets about companies from authoritative sources. If we talk about a country like Uganda, I think it would be great to have an item for all companies that truly exist in Uganda. People in Uganda care about the companies that exist in their country and there government might not have the capability to host that data in a user-friendly way. An African app developer could profit from the existance of a unique identifier that's the same for multiple African countries.
- When it comes to the concern about data not being up-to-date there were multiple cases where I would have really liked data about 19th century companies will doing research in Wikidata. Having data that's kept up-to-date is great, but having old data is also great. ChristianKl (✉) 20:11, 13 December 2017 (UTC)[reply]
- @Rjlabs: We did go back and forth with a lot of ideas on how to set some sort of criteria for company notability. I think any public company with a stock market listing should be considered notable, as there's a lot of public data available on those. For private companies we talked about some kind of size cutoff, but I suppose the existence of 2 or more independent reference sources with information about the company might be enough? ArthurPSmith (talk) 18:01, 3 November 2017 (UTC)[reply]
- @ArthurPSmith:@Denny:@LydiaPintscher: Arthur, let's make it any public company that trades on a recognized stock exchange, anywhere worldwide, with a continuous bid and ask quote, that actually trades at least once per week is automatically considered "notable" for WikiData inclusion. This is by virtue that real people wrote real checks to buy shares and there is sufficient continuing trading interest in the stock to make it trade at least once per week, and some exchange somewhere endows that firm to be listed on its exchange. We should also note that passing this hurdle means that SOME data on that firm is automatically allowable on WikiData, provided the data is regularly updated. Rjlabs (talk) 19:35, 3 November 2017 (UTC)[reply]
- @Rjlabs, Denny, LydiaPintscher: Public Companies are a no-brainer because there's only 60k in the world (there are about 2.6k exchanges); compare to about 200M companies world-wide. --Vladimir Alexiev (talk) 15:46, 19 November 2017 (UTC)[reply]
- Some data means (for right now) information like LEI, name, address, phone, industry code(s), brief text description of what they do) plus about 10 high level fields that cover the most frequently needed company data (such as: sales, employees, assets, principal exchange(s) down to where at least 20% of the volume is traded, unique symbol on that exchange, CEO, URL to investor relations section of website where detailed financial statements may be found, Central index key (or equivalent) with link to regulatory filings / structured data in the primary country where its regulated. For now that is all that should be "automatically allowable". No detailed financial statements, line by line, going back 10-20 years, with adjustments for stock splits, etc. No bid/offer/last trade time series. Consensus on further detail has to wait further gelling up. I Ping Lydia and Denny here to be sure they are good with this potential volume of linked data. (I think it would be great, a good start and limited. I especially like it if it MANDATES LEI, if one is available). Moving down from here (after 100% of public companies that are alive enough to actually trade) there is of course much more. However its a very murky area. >=2 independent reference sources with information about the company might be too broad causing WikiData capacity issues, or it may be too burdensome if someone has a structured data source that is much more reliable then WikiData to feed in, but lacks that "second source". Even if was one absolutely assured good quality source, and WikiData capacity was not an issue, I'd like to see a "sustainability" requirement up front. Load no private company data where it isn't AUTOMATICALLY updated or expired out. Again, would be great to have further Denny/Lydia input here on any capacity concern. Rjlabs (talk) 19:35, 3 November 2017 (UTC)[reply]
- "A modicum of data" as you describe above is a good criterion for any company. --Vladimir Alexiev (talk)
- On WikidataCon there was a question from the audience of whether Wikidata would be okay with importing the 400 million entries about items in museums that are currently managed by various museums. User:LydiaPintscher answered by saying that her main concerns aren't technical but whether our communities does well with handling a huge influx of items. Importing data like the Handelsregister will mean that there will be a lot of items that won't be touched by humans but I don't think that's a major concern for our community. Having more data means more work for our community but it also means that new people get interested in interacting with Wikidata. When we make decisions like this, technical capabilities however matter. I think it would be great if a member of the development team would write a longer blog post that explains the technical capabilities, so that we can better factor them into our policy decisions. ChristianKl (talk) 12:35, 4 November 2017 (UTC)[reply]
- I agree with Lydia. The issue is hardly the scalability of the software - the software is designed in such a way that there *should* not be problems with 400M new items. The question is do we have a story as a community to ensure that these items don't just turn into dead weight. Do we ensure that items in this set are reconciled with existing items if they should be? That we can deal with attacks on that dataset in some way, with targeted vandalism? Whether the software can scale, I am rather convinced. Whether the community can scale, I think we need to learn that.
- Also, for the software, I would suggest not to grow 10x at once, but rather to increase the total size of the database with a bit more measure, and never to more than double it in one go. But this is just, basically, for stress-testing it, and to discover, if possible, early unexpected issues. But the architecture itself should accommodate such sizes without much ado (again - "should" - if we really go for 10x, I expect at least one unexpected bug to show up). --Denny (talk) 23:25, 5 November 2017 (UTC)[reply]
- Speaking of the community being able to handle dead weight, it seems we mostly lack the tools to do so. Currently we are somewhat flooded by items from cebwiki and despite efforts by individual users to deal with one or the other problem, we still haven't tackled them systematically and this lead to countless items with unclear scope complicating every other import.
--- Jura 07:00, 6 November 2017 (UTC)[reply] - I don't think we should just add 400M new items in one go either. I don't think that the amount of vandalism that Wikidata faces scales directly with the amount of items that we host if we double the amount of items we don't double the amount of vandalism.
- As far as the cebwiki items go, the problem isn't just that there are many items. The problem is that there's unclear scope for a lot of the items. For me that means that when we allow massive data imports we have to make sure that the imported data is up to a high quality where the scope of every item is clear. This means that having a bot approval process for such data imports is important and suggests to me that we should also get clear about the necessarity of having a bot approval for creating a lot of items via QuickStatements.
- Currently, we are importing a lot of items via WikiCite and it seems to me that process is working without significant issues.
- I agree that scaling the community should be a higher priority than scaling the number of items. One implication of that is that it makes sense to have higher standards for mass imports via bots than for items added by individuals (a newbie is more likely to become involved in our community when we don't great him by deleting the items they created).
- Another implication is that the metric we celebrate shouldn't be focused on the number of items or statments/item but the number of active editors. ChristianKl (✉) 09:58, 20 November 2017 (UTC)[reply]
- Speaking of the community being able to handle dead weight, it seems we mostly lack the tools to do so. Currently we are somewhat flooded by items from cebwiki and despite efforts by individual users to deal with one or the other problem, we still haven't tackled them systematically and this lead to countless items with unclear scope complicating every other import.
Now what?
[edit]Lots of good discussion above. Would anyone care to summarize, and how do we move to a decision? --Vladimir Alexiev (talk) 15:10, 5 December 2017 (UTC)[reply]
- Some seem to consider it too granular. Maybe a test could be done with a subset. If no other criteria can be determined, maybe a start could be with companies with a capital > EUR 100 mio.
--- Jura 20:21, 13 December 2017 (UTC)[reply]