Wikidata talk:WikiProject Lakes

From Wikidata
Jump to navigation Jump to search

Concept to consider in incorporating into the Lakes model - Lake association and lake management organization. For example in Michigan there is the Michigan Lakes and Streams Association which has lake specific association which help to manage the lakes. The property should be bidirectional from the association and the lake. Managed by and Manages or Affiliated. Don't think "operator" is an appropriate semantic affiliation like in a Park. Comments on what would best describe the relationship? Wolfgang8741 (talk) 06:52, 4 May 2020 (UTC)[reply]

Location of the source data[edit]

For clarity, I found the source data downloadable from the U.S. Board on Geographic Names. The descriptions of what features are in the index differ a bit from what's used in Wikidata and might be a ripe topic for discussion. Regards, Gettinwikiwidit (talk) 05:02, 30 December 2020 (UTC)[reply]

@Gettinwikiwidit: The feature list you're referring is a categorical starting point, but is fine as a broad category and useful for USGS naming, but not much in a granular segmentation of features. The mapping of these to WikiData items are rough and definitely in ways need a fuzzy match approach. Have you looked at the GNIS Data import? I've been slowly trying to create a good mapping to do an import while some others have been using bots to repair an older partial import from ceb Wikipedia and expanding from that. I've mainly been working on Lakes and want to add Reservoirs soon. I'm long overdue with flusing out the data import and QA on that import of what I want to do. Been spending a lot of time cleaning things up manually in Lakes to better understand what issues are present. Which aspect do you want to discuss on this? Wolfgang8741 (talk) 18:30, 6 January 2021 (UTC)[reply]
@Wolfgang8741: This post was mainly about me trying to understand where the source data for the Mix'n'Match came from. I wasn't 100% sure from the Project page where to find those data. With the source data available, I was able to import this into OpenRefine to take a different look at this. I hadn't seen the GNIS Data import. Thanks for the pointer. I'll have a look. I did this more or less by hand as described below. Gettinwikiwidit (talk) 00:56, 7 January 2021 (UTC)[reply]

Adjusting the Mix'n'Match[edit]

Hello,

I'm not sure how the Mix'n'Match works, but I looked at all the lake (Q23397) items which had a GNIS Feature ID (P590) claim on them and compared the coordinate location (P625) with the coordinates available from GNIS. I eyeballed all claims where the distance between the two was more than a kilometer (about 500) and fixed mislabeled items as appropriate. At the time of writing there are <1000 US lakes without any GNIS Feature ID (P590) claim which is considerably less than the number of items in the Mix'n'Match.

SELECT ?hasGNISID (COUNT(*) as ?count) WHERE {
  ?item wdt:P31/wdt:P279* wd:Q23397.
  ?item wdt:P17 wd:Q30.
  OPTIONAL { ?item wdt:P590 ?gnisid }
  BIND(IF(BOUND(?gnisid),true,false) AS ?hasGNISID )
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} GROUP BY ?hasGNISID
Try it!

Should we adjust the Mix'n'Match to remove items which are already assigned to an entity? Or maybe make two separate ones, one to verify those which are already matched and those which do not yet have a match? I realize this is coming at it from the other end, but it might be more approachable to first find matches for everything which is already in Wikidata and then upload those which don't.

Regards, Gettinwikiwidit (talk) 06:42, 4 January 2021 (UTC)[reply]

@Jst4: @Wolfgang8741:

@Gettinwikiwidit: The ping was definitely needed, thanks. In short it is a bit more than that. I would love to create a single clean import of USGS data with all fields filled then merge duplicates, but Mix'n'Match is a middle where existing items are resolved before importing to reduce the number of items needing redirect. A little slower for such a large set of identifiers, but there are good reasons for this hinted at below. Glad you were looking at a few Mix'n'Match sets I uploaded back in 2018 (I'm way overdue to update that, but found that more is needed first). It is a useful took to monitor key items for duplicates, but leaves much to be desired in terms of consistency checks with database values. The USGS GNIS downloads only export one of possible many points depending on which source map the item is created and linked given the GNIS is a common identifier to bridge US gov naming of features. There can be some significant variation between points for an item depending on the item location and the work of the NationalMap in the US is a crowdsources way to check, but even some entry errors occur and more and more source documents are being scanned for USGS items so you can check the scan if the coord is consistent. Even with that with a satellite overview you might find GNIS to be close, but not exact for some features and those might be updated in a review and new dump. Where there are duplicate GNIS IDs on two or more features the WikiData entry features should be reviewed to the original feature the Q was created. While manually cleaning up Lakes with my start work being in Michigan there were many duplicates, but in a surprising number of cases there became a feature conflation or an import of GNIS from a Wikipedia article where the feature is mentioned and GNIS applied, but the article about another subject meaning the Wikidata item is not about the GNIS. As the number of fields are filled out on a Q the item's concept or feature is better outlined and it is easier to identify with automation. This is why I went and added most articles in EN Wikipedia to have Infoboxes which can be imported to the associated WikiData item. There are many EN Wikipedia articles where GNIS is present, but not yet on the WikiData item, but an import of those should be reviewed due to how the GNIS template and IDs are used in articles. The approach I've been using is select a geographic region. Clean-up entries in that region then import the remaining items in that region. I've nearly finished Michigan, but my import is not yet complete as I need to generate an automated description that makes them unique for lakes with the same name ie Mud Lake or Long Lakes where there is two or more in the same county. Take a look at the demo links on my Wikipedia profile and I also have a set of queries saved for Wikidata to help in the clean-up process. I think one thing I want to do is better outline the issues found and coordinate the means to mediate them that is documented finally creating a QA tool and set of constraints on related items to GNIS that will keep the data in sync with USGS or at least highlight deviations. Interesting enough from the QA process so far I've identified GeoNames and many other sites relying on USGS GNIS Data don't disclose the GNIS ID with their items, but more importantly they aren't performing the QA checks. Having a number of features I've reported and USGS corrected due to a reviewed process linking across Wikidata and OpenStreetMap had been surprising and highlights the strength of open data that used and checked can help keep a check on an authority's data QA processes and fix oversights or spread resources. Would love to have you help and discuss and outline the issues more. Wolfgang8741 (talk) 18:52, 6 January 2021 (UTC)[reply]
@Wolfgang8741: That sounds great. I'm more than willing to follow your lead. I hadn't seen any activity, so I followed my own nose. You can browse my edits in the last week or so to see what I've done. (Specifically from 22:43, 29 December 2020 through 08:24, 5 Jan 2021.) Regards, Gettinwikiwidit (talk) 01:01, 7 January 2021 (UTC)[reply]
FWIW, I've found that OpenStreetMap is occasionally not detailed enough to make determinations. In those cases, I would often consult Google Maps. GNIS maps also occasionally aren't detailed enough. For things like salt flats, it's hard to make any determination. Gettinwikiwidit (talk) 01:00, 7 January 2021 (UTC)[reply]
Now closer to 300. The last ones are not surprisingly the hardest. I suspect some are not in GNIS or have their coordinates wildly off in either GNIS or wikidata. I only just discovered that USGS also publishes a list of alternate names for entities. I'll have a look at these next to see if I can knock off the rest.
I also labeled some lakes as instance of (P31) former lake (Q47486890). I'm removing them from my analysis. Regards, Gettinwikiwidit (talk) 12:35, 12 January 2021 (UTC)[reply]
I understand the OSM data quality can vary, but so do GNIS points somewhat. I actually went in per entry and added water MI where a GNIS item existed and aerial photos were present. There still is a large amount of missing data and dataing to be cleaned up on OSM and QA for GNIS, but in tandem and linking to Wikidata it should lead to easier maintenance in the long term. For now there is leg work. One thing also on the TODO is running a comparison of the coord on Wikidata to that of Wikipedia entries and importing fields where no value exists. Have you look at that any? Wolfgang8741 (talk) 15:49, 1 February 2021 (UTC)[reply]
@Wolfgang8741 I must have missed your replies and I lost track of this project. I did some comparison of Wikidata to Wikipedia using DBPedia as mentioned below. I found some inconsistencies and patched up what seemed like clear errors but this was a while ago so it's not fresh in my memory which. The process described below seems fruitful even if it adds DBPedia as a dependency. My main focus was in trying to get all the GNIS ids added where appropriate. From the SPARQL above it looks like we're down to 317 United States of America (Q30) lake (Q23397)s which are missing a GNIS id for one reason or another.
FWIW, part of what I got distracted with is creating a build-your-own-reconciliation-service tool for use with OpenRefine. It's called csv-reconcile and it has a plugin to be able to reconcile based on distance. The idea would be to used SPARQL to generate a csv file with all the values we'd like to reconcile against and then fire up a local reconciliation server using that csv file and using that with OpenRefine to do the fuzzy matching.
I had been reconciling against the names of the lakes and then cross-referencing the distance. Given the amount of overlaps in names and plain different names for the same lake this started to feel inefficient. I had ad hoc ways of reducing my choices to nearby lakes but this service seemed cleaner than that. Moreover, I wrote up a recipe for how to refine candidates returned from a reconciliation service.
I'm happy to try to contribute more to this project but might need some help with motivation. If you have a plan of action or would like to know more details about how I did anything mentioned below, please let me know. Regards, Gettinwikiwidit (talk) 05:52, 14 November 2021 (UTC)[reply]
I also took a crack at reconciling this data against the Getty Thesaurus of Geographic Names (Q1520117) but that data seemed extremely crufty and potentially abandoned. Gettinwikiwidit (talk) 05:55, 14 November 2021 (UTC)[reply]

Reservoir vs. lake[edit]

Hi there,

As mentioned above I have eyeballed quite a few lake entries. It's not clear to me where the line between lake (Q23397) and reservoir (Q131681) is. Things marked in Wikidata as a lake are sometimes indicated as being a reservoir in Geographic Names Information System (Q136736). Should we try to establish guidelines about how we choose this classification at least with US lakes? In the work mentioned above I was mostly interested in whether the GNIS Feature ID (P590) claim was pointing at the same entry and haven't attempted to clarify these data on any other front, but it seems to me that rationalizing this classification is a ripe area for future work.

Regards, Gettinwikiwidit (talk) 11:49, 4 January 2021 (UTC)[reply]

@Gettinwikiwidit: Thanks for the ping above, I didn't see your notes until that. I'll explain more above on the origins and what I've done and seen over the while, and am still working on. I would say the guide for the US should be following what the USGS GNIS qualifies as reservoir vs lake (and they do make updates if conflicts are found and resolved). Which in my basic understanding the differentiation mainly relies upon is it a natural feature (lake - inclusive of ponds and various forms) or man-made (reservoir). There is much room for further refinement of these broad features to more scientific taxometric terms as related to the lake type based on shape, water origin etc (not my scientific area of expertise, but learning a lot on the lake scientific classifications while making USGS GNIS, Wikidata, OpenStreetMap, and EN Wikipedia an intersecting component of my dissertation. I'd say where lake is used and USGS GNIS states reservoir or the inverse the values should be brought to alignment with USGS. The larger issue though is developing a workflow that keeps USGS GNIS and Wikidata in sync or highlights the deviations. That actually would make this even better than a one off import and is something I'm interesting in developing in more generalizable workflow. Mix'n'match does fine for scraping pages, but doesn't have an integrated workflow for dumps on a regular basis... hence the current data is what I uploaded in 2018 to get a start, but there have been changes in GNIS that Mix'n'match doesn't yet reflect. That is on my plate for updating once I get my proposal submitted. Definitely would be interested in collaborating. As I'll expand upon above I've focused on Michigan as a way to work on process which could be scaled to the full US. Very interested in building guidelines especially QA checks for the values through some work that was reviewed manually with linking across WikiData and OpenStreetMap I was able to highlight points that were duplicates of the same lake and they corrected after a review noting some borders due to reporting of data can be missed since states each submit their data and intersecting lakes can be duplicates until reviewed. Wolfgang8741 (talk) 18:24, 6 January 2021 (UTC)[reply]
@Wolfgang8741: FWIW, I've been looking at items which already had a coordinate location (P625). For those which had a GNIS Feature ID (P590) I used OpenRefine to create a table of the coordinates of each and calculated the distance between the two coordinates, sorted by distance and then used the query below to plot them side by side. For those without a GNIS Feature ID (P590) I searched for GNIS lakes or reservoirs which were nearby. For those which were close enough and/or matched the same name I carried out the same process just described. I found quite a few errors in Wikidata and a handful of problems with GNIS data which I notified them of.
#defaultView:Map
SELECT DISTINCT ?place ?placeLabel ?location ?layer ?distance WITH {
  SELECT ?place ?gnisPt ?distance WHERE {
    VALUES (?place ?name ?gnisPt ?gnisid ?distance) {
      (wd:Q7363321 "Shadehill Reservoir" "Point(-102.2572521,45.731476)"^^geo:wktLiteral "1261165" "4.883"^^xsd:double )
      (wd:Q6475393 "Lake Claiborne" "Point(-92.9607898,32.7581515)"^^geo:wktLiteral "553994" "5.434"^^xsd:double )
      (wd:Q7331554 "Somerville Lake" "Point(-96.5862171,30.3115649)"^^geo:wktLiteral "1376179" "5.846"^^xsd:double )
      (wd:Q6477811 "Lake Springfield" "Point(-89.6079334,39.706848)"^^geo:wktLiteral "419006" "5.884"^^xsd:double )
      (wd:Q6476037 "Lake Greeson" "Point(-93.789557,34.2410144)"^^geo:wktLiteral "75726" "6.39"^^xsd:double )
      (wd:Q6475467 "Lake Corpus Christi" "Point(-97.8646671,28.0435748)"^^geo:wktLiteral "1861919" "6.615"^^xsd:double )
      (wd:Q5181892 "Cranberry Lake" "Point(-74.8852263,44.125427)"^^geo:wktLiteral "976197" "7.143"^^xsd:double )
      (wd:Q6477632 "Lake San Antonio" "Point(-120.9011529,35.8082364)"^^geo:wktLiteral "254200" "7.638"^^xsd:double )
      (wd:Q451522 "Lake Havasu" "Point(-114.3128939,34.4343676)"^^geo:wktLiteral "243280" "8.447"^^xsd:double )
      (wd:Q6477985 "Lake Tawakoni" "Point(-95.9923037,32.8821102)"^^geo:wktLiteral "1376396" "10.617"^^xsd:double )
    }
    FILTER EXISTS { ?place wdt:P590 [] }
  }
} AS %vals
WHERE {
  {
    INCLUDE %vals
    ?place wdt:P625 ?location.
    BIND( 'wiki' as ?layer )
  } UNION {
    INCLUDE %vals
    BIND( ?gnisPt AS ?location )
    BIND( 'gnis' as ?layer )
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} ORDER BY ?label
Try it!
I guess, I'm not understanding what redirects you're talking about. Once we've matched up all Wikidata items to GNIS ID items. The only way unmatched GNIS ID items would conflict with existing Wikidata items is if they conflict with GNIS ID items, right? There are cases of duplicates in GNIS, but I don't believe there are tons of them. I'm not suggesting we should do it blindly, just that this approach may be more productive. Regards, Gettinwikiwidit (talk) 23:01, 6 January 2021 (UTC)[reply]
@Gettinwikiwidit: Ah, the redirect I referred to in the other thread has to do if duplicated Q's exist for the same entry. Once a Q is merged the second Q is treated as a redirect to the consolidated item on the same concept. Yes, IF all Wikidata items are matched to a GNIS then any duplicate should, but is not guaranteed to be a duplicate as it depends on how the GNIS is added and to what. For instance there have been GNIS IDs added from a EN Wiki article, but the article only cited the GNIS ID in passing for a feature with a similar name. In other cases GNIS IDs were added, but in an unmonitored fashion through a mass import and added to the same name Q, but now that Q is conflated since it was conflated on name and didn't have other fields to distinguish it since the check was just that the item article's Wikidata item didn't have a GNIS ID. Also there are times were a Wikidata item stayed true, but Wikipedia article changed focus. This is why I'm relying more on manual review for cleanup then import per a select region given many of these require splitting fields where there is conflated additional IDs etc. Given the moving target a systematic cleanup per region then moving to the next seems to me more likely to produce the quality and comprehensive coverage than focusing only on what exists here. This tackles both reducing chances of new items being added without GNIS and provides a strong representation of what the Q item represents for matching to new articles. Wolfgang8741 (talk) 16:08, 1 February 2021 (UTC)[reply]

Duplicates?[edit]

Atoka Lake certainly looks like it's the same thing as Lake Atoka Reservoir. Gettinwikiwidit (talk) 05:19, 7 January 2021 (UTC)[reply]

Merged Gettinwikiwidit (talk) 09:19, 17 January 2021 (UTC)[reply]

Grabbing GNIS from Wikipedia pages[edit]

Hello, using the Wikipedia API, I was able to find the GNIS Feature ID (P590) for the following entities. None of these had an GNIS Feature ID (P590) claim before, but I've since added them. Along the way I noticed that some of the coordinates were quite far off. We can/should probably extract the GNIS reference from all related pages to make sure they match with what's in the entity. And similarly do a comparison of the coordinates. Gettinwikiwidit (talk) 03:43, 11 January 2021 (UTC)[reply]

I went ahead and pulled all GNIS references out of en-wiki US lake entries which are associated with a Wikidata entity. There were ~50 cases where a GNIS ID in the Wikidata entity weren't mentioned in the en-wiki page but other GNIS entries were. Of these some simply weren't referencing the lake itself, but did reference the attached dam, for instance. There were scattered among them some mismatched entries which I endeavored to resolve. Often it was cases with multiple lakes in the same area with the same name, but occasionally it looked like a bad cut-and-paste. In any event, I currently believe that the correct GNIS ID is now referenced in all these en-wiki pages. Gettinwikiwidit (talk) 09:24, 17 January 2021 (UTC)[reply]

Cebu entities merged[edit]

I have merged the following entities:

Gettinwikiwidit (talk) 00:13, 14 January 2021 (UTC)[reply]

en-wiki coords differ from wikidata coords[edit]

Using the DBPedia SPARQL endpoint I extracted the en-wiki coordinates with the following query and then compared them with what was in Wikidata.

PREFIX : <http://dbpedia.org/resource/>
PREFIX coord: <http://www.w3.org/2003/01/geo/wgs84_pos#>
SELECT DISTINCT ?lake ?title ?lat ?long ?geometry ?prov WHERE {
  VALUES ?title {
    "Big_Creek_Lake_(Iowa)"   
  } 
  BIND(URI(CONCAT(STR(:),?title)) AS ?lake)
?lake coord:lat ?lat.
?lake coord:long ?long.
?lake coord:geometry ?geometry.
}
Try it!

Here are the 50 worst cases. I've started working through them starting at the worst. Some are just large lakes and so both points are valid, but others are mismatched Wikidata items to Wikipedia items.

I've looked at all of those and am replacing the table with the next 50. Gettinwikiwidit (talk) 02:36, 15 January 2021 (UTC)[reply]
Now I've done a few more rounds of this. I'm not seeing terrible errors but there are an awful lot of en-wiki entries which are on the lip of the lake rather than somewhere in the center. I changed several of these to use the GNIS coordinates, but couldn't be bothered to do them all. I'd guess there aren't many actual errors inside of 1.5km, so I'll probably put this on hold for now. I've marked below where I left off. Gettinwikiwidit (talk) 10:15, 16 January 2021 (UTC)[reply]
I should mention that I was focused on on en-wiki. I haven't touched any of the others. Gettinwikiwidit (talk) 10:15, 16 January 2021 (UTC)[reply]
Caption text
entity distance from en-wiki coords checked
Fort Peck Lake (Q4492006) 1.1
Singletary Lake (Q7349228) 1.1
Lake Lynn (Q16893907) 1.109
Loon Lake (Q6675634) 1.125
Lake Granby (Q6476029) 1.142
Trout Bog Lake (Q7361538) 1.145
Cataract Lake (Q5051463) 1.155
Lake Saint Francis (Q22352699) 1.173
Hoopes Reservoir (Q5898403) 1.183
Lower Otay Reservoir (Q1872606) 1.19
Griffy Lake (Q5608957) 1.19
North Lake (Q7055882) 1.193
Q2917246 1.196
Cross Lake (Q5188336) 1.212
Torch Bay (Q7825534) 1.22
Lake Iroquois, Illinois (Q6476287) 1.221
McDaniel Lake (Q6800839) 1.226
Waldo Lake (Q3215275) 1.232
Lake Sabbatia (Q6477599) 1.248
Proctor Lake (Q7247526) 1.259
Swimming River Reservoir (Q7656235) 1.267
Stansberry Lake (Q16107989) 1.268
Rocky Gorge Reservoir (Q7350741) 1.274
Barr Lake (Q808761) 1.281
Lake Chippewa (Q23950324) 1.282
Kent Lake (Q25238480) 1.294
Indian Lake (Q6020857) 1.303 checked
Moss Lake (Q30603602) 1.304 checked
Lake Kanawauke (Q6476405) 1.327 checked
Prien Lake (Q7242865) 1.328 checked
Pruess Lake (Q7253018) 1.361 checked
Mitchell Lake (Q6881286) 1.374 checked
Triadelphia Reservoir (Q7350748) 1.382 checked
Silver Lake (Q49702647) 1.383 checked
Candlewood Lake (Q5032008) 1.389 checked
Lake Georgetown (Q6475983) 1.396 checked
Lake Toxaway (Q6478136) 1.417 checked
Chevelon Canyon Lake (Q5094319) 1.418 checked
Lake Arrowhead (Q6474833) 1.423 checked
Lewiston Lake (Q6537401) 1.426 checked
Mona Lake (Q25238181) 1.429 checked
Jersey City Flowage (Q18152675) 1.45 checked
J C Murphey Lake (Q33264232) 1.452 checked
Kukaklek Lake (Q1791343) 1.498 checked
Lake Belle Taine (Q85775595) 1.504 checked
Paulina Lake (Q7154946) 1.506 checked
Lake Wichita (Q14710926) 1.523 checked
Malheur Lake (Q3215037) 1.526 checked
Fletcher Pond (Q5458814) 1.556 checked
Lake Alvin (Q6474756) 1.56 checked
Along the way found a few more merges with ceb-wiki: Silver Lake (Q34716991) Lake Gibson (Q35731456) Mays Pond (Q49305471) Gettinwikiwidit (talk) 23:59, 14 January 2021 (UTC)[reply]
Q2917246 was a conflation of census designated place and reservoir which is why the coord deviated from EN - split reservoir concept to Lake Waynoka (Q111977403) Wolfgang8741 (talk) 21:05, 12 May 2022 (UTC)[reply]