User talk:Kristian-Clausal

Reusing Wiktionary data
Hi there, thank you for your comments and edits here. You mentioned on the Feedback page that the company you work for uses Wiktionary data in some form, can you share some more details about this? – Jberkel 15:37, 10 November 2021 (UTC)


 * I'll have to ask my boss what I can disclose. I'm pretty sure it's reasonably public and he's been active here himself before I joined, but I want to be sure. Kristian-Clausal (talk) 06:17, 11 November 2021 (UTC)
 * Nothing to disclose then? Anyway, whatever it is, make sure you comply with the license. Seems obvious but a lot of sites reuse content without attribution. - Jberkel 12:13, 28 November 2021 (UTC)
 * We ended up not actually going through with it because Wiktionary's wikimedia style data structures (especially the category system) made it essentially unusable for our purposes. There are some other things in the pipeline, maybe. All the licensing stuff and such is above board, and has been before I came along, so no need to worry about that. Kristian-Clausal (talk) 06:10, 29 November 2021 (UTC)

"changes for machine-readability"
Hi. Can you explain why you need to make changes like for "machine readability"? I do software development professionally and I can guarantee you that there is no need to make such changes. You changed the look of the page, which you should only do if this is the correct thing from a usability/UI standpoint, not simply to make your life easier as a programmer. Benwing2 (talk) 02:08, 23 July 2022 (UTC)
 * I see you did this to a lot of templates. I am going to revert them all; this is a really bad idea. Benwing2 (talk) 02:11, 23 July 2022 (UTC)
 * OK maybe that was a bit harsh. I think what you are trying to do is make it easier to identify the headers vs. the content. However, this causes all headers to be boldface, which changes the look of the page, which may not be the right thing for an individual page. Benwing2 (talk) 03:19, 23 July 2022 (UTC)
 * Thank you for not reverting the changes yet. I am on vacation right now, so I'll try to get someone else to explain what we're trying to do, if I can't do so well enough right now.
 * I work for Clausal Computing Oy in Finland, which runs kaikki.org. Wiktextract is a project that takes wikitext (specifically en.wiktionary.org text), processes and tries to parse it to get data from Wiktionary word article headers and tables, then outputs it in a json format.
 * And by parsing, I mean genuinely getting meaningful information out of human-written text. But this is hard, genuinely difficult to do, because wiktionary is written by humans and their output is too variable and ambiguous (comparing one editor to another, or even one editor to themselves in a different article).
 * The changes I have been making into tables is, as you've seen, changing table cells with content into headers when they are semantically headers, because that's one of the persistent problems we have with tables. We have some heuristic-based methods for getting something out of tables by trying to figure out if a given cell should actually be considered a header by guessing, but that causes a lot of false positives and a lot of garbage data because you can easily have a table where you have a table cell that looks like a pronoun or a grammatical term that then scrambles everything (amongst other problems). However, when the parser finds something inside a  header block, it knows that it's not a content cell and can handle it appropriately, without guessing.
 * What I've been doing when I edit tables (and modules that generate tables) is change |-cells in the table into !-cells (by hand). This is the correct thing to do, from the perspective of the tables: these are headers, and 90% of the time there isn't even any kind of cosmetic change. Only sometimes, like with the Hindi modules you've apparently noticed, is there a visible change.
 * This is the first time since I've started doing these changes that anyone's actually noticed anything, or at least said anything, so I've taken it to mean that the changes haven't broken anything, and they shouldn't have.
 * If you wish, I can leave the specific language you're worried about alone and just add it to a list of languages in our code "with known difficult tables". Eventually, as we get the list populated, the languages in that list will have the above-mentioned guess-based heuristics used on them and won't output error messages to me fix the table cell headers. But that will have to wait until next month when I get back to work.
 * If you want to contribute to Wiktextract, it is open source and everyone is welcome! Seeing as how you are a wiktionary editor and programmer, you could even make tables correct in a better way than I have done up until now by correct cells into headers when appropriate and applying them with some kind of style to override the header bolding? Kristian-Clausal (talk) 05:43, 23 July 2022 (UTC)
 * I'm all for making Wiktionary more machine-readable, but maybe it would be good to discuss this with the community as well. I think most editors here aren't even aware that this project exists (and when I asked you about it previously, you didn't mention it). – Jberkel 13:51, 23 July 2022 (UTC)
 * I have tried to not do anything that would be big enough to involve the community and just made what I've felt are small common-sense changes. For example, changing the  in tables is the most common way to do tables in general on wiktionary, so I had convinced myself that it had to be part of some general table formatting style guide and that tables that didn't do it were just incorrect. I will not touch tables in the future, unless something comes from your deliberations. Most all other edits I've done are all very minor and you can verify that easily by checking them out, if you are feeling suspicious. Kristian-Clausal (talk) 14:40, 23 July 2022 (UTC)

"The subst: quotation template is malformed and I can't figure out how to fix "
If you can't figure out how to fix it, ask on the BP or GP, or mark it for cleanup, instead of just deleting it. Jberkel 08:30, 15 February 2024 (UTC)


 * I'll keep that in mind. Kristian-Clausal (talk) 08:32, 15 February 2024 (UTC)