User talk:Equinox/code/ExtractBookWords

Hello ,

I used this, changing the line if ((ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z')) to if ((ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z') || (ch == 'æ') || (ch == 'ø') || (ch == 'å') || (ch == 'Æ') || (ch == 'Ø') || (ch == 'Å')) yet the program stills throws out those extra letters. It works fine apart from that. Perhaps you can tell me what I've done wrong?__Gamren (talk) 14:00, 5 October 2016 (UTC)


 * My immediate thought is that the program uses File.ReadAllText(string) and not File.ReadAllText(string, Encoding): if your file contains complex characters, you might need to specify the encoding, such as UTF-8 or Windows-Latin-1. That's just a guess. Since "throws out" in your comment might mean either "discards" or "outputs" (isn't English dumb?), I don't really understand what problem you are having. Equinox ◑ 20:00, 5 October 2016 (UTC)
 * Thank you! UTF8 didn't work for some reason, so I used

string[] s = File.ReadAllText(INPUT_FILE, Encoding.UTF7)
 * which worked. I am a complete newbie to C#, or C in general. By "throws out" I meant "discards".__Gamren (talk) 06:26, 6 October 2016 (UTC)
 * By the way, how would you recommend easily getting books in text format, apart from Gutenberg?__Gamren (talk) 09:20, 6 October 2016 (UTC)


 * I don't really know any other (legal!) sources. Equinox ◑ 16:11, 7 October 2016 (UTC)


 * If copyright is the issue, how about if one changed the sequence of words, e.g. through alphabetization? Surely that would not constitute infringement?__Gamren (talk) 09:40, 8 October 2016 (UTC)


 * You're using the copyrighted work to create another work, so I think that counts as a "derivative work" or something. IANAL. Equinox ◑ 09:51, 8 October 2016 (UTC)


 * I mean, I don't think that generating a word list from a typical novel etc. is a problem, but I thought you were asking where to get hold of computerised copies of books that are still in copyright. You'd have to go to illegal torrents etc. (or maybe hack Amazon Kindle's DRM!). Equinox ◑ 09:53, 8 October 2016 (UTC)