Wiktionary talk:Searchable external archives

Portuguese
Could someone please make a new page on the Portuguese Wiktionary, move the reference to Biblio there, and add an interwiki link? DAVilla 14:18, 24 January 2008 (UTC)

German
Likewise for the periodicals. DAVilla 15:56, 6 January 2009 (UTC)

Policy
Amazingly, or shockingly, this seems to be the closest we have to defining what a "durably archived source" is. I mean the question I always think of on hearing this phrase is "Durably? How durably?" Mglovesfun (talk) 12:34, 18 January 2010 (UTC)


 * There is no real policy, but we could try and backform one that is both simple and consistent with our current philosophy. I suggest: For something to be "durably" archived, it has to be archived in three independant locations. Where "archived" implies stored with the view to keeping it stored as long as possible; and "independant" implies not in the same place or operated/funded by the same people/organisations. The majority of the internet does not fit into this, it is only archived by http://archive.org; however a case can be made for Wikipedia, which is downloaded by researchers world-wide. Usenet is a harder issue, many of the companies that used to archive it have disappeared, perhaps prompting the arrival of google groups, see their announcement; I think it's currently being taken on trust that usenet is archived enough, besides, like Wikipedia, usenet posts are very widely distributed, so it's quite likely that copies of messages are stored in smaller privte places worldwide. For most periodicals and printed works, there are national libraries and private collections. It would be interesting to see if someone could use this definition to include something that would normally be excluded, or vice versa. Conrad.Irwin 13:51, 18 January 2010 (UTC)
 * Well what about printed sources? Everything you've said seem to refer to online stuff, which is good, but do we accept any printed source as durable? Your three separate location idea would work well with this, as clearly and decent book is going to be in print more than three times! Mglovesfun (talk) 14:24, 18 January 2010 (UTC)
 * " For most periodicals and printed works, there are national libraries and private collections.", I didn't think there was much to say about them :). I went for three to tie into our "three independent cites", but the number was arbitrary. Conrad.Irwin 14:34, 18 January 2010 (UTC)


 * Whatever policy or guideline we choose, accepting three quotations from the complete world wide web seems undesirable. That is a really all-inclusive standard. The number 3 is rather low; that is why it is accompanied by the stricter requirement of "durably archived sources", in CFI under the head of "permanently recorded media". --Dan Polansky 16:02, 19 May 2010 (UTC)


 * I think we could stand to be more honest about the reasons for our having such additional filters. If durability were the only issue, then any redundantly-archived wiki would be acceptable.  Yet we do not allow citations from wiki revisions, or AFAIK from other highly durable archives like the Enron corpus.  A primary reason for this would seem to be editorial oversight -- which would also argue against allowing self-published books or Usenet posts, but in favor of third-party-published e-books and e-journals, as well as things like RFCs and open-source software documentation.  (Printed collections of correspondence are an interesting case as well...)  Durability seems to be necessary, but not really sufficient, for a citation to be considered adequate. -- Visviva 16:08, 24 May 2010 (UTC)


 * Why exactly is Web undesirable? The only objection that I could think of is a possible manipulation in case of anonymously created dynamic content (e.g. blog or forum posts). But these themselves are often the only sources of popular slang or vulgar terms. I think that we should openly support all Web sources if author is signed (especially in case of online newspapers, or personal websites or blogs of already established authors), and anonymous sources only in case of certain special cases. "Durability" is completely irrelevant point IMHO, if the citation is provided and verified. --Ivan Štambuk 17:06, 24 May 2010 (UTC)


 * Someone (Conrad?) already said this elsewhere, but I just want to point out again that, even if Google Books is taken down some day (the legality of their having scanned and republished bits of the books is in question), the books and magazines themselves are "durably archived" because they were published in book form. Equinox ◑ 14:51, 22 May 2010 (UTC)
 * Exactly. --Dan Polansky 13:40, 24 May 2010 (UTC)


 * However, there are some books on b.g.c. from the past decade which have never been published except in electronic form. These can be difficult to sniff out, though the page layout is frequently a tipoff. -- Visviva 16:08, 24 May 2010 (UTC)

See also Votes/pl-2007-12/Attestation criteria and its talk page in particular. --Dan Polansky 08:56, 3 June 2010 (UTC)

Confusing
The three headers are electronic, online and offline. The offline section contains a link, and how are the first two different? Anyone have a clue? Mglovesfun (talk) 23:23, 18 March 2011 (UTC)

Google fail for Usenet searching on Google Groups
Google formerly had an "advanced search" function for Google Groups, which allowed us to locate stuff from Usenet a lot more reliably. We could put in start/end date ranges, and chronological ordering was more consistent.

See for example this 2004 post where this is mentioned:
 * Shannon Bauman
 * Associate Product Manager, Google Groups
 * P.S. In related news, we have restored advanced date search to Google Groups.

So apparently it was lost sometime in 2004 and brought back. Could the same be happening in 2014? I don't know when the feature was introduced or when it left, but I did come across complaints about it from late 2013, so I want to estimate sometime last year.

With Google having bought Deja News, is there any way to do advanced search functions for Usenet content anymore? Or is this resource lost to us? Without being able to do specific chronological searches, we end up resorting to randomly looking for the oldest uses of certain terms like "trolling".

I came across https://support.google.com/groups/answer/2371405?rd=1 which has some code additions to searches that I would like to try out, I am not sure if they work though. Of primary interest:
 * before:YYYY/MM/DD - Include only messages created before the YYYY/MM/DD date. For example, before:2011/11/02

From what I can tell this works, although I think you need to put in the full month/day. When I cut those off and just did the year, it didn't appear to function. The easiest solution if you want prior to a year is to add 1/1 for January 1. So if I did before:1990/1/1 I would get results from 1989 and earlier.

Under Template talk:google a 2008 discussion mentions that it may also be possible to add a string to searches which restricts the result to Usenet as opposed to searching non-Usenet results also consolidated under Google Groups.

Does anyone know if Wiktionary has any features or tools that would simplify this process for us?

I also found two videos from 2012 which show the Advanced Search feature, recorded before it got taken away:
 * July by flippedlearning
 * October by TempusNova

These are the types of assets I would like Wiktionary editors to have access to once more. Is it possible that any browser extensions could perform these functions for us? Etym (talk) 14:51, 11 August 2014 (UTC)


 * Just realized I did not give a clear example of how this works. As a base, if you do a basic search for "Wiktionary" under Google Groups you get:
 * https://groups.google.com/forum/#!search/wiktionary
 * This returns posts from recent years and stuff, seemingly in no particular order, I'm not sure what actually determines the order of the result. Basically you include the after/before string as a search parameter. Now if I want to exclude recent years, since it began in 2002 I might add a string and as such amend the URL to be:
 * https://groups.google.com/forum/#!search/wiktionary$20before$3A2004$2F01$2F01
 * By doing before 1 January 2004, I limit it to 2002 and 2003 (mostly 2003) posts. I imagine you could also stack before/after strings to get a range. I'm mostly interested in 'before' to find earlier mentions of words. Etym (talk) 05:27, 20 September 2014 (UTC)

Non-durable websites, removed
I removed these because they are actively misleading people (on numerous occasions, people have cited news websites in RFV and we've had to explain that when there's no way to tell whether or not an article appeared in print, the news website is useless) and because we now have good engines for searching printed media, anyway, in the form of Issuu and to a lesser extent Google Scholar.


 * not spider friendly
 * British Broadcasting Corporation
 * Official Vatican website


 * Computers: Government Computer News www.gcn.com/print/
 * Computers:IT Managing Information Strategies www.misweb.com/magarticle.asp
 * Computers:OS:Linux LinuxInsider www.linuxinsider.com/story/
 * recent articles free, older also searchable but full article at a fee
 * United States: Onion www.theonion.com
 * United States, California, central: Modesto Bee
 * United States, Illinois, Chicago: Chicago Sun-Times www.suntimes.com
 * United States, Massachusetts, Boston: Boston globe www.boston.com
 * United States, New York: Northern New York Historical Newspapers
 * United States, Washington D.C. Washington Post www.washingtonpost.com
 * NPR npr.org
 * CNN.com <tt>edition.cnn.com rss.cnn.com www.cnn.com</tt>
 * AOL <tt>news.aol.com</tt>
 * Yahoo! <tt>news.yahoo.com</tt>
 * Google

- -sche (discuss) 18:18, 2 August 2015 (UTC)

I also removed this from the section on Youtube: "In rare cases the clips are professional productions whose publishers have real, non-virtual contact information. Those special cases can be cited, but they are not considered durably archived on those grounds alone." Youtube isn't durably archived, so even if a professional output is publishing its videos on that platform, the videos aren't durably archived unless the media is also published somewhere else that is durable, in which case the advice to "cite the original [/other, durable] source" applies. - -sche (discuss) 18:39, 2 August 2015 (UTC)


 * If we're really at the point of rejecting words because they are "only" attested on sites like the BBC and NPR, that's really unfortunate and a sign of how ludicrous the "durably archived" criterion has become in practice. But it's certainly the case that our project pages should reflect current practice, however broken that practice may be, so thank you for updating. -- Visviva (talk) 21:09, 2 August 2015 (UTC)


 * In the years since my initial post above, some of these may have become acceptable. Someone with more time than I have right now might compile a new set of acceptable online news sites to add 'back' to the list. - -sche (discuss) 23:49, 7 March 2023 (UTC)

Quiet Quentin
Some part of this should be added here, maybe a quick "hey, there's this thing":

Quotation gadgets
A gadget, called Quiet Quentin is available. This gadget allows users to search Google Books for a term, and it creates quotations formatted to Wiktionary’s standards. A modified version of Quiet Quentin also exists which formats quotations using templates, as described at [ this Grease Pit discussion].


 * Good idea. - -sche (discuss) 23:50, 7 March 2023 (UTC)