These are crawler settings for Applefritter's search engine, The Wild Hunt. Please post your recommended changes to this thread, as well as any technical issues with the search engine.
This thread will not be treated like a typical forum discussion. It will be updated periodically as crawler settings change. Posts will be deleted as they become outdated or as their suggestions are integrated or declined.
unless otherwise noted
- unlimited depth for .*
- Treat documents that are loaded 1 Days ago as stale and load them again. If they are younger, they are ignored.
- Do not delete any document before the crawl is started.
www.applefritter.com
- restrict to starting domains
- no query URLs
- document filter must not match: (.*/user/.*|.*/users/.*|.*/forum/.*|.*/taxonomy/.*|.*/page/.*)
asimov.applefritter.com
- restrict to starting domains
info-mac.applefritter.com
- restrict to starting domains
- document filter must not match: .*/_.*
download.info.applefritter.com
old.applefritter.com
- restrict to starting domains
jagubox.applefritter.com
- restrict to starting domains
www.reddit.com
- Restrict to sub-path(s)
- document filter must match: .*comments.*
lowendmac.com
- restrict to starting domains
- document filter must not match: (.*/page/.*|.*/tag/.*|.*/category/.*|.*/wp-json/.*)
www.mactech.com
- Restrict to sub-path(s)
- document filter must not match: .*/page/.*
tidbits.com
- Restrict to sub-path(s)
- document filter must not match: .*/page/.*
folklore.org
- restrict to starting domains
- document filter must not match: .*-index.html.*
emaculation.com
- restrict to starting domains
- crawler filter must not match: (.*\?do.*|.*macemucompatibilitysheet.*|.*/lib/exe.*)
- document filter must not match: .*viewforum.php.*
macintoshgarden.org/
- restrict to starting domains
- follow depth: 1
- document filter must match: (.*/games/.*|.*/apps/.*)
- restrict to starting domains
- crawl filter must match: (https://macintoshgarden.org/forum/.*|https://macintoshgarden.org/forums/.*)
- document filter must match: .*/forum/.*
macintoshrepository.org
- restrict to starting domains
- document filter must not match: (.*\?p.*|.*/mac-specs/.*|.*.php.*|.*/img/.*)
68kmla.org
- restrict to starting domains
- document filter must match: (.*index.php\?threads.*|.*/archive/*)
- not working
- Restrict to sub-path(s)
- document filter must match: .*topic.asp.*
floodgap.com
- restrict to starting domains
geekhack.org
- restrict to starting domains
- document filter must match: .*\?topic.*
reddit.com
- Restrict to sub-path(s)
- document filter must match: .*comments.*
multicians.org
- not working
bitsavers.org
- restrict to starting domains
- disallow query URLs
bytecollector.com
- restrict to starting domains
bitsavers.org
- restrict to starting domains
- disallow query URLs
retrotechnology.com
- restrict to starting domains
history.computer.org/pioneers/
- Restrict to sub-path(s)
- crawler filter must not match: .*.pdf.*
vtda.org
- restrict to starting domains
forums.atariage.com
- restrict to starting domains
- no query urls
- document filter must match: .*/topic/.*
minuszerodegrees.net
- restrict to starting domains
marc.info/?l=classiccmp
- crawler filter must match: .*l=classiccmp.*w=4.*
- document filter must match: .*l=classiccmp\&m=.*
forum.vcfed.org
- restrict to starting domains
- document filter must match: .*index.php\?threads.*
- Not working
git.applefritter.com
- restrict to starting domains
- document filter must match: .*README.*
insidemacgames.com
- restrict to starting domains
- crawler filter must not match: .*/forum/.*
- document filter must not match: .*Page=.*
32by32.com/
- restrict to starting domains
- crawler filter must not match: .*32by32.com/19.*
- not working
textfiles.com
- restrict to starting domains
fidonet.applefritter.com
- restrict to starting domains
tinkerdifferent.com
- link-list
restrict to starting domains document filter must match: .*/threads/.*
crawler filter must not match: (.*/members/.*|.*/latest)
Leafing through my bookmarks, here are some suggestions for sites to add, mostly blogs:
If 68kMLA is on the list, might want to add tinkerdifferent.com too.
That's another Xenforo site. I'll need to figure out why those aren't crawling properly, first.
tinkerdifferent.com seems to be working. I have it crawling now. I think the difference between it and VCFed and 68kmla is that it uses clean URLs. I'm going to take a closer look at the latter two now.
Edit: scratch that, tinkerdifferent.com isn't working. I'll look at it some more. And if anybody else is interested in playing with this, I have a test engine set up for that purpose that you can have access to.
Edit 2: tinkerdifferent is done.
For the Xenforo sites that don't use clean URLs, a /= is getting appended to URLs, like this:
https://forum.vcfed.org/index.php?threads/how-do-you-remove-a-ps-2-55sx-microchannel-slot-cover.1248201/=
I'm not sure why that's happening, but it causes the crawler to load the main index rather than the specific thread page.
Will the index used by the search tool on this site ("Search this site") be updated as well?
It doesn't return any results newer than 2020.