Wild Hunt Crawler settings

7 posts / 0 new
Last post
Tom Owad's picture
Offline
Last seen: 1 day 2 hours ago
Joined: Dec 16 2003 - 15:14
Posts: 3379
Wild Hunt Crawler settings

These are crawler settings for Applefritter's search engine, The Wild Hunt. Please post your recommended changes to this thread, as well as any technical issues with the search engine.

This thread will not be treated like a typical forum discussion. It will be updated periodically as crawler settings change. Posts will be deleted as they become outdated or as their suggestions are integrated or declined.

 

unless otherwise noted

  • unlimited depth for .*
  • Treat documents that are loaded 1 Days ago as stale and load them again. If they are younger, they are ignored.
  • Do not delete any document before the crawl is started.
 

www.applefritter.com

  • restrict to starting domains
  • no query URLs
  • document filter must not match: (.*/user/.*|.*/users/.*|.*/forum/.*|.*/taxonomy/.*|.*/page/.*)
 

asimov.applefritter.com

 

info-mac.applefritter.com

 

old.applefritter.com

 

jagubox.applefritter.com

 

www.reddit.com

 

lowendmac.com

 

folklore.org

 

emaculation.com

 

macintoshgarden.org/

 

macintoshrepository.org

 

68kmla.org

 

floodgap.com

 

geekhack.org

 

multicians.org

 

bitsavers.org

 

bytecollector.com

 

bitsavers.org

 

retrotechnology.com

 

history.computer.org/pioneers/

 

vtda.org

 

forums.atariage.com

 

minuszerodegrees.net

 

marc.info/?l=classiccmp

 

forum.vcfed.org

 

git.applefritter.com

 

insidemacgames.com

 

32by32.com/

  • restrict to starting domains
  • crawler filter must not match: .*32by32.com/19.*
  • not working
 

fidonet.applefritter.com

 

tinkerdifferent.com

  • link-list
    restrict to starting domains
  • document filter must match: .*/threads/.*
    crawler filter must not match: (.*/members/.*|.*/latest)

     

Offline
Last seen: 1 month 3 weeks ago
Joined: Dec 20 2003 - 10:38
Posts: 70
Leafing through my bookmarks,

Leafing through my bookmarks, here are some suggestions for sites to add, mostly blogs:

 

  • https://oldvcr.blogspot.com/
  • http://www.ysflight.com/
  • https://www.leadedsolder.com/
  • https://www.bigmessowires.com/
  • https://modelrail.otenko.com/
  • http://basalgangster.macgui.com/
  • https://eggfreckles.net/
  • https://tenfourfox.blogspot.com/
  • https://appletothecore.me/
  • https://preserve.mactech.com/
  • https://www.benshoof.org/blog/
  • https://oldcrap.org/
  • https://retroviator.com/
  • https://whatisthe2gs.apple2.org.za/
Dr. Webster's picture
Online
Last seen: 1 hour 9 min ago
Joined: Dec 19 2003 - 17:34
Posts: 1753
If 68kMLA is on the list,

If 68kMLA is on the list, might want to add tinkerdifferent.com too.

Tom Owad's picture
Offline
Last seen: 1 day 2 hours ago
Joined: Dec 16 2003 - 15:14
Posts: 3379
Dr. Webster wrote:If 68kMLA
Dr. Webster wrote:

If 68kMLA is on the list, might want to add tinkerdifferent.com too.

That's another Xenforo site. I'll need to figure out why those aren't crawling properly, first.

Tom Owad's picture
Offline
Last seen: 1 day 2 hours ago
Joined: Dec 16 2003 - 15:14
Posts: 3379
tinkerdifferent.com seems to

tinkerdifferent.com seems to be working. I have it crawling now. I think the difference between it and VCFed and 68kmla is that it uses clean URLs. I'm going to take a closer look at the latter two now.

Edit: scratch that, tinkerdifferent.com isn't working. I'll look at it some more. And if anybody else is interested in playing with this, I have a test engine set up for that purpose that you can have access to.

Edit 2: tinkerdifferent is done.

Tom Owad's picture
Offline
Last seen: 1 day 2 hours ago
Joined: Dec 16 2003 - 15:14
Posts: 3379
For the Xenforo sites that

For the Xenforo sites that don't use clean URLs, a /= is getting appended to URLs, like this:

https://forum.vcfed.org/index.php?threads/how-do-you-remove-a-ps-2-55sx-microchannel-slot-cover.1248201/=

I'm not sure why that's happening, but it causes the crawler to load the main index rather than the specific thread page.

Offline
Last seen: 6 hours 9 min ago
Joined: Feb 27 2021 - 18:59
Posts: 554
Site index

Will the index used by the search tool on this site ("Search this site") be updated as well?

It doesn't return any results newer than 2020.

Log in or register to post comments