Skip to content

Example of always running crawl app

Martin Angers edited this page Jun 12, 2013 · 1 revision

The following example shows a gocrawl application that reads its seeds from the database, presumably saves the harvested URLs to the database too, with a nextCrawl date field or a crawled flag to false, for example, and runs gocrawl indefinitely, limiting the number of seeds per host to send in each cycle.

package main

import (
	"log"
	"time"

	"github.com/PuerkitoBio/gocrawl"
)

const (
	// You probably want to limit the number of hits you will continually make on 
	// the websites, so this allows some throttling. Adjust as required.
	SeedsLimitPerSource = 100
	ForeverLoopDelay    = 10 * time.Minute
)

type customExtender struct {
	*gocrawl.DefaultExtender
	// Possibly some additional fields, as required
}

// Omitted: overridden Extender methods, as required

var (
	ext = &customExtender{
		new(gocrawl.DefaultExtender),
	}
	crawler = gocrawl.NewCrawler(ext)
)

func main() {
	// Omitted: Open connection to the database, defer the close
	// Adjust the options as required, for example:
	crawler.Options.LogFlags = gocrawl.LogError
	loopForever()
}

func loopForever() {
	for {
		delay := time.After(ForeverLoopDelay)
		seeds := getNextSeeds()
		err := crawler.Run(seeds)
		if err != nil {
			log.Print("error crawling URLs: ", err)
		}
		<-delay
	}
}

func getNextSeeds() gocrawl.S {
	ret := make(gocrawl.S)
/*
  Omitted: implementation of looping over a range of hosts, querying
  SeedsLimitPerSource number of URLs for each one. The use of gocrawl.S
  allows specifying each URL as an entry in a map[string]interface{} where
  the key is the URL, and the value is some state data associated with the URL
  (e.g. the ID of the URL in the database, or the whole struct representing the URL, 
  whatever).
*/
	return ret
}
Clone this wiki locally