Skip to content

noahhaon/scrapi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

== NOTES

This is a fork of aasaf's scrapi-1.2.0 gem.  The intent of the fork was to
support handling of cookies.  

You may pass in your cookie string in the Scrapi.scrape options hash like
so:

scraper.scrape(url,	:user_agent => 'Mosaic for Amiga/1.2', 
			:cookies => 'chocolate_chip=1; peanut_butter=5;' )




== ScrAPI toolkit for Ruby

A framework for writing scrapers using CSS selectors and simple
select => extract => store processing rules.

Here’s an example that scrapes auctions from eBay:

  ebay_auction = Scraper.define do
    process "h3.ens>a", :description=>:text,
                        :url=>"@href"
    process "td.ebcPr>span", :price=>:text
    process "div.ebPicture >a>img", :image=>"@src"

    result :description, :url, :price, :image
  end

  ebay = Scraper.define do
    array :auctions

    process "table.ebItemlist tr.single",
            :auctions => ebay_auction

    result :auctions
  end

And using the scraper:

  auctions = ebay.scrape(html)

  # No. of auctions found
  puts auctions.size

  # First auction:
  auction = auctions[0]
  puts auction.description
  puts auction.url


To get the latest source code with regular updates:

git clone [email protected]:noahhaon/scrapi.git


== Using TIDY

By default scrAPI uses Tidy to cleanup the HTML.

You need to install the Tidy Gem for Ruby:
  gem install tidy

And the Tidy binary libraries, available here:

  http://tidy.sourceforge.net/

By default scrAPI looks for the Tidy DLL (Windows) or shared library (Linux) in the directory lib/tidy. That's one place to place the Tidy library.

Alternatively, just point Tidy to the library with:

  Tidy.path = "...."

On Linux this would probably be:

  Tidy.path = "/usr/local/lib/libtidy.so"

On OS/X this would probably be:

  Tidy.path = “/usr/lib/libtidy.dylib”

For testing purposes, you can also use the built in HTML parser. It's useful for testing and getting up to grabs with scrAPI, but it doesn't deal well with broken HTML. So for testing only:

  Scraper::Base.parser :html_parser


== License

Copyright (c) 2006 Assaf Arkin, under Creative Commons Attribution and/or MIT License

Developed for http://co.mments.com

Code and documention: http://labnotes.org

HTML cleanup and good hygene by Tidy, Copyright (c) 1998-2003 World Wide Web Consortium.
License at http://tidy.sourceforge.net/license.html

HTML DOM extracted from Rails, Copyright (c) 2004 David Heinemeier Hansson. Under MIT license.

HTML parser by Takahiro Maebashi and Katsuyuki Komatsu, Ruby license.
http://www.jin.gr.jp/~nahi/Ruby/html-parser/README.html

About

fork of scrapi 1.2.0

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages