- Can now specify a range for the number of words to group together when grouping.
- Specify multiple grouping characters.
- Fixed accidental concatenation of words when stripping HTML tags.
- Grouping words together.
- Added Docker support.
- Updated the parser so that it looks at the content on all pages which are returned, not just those with a 200 return code.
- Added the
--allowed parameter
to limit crawling to URLs matching the passed RegEx. Work done by 5p1n.
- Added the
--lowercase
parameter to convert all letters to lower case. - Added the
--convert-umlauts
parameter to convert Latin-1 umlauts (e.g. "ä" to "ae", "ö" to "oe", etc.).
- Added the
--with-number
parameter to make words include letters and numbers.
- Merged an update to change the way usage instructions are shown.
- Updated instructions on installing gems.
- Updated README.
- A line to add a / to the end of the URL had been commented out. I don't remember why it was done but I'm putting it back in. See issue 26.
- Steven van der Baan added the ability to hit ctrl-c and keep the results so far.
- Added the ability to handle non-standard port numbers.
- Added lots more debugging and a new --debug parameter.
- Added the command line argument --header (-H) to allow headers to be passed in.
- Parameters are specified in name:value pairs and you can pass multiple.
Loads of changes including:
- Code refactoring by @g0tmi1k
- Internationalisation - should now handle non-ASCII sites much better
- Found more ways to pull words out of JavaScript content and other areas that aren't normal HTML
- Lots of little bug fixes
- Added the GPL-3+ licence to allow inclusion in Debian.
- Added a Gemfile to make installing gems easier.
- Adds proxy support from the command line and the ability to pass in credentials for both basic and digest authentication.
- A few other smaller bug fixes as well.
CeWL now sorts the words found by count and optionally (new --count argument) includes the word count in the output. I've left the words in the case they are in the pages so "Product" is different to "product" I figure that if it is being used for password generation then the case may be significant so let the user strip it if they want to. There are also more improvements to the stability of the spider in this release.
By default, CeWL sticks to just the site you have specified and will go to a depth of 2 links, this behaviour can be changed by passing arguments. Be careful if setting a large depth and allowing it to go offsite, you could end up drifting on to a lot of other domains. All words of three characters and over are output to stdout. This length can be increased and the words can be written to a file rather than screen so the app can be automated.
Fixes a pretty major bug that I found while fixing a smaller bug for @yorikv. The bug was related to a hack I had to put in place because of a problem I was having with the spider, while I was looking in to it I spotted this line which is the one that the spider uses to find new links in downloaded pages:
web_page.scan(/href="(.*?)"/i).flatten.map do |link|
This is fine if all the links look like this:
<a href="test.php">link</a>
But if the link looks like either of these:
<a href='test.php'>link</a>
<a href=test.php>link</a>
The regex will fail so the links will be ignored.
To fix this up I've had to override the function that parses the page to find all the links, rather than use a regex I've changed it to use Nokogiri which is designed to parse a page looking for links rather than just running through it with a custom regex. This brings in a new dependency but I think it is worth it for the fix to the functionality. I also found another bug where a link like this:
<a href='#name'>local</a>
Which should be ignored as it just links to an internal name was actually being translated to '/#name' which may unintentionally mean referencing the index page. I've fixed this one as well after a lot of debugging to find how best to do it.
A final addition is to allow a user to specify a depth of 0 which allows CeWL to spider a single page.
I'm only putting this out as a point release as I'd like to rewrite the spidering to use a better spider, that will come out as the next major release.
The main change in version 4.0/1 is the upgrade to run with Ruby 1.9.x, this has been tested on various machines and on BT5 as that is a popular platform for running it and it appears to run fine. Another minor change is that Up to version 4 all HTML tags were stripped out before the page was parsed for words, this meant that text in alt and title tags were missed. I now grab the text from those tags before stripping the HTML to give those extra few works.
Addresses a problem spotted by Josh Wright. The Spider gem doesn't handle JavaScript redirection URLs, for example an index page containing just the following:
<script language="JavaScript">
self.location.href =
'http://www.FOO.com/FOO/connect/FOONet/Top+Navigator/Home';
</script>
Wasn't spidered because the redirect wasn't picked up. I now scan through a page looking for any lines containing location.href= and then add the given URL to the list of pages to spider.
Version 2 of CeWL can also create two new lists, a list of email addresses found in mailto links and a list of author/creator names collected from meta data found in documents on the site. It can currently process documents in Office pre 2007, Office 2007 and PDF formats. This user data can then be used to create the list of usernames to be used in association with the password list.