-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prettify function #67
Comments
This already looks good except that there is no ordering for the length of matching and the match-all clause. I don't know if that's the case but possibly a parser would match "User-agent: *" if this is not the last entry even if there is a more specific entry like "User-agent: GoogleBot". "User-agent: *" should always be the last entry in the generated file. The same may hold true for the paths. I doubt that a parser matches "/themes/flowers" if there comes "/themes" first. So the entries within the User-agent blocks should be ordered with the longest matching paths first and the shorter ones later. For example: User-agent: * |
I quickly checked the Google and Yandex specs. They say that more specific paths override more general ones. However there may be other not so smart parsers out there. So the ordering has two use cases: satisfy less smart parsers and make the file better readable for humans. |
Implemented. // Disable
//$this->convert('self::prepareRegexRule');
$robots = $parser->getRules();
$output = array();
// Make match-all clause the last
if ( array_key_exists( '*', $robots ) ) {
$match_all = $robots['*'];
unset( $robots['*'] );
$robots['*'] = $match_all;
}
foreach ( $robots as $ua => $rules ) {
$output[] = 'User-agent: ' . $ua;
foreach ( $rules as $name => $value ) {
// Not multibyte
$nice_name = ucfirst( $name );
if ( is_array( $value ) ) {
// Shorter paths later
usort( $value, function ( $a, $b ) {
return mb_strlen( $a ) < mb_strlen( $b );
} );
foreach( $value as $elem ) {
$output[] = $nice_name . ': ' . $elem;
}
} else {
$output[] = $nice_name . ': ' . $value;
}
}
$output[] = '';
}
$sitemaps = $parser->getSitemaps();
foreach ( $sitemaps as $sitemap ) {
$output[] = 'Sitemap: ' . $sitemap;
}
$output[] = '';
print implode( PHP_EOL, $output ); |
@szepeviktor I'm only an end user of your Wordpress plugin :) I know what a pull request might be but I don't have anything to do with this library. |
@t1gor @JanPetterMG Could you add this as a new function and solve the disabling of |
I'm quite busy right now, but give me a day or two and I'll come back to this. Stay tuned. |
OK! |
@fte378 Just want to hear your opinion, why does the user-agents really have to be sorted Z-A (witch normally resulting in * as the last one)?
I know the UA matching code in this library has a lot of bugs, but at least it's selecting the correct UA in most cases. I'm probably fixing that one within a couple of days... |
It is going into my plugin: https://wordpress.org/plugins/multipart-robotstxt-editor/ |
*Added sitemaps. |
@JanPetterMG Could you guide me to disable |
@szepeviktor sorry for the delay, I'll take a look at it |
Thank you for your help. |
- Render, see #67 - Rearranged code, based on usage - Fixed PHPUnit bootstrap - Added PHP 7.1 for testing - Added future tests, currently disabled due to bugs
How about now? |
Oh! Thanks. |
It's PHP 5.4 compatible too, pushed b74ad59 a few minutes later, with an build fix... Still, there is work to do:
|
ucfirst() is implented above. |
@szepeviktor I know, but that doesn't fix the inline directives.
It should be something like:
|
Doesn't Host stand on a separate line? |
Rendering of the (regular) host directive now implemented. 627082d I'm not sure where I saw it, but there was some robots.txt files where it was used as an "in-line" directive. It's not coming from any of the Yandex examples, Google doesn't support them, and these two directives aren't listed in any draft either, so I haven't much references to it... Anyway, Yandex states this, whatever interpretation witch is right or wrong...
and
There is also the |
Do you plan to increase coverage? |
@fte378 It seems it is ready to integrate into the robots.txt plugin |
Reopen for todos in #67 (comment) |
There's a link on the front page, at the top of the README.md to be exact, the red box, from there it should be pretty easy to get an overview, and even view the coverage line by line... There is a lot more bugs in this library than I've expected, so I'm not sure how much time I'm going to spend fixing things... It should be refactored and rewritten, the sooner the better... Take a look at VIPnytt/RobotsTxtParser, it's originally based on this library, but refactored and built from scratch as of version 2.0. |
Why reinvent the wheel? |
Yeah, why fix this library at all, when VIPnytt/RobotsTxtParser has everything this library is missing?? I needed to handle caching, proper rendering, automatic download, proper user-agent matching, and it had to be as bug free as possible, for my spider, so that I can crawl up to 10k webpages a day without having to worry about anything... |
I do not care. I've just released this class: |
Seems like you've just moved most of the code and the idea to your own library now :( I'm not really sure how to beahve in this situation ... Taking a closer look at your lib, I have several suggestions. Maybe we join the efforts and make a better version of the parser together? |
@t1gor You're using the MIT license, witch allows me to do anything really. I've moved the idea, but 95% of the code is unique. The only reasons behind this decision, was too many bugs, lots of missing features, and that the backwards compatibility would be really hard to maintain... Anyways, I'm happy to join forces again, my library is rather over-engineered and could be simplified. |
It would be nice to output a nice and valid robots.txt.
The text was updated successfully, but these errors were encountered: