Prettify function #67

szepeviktor · 2016-07-22T00:14:49Z

It would be nice to output a nice and valid robots.txt.

fte378 · 2016-07-22T05:55:24Z

This already looks good except that there is no ordering for the length of matching and the match-all clause.

I don't know if that's the case but possibly a parser would match "User-agent: *" if this is not the last entry even if there is a more specific entry like "User-agent: GoogleBot". "User-agent: *" should always be the last entry in the generated file.

The same may hold true for the paths. I doubt that a parser matches "/themes/flowers" if there comes "/themes" first. So the entries within the User-agent blocks should be ordered with the longest matching paths first and the shorter ones later. For example:

User-agent: *
Allow: /themes/flowers/roses
Disallow: /themes/flowers
Allow: /themes

fte378 · 2016-07-22T06:30:12Z

I quickly checked the Google and Yandex specs. They say that more specific paths override more general ones. However there may be other not so smart parsers out there. So the ordering has two use cases: satisfy less smart parsers and make the file better readable for humans.

szepeviktor · 2016-07-22T13:43:03Z

Implemented.
@fte378 Would you open a PR and solve the disabling of $this->convert('self::prepareRegexRule');?

// Disable
//$this->convert('self::prepareRegexRule');

$robots = $parser->getRules();
$output = array();

// Make match-all clause the last
if ( array_key_exists( '*', $robots ) ) {
    $match_all = $robots['*'];
    unset( $robots['*'] );
    $robots['*'] = $match_all;
}

foreach ( $robots as $ua => $rules ) {
    $output[] = 'User-agent: ' . $ua;
    foreach ( $rules as $name => $value ) {
        // Not multibyte
        $nice_name = ucfirst( $name );
        if ( is_array( $value ) ) {
            // Shorter paths later
            usort( $value, function ( $a, $b ) {
                return mb_strlen( $a ) < mb_strlen( $b );
            } );
            foreach( $value as $elem ) {
                $output[] = $nice_name . ': ' . $elem;
            }
        } else {
            $output[] = $nice_name . ': ' . $value;
        }
    }
    $output[] = '';
}

$sitemaps = $parser->getSitemaps();

foreach ( $sitemaps as $sitemap ) {
    $output[] = 'Sitemap: ' . $sitemap;
}
$output[] = '';

print implode( PHP_EOL, $output );

fte378 · 2016-07-22T14:05:08Z

@szepeviktor I'm only an end user of your Wordpress plugin :) I know what a pull request might be but I don't have anything to do with this library.

szepeviktor · 2016-07-22T14:08:58Z

@t1gor @JanPetterMG Could you add this as a new function and solve the disabling of $this->convert('self::prepareRegexRule');?

JanPetterMG · 2016-07-22T14:47:03Z

I'm quite busy right now, but give me a day or two and I'll come back to this. Stay tuned.

szepeviktor · 2016-07-22T14:59:19Z

OK!

JanPetterMG · 2016-07-22T21:00:03Z

@fte378 Just want to hear your opinion, why does the user-agents really have to be sorted Z-A (witch normally resulting in * as the last one)?
I know there is a lot of pros/cons here, but if parsed by this library, it shouldn't make any difference anyway. Do you plan to use this library to render an existing robots.txt and then publish it? or in other ways use the rendered result outside of this library?

Only one group of group-member records is valid for a particular crawler. The crawler must determine the correct group of records by finding the group with the most specific user-agent that still matches. All other groups of records are ignored by the crawler. The user-agent is non-case-sensitive. All non-matching text is ignored (for example, both googlebot/1.2 and googlebot* are equivalent to googlebot). The order of the groups within the robots.txt file is irrelevant.

Source: https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt#order-of-precedence-for-user-agents

I know the UA matching code in this library has a lot of bugs, but at least it's selecting the correct UA in most cases. I'm probably fixing that one within a couple of days...

szepeviktor · 2016-07-22T21:11:28Z

It is going into my plugin: https://wordpress.org/plugins/multipart-robotstxt-editor/

szepeviktor · 2016-07-23T10:01:28Z

*Added sitemaps.

szepeviktor · 2016-08-10T11:42:53Z

@JanPetterMG Could you guide me to disable $this->convert('self::prepareRegexRule');?

JanPetterMG · 2016-08-10T14:59:06Z

@szepeviktor sorry for the delay, I'll take a look at it

szepeviktor · 2016-08-10T15:03:18Z

Thank you for your help.
I appreciate it.

- Render, see #67 - Rearranged code, based on usage - Fixed PHPUnit bootstrap - Added PHP 7.1 for testing - Added future tests, currently disabled due to bugs

JanPetterMG · 2016-08-10T17:16:02Z

How about now?
$parser->render();

szepeviktor · 2016-08-10T17:28:08Z

Oh! Thanks.
814737c

JanPetterMG · 2016-08-10T17:53:50Z

It's PHP 5.4 compatible too, pushed b74ad59 a few minutes later, with an build fix...

Still, there is work to do:

Render the Host directive
ucfirst and prettify all inline directives
Verify that the inline directives actually are valid, and not just some random text
Typo in the input parameter, should be "/r/n"

szepeviktor · 2016-08-10T17:56:38Z

ucfirst() is implented above.

JanPetterMG · 2016-08-10T18:16:35Z

@szepeviktor I know, but that doesn't fix the inline directives.
This is what it outputs:

Disallow: clean-param:token path // invalid path
Disallow: host:a.b // invalid host

It should be something like:

Disallow: Clean-param: token /path
Disallow: Host: example.com

szepeviktor · 2016-08-10T18:24:08Z

Doesn't Host stand on a separate line?
https://yandex.com/support/webmaster/controlling-robot/robots-txt.xml#host

JanPetterMG · 2016-08-10T19:04:49Z

Rendering of the (regular) host directive now implemented. 627082d
Host: example.com

I'm not sure where I saw it, but there was some robots.txt files where it was used as an "in-line" directive. It's not coming from any of the Yandex examples, Google doesn't support them, and these two directives aren't listed in any draft either, so I haven't much references to it...

Anyway, Yandex states this, whatever interpretation witch is right or wrong...

However, the Host directive is intersectional, so it will be used by the robot regardless of its location in robots.txt.

and

The Clean-Param directive is intersectional, so it can be indicated in any place within the robots.txt file.

There is also the Cache-delay directive (implemented), I don't know where it's coming from, haven't found any links or references to date, still, I can list a couple of handfuls of URLs to robots.txt files where it's been used...

szepeviktor · 2016-08-10T19:08:59Z

Do you plan to increase coverage?
https://codeclimate.com/github/t1gor/Robots.txt-Parser-Class/source/robotstxtparser.php
there is no direct link to Source -> Coverage

szepeviktor · 2016-08-10T19:54:00Z

@fte378 It seems it is ready to integrate into the robots.txt plugin

szepeviktor · 2016-08-10T19:55:05Z

Reopen for todos in #67 (comment)

JanPetterMG · 2016-08-10T19:59:19Z

There's a link on the front page, at the top of the README.md to be exact, the red box, from there it should be pretty easy to get an overview, and even view the coverage line by line...

There is a lot more bugs in this library than I've expected, so I'm not sure how much time I'm going to spend fixing things... It should be refactored and rewritten, the sooner the better...

Take a look at VIPnytt/RobotsTxtParser, it's originally based on this library, but refactored and built from scratch as of version 2.0.
Any bugs/issues from t1gor/Robots.txt-Parser-Class is cross checked and eliminated in VIPnytt/RobotsTxtParser...
Coverage 96.5%
It even has an integrated render function, a little bit better than the one included in this library 😄

szepeviktor · 2016-08-10T20:02:55Z

Why reinvent the wheel?

JanPetterMG · 2016-08-10T20:11:22Z

Yeah, why fix this library at all, when VIPnytt/RobotsTxtParser has everything this library is missing??

I needed to handle caching, proper rendering, automatic download, proper user-agent matching, and it had to be as bug free as possible, for my spider, so that I can crawl up to 10k webpages a day without having to worry about anything...
That's something this library never will be able to do :p
That's why I've built it :D

szepeviktor · 2016-08-10T21:01:23Z

I do not care. I've just released this class:
szepeviktor/multipart-robotstxt-editor@71e0a87

t1gor · 2017-12-05T09:07:54Z

Yeah, why fix this library at all, when VIPnytt/RobotsTxtParser has everything this library is missing??

I needed to handle caching, proper rendering, automatic download, proper user-agent matching, and it had to be as bug free as possible, for my spider, so that I can crawl up to 10k webpages a day without having to worry about anything...
That's something this library never will be able to do :p
That's why I've built it :D

Seems like you've just moved most of the code and the idea to your own library now :( I'm not really sure how to beahve in this situation ...

Taking a closer look at your lib, I have several suggestions. Maybe we join the efforts and make a better version of the parser together?

JanPetterMG · 2017-12-08T21:00:05Z

@t1gor You're using the MIT license, witch allows me to do anything really. I've moved the idea, but 95% of the code is unique. The only reasons behind this decision, was too many bugs, lots of missing features, and that the backwards compatibility would be really hard to maintain...

Anyways, I'm happy to join forces again, my library is rather over-engineered and could be simplified.
I've included support for every rule ever existed, including drafts, but I'm open to dropping support for some of them, as they're practically never used on the internet anymore...

JanPetterMG added the enhancement label Jul 22, 2016

JanPetterMG self-assigned this Jul 26, 2016

JanPetterMG added a commit that referenced this issue Aug 10, 2016

Various fixes

814737c

- Render, see #67 - Rearranged code, based on usage - Fixed PHPUnit bootstrap - Added PHP 7.1 for testing - Added future tests, currently disabled due to bugs

JanPetterMG added the work-in-progress label Aug 10, 2016

JanPetterMG removed the work-in-progress label Aug 10, 2016

szepeviktor closed this as completed Aug 10, 2016

szepeviktor reopened this Aug 10, 2016

JanPetterMG removed their assignment Aug 10, 2016

JanPetterMG mentioned this issue Sep 2, 2016

Very poor performance if isAllowed()/isDisallowed() functions #77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prettify function #67

Prettify function #67

szepeviktor commented Jul 22, 2016

fte378 commented Jul 22, 2016

fte378 commented Jul 22, 2016

szepeviktor commented Jul 22, 2016 •

edited

Loading

fte378 commented Jul 22, 2016

szepeviktor commented Jul 22, 2016

JanPetterMG commented Jul 22, 2016

szepeviktor commented Jul 22, 2016

JanPetterMG commented Jul 22, 2016 •

edited

Loading

szepeviktor commented Jul 22, 2016

szepeviktor commented Jul 23, 2016

szepeviktor commented Aug 10, 2016

JanPetterMG commented Aug 10, 2016

szepeviktor commented Aug 10, 2016

JanPetterMG commented Aug 10, 2016

szepeviktor commented Aug 10, 2016

JanPetterMG commented Aug 10, 2016 •

edited

Loading

szepeviktor commented Aug 10, 2016

JanPetterMG commented Aug 10, 2016

szepeviktor commented Aug 10, 2016 •

edited

Loading

JanPetterMG commented Aug 10, 2016

szepeviktor commented Aug 10, 2016 •

edited

Loading

szepeviktor commented Aug 10, 2016

szepeviktor commented Aug 10, 2016

JanPetterMG commented Aug 10, 2016

szepeviktor commented Aug 10, 2016

JanPetterMG commented Aug 10, 2016

szepeviktor commented Aug 10, 2016

t1gor commented Dec 5, 2017

JanPetterMG commented Dec 8, 2017

Prettify function #67

Prettify function #67

Comments

szepeviktor commented Jul 22, 2016

fte378 commented Jul 22, 2016

fte378 commented Jul 22, 2016

szepeviktor commented Jul 22, 2016 • edited Loading

fte378 commented Jul 22, 2016

szepeviktor commented Jul 22, 2016

JanPetterMG commented Jul 22, 2016

szepeviktor commented Jul 22, 2016

JanPetterMG commented Jul 22, 2016 • edited Loading

szepeviktor commented Jul 22, 2016

szepeviktor commented Jul 23, 2016

szepeviktor commented Aug 10, 2016

JanPetterMG commented Aug 10, 2016

szepeviktor commented Aug 10, 2016

JanPetterMG commented Aug 10, 2016

szepeviktor commented Aug 10, 2016

JanPetterMG commented Aug 10, 2016 • edited Loading

szepeviktor commented Aug 10, 2016

JanPetterMG commented Aug 10, 2016

szepeviktor commented Aug 10, 2016 • edited Loading

JanPetterMG commented Aug 10, 2016

szepeviktor commented Aug 10, 2016 • edited Loading

szepeviktor commented Aug 10, 2016

szepeviktor commented Aug 10, 2016

JanPetterMG commented Aug 10, 2016

szepeviktor commented Aug 10, 2016

JanPetterMG commented Aug 10, 2016

szepeviktor commented Aug 10, 2016

t1gor commented Dec 5, 2017

JanPetterMG commented Dec 8, 2017

szepeviktor commented Jul 22, 2016 •

edited

Loading

JanPetterMG commented Jul 22, 2016 •

edited

Loading

JanPetterMG commented Aug 10, 2016 •

edited

Loading

szepeviktor commented Aug 10, 2016 •

edited

Loading

szepeviktor commented Aug 10, 2016 •

edited

Loading