Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add '--convert-html-to-text' option #103

Closed
wants to merge 1 commit into from

Conversation

unsolaci
Copy link
Contributor

@unsolaci unsolaci commented Apr 7, 2022

Introduce the '--convert-html-to-text' option, that converts .html
license files (exported using '--export-license-texts') to .txt, by
stripping and decoding HTML.

(Partially?) fix #81.

This is certainly not a perfect implementation, and could see some refactoring
and improvement.

All it really does is get the innerText DOM property and decode HTML entities,
so the formatting of the output text files is sometimes a bit awkward.

That said, at least it gives you an option to get text instead of HTML.
And since in some cases these HTML license files are huge (with CSS and JS
embedded 🤯), I still consider it a step up (i.e. works for my project).

Example excerpt from System.Resources.ResourceManager_4.0.0.html:

<h1 style='margin-top:6.0pt;margin-right:0in;margin-bottom:0in;margin-left:
.25in;margin-bottom:.0001pt;text-indent:-.25in'><span style='font-size:10.0pt'>1.<span
style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp; </span></span><span
style='font-size:10.0pt'>INSTALLATION AND USE RIGHTS. </span></h1>

<p class=Bullet3 style='margin-top:0in;margin-right:0in;margin-bottom:6.0pt;
margin-left:.25in;text-indent:0in'><span style='font-size:10.0pt'>You may
install and use any number of copies of the software </span><span
style='font-size:10.0pt;color:black'>to develop and test your applications.&nbsp;
</span></p>

<h1><span style='font-size:10.0pt'>2.<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;
</span></span><span style='font-size:10.0pt'>THIRD PARTY COMPONENTS. </span><span
style='font-weight:normal'>The software may include third party components with
separate legal notices or governed by other agreements, as may be described in
the ThirdPartyNotices file(s) </span><span style='font-size:10.0pt;font-weight:
normal'>accompanying the software.</span></h1>

Converted to System.Resources.ResourceManager_4.0.0.txt:

1.    INSTALLATION AND USE RIGHTS. 

You may
install and use any number of copies of the software to develop and test your applications. 


2.   
THIRD PARTY COMPONENTS. The software may include third party components with
separate legal notices or governed by other agreements, as may be described in
the ThirdPartyNotices file(s) accompanying the software.

Not perfect, but much more human-readable.

Introduce the '--convert-html-to-text' option, that converts .html
license files (exported using '--export-license-texts') to .txt, by
stripping and decoding HTML.

public void ConvertHtmlFilesToText(string htmlFolder)
{
var htmlFiles = Directory.GetFiles(htmlFolder, "*.html");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @unsolaci,

Instead of saving all HTML files and then converting them into text files, would you mind intercepting the stream at :990 and saving converted txt directly?

I think this would eliminate unnecessary file management.

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Lexy2,

Your suggestion does sound like a good idea, but I'm unsure about the
implementation.

HtmlAgilityPack provides
methods that work with files and strings. I don't know on how to work with that
asynchronously in the context of ExportLicenseTexts().

I could simply do something like this:

public string ConvertHtmlToText(string html) {
    var htmlDocument = new HtmlAgilityPack.HtmlDocument();
    
    htmlDocument.LoadHtml(html);
    
    string htmlStrippedText =
        System.Web.HttpUtility.HtmlDecode(
            htmlDocument.DocumentNode.InnerText
        );
    
    return htmlStrippedText;
}

And then inside ExportLicenseTexts():

string licenseFileContents =
    await response.Content.ReadAsStringAsync();

string htmlStrippedLicenseFileContents =
    ConvertHtmlToText(licenseFileContents);

System.IO.File.WriteAllText(
    path: outpath, 
    contents: htmlStrippedLicenseFileContents
);

But I have a feeling that this might be defeating the purpose of originally
using async here.

On top of that, if --convert-html-to-text is supposed to be an option, then
I'd also have to check for the option from within ExportLicenseTexts().

Copy link
Collaborator

@Lexy2 Lexy2 Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I like the idea with ReadAsStringAsync!

If you're worried about asyncness further down the road, use StreamWriter.WriteAsync (don't forget to flush the stream after write).

The question though is: how do you ensure that the license text is not stripped during conversion, like the ones from opensource.org?
Have you tested HtmlAgilityPack in these tough cases?
In this case, it should definitely be a separate experimental option as we can't guarantee that HtmlAgilityPack does the correct conversion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And best to have a test for this option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question though is: how do you ensure that the license text is not stripped during conversion, like the ones from opensource.org?
Have you tested HtmlAgilityPack in these tough cases?

I've tried it with https://opensource.org/licenses/MIT and while the output
doesn't look particularly pretty (some non-license text from the page remains),
the license text itself is not stripped off.

In this case, it should definitely be a separate experimental option as we can't guarantee that HtmlAgilityPack does the correct conversion.

I'm completely fine with this being marked as an experimental option. While not
very sophisticated, this implementation is about as much as I can contribute at
this time. Any improvements are of course welcome.

On top of that, if --convert-html-to-text is supposed to be an option, then
I'd also have to check for the option from within ExportLicenseTexts().

Do you have any suggestions on how I should go about checking whether the
--convert-html-to-text option has been passed from within
ExportLicenseTexts()?

@Lexy2 Lexy2 mentioned this pull request May 15, 2022
@Lexy2
Copy link
Collaborator

Lexy2 commented May 15, 2022

@unsolaci please check the updated PR based on our discussion and see if it fits your purpose.

Thanks!

@Lexy2 Lexy2 closed this May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to include full license texts in licenses.txt output
2 participants