Add '--convert-html-to-text' option #103

unsolaci · 2022-04-07T21:34:04Z

Introduce the '--convert-html-to-text' option, that converts .html
license files (exported using '--export-license-texts') to .txt, by
stripping and decoding HTML.

(Partially?) fix #81.

This is certainly not a perfect implementation, and could see some refactoring
and improvement.

All it really does is get the innerText DOM property and decode HTML entities,
so the formatting of the output text files is sometimes a bit awkward.

That said, at least it gives you an option to get text instead of HTML.
And since in some cases these HTML license files are huge (with CSS and JS
embedded 🤯), I still consider it a step up (i.e. works for my project).

Example excerpt from System.Resources.ResourceManager_4.0.0.html:

<h1 style='margin-top:6.0pt;margin-right:0in;margin-bottom:0in;margin-left:
.25in;margin-bottom:.0001pt;text-indent:-.25in'><span style='font-size:10.0pt'>1.<span
style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp; </span></span><span
style='font-size:10.0pt'>INSTALLATION AND USE RIGHTS. </span></h1>

<p class=Bullet3 style='margin-top:0in;margin-right:0in;margin-bottom:6.0pt;
margin-left:.25in;text-indent:0in'><span style='font-size:10.0pt'>You may
install and use any number of copies of the software </span><span
style='font-size:10.0pt;color:black'>to develop and test your applications.&nbsp;
</span></p>

<h1><span style='font-size:10.0pt'>2.<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;
</span></span><span style='font-size:10.0pt'>THIRD PARTY COMPONENTS. </span><span
style='font-weight:normal'>The software may include third party components with
separate legal notices or governed by other agreements, as may be described in
the ThirdPartyNotices file(s) </span><span style='font-size:10.0pt;font-weight:
normal'>accompanying the software.</span></h1>

Converted to System.Resources.ResourceManager_4.0.0.txt:

1.    INSTALLATION AND USE RIGHTS. 

You may
install and use any number of copies of the software to develop and test your applications. 


2.   
THIRD PARTY COMPONENTS. The software may include third party components with
separate legal notices or governed by other agreements, as may be described in
the ThirdPartyNotices file(s) accompanying the software.

Not perfect, but much more human-readable.

Introduce the '--convert-html-to-text' option, that converts .html license files (exported using '--export-license-texts') to .txt, by stripping and decoding HTML.

Lexy2 · 2022-04-11T02:49:32Z

src/Methods.cs

+
+        public void ConvertHtmlFilesToText(string htmlFolder)
+        {
+            var htmlFiles = Directory.GetFiles(htmlFolder, "*.html");


Hi @unsolaci,

Instead of saving all HTML files and then converting them into text files, would you mind intercepting the stream at :990 and saving converted txt directly?

I think this would eliminate unnecessary file management.

Thanks!

Hey @Lexy2,

Your suggestion does sound like a good idea, but I'm unsure about the
implementation.

HtmlAgilityPack provides
methods that work with files and strings. I don't know on how to work with that
asynchronously in the context of ExportLicenseTexts().

I could simply do something like this:

public string ConvertHtmlToText(string html) { var htmlDocument = new HtmlAgilityPack.HtmlDocument(); htmlDocument.LoadHtml(html); string htmlStrippedText = System.Web.HttpUtility.HtmlDecode( htmlDocument.DocumentNode.InnerText ); return htmlStrippedText; }

And then inside ExportLicenseTexts():

string licenseFileContents = await response.Content.ReadAsStringAsync(); string htmlStrippedLicenseFileContents = ConvertHtmlToText(licenseFileContents); System.IO.File.WriteAllText( path: outpath, contents: htmlStrippedLicenseFileContents );

But I have a feeling that this might be defeating the purpose of originally
using async here.

On top of that, if --convert-html-to-text is supposed to be an option, then
I'd also have to check for the option from within ExportLicenseTexts().

Hey, I like the idea with ReadAsStringAsync!

If you're worried about asyncness further down the road, use StreamWriter.WriteAsync (don't forget to flush the stream after write).

The question though is: how do you ensure that the license text is not stripped during conversion, like the ones from opensource.org?
Have you tested HtmlAgilityPack in these tough cases?
In this case, it should definitely be a separate experimental option as we can't guarantee that HtmlAgilityPack does the correct conversion.

And best to have a test for this option.

The question though is: how do you ensure that the license text is not stripped during conversion, like the ones from opensource.org?
Have you tested HtmlAgilityPack in these tough cases?

I've tried it with https://opensource.org/licenses/MIT and while the output
doesn't look particularly pretty (some non-license text from the page remains),
the license text itself is not stripped off.

In this case, it should definitely be a separate experimental option as we can't guarantee that HtmlAgilityPack does the correct conversion.

I'm completely fine with this being marked as an experimental option. While not
very sophisticated, this implementation is about as much as I can contribute at
this time. Any improvements are of course welcome.

On top of that, if --convert-html-to-text is supposed to be an option, then
I'd also have to check for the option from within ExportLicenseTexts().

Do you have any suggestions on how I should go about checking whether the
--convert-html-to-text option has been passed from within
ExportLicenseTexts()?

Lexy2 · 2022-05-15T12:53:43Z

@unsolaci please check the updated PR based on our discussion and see if it fits your purpose.

Thanks!

feat: Add '--convert-html-to-text' option

29b688b

Introduce the '--convert-html-to-text' option, that converts .html license files (exported using '--export-license-texts') to .txt, by stripping and decoding HTML.

Lexy2 requested changes Apr 11, 2022

View reviewed changes

Lexy2 mentioned this pull request May 15, 2022

Convert html to text #111

Merged

Lexy2 closed this May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add '--convert-html-to-text' option #103

Add '--convert-html-to-text' option #103

unsolaci commented Apr 7, 2022 •

edited

Loading

Lexy2 Apr 11, 2022

unsolaci Apr 12, 2022

Lexy2 Apr 12, 2022 •

edited

Loading

Lexy2 Apr 12, 2022

unsolaci Apr 20, 2022

Lexy2 commented May 15, 2022

Add '--convert-html-to-text' option #103

Add '--convert-html-to-text' option #103

Conversation

unsolaci commented Apr 7, 2022 • edited Loading

Lexy2 Apr 11, 2022

Choose a reason for hiding this comment

unsolaci Apr 12, 2022

Choose a reason for hiding this comment

Lexy2 Apr 12, 2022 • edited Loading

Choose a reason for hiding this comment

Lexy2 Apr 12, 2022

Choose a reason for hiding this comment

unsolaci Apr 20, 2022

Choose a reason for hiding this comment

Lexy2 commented May 15, 2022

unsolaci commented Apr 7, 2022 •

edited

Loading

Lexy2 Apr 12, 2022 •

edited

Loading