-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add '--convert-html-to-text' option #103
Conversation
Introduce the '--convert-html-to-text' option, that converts .html license files (exported using '--export-license-texts') to .txt, by stripping and decoding HTML.
|
||
public void ConvertHtmlFilesToText(string htmlFolder) | ||
{ | ||
var htmlFiles = Directory.GetFiles(htmlFolder, "*.html"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @unsolaci,
Instead of saving all HTML files and then converting them into text files, would you mind intercepting the stream at :990 and saving converted txt directly?
I think this would eliminate unnecessary file management.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Lexy2,
Your suggestion does sound like a good idea, but I'm unsure about the
implementation.
HtmlAgilityPack provides
methods that work with files and strings. I don't know on how to work with that
asynchronously in the context of ExportLicenseTexts()
.
I could simply do something like this:
public string ConvertHtmlToText(string html) {
var htmlDocument = new HtmlAgilityPack.HtmlDocument();
htmlDocument.LoadHtml(html);
string htmlStrippedText =
System.Web.HttpUtility.HtmlDecode(
htmlDocument.DocumentNode.InnerText
);
return htmlStrippedText;
}
And then inside ExportLicenseTexts()
:
string licenseFileContents =
await response.Content.ReadAsStringAsync();
string htmlStrippedLicenseFileContents =
ConvertHtmlToText(licenseFileContents);
System.IO.File.WriteAllText(
path: outpath,
contents: htmlStrippedLicenseFileContents
);
But I have a feeling that this might be defeating the purpose of originally
using async here.
On top of that, if --convert-html-to-text
is supposed to be an option, then
I'd also have to check for the option from within ExportLicenseTexts()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I like the idea with ReadAsStringAsync
!
If you're worried about asyncness further down the road, use StreamWriter.WriteAsync
(don't forget to flush the stream after write).
The question though is: how do you ensure that the license text is not stripped during conversion, like the ones from opensource.org?
Have you tested HtmlAgilityPack in these tough cases?
In this case, it should definitely be a separate experimental option as we can't guarantee that HtmlAgilityPack does the correct conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And best to have a test for this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question though is: how do you ensure that the license text is not stripped during conversion, like the ones from opensource.org?
Have you tested HtmlAgilityPack in these tough cases?
I've tried it with https://opensource.org/licenses/MIT and while the output
doesn't look particularly pretty (some non-license text from the page remains),
the license text itself is not stripped off.
In this case, it should definitely be a separate experimental option as we can't guarantee that HtmlAgilityPack does the correct conversion.
I'm completely fine with this being marked as an experimental option. While not
very sophisticated, this implementation is about as much as I can contribute at
this time. Any improvements are of course welcome.
On top of that, if --convert-html-to-text is supposed to be an option, then
I'd also have to check for the option from within ExportLicenseTexts().
Do you have any suggestions on how I should go about checking whether the
--convert-html-to-text
option has been passed from within
ExportLicenseTexts()
?
@unsolaci please check the updated PR based on our discussion and see if it fits your purpose. Thanks! |
Introduce the '--convert-html-to-text' option, that converts .html
license files (exported using '--export-license-texts') to .txt, by
stripping and decoding HTML.
(Partially?) fix #81.
This is certainly not a perfect implementation, and could see some refactoring
and improvement.
All it really does is get the innerText DOM property and decode HTML entities,
so the formatting of the output text files is sometimes a bit awkward.
That said, at least it gives you an option to get text instead of HTML.
And since in some cases these HTML license files are huge (with CSS and JS
embedded 🤯), I still consider it a step up (i.e. works for my project).
Example excerpt from
System.Resources.ResourceManager_4.0.0.html
:Converted to
System.Resources.ResourceManager_4.0.0.txt
:Not perfect, but much more human-readable.