Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add '--convert-html-to-text' option #103

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions src/Methods.cs
Original file line number Diff line number Diff line change
Expand Up @@ -1000,6 +1000,41 @@ public async Task ExportLicenseTexts (List<LibraryInfo> infos) {
}
}

public void ConvertHtmlFileToText(string htmlFile)
{
var htmlDocument = new HtmlAgilityPack.HtmlDocument();

htmlDocument.Load(htmlFile);

string htmlStrippedText =
System.Web.HttpUtility.HtmlDecode(
htmlDocument.DocumentNode.InnerText
);

string textFile = htmlFile.Replace(".html", ".txt");

WriteOutput(
line: $"Converting {htmlFile} to {textFile}...",
logLevel: LogLevel.Verbose
);
File.WriteAllText(textFile, htmlStrippedText);

WriteOutput(
line: $"Deleting {htmlFile}...",
logLevel: LogLevel.Verbose
);
File.Delete(htmlFile);
}

public void ConvertHtmlFilesToText(string htmlFolder)
{
var htmlFiles = Directory.GetFiles(htmlFolder, "*.html");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @unsolaci,

Instead of saving all HTML files and then converting them into text files, would you mind intercepting the stream at :990 and saving converted txt directly?

I think this would eliminate unnecessary file management.

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Lexy2,

Your suggestion does sound like a good idea, but I'm unsure about the
implementation.

HtmlAgilityPack provides
methods that work with files and strings. I don't know on how to work with that
asynchronously in the context of ExportLicenseTexts().

I could simply do something like this:

public string ConvertHtmlToText(string html) {
    var htmlDocument = new HtmlAgilityPack.HtmlDocument();
    
    htmlDocument.LoadHtml(html);
    
    string htmlStrippedText =
        System.Web.HttpUtility.HtmlDecode(
            htmlDocument.DocumentNode.InnerText
        );
    
    return htmlStrippedText;
}

And then inside ExportLicenseTexts():

string licenseFileContents =
    await response.Content.ReadAsStringAsync();

string htmlStrippedLicenseFileContents =
    ConvertHtmlToText(licenseFileContents);

System.IO.File.WriteAllText(
    path: outpath, 
    contents: htmlStrippedLicenseFileContents
);

But I have a feeling that this might be defeating the purpose of originally
using async here.

On top of that, if --convert-html-to-text is supposed to be an option, then
I'd also have to check for the option from within ExportLicenseTexts().

Copy link
Collaborator

@Lexy2 Lexy2 Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I like the idea with ReadAsStringAsync!

If you're worried about asyncness further down the road, use StreamWriter.WriteAsync (don't forget to flush the stream after write).

The question though is: how do you ensure that the license text is not stripped during conversion, like the ones from opensource.org?
Have you tested HtmlAgilityPack in these tough cases?
In this case, it should definitely be a separate experimental option as we can't guarantee that HtmlAgilityPack does the correct conversion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And best to have a test for this option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question though is: how do you ensure that the license text is not stripped during conversion, like the ones from opensource.org?
Have you tested HtmlAgilityPack in these tough cases?

I've tried it with https://opensource.org/licenses/MIT and while the output
doesn't look particularly pretty (some non-license text from the page remains),
the license text itself is not stripped off.

In this case, it should definitely be a separate experimental option as we can't guarantee that HtmlAgilityPack does the correct conversion.

I'm completely fine with this being marked as an experimental option. While not
very sophisticated, this implementation is about as much as I can contribute at
this time. Any improvements are of course welcome.

On top of that, if --convert-html-to-text is supposed to be an option, then
I'd also have to check for the option from within ExportLicenseTexts().

Do you have any suggestions on how I should go about checking whether the
--convert-html-to-text option has been passed from within
ExportLicenseTexts()?

foreach (var htmlFile in htmlFiles)
{
ConvertHtmlFileToText(htmlFile);
}
}

private bool IsGithub(string uri)
{
return uri.StartsWith("https://github.com", StringComparison.Ordinal);
Expand Down
1 change: 1 addition & 0 deletions src/NugetUtility.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

<ItemGroup>
<PackageReference Include="CommandLineParser" Version="2.8.0" />
<PackageReference Include="HtmlAgilityPack" Version="1.11.42" />
<PackageReference Include="Newtonsoft.Json" Version="13.0.1" />
<PackageReference Include="NuGet.Versioning" Version="6.1.0" />
<PackageReference Include="System.IO.Compression" Version="4.3.0" />
Expand Down
3 changes: 3 additions & 0 deletions src/PackageOptions.cs
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,9 @@ public class PackageOptions
[Option('t', "include-transitive", Default = false, HelpText = "Include distinct transitive package licenses per project file.")]
public bool IncludeTransitive { get; set; }

[Option("convert-html-to-text", Default = false, HelpText = "Convert html licenses to plain text.")]
public bool ConvertHtmlToText { get; set; }

[Option("ignore-ssl-certificate-errors", Default = false, HelpText = "Ignore SSL certificate errors in HttpClient.")]
public bool IgnoreSslCertificateErrors { get; set; }

Expand Down
15 changes: 15 additions & 0 deletions src/Program.cs
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,14 @@ private static async Task<int> Execute(PackageOptions options)
return 1;
}

if (options.ConvertHtmlToText && !options.ExportLicenseTexts)
{
Console.WriteLine("ERROR(S):");
Console.WriteLine("--convert-html-to-text\tThis option requires the --export-license-texts option.");

return 1;
}

if (options.UseProjectAssetsJson && !options.IncludeTransitive)
{
Console.WriteLine("ERROR(S):");
Expand Down Expand Up @@ -53,6 +61,13 @@ private static async Task<int> Execute(PackageOptions options)
await methods.ExportLicenseTexts(mappedLibraryInfo);
}

if (options.ConvertHtmlToText)
{
methods.ConvertHtmlFilesToText(
methods.GetExportDirectory()
);
}

mappedLibraryInfo = methods.HandleDeprecateMSFTLicense(mappedLibraryInfo);

if (options.Print == true)
Expand Down