Web Scraping and Generating PDFs Using C# and .NET

Table of contents

This blog post is part of the 2022 C# Advent Calendar. Go check out the other 49 great posts after you read mine, of course.

One of the most valuable things I've ever made is a simple web scraper that could generate PDFs and compare versions. I'm sure plenty of services will do this for a small fee, but I'd rather create a custom solution and save myself the cost.

This idea started when I worked for a company that made digital versions of government forms. One of the most annoying parts of that job was that the government often updated records without notice. We usually relied on our clients to tell us when the government revised forms.

I decided I wouldn't stay in the dark, so I created a simple application to let me know when things had changed. It had three primary functions; check if a known webpage or PDF had changed; if so, make a copy and save the current document in a versioned database. Eventually, I decided to add an email function, but I'm not going to include that in today's post.

Let's get started!

Dependencies

You'll use three tools to help create this application, though it's only two.

The first is AngleSharp, which will let you create a copy of a web page's Document Object Model (DOM). I recently upgraded to this from HtmlAgilityPack because it seemed easier to use.

Next is MigraDocCore, built on top of PdfSharpCore and created by the same developer. They'll help you create PDF documents.

dotnet add package AngleSharp --version 0.17.1
dotnet add package PdfSharpCore --version 1.3.41
dotnet add package MigraDocCore.Rendering --version 1.3.41

Code

First, create a model of the data you want to store. You could easily make this a Record, but some habits die hard for me.

public class Country
{
    public string? Name { get; set; }
    public string? Capital { get; set; }
    public string? Population { get; set; }
    public string? Area_KM_Squared { get; set; }
}

Next, you'll set up AngleSharp to go out and fetch the DOM to manipulate.

The two essential parts are the address and what you want to select from the DOM.
var document = await context.OpenAsync("https://www.scrapethissite.com/pages/simple/");

var countries =
document
.QuerySelectorAll("*")
.Where(e => e.LocalName == "div" && e.ClassName == "col-md-4 country")
.ToList();

The above says select everything with a div and the class name "col-md-4 country" and place that in a list.

public class WebpageData
{
    public async Task<List<Country>> CountriesAsync()
    {
        var config = Configuration.Default.WithDefaultLoader();
        var context = BrowsingContext.New(config);
        var document = await context.OpenAsync("https://www.scrapethissite.com/pages/simple/");
        var countries = 
       document
       .QuerySelectorAll("*")
       .Where(e => e.LocalName == "div" && e.ClassName == "col-md-4 country")
       .ToList();

        List<Country> parsedCountries = new();
        foreach (var country in countries)
        {
            var lines = country
            .Text()
            .Split("\n")
            .Select(s => s.Trim())
            .Where(s => !string.IsNullOrWhiteSpace(s))
            .ToArray();

            parsedCountries.Add(new Country()
            {
                Name = lines[0].Trim(),
                Capital = lines[1].Split(':')[1].Trim(),
                Population = lines[2].Split(':')[1].Trim(),
                Area_KM_Squared = lines[3].Split(':')[1].Trim()
            });
        }

        var output = parsedCountries.OrderBy(c => c.Name).ToList();
        return output;
    }
}

This is where you'll transform List<Country> into PdfDocument within a static class helper. I prefer to use these when converting types or to return bools. It just makes things look cleaner.

public static PdfDocument ToPDF(this List<Country> input)
{
    Document document = new();

    Section section = document.AddSection();

    foreach (var item in input)
    {
        section.AddParagraph($"Name: {item.Name}");
        section.AddParagraph($"Capital: {item.Capital}");
        section.AddParagraph($"Population: {item.Population}");
        section.AddParagraph($"Area KM Squared: {item.Area_KM_Squared}");
        section.AddParagraph();
    }

    PdfDocumentRenderer pdfRenderer = new() { Document = document };

    pdfRenderer.RenderDocument();

    return pdfRenderer.PdfDocument;
}

public static bool IsTheSameAsLatestDBVersion(this PdfDocument input)
{
    var current = DB.GetCurrentDBVersion();

    if (input == current) { return true; }

    return false;
}
    }

This is where you'd connect to your database and pull a copy of the latest saved page. This is a cheeky way to make sure the two PdfDocuments are different since our source data never really changes.

public class DB
{
    public static PdfDocument GetCurrentDBVersion()
    {
        return new PdfDocument();
    }
}

Everything finally comes together in this simple section of code. All that happens here is that the current page is returned as a list, it's transformed into a PDF, compared to the last saved version, and if it's the same, the program exits. If it's not the same, a new PDF is created, and an entry in the database is made.

var result = new WebpageData().CountriesAsync().GetAwaiter().GetResult();

var pdf = result.ToPDF();

var pdfIsTheSameAsLatestDBVersion = pdf.IsTheSameAsLatestDBVersion();

if (pdfIsTheSameAsLatestDBVersion)
    return;

var todaysDate = DateTime.Now.Date.ToString("yyyy-MM-dd");
var filename = $"CountryData_{todaysDate}.pdf";
pdf.Save(filename);
// pdf.AddToDB();

This may seem like a simple tool, but I can't count the times I've found it helpful to be alerted to a change in a webpage or a PDF.

If you found this helpful, let me know! If you have any suggestions or feedback, I'd love to hear it.