I’m working through converting my blog from Drupal to Jekyll (it’s a long story) and one of the things I needed to do is to convert a bunch of posts originally written in HTML into Markdown. With a little application of PowerShell, most of the heavy lifting was done fairly quickly - leaving just a manual review and tweak of each post.

Here’s the core of the PowerShell script I used:

foreach( $source in (get-childitem .\_posts\*.md )) {
    $sourceName = $source.Name
    Write-Host $sourceName
 
    # Load the contents of the file as a string 
    $content = get-content $source | join-string -newline
    $content = "$content"
    
    # Convert Links from <a> to Markdown style
    $content = $content -replace '<a\s+href="([^"]+)">([^<]+)</a>', '[$2]($1)'

    # Convert paragraphs and lists
    $content = $content -replace "\s*<ul>\s*", "`r`n"
    $content = $content -replace "\s*</ul>\s*", "`r`n"
    $content = $content -replace "\s*<ol>\s*", "`r`n"
    $content = $content -replace "\s*</ol>\s*", "`r`n"
    $content = $content -replace "<p>", "`r`n"
    $content = $content -replace "</p>", "`r`n"
    $content = $content -replace "<li>", "`r`n  *  "
    $content = $content -replace "</li>", ""
    
    # Word wrap each paragraph
    $content = $content -split "`r`n" | foreach-object { wrap-string $_ 120 } | join-string -separator "`r`n"
    
    # Word/Phrase highlighting    
    $content = $content -replace "<em>", "*"
    $content = $content -replace "</em>", "*"
    $content = $content -replace "<b>", "**"
    $content = $content -replace "</b>", "**"
    $content = $content -replace "<strong>", "**"
    $content = $content -replace "</strong>", "**"
    $content = $content -replace "&quot;", "'"
    
    $content = $content -replace "<!--break-->", ""
    
    # Eliminate excess whitespace
    $content = $content -replace "/^\s*$/",""
    $content = $content -replace "`r`n`r`n`r`n","`r`n`r`n"
    $content = $content -replace "`r`n`r`n`r`n","`r`n`r`n"
    $content = $content -replace "`r`n`r`n`r`n","`r`n`r`n"
    $content = $content -replace "`r`n`r`n`r`n","`r`n`r`n"

    set-content .\_processed\$sourceName -value $content 
}

Comments

blog comments powered by Disqus
Next Post
Time for a change  24 Jun 2014
Prior Post
Language Extensions for C#  19 May 2014
Related Posts
Using Constructors  27 Feb 2023
An Inconvenient API  18 Feb 2023
Method Archetypes  11 Sep 2022
A bash puzzle, solved  02 Jul 2022
A bash puzzle  25 Jun 2022
Improve your troubleshooting by aggregating errors  11 Jun 2022
Improve your troubleshooting by wrapping errors  28 May 2022
Keep your promises  14 May 2022
When are you done?  18 Apr 2022
Fixing GitHub Authentication  28 Nov 2021
Archives
May 2014
2014