HTML to Markdown converter using regex

Estimated reading time of this article: 8 minutes

Do you want to easily convert HTML to Markdown? Simply by copy & paste?

I've a series of HTML blog posts written in HTML with TinyMCE which I want to convert to Markdown Extra. There are several scripts for this task:

Neither really satisfied me. Furthermore, I want to convert my blog posts with saving all of them to a file. I've written a little JavaScript script.

As I'm using regex the conversion is limited. The HTML must be well formed without errors. Simply, <p> <br> <strong> <em> <h1> <h2> <h3> <h4> <h5> <a> <li> <table> are replaced with the Markdown equivalent. Thus, a conversion using this script is not fully automatic. But for my task perfect.

Converter

Source Code

Source code on GitHub.

Show JavaScript source code

The JavaScript source file.

License: GPL 3.

Appendix

Japanse Overview

There is an overview on a Japanese site.

Google Translated Text

Libraries I want to know when converting HTML to Markdown 11 selections + α

We examined a few libraries and tools that convert HTML to Markdown notation.

<h2> Foo </ h2> → ## Foo

You don't want to be embedded in the app, but you just want to re-edit the article before moving to Markdown with WordPress, or when the post text changes from Markdown to HTML after conversion due to something wrong. It is a request to return to. I found a lot if I was investigating.

Library 11 + Tools & Editors Comparison

Library feature list.

name language license Extended notation
reMarked.js JavaScript MIT
to-markdown JavaScript MIT ×
HTML2 Markdown JavaScript unknown ×
Simple HTML to Markdown Extra converter with regex JavaScrpt GPLv3
Markdownify PHP LGPL
HTML To Markdown for PHP PHP MIT ×
html2text Python GPLv3 ×
url2markdown (using html2text) Python GPLv3 ×
reverse_markdown Ruby WTFPL
html2markdown Ruby MIT ×
Html2 Markdown C # unknown ×
Pandoc (conversion tool for various formats) - -
Markable (Online Editor) - - ×

※ There is something that I tried for a while, but it is unverified that conversion can be done without problems properly. There are various extension notations, so I changed it to "○" if it felt like it was a bit compatible.

What I want to pay attention to the conversion behavior is that if you can mix Markdown + HTML like the body of WordPress, the information will disappear if the script tag etc. is not output as HTML as it is after conversion. A proper (?) Library contains an action to delete script tags that do not have meaning as a document, and conversely, a simple conversion library simply replaces HTML tags and outputs tags that can not be handled as it is It is done.

In the following, what is expressed as early Markdown is Markdown 1.0.1 that does not include an extension . I do not know if the library is actually 100% compliant.

reMarked.js

  • JavaScript, MIT
  • demo

reMarked.js is a JavaScript library that supports not only the initial Markdown notation but also the table. The demo site can convert specified HTML text, which is useful when you want to convert a little. It is used by pronama.jp/md .

A variety of conversion options are also featured (not available on the demo site). By default, the contents of script are not output, but you can also specify optional tags to output or not.

var reMarker = new reMarked ({unsup_tags : {ignore : " " }}); // No tag not output
var markdown = reMarker . render ( document . body );
view raw gistfile1.js hosted with ❤ by GitHub

to-markdown

  • JavaScript, MIT
  • demo

to-markdown is a simple JavaScript library. The site has a bug related to blockquate, but unfortunately it seems that development has stopped.

Simple HTML can convert it without any problems, and you can try it on the demo site right away. The content that can not be converted is output as it is.

HTML2 Markdown

  • JavaScript, license unknown
  • No demo site

It seems that conversion is done using HTML Parser library. I have not tried.

HTML to Markdown converter using regex

  • JavaScript, GPLv3
  • demo

A simple JavaScript code is available that converts HTML to Markdown with regular expressions. The function is limited, but the extension notation is included.

Markdownify

Markdownify is a library that supports Markdown Extra . You can also convert table and HTML attribute values ​​to Markdown.

<? php
$ converter = new Markdownify \ ConverterExtra ;
$ converter- > parseString ( ' <h1 id = "md"> Heading </ h1> ' );
// Returns: # Heading {#md}
? >
view raw gistfile1.php hosted with ❤ by GitHub

The initial Markdown and a class to convert to the extended Markdown are provided. Content that can not be converted is output as it is, but content such as script is not output. Although the conversion option is also provided when you read the code, you need to modify the code to output the content that can not be converted as it is.

It was an impression that most supported the extended notation (though I do not know how well it can be converted). However, development has stopped in the state where the official site is down.

HTML To Markdown for PHP

  • PHP, MIT
  • No demo site

HTML To Markdown for PHP is a PHP library that supports early Markdown. It has less features but also conversion options.

html2text

  • Python, GPLv3
  • demo

html2text is an old Python library. It corresponds to the early Markdown notation. There is a demo for converting web pages from URLs, but it almost feels like a failure.

Content that can not be converted is deleted. Again, reading the code provides conversion options, but not all output settings.

import html2text
h = html2text.HTML2Text ()
h.ignore_links = True
print h.handle ( " <p> Hello, <a href='http://earth.google.com/'> world </a>! " )
view raw gistfile1.py hosted with ❤ by GitHub

It is an impression that there are a lot of tools that use html2text or because of an old library. Tools for OS X Markdown Service Tools were also used.

url2markdown

  • Python, GPLv3
  • demo

url2markdown is a library to get web page from URL and convert it to Markdown. The conversion part is html2text. However, I use Readability's Parser API for fetching web pages, and the results are old and information is lost.

reverse_markdown

  • Ruby, WTFPL
  • No demo site

reverse_markdown is a gem-style Ruby library. I have not actually tried it, but according to the explanation, table is also supported.

html2markdown

  • Ruby, MIT
  • No demo site

html2markdown is also a gem-style Ruby library. Only simple conversion is supported, as described in "Simple html to Markdown". It was also in Bitbucket .

Html2 Markdown

  • C #, license unknown
  • No demo site

Html2Markdown is a C # library, also published by Nuget . Looking at the code we have a fairly simple conversion.

Pandoc (conversion tool)

Pandoc is a tool that can convert various document formats. HTML to Markdown is also possible. It supports various Markdown notations. However, the result of table was subtle with missing columns. The details of Pandoc are as follows.

HTML-Supports a variety of formats! Get to know the document conversion tool Pandoc-Qiita

Markable (Online Editor)

Markable is an online Markdown editor. I will introduce HTML to Markdown conversion function. Once you register your account, you can use the HTML import feature. The conversion from HTML seems to support only the initial Markdown notation, but the Markdown to HTML conversion function also supports extended notation such as table.

In the online editor examined, only Markable supported conversion from HTML.

in conclusion

At first I found only a few libraries and checked their operation, but if you search frequently you will come out a lot. Everyone's making it.

If you want to convert a bit, reMarked.js demo site is useful. ReMarked.js itself seems to be the end goal to integrate with the existing WYSIWYG HTML editor.

Please let me know if you have any other information or mistakes.