Writing your own Markdown to LaTEX parser - Subhaditya's Website

# Writing your own Markdown to LaTEX parser ## What we want ## Disclaimers ## Overview of Steps ## Let’s Make It! ## Libraries ```python import markdown import argparse as ap from pathlib import Path import re from html.parser import HTMLParser from html.entities import name2codepoint ``` ### Base Templates ```python default_template = """ \\documentclass[12pt]{article} \\usepackage[a4paper, total={6in, 8in}]{geometry} \\usepackage[utf8]{inputenc} \\usepackage[T1]{fontenc} \\usepackage[english]{babel} \\usepackage{graphicx} \\usepackage[dvipsnames]{xcolor} \\usepackage{hyperref} \\usepackage{listings} \\newcommand\myshade{85} \\colorlet{mylinkcolor}{violet} \\colorlet{mycitecolor}{YellowOrange} \\colorlet{myurlcolor}{Aquamarine} \\hypersetup{ linkcolor = mylinkcolor!\\myshade!black, citecolor = mycitecolor!\\myshade!black, urlcolor = myurlcolor!\\myshade!black, colorlinks = true, } \\author{} """ ``` ### HTML Parser ```python class MyHTMLParser(HTMLParser): def __init__(self): super().__init__() self.attrs = [] def handle_starttag(self, tag, attrs): for attr in attrs: self.attrs.append(attr) def get_attrs(self): return self.attrs def handle_endtag(self, tag): pass def handle_data(self, data): print("Data :", data) def handle_comment(self, data): print("Comment :", data) def handle_entityref(self, name): c = chr(name2codepoint[name]) print("Named ent:", c) def handle_charref(self, name): if name.startswith('x'): c = chr(int(name[1:], 16)) else: c = chr(int(name)) print("Num ent :", c) def handle_decl(self, data): print("Decl :", data) ``` ``` def get_html_attributes(text): parser = MyHTMLParser() parser.feed(text) return parser.get_attrs() ``` ### Replace strings ``` replacer_dict = { "<head>" : "", "</head>" : "", "<html>" : "", "</html>" : "", "<p>" : "", "</p>" : "", "<h1>" : "\\begin{document}\n\\toc: true title{", "</h1>" : "}\n\\maketoc: true title{}\n", "<h2>" : "\\section{", "<h3>" : "\\subsection{", "<h4>" : "\\subsubsection{", # "<body>" : "\\begin{document}\n", "</body>" : "\\end{document}\n", "<ul>" : "\\begin{itemize}\n", "</ul>" : "\\end{itemize}\n", "<il>" : "\\begin{enumerate}\n", "</il>" : "\\end{enumerate}\n", "<code>" : "\\begin{lstlisting}[language=Python]\n", "</code>" : "\\end{lstlisting}\n", "<li>" : "\\item ", "</li>" : "", "%": "\%", "&": "\&", } ``` ### Close Tags ``` def add_end_brace(list_of_vals, replacer_dict): list_of_vals = [x.strip() for x in list_of_vals.split(",")] for i in list_of_vals: replacer_dict[i.replace("<", "</")] = "}\n" ``` ``` add_end_brace( list_of_vals="<h2>, <h3>, <h4>", replacer_dict=replacer_dict ) ``` ### Images ``` def figure_code(text): found_links = re.findall('\<img .* \/>' , text) for link in found_links: attrs = get_html_attributes(link) caption_data = "" file_path = "" for i in attrs: if i[0] == "alt": caption_data = i[1] if i[0] == "src": file_path = i[1] gen_latex = "\\begin{figure}[!htbp]\n\centering\n\includegraphics[width=.75\columnwidth]{"+file_path+"}\n\caption{"+caption_data+"}\n\label{}\n\end{figure}" text = text.replace(link, gen_latex) return text ``` ### CLI input ```python ags = ap.ArgumentParser("md2tex") ags.add_argument("-f", help="Full file path", required=True) ags.add_argument("-d", help="Insert default formatting code", action='store_true') aps = ags.parse_args() f_name = Path(aps.f) ``` ### Running the pipeline ``` # Read the file with open(f_name, 'r') as f: text = f.read() html = markdown.markdown(text) # Replacing things text = figure_code(html) for key in replacer_dict.keys(): text = text.replace(key, replacer_dict[key]) # Write the file with open(f_name.parent/f"{f_name.stem}.tex", 'w') as f: if aps.d: f.write(default_template) f.write(text) if aps.d: f.write("\\end{document}") ``` ## Fin This article is in the hopes that it will help someone out. Maybe have the help that I did not. I do not know who it will reach. But to whoever it does, best of luck :) Like these/Want more? Buy me a coffee! [Kofi](https://ko-fi.com/subhadityamukherjee) Want articles on something specific? Just ask! You can always contact me on [LinkedIn](https://www.linkedin.com/in/subhaditya-mukherjee-a36883100), or drop me an [[mailto:[email protected]|Email]]. For all the code, drop by my [Github](https://github.com/SubhadityaMukherjee/).