pandoc-php/pandoc
最新稳定版本:3.3.2
Composer 安装命令:
composer require pandoc-php/pandoc
包简介
A native PHP 8.4 port of the Pandoc document converter.
关键字:
README 文档
README
A native PHP 8.4 port of the Pandoc document converter. This library converts documents between formats (Word .docx, Excel .xlsx, PowerPoint .pptx, HTML .html, Markdown .md, Jupyter .ipynb → LaTeX) without requiring the system-level Pandoc binary.
Features
- Native PHP 8.4: Uses
readonlyclasses, Enums, and property hooks. - AST-Centric Architecture: Mirrors Pandoc's Abstract Syntax Tree for robust conversions.
- Modular Reader System: Factory pattern and
ReaderInterfacefor easy format expansion. - Deep Docx Parsing: Paragraphs, headers, tables, lists, images, bold/italic/underline/strikeout, superscript/subscript, text and background colors, hyperlinks (external
\href/\url, internal\hyperref), footnotes and endnotes (\footnote), automatic run-merging (consecutive runs with identical styling are collapsed into one command), and black-color suppression (spurious\textcolor[HTML]{000000}commands are dropped). - Excel (XLSX): All sheets as booktabs tables, shared strings, bold/italic, embedded images, chart extraction (JSON metadata + CSV data for Chart.js), per-sheet CSV export with locale-aware separators, and a
metadata.jsonsummary of document locale. - PowerPoint (PPTX): Each slide becomes a
slideenvironment, all slides wrapped in asliderenvironment. Images, embedded videos (\begin{video}...\end{video}), and audio (\begin{audio}...\end{audio}) extracted to MediaBag. - LaTeX Generation: Standalone documents or body fragments.
- Automatic ZIP Bundling: When a document contains images or chart data, output is a
.zipwith the.texand all media files in the same directory. Plain.texotherwise. - Full UTF-8: End-to-end UTF-8, supporting CJK, Cyrillic, Arabic, Thai, and all Latin-extended scripts.
- No External Dependencies: Pure PHP 8.4+.
Installation
Requires PHP 8.4 or higher.
composer require pandoc-php/pandoc
Basic Usage
Converting a Word Document to LaTeX
use Pandoc\Reader\DocxReader; use Pandoc\Writer\LatexWriter; $reader = new DocxReader(); $writer = new LatexWriter(); $doc = $reader->read('document.docx'); $latex = $writer->write($doc, standalone: true); file_put_contents('document.tex', $latex);
Converting Markdown to a LaTeX Fragment
use Pandoc\Reader\MarkdownReader; use Pandoc\Writer\LatexWriter; $reader = new MarkdownReader(); $writer = new LatexWriter(); $markdown = "# Hello World\nThis is a paragraph."; $doc = $reader->read($markdown); // standalone: false → body only, no \documentclass preamble $fragment = $writer->write($doc, standalone: false);
Converting HTML to LaTeX
use Pandoc\Reader\HtmlReader; use Pandoc\Writer\LatexWriter; $reader = new HtmlReader(); $writer = new LatexWriter(); $doc = $reader->read("<h1>Hello</h1><p>World</p>"); $latex = $writer->write($doc);
Converting an Excel Spreadsheet to LaTeX
use Pandoc\Reader\XlsxReader; use Pandoc\Writer\LatexWriter; $reader = new XlsxReader(); $writer = new LatexWriter(); $doc = $reader->read('spreadsheet.xlsx'); $latex = $writer->write($doc);
Each sheet produces a level-2 header followed by a booktabs table. If the spreadsheet contains embedded images or charts, use the ZIP output pattern below.
Note: Only
.xlsx(OOXML) is supported. Legacy.xlsfiles must be converted first (e.g. via LibreOffice).
Chart extraction: Charts are exported as two companion files added to the MediaBag:
chart1.json — Chart.js-ready metadata:
{
"type": "bar",
"title": "Sales by Quarter",
"dataFile": "chart1.csv",
"options": {
"indexAxis": "x",
"scales": {
"x": { "title": { "display": true, "text": "Quarter" }, "stacked": false },
"y": { "title": { "display": true, "text": "Revenue" }, "stacked": false }
}
},
"series": [
{ "label": "Product A" },
{ "label": "Product B" }
]
}
chart1.csv — the data (categories + one column per series):
Category,Product A,Product B
Q1,120,85
Q2,135,90
Q3,128,95
Q4,145,110
A comment marker is inserted in the LaTeX at the chart's position:
% [pandoc-chart: chart1.json]
Your app reads the marker → loads the JSON → finds dataFile → loads the CSV → renders with Chart.js.
Per-sheet CSV export: Each worksheet is also exported as a standalone CSV file (e.g. sheet-Sales.csv) added to the MediaBag. Trailing empty rows and columns are stripped automatically.
Locale detection: The reader inspects docProps/core.xml for a <dc:language> tag and selects separators accordingly:
| Language group | Decimal sep. | Thousands sep. | Column delim. |
|---|---|---|---|
en, ja, zh, pt-BR, … |
. |
, |
, |
fr, de, it, es, nl, pl, ru, … |
, |
. |
; |
When no language tag is present the file falls back to en-US conventions.
metadata.json: Always added to the MediaBag alongside the CSVs:
{
"language": "fr-FR",
"decimalSeparator": ",",
"thousandsSeparator": ".",
"columnDelimiter": ";",
"quoteCharacter": "\"",
"sheets": ["Sheet1", "Sheet2"]
}
Utility script: export_xlsx_media.php converts any .xlsx file to a ZIP containing its CSVs and metadata.json:
php export_xlsx_media.php spreadsheet.xlsx output.zip
Converting a PowerPoint Presentation to LaTeX
use Pandoc\Reader\PptxReader; use Pandoc\Writer\LatexWriter; $reader = new PptxReader(); $writer = new LatexWriter(); $doc = $reader->read('presentation.pptx'); $latex = $writer->write($doc, standalone: true);
Each slide is wrapped in a slide environment (with the slide title as argument), and all slides are enclosed in a slider environment:
\begin{slider} \begin{slide}{Slide Title} Paragraph content here. \end{slide} \begin{slide}{Second Slide} More content. \end{slide} \end{slider}
These are custom environments — define them in your LaTeX preamble to control rendering. All images (including slide master/template graphics) are extracted into the MediaBag.
Embedded videos are exported as a video environment:
\begin{video} \url{media1.mp4} \type{mp4} \end{video}
Embedded audio is exported as an audio environment:
\begin{audio} \url{recording.mp3} \end{audio}
All media files (images, video, audio) are included in the ZIP output alongside the .tex.
Converting Jupyter Notebooks to LaTeX
use Pandoc\Reader\IpynbReader; use Pandoc\Writer\LatexWriter; $reader = new IpynbReader(); $writer = new LatexWriter(); $json = file_get_contents('notebook.ipynb'); $doc = $reader->read($json); $latex = $writer->write($doc);
Output: Plain .tex or .zip
When a document contains images, charts, or other media, you need to bundle them alongside the .tex file. The MediaBag tells you whether there are any attachments:
use Pandoc\Reader\ReaderFactory; use Pandoc\Writer\LatexWriter; $reader = ReaderFactory::createForExtension('docx'); // or xlsx, pptx, etc. $doc = $reader->read($filePath); $latex = (new LatexWriter())->write($doc, standalone: true); if (!$doc->mediaBag->isEmpty()) { // Bundle .tex + all media into a ZIP $zip = new ZipArchive(); $zip->open('output.zip', ZipArchive::CREATE | ZipArchive::OVERWRITE); $zip->addFromString('document.tex', $latex); foreach ($doc->mediaBag->getAll() as $filename => $media) { $zip->addFromString($filename, $media['contents']); } $zip->close(); // → distribute output.zip } else { // No media — plain .tex is sufficient file_put_contents('document.tex', $latex); }
All media files (images, chart JSON/CSV) are stored at the root of the ZIP, so \includegraphics{image.png} and chart references resolve correctly when the .tex is compiled or processed from the same directory.
Web Interface
The project includes a web-based demonstration tool in web/.
- Point your web server to the
php-pandoc/web/folder. - Open
index.htmlin your browser. - Upload a
.docx,.xlsx,.pptx,.html,.ipynb, or.mdfile. - Choose Standalone or Fragment output.
- Download the result — a plain
.texif the document has no media, or a.zipif it does.
Supported Structures
See SUPPORTED_STRUCTURES.md for a full feature list. Highlights:
- Word: Headers (H1–H6, Title), bold/italic/underline/strikeout/color, lists, tables, images, headers & footers, hyperlinks, footnotes/endnotes, automatic run-merging.
- Excel: Multi-sheet tables, cell formatting, embedded images, Chart.js-ready chart extraction, per-sheet CSV export with locale-aware separators.
- PowerPoint: Slide titles, body text, bullet/ordered lists, images, tables,
slide/sliderLaTeX environments. - HTML: Full block and inline element support.
- Jupyter: Markdown cells, code blocks, output images.
Development and Testing
./vendor/bin/phpunit
Credits
This project is a port of Pandoc, originally created by John MacFarlane.
License
GPL v2 or later, mirroring the original Pandoc license.
统计信息
- 总下载量: 31
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 0
- 点击次数: 7
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: GPL-2.0-or-later
- 更新时间: 2026-01-07