ผู้ใช้:Ans/pdftotext
หน้าตา
This document will describe how I convert thai pdf file to text data.
In GNU/Linux,
- use the command pdftotext (bundled with xpdf or poppler-utils)
- put the following config to file ~/xpdfrc-thai-unicode,
include /etc/xpdf/xpdfrc unicodeMap UTF-8-Thai /home/sysadmin/UTF-8-Thai.unicodeMap
- generate the file /home/sysadmin/UTF-8-Thai.unicodeMap using the following perl script (the script is based on the code from xpdf/UTF8.h),
$ perl -e 'print "0000 007f 00\n"; for ($u=0x0080; $u<0x0800; $u+=0x40) { printf("%04x %04x %02x%02x\n", $u, $u+0x40-1, 0xc0+($u>>6), 0x80+($u&0x3f) ); } for ($u=0x0800; $u<0x10000; $u+=0x40) { printf("%04x %04x %02x%02x%02x\n", $u, $u+0x40-1, 0xe0+($u>>12), 0x80+($u>>6 & 0x3f), 0x80+($u&0x3f) ); } for ($u=0x10000; $u<0x110000; $u+=0x40) { printf("%06x %06x %02x%02x%02x%02x\n", $u, $u+0x40-1, 0xf0+($u>>18), 0x80+($u>>12 & 0x3f), 0x80+($u>>6 & 0x3f), 0x80+($u&0x3f) ); } ' > ~/UTF-8-Thai.unicodeMap
- edit the file /home/sysadmin/UTF-8-Thai.unicodeMap as described by the following patch (it is derived from the mapping in /usr/share/xpdf/thai/TIS-620.unicodeMap),
--- - 2008-09-22 18:46:28.681553000 +0700
+++ /home/sysadmin/UTF-8-Thai.unicodeMap 2008-09-22 18:36:41.000000000 +0700
@@ -985,7 +985,16 @@
f640 f67f ef9980
f680 f6bf ef9a80
f6c0 f6ff ef9b80
-f700 f73f ef9c80
+f700 e0b890
+f701 f704 e0b8b4
+f705 f709 e0b988
+f70a f70e e0b988
+f70f e0b88d
+f710 e0b8b1
+f711 e0b98d
+f712 f717 e0b987
+f718 f71a e0b8b8
+f720 f73f ef9ca0
f740 f77f ef9d80
f780 f7bf ef9e80
f7c0 f7ff ef9f80
- begin convert the pdf file using the command,
$ pdftotext -raw -enc UTF-8-Thai /tmp/1.PDF /tmp/1.txt -cfg ~/xpdfrc-thai-unicode
licensing
[แก้ไข]- my contribution here is released under dual licenses,
- GFDL
- The same license as xpdf