如何改进印地语文本提取?
我正在尝试从 PDF 中提取印地语文本。我尝试了所有从 PDF 中提取的方法,但都没有奏效。有解释为什么它不起作用,但没有这样的答案。因此,我决定将PDF转换为图像,然后用于pytesseract提取文本。我已经下载了印地语训练的数据,但是这也提供了非常不准确的文本。
这是 PDF 中的实际印地语文本(下载链接):
到目前为止,这是我的代码:
import fitz
filepath = "D:BADI KA BANS-Ward No-002.pdf"
doc = fitz.open(filepath)
page = doc.loadPage(3) # number of page
pix = page.getPixmap()
output = "outfile.png"
pix.writePNG(output)
from PIL import Image
import pytesseract
# Include tesseract executable in your path
pytesseract.pytesseract.tesseract_cmd = r"C:Program FilesTesseract-OCRtesseract.exe"
# Create an image object of PIL library
image = Image.open('outfile.png')
# pass image into pytesseract module
# pytesseract is trained in many languages
image_to_text = pytesseract.image_to_string(image, lang='hin')
# Print the text
print(image_to_text)
这是一些输出示例:
??? ???? ???? ? ?... ??? ??????? ???? ??? ??
??? ?? ???: ???? ???. “50?... ???? ?? ????????.... “??? ?? ???: ??????
??? ??: 43 ?????????: 93?. ???? ?????: 3?
??: 29 _ ???? ??. | ?? 57 ???? ????? ??: 62 ???? ??
???????????? (???? ???????????
??? ????? ???? ??... ?? ???????? ???... ???? ??? ??
???? ?? ???????? ????.“ ??? | ???? ?? ??????????? ?????... 0 2... | ???? ???????? ???? .... “20?
|??????????: 43? ?????????: 43?. ??? ??????: 44
???: 27 ???? ?? ??: 27 ?? ?? ?? ???? ?????
这个问题有一个答案我想用 python 刮一个印地语(印度语)pdf文件,它似乎告诉了如何做到这一点,但没有提供任何解释。
除了自己训练语言模型之外,还有什么方法可以做到这一点?
回答
我将给出一些如何处理您的图像的想法,但我会将其限制在给定文档的第 3 页,即问题中显示的页面。
为了将 PDF 页面转换为某些图像,我使用了pdf2image.
对于 OCR,我使用pytesseract,而不是lang='hin',我使用lang='Devanagari',参见。在正方体GitHub的。一般来说,确保通过提高Tesseract 文档的输出质量,尤其是页面分割方法来工作。
这是整个过程的(冗长的)描述:
- 逆二值化图像以进行轮廓查找:黑色背景上的白色文本、形状等。
- 找到所有的轮廓,过滤掉两个非常大的轮廓,即这两个表。
- 提取两个表之外的文本:
- 屏蔽二值化图像中的表格。
- 进行形态闭合以连接剩余的文本行。
- 查找这些文本行的轮廓和边界矩形。
- 运行
pytesseract以提取文本。
- 提取两个表内的文本:
- 从当前表中更好地提取单元格:它们的边界矩形。
- 对于第一个表:
- 运行
pytesseract以按原样提取文本。
- 运行
- 对于第二个表:
- 填充数字周围的矩形以防止出现错误的 OCR 输出。
- 掩盖左(印地语)和右(英语)部分。
- 在左侧运行
pytesseractusinglang='Devaganari',在右侧运行,lang='eng'以提高两者的 OCR 质量。
那将是整个代码:
import cv2
import numpy as np
import pdf2image
import pytesseract
# Extract page 3 from PDF in proper quality
page_3 = np.array(pdf2image.convert_from_path('BADI KA BANS-Ward No-002.pdf',
first_page=3, last_page=3,
dpi=300, grayscale=True)[0])
# Inverse binarize for contour finding
thr = cv2.threshold(page_3, 128, 255, cv2.THRESH_BINARY_INV)[1]
# Find contours w.r.t. the OpenCV version
cnts = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
# STEP 1: Extract texts outside of the two tables
# Mask out the two tables
cnts_tables = [cnt for cnt in cnts if cv2.contourArea(cnt) > 10000]
no_tables = cv2.drawContours(thr.copy(), cnts_tables, -1, 0, cv2.FILLED)
# Find bounding rectangles of texts outside of the two tables
no_tables = cv2.morphologyEx(no_tables, cv2.MORPH_CLOSE, np.full((21, 51), 255))
cnts = cv2.findContours(no_tables, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
rects = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda r: r[1])
# Extract texts from each bounding rectangle
print('nExtract texts outside of the two tablesn')
for (x, y, w, h) in rects:
text = pytesseract.image_to_string(page_3[y:y+h, x:x+w],
config='--psm 6', lang='Devanagari')
text = text.replace('n', '').replace('f', '')
print('x: {}, y: {}, text: {}'.format(x, y, text))
# STEP 2: Extract texts from inside of the two tables
rects = sorted([cv2.boundingRect(cnt) for cnt in cnts_tables],
key=lambda r: r[1])
# Iterate each table
for i_r, (x, y, w, h) in enumerate(rects, start=1):
# Find bounding rectangles of cells inside of the current table
cnts = cv2.findContours(page_3[y+2:y+h-2, x+2:x+w-2],
cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
inner_rects = sorted([cv2.boundingRect(cnt) for cnt in cnts],
key=lambda r: (r[1], r[0]))
# Extract texts from each cell of the current table
print('nExtract texts inside table {}n'.format(i_r))
for (xx, yy, ww, hh) in inner_rects:
# Set current coordinates w.r.t. full image
xx += x
yy += y
# Get current cell
cell = page_3[yy+2:yy+hh-2, xx+2:xx+ww-2]
# For table 1, simply extract texts as-is
if i_r == 1:
text = pytesseract.image_to_string(cell, config='--psm 6',
lang='Devanagari')
text = text.replace('n', '').replace('f', '')
print('x: {}, y: {}, text: {}'.format(xx, yy, text))
# For table 2, extract single elements
if i_r == 2:
# Floodfill rectangles around numbers
ys, xs = np.min(np.argwhere(cell == 0), axis=0)
temp = cv2.floodFill(cell.copy(), None, (xs, ys), 255)[1]
mask = cv2.floodFill(thr[yy+2:yy+hh-2, xx+2:xx+ww-2].copy(),
None, (xs, ys), 0)[1]
# Extract left (Hindi) and right (English) parts
mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE,
np.full((2 * hh, 5), 255))
cnts = cv2.findContours(mask,
cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
boxes = sorted([cv2.boundingRect(cnt) for cnt in cnts],
key=lambda b: b[0])
# Extract texts from each part of the current cell
for i_b, (x_b, y_b, w_b, h_b) in enumerate(boxes, start=1):
# For the left (Hindi) part, extract Hindi texts
if i_b == 1:
text = pytesseract.image_to_string(
temp[y_b:y_b+h_b, x_b:x_b+w_b],
config='--psm 6',
lang='Devanagari')
text = text.replace('f', '')
# For the left (English) part, extract English texts
if i_b == 2:
text = pytesseract.image_to_string(
temp[y_b:y_b+h_b, x_b:x_b+w_b],
config='--psm 6',
lang='eng')
text = text.replace('f', '')
print('x: {}, y: {}, text:n{}'.format(xx, yy, text))
而且,这是输出的前几行:
Extract texts outside of the two tables
x: 972, y: 93, text: ????? ???????? ????, ????????
x: 971, y: 181, text: ?????? ????? ???????? ???????, 2021
x: 166, y: 610, text: ????? ?? ???? ,??????? ?? ????
x: 151, y: 3417, text: ??? 1 ????? 2021 ?? ??????
x: 778, y: 3419, text: ????? ?????? : 3 / 10
Extract texts inside table 1
x: 146, y: 240, text: ????????? ?? ??? : ?????
x: 1223, y: 240, text: ??° ?° ????? ???????? ??????? : 21
x: 146, y: 327, text: ?????? ????? ?? ??? : ????????
x: 1223, y: 327, text: ??° ?° ????? ???????? ??????? : 6
x: 146, y: 415, text: ??????????? : ??? ?? ????
x: 1223, y: 415, text: ????? ??????? : 2
x: 146, y: 502, text: ???????? ??????? ?? ?????? ??? ???:- 56-????
Extract texts inside table 2
x: 142, y: 665, text:
1 RBP2469583
???: ???? ?????
???? ?? ???????? ??? ?????
???? ??????? ??
???? 21 ????? ??????
x: 142, y: 665, text:
Photo is
Available
x: 867, y: 665, text:
2 MRQ3101367
???? ???? ????
???? ?? ????????????
???? ??????? ?? /18
???? 44 ????? ??????
x: 867, y: 665, text:
Photo is
Available
我使用手动逐字比较检查了一些文本,并认为它看起来不错,但无法理解印地语或阅读天城文脚本,我无法评论 OCR 的整体质量。请告诉我!
令人讨厌的是,9相应“卡片”中的数字被错误地提取为2。我认为,这是由于与文本的其余部分相比字体不同,并且与lang='Devanagari'. 找不到解决方案 - 没有从“卡片”中单独提取矩形。
----------------------------------------
System information
----------------------------------------
Platform: Windows-10-10.0.19041-SP0
Python: 3.9.1
PyCharm: 2021.1.1
NumPy: 1.19.5
OpenCV: 4.5.2
pdf2image 1.14.0
pytesseract: 5.0.0-alpha.20201127
----------------------------------------
- I guess, there's no "context awareness" or similar for the Devanagari `traineddata`, which I assume to be purely script specific. So, I don't think, there won't be a Hindi (language) dictionary. As long as your inputs are quite "perfect" like the given example (upright font, proper resolution, quality, etc.), I'd guess output for any Devanagari texts should be good.