Interpreter takes very long on page with rotated text #231

thf24 · 2019-03-10T11:41:22Z

Hi, thanks for the great project.

I noticed that the interpreter takes very long (5 minutes+ on my machine) to process a page when there is a lot of rotated text on it (90 degrees in the document, see attached).

this is the (shortened) code for reference:

fp = open(PATH_TO_FILE, 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser, '')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page_index, page in enumerate(PDFPage.create_pages(document)):
     interpreter.process_page(page)

rotated.pdf

The text was updated successfully, but these errors were encountered:

SVasilev · 2019-08-28T15:21:38Z

+1

pietermarsman · 2019-09-10T07:23:10Z

I've used the profiler to find out which lines take most time to execute. It is this '.find()' method in the inline 'isany()' method, in the 'group_textboxes()' method of 'LTLayoutContainer' that takes 65% of the time!

This method takes so long because theboxes input is a long list. And this is directly caused by group_objects() not grouping vertical aligned objects by default. This can be enabled by setting LAParams.detect_vertical to True.

So you can fix your problem by using laparams = LAParams(detect_vertical=True).

pietermarsman · 2019-09-10T07:23:51Z

If @thf24 or @SVasilev agree, this issue can be closed.

thf24 · 2019-09-11T11:29:43Z

ok thanks, I just ran some unit tests on normal (non-rotated) pages with the detect_vertical=True and didn't seem to get much of a performance loss, so I wonder why this is not enabled by default? Can be closed though.

pietermarsman · 2019-09-11T17:00:37Z

I'm not sure either why False is the default. Maybe it deterroriates the quality of the output?

SVasilev · 2019-10-02T12:56:24Z

Hey there guys. Thanks for spending time on that issue. For some of my PDFs I do see slight improvement if I use detect_vertical=True.
But for some of the PDFs it does not look like it helps. Here is an example:
https://patentimages.storage.googleapis.com/93/e9/69/f780b3b43c6635/EP1338665A1.pdf

For this document for example page 17 takes 6.6 seconds and page 18 takes 5.71 seconds with and without the flag. Any further suggestions? Maybe I can play with other LAParams properties?

pietermarsman · 2019-11-17T18:57:11Z

Can be closed though.

SVasilev · 2019-11-22T16:01:26Z

I don't understand why is this issue closed. There is currently no solution to the problem.
Or you keep your bugs in separate issue tracker?

Mglt-b · 2020-10-26T11:00:42Z

Hello, any other solutions ?

pietermarsman · 2020-10-27T16:28:59Z

@SVasilev @migliorati the issue described by @thf24 is fixed by enabling detection of vertical text boxes. I consider this issue closed because this specific question is answered.

I get that this solution does not work for all PDF's and for all code. If you have performance issues with specific PDF's or if you think pdfminer.six is slow in general for some subset of all PDF's, feel free to open a new issue.

pietermarsman added the type: question label Oct 13, 2019

pietermarsman added component: converter Related to any PDFLayoutAnalyzer type: bug labels Oct 28, 2019

pietermarsman closed this as completed Nov 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpreter takes very long on page with rotated text #231

Interpreter takes very long on page with rotated text #231

thf24 commented Mar 10, 2019 •

edited

SVasilev commented Aug 28, 2019

pietermarsman commented Sep 10, 2019

pietermarsman commented Sep 10, 2019

thf24 commented Sep 11, 2019 •

edited

pietermarsman commented Sep 11, 2019

SVasilev commented Oct 2, 2019

pietermarsman commented Nov 17, 2019

SVasilev commented Nov 22, 2019

Mglt-b commented Oct 26, 2020

pietermarsman commented Oct 27, 2020

Interpreter takes very long on page with rotated text #231

Interpreter takes very long on page with rotated text #231

Comments

thf24 commented Mar 10, 2019 • edited

SVasilev commented Aug 28, 2019

pietermarsman commented Sep 10, 2019

pietermarsman commented Sep 10, 2019

thf24 commented Sep 11, 2019 • edited

pietermarsman commented Sep 11, 2019

SVasilev commented Oct 2, 2019

pietermarsman commented Nov 17, 2019

SVasilev commented Nov 22, 2019

Mglt-b commented Oct 26, 2020

pietermarsman commented Oct 27, 2020

thf24 commented Mar 10, 2019 •

edited

thf24 commented Sep 11, 2019 •

edited