Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpreter takes very long on page with rotated text #231

Closed
thf24 opened this issue Mar 10, 2019 · 10 comments
Closed

Interpreter takes very long on page with rotated text #231

thf24 opened this issue Mar 10, 2019 · 10 comments
Labels

Comments

@thf24
Copy link

thf24 commented Mar 10, 2019

Hi, thanks for the great project.

I noticed that the interpreter takes very long (5 minutes+ on my machine) to process a page when there is a lot of rotated text on it (90 degrees in the document, see attached).

this is the (shortened) code for reference:

fp = open(PATH_TO_FILE, 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser, '')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page_index, page in enumerate(PDFPage.create_pages(document)):
     interpreter.process_page(page)

rotated.pdf

@SVasilev
Copy link

+1

@pietermarsman
Copy link
Member

I've used the profiler to find out which lines take most time to execute. It is this '.find()' method in the inline 'isany()' method, in the 'group_textboxes()' method of 'LTLayoutContainer' that takes 65% of the time!

This method takes so long because theboxes input is a long list. And this is directly caused by group_objects() not grouping vertical aligned objects by default. This can be enabled by setting LAParams.detect_vertical to True.

So you can fix your problem by using laparams = LAParams(detect_vertical=True).

@pietermarsman
Copy link
Member

If @thf24 or @SVasilev agree, this issue can be closed.

@thf24
Copy link
Author

thf24 commented Sep 11, 2019

ok thanks, I just ran some unit tests on normal (non-rotated) pages with the detect_vertical=True and didn't seem to get much of a performance loss, so I wonder why this is not enabled by default? Can be closed though.

@pietermarsman
Copy link
Member

I'm not sure either why False is the default. Maybe it deterroriates the quality of the output?

@SVasilev
Copy link

SVasilev commented Oct 2, 2019

Hey there guys. Thanks for spending time on that issue. For some of my PDFs I do see slight improvement if I use detect_vertical=True.
But for some of the PDFs it does not look like it helps. Here is an example:
https://patentimages.storage.googleapis.com/93/e9/69/f780b3b43c6635/EP1338665A1.pdf

For this document for example page 17 takes 6.6 seconds and page 18 takes 5.71 seconds with and without the flag. Any further suggestions? Maybe I can play with other LAParams properties?

@pietermarsman
Copy link
Member

Can be closed though.

@SVasilev
Copy link

I don't understand why is this issue closed. There is currently no solution to the problem.
Or you keep your bugs in separate issue tracker?

@Mglt-b
Copy link

Mglt-b commented Oct 26, 2020

Hello, any other solutions ?

@pietermarsman
Copy link
Member

@SVasilev @migliorati the issue described by @thf24 is fixed by enabling detection of vertical text boxes. I consider this issue closed because this specific question is answered.

I get that this solution does not work for all PDF's and for all code. If you have performance issues with specific PDF's or if you think pdfminer.six is slow in general for some subset of all PDF's, feel free to open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants