This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: pdftk and apropos - general questions





----------------------------------------
> Date: Wed, 4 Mar 2009 09:56:49 -0800
> From: garyjohn@spocom.com
> To: cygwin@cygwin.com
> Subject: Re: pdftk and apropos - general questions
>
> On 2009-03-04, Mike Marchywka wrote:
>
>>> Mike Marchywka wrote:
>>>> I've had a persistent problem getting apropos to work
>>>> as it never finds anything appropriate. Is there
>>>> something I need to do to make this work?
>>>>
>>> After each setup session, you need to run, /usr/sbin/makewhatis -u.
>>
>>
>> Thanks but I did get that far after earlier hints and you list
>> below is about what I ended up with too. One problem
>> I ran into was trying to extract sensical text from the
>> IRS instructions.
>
> I have that problem with the printed versions.
>
>> I used the pdftotext utility IIRC from
>>
>> http://www.foolabs.com/xpdf/download.html
>>
>> and it didn't seem to be able to separate multi-column text
>> automatically ( with sed and awk I got what I needed but what
>> a mess).
>
> Did you use the -layout option to pdftotext? It makes a huge
> difference on the documents I've converted, but they've all been
> single column.

I played with the options but I'm not sure the information
is in the source PDF. I don't imagine the authors really cared
too much about layout. IIRC, selection gave rectangles of the whole page width but also IIRC from scientific papers normally the selection
went column by column. Somewhere between intelligent formatting
and scanned pdf is probably the authoring tool that just
puts out blocks of text that can't be extracted properly
( probably even be design to stop people from using information
without pictures that someone spent a lot of time authoring  ).

I did try the pdftk on an f1040.pdf download
but I finally had to install Acrobat Reader to look
at the forms and fill it in. pdftk let me examine the
filled in form but there was not immediate way to
identify form fields- I have to look for meaningful names etc.

I guess if I could enter input data into something I could use
it would be worthwhile writing a script to fill out the form.
I'll use a web form for a few lines of input but if I have
to type 100 numbers into an information black hole I'm
happy to kill a tree or two.




>
> Regards,
> Gary
>
>
>
> --
> Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
> Problem reports: http://cygwin.com/problems.html
> Documentation: http://cygwin.com/docs.html
> FAQ: http://cygwin.com/faq/
>

_________________________________________________________________
Windows Live™ Groups: Create an online spot for your favorite groups to meet.
http://windowslive.com/online/groups?ocid=TXT_TAGLM_WL_groups_032009

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]