• Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint
Share this Page URL
Help

2. Managing a Collection > 23. Identify Related PDFs

Identify Related PDFs

Analyze word frequency to find relationships between PDFs.

Organizing a large collection into categories requires a firsthand familiarity with every document. This level of care generally is not possible. In any case, some documents inevitably get filed into the wrong categories.

Here is a pair of Bourne shell scripts that measure the similarity between two PDF documents. You can use them to help categorize PDFs, to help identify misfiled documents, or to suggest related material to your readers. Their logic is easy to reproduce using any scripting language. To install the Bourne shell on Windows, see [Hack #97] .

They use the following command-line tools: pdftotext [Hack #19] , sed (Windows users visit http://gnuwin32.sf.net/packages/sed.htm ), sort, uniq, cat, and wc (Windows users visit http://gnuwin32.sf.net/packages/textutils.htm ). These tools are available on most platforms. Here are some brief descriptions of what the tools do:


PREVIEW

                                                                          

Not a subscriber?

Start A Free Trial


  
  • Creative Edge
  • Create BookmarkCreate Bookmark
  • Create Note or TagCreate Note or Tag
  • PrintPrint