Sunday, May 5, 2013

I Really Need a Document Management System


[gregm@feynman ~]$ du -hs $HOME
52G     /home/gregm
[gregm@feynman ~]$ find $HOME -type f -name *pdf | wc -l
5056
[gregm@feynman ~]$ 

That 'working set' is nothing, in terms of storage requirements; it's a speck on Terrabyte drives. The requirements are so low because there are very few media files. I don't 'do' media, as a rule. I want my favorite entertainment producers to get paid, so that they can keep entertaining me. I buy DVDs, have only ripped MP3s from CDs that I bought (years ago, and still have) etc.

But there are also large numbers of text files (notes and code), email, spreadsheets, etc. That stuff, combined with the PDFs, is important, in the sense that having this stuff on tap, efficiently accessible, is directly related to whether I can make a living at protecting people and the things that are important to them. I'd obviously like to continue to do that, as it's one of the more worthwhile things I can do with my life. And life is all too short.

Currently, I am most concerned with the PDFs. Some small percentage are chapter-by-chapter downloads of books from my Safari account. Those can be recreated, and there are other examples of PDFs that I am not very concerned about. But a large number, were they lost, would be difficult to replace, for a variety of reasons. 
  • An academic has changed positions
  • A commercial entity has ceased operations, deleted old files, etc.
  • A security industry private researcher has lost interest and allowed their site to lapse
  • I do not have data on why that missing file is important, and/or the source
  • Other. And 'Other' is large.
I'm very old-school about having a well-organized file system; I know how my directories are organized, and I'm far from reliant the file indexing systems that bog so many systems down. Nor am I fan of various 'tagging' systems; their usefulness seems ephemeral in that it's mostly useful in the scope of a single project, or a small number of related projects. It is perhaps likely that these would also be ephemeral, while I am also interested in the broad sweep of history, and how these things evolve over time. 

The notion of keeping a good mental reference breaks down at 5000 files. Is something filed under privacy, breaking anonymized data, or what?

I need a more formal document control, or library system. Current approaches seem to revolve around the Semantic Web (e.g. the 2012 ACM Computing Classification System (http://www.acm.org/about/class/2012) is one approach, etc. One program I am particularly interested in is Invenio, which has roots from CERN (birthplace of the Web, and home of the Large Hadron Collider), but is now a collaboration involving The Stanford Linear Accelerator Center, Fermilab, and others. Details are at http://invenio-software.org/.



No comments:

Post a Comment

Comments on posts older than 60 days go into a moderation queue. It keeps out a lot of blog spam.

I really want to be quick about approving real comments in the moderation queue. When I think I won't manage that, I will turn moderation off, and sweep up the mess as soon as possible.

If you find comments that look like blog spam, they likely are. As always, be careful of what you click on. I may have had moderation off, and not yet swept up the mess.