Cataloguing A 50,000 Image Photo Collection

Recently I sorted, catalogued and ranked my photo collection of 50,000 unsorted images. This was such a daunting and intensive task that I thought it might be worthwhile to share some of my experiences and see if anybody in the RedBubble community might have thoughts or found themselves in a similar situation.

But first, some backstory (which you can skip if you like)!

The Big Setback

18 months ago I made a decision to start looking towards becoming self employed and attempt to at least partially support myself with photography. It was obvious that the first thing I would need to do would be to take stock of what I already had and that this would require a good amount of time and effort. I took a step back from a lot of volunteer commitments and rebalanced my day job to give me more time to start to set myself up.

I was using the spare time I’d been freeing up to get some yard work done (which I’m ashamed to admit had been neglected for a couple of years) and pitch in for big projects with my day job, so though I was pulling back from things, I was still moderately overworking myself.

In November last year, due to a combination of many things, my hands, wrists and elbows gave up. My doctor diagnosed me as having tenosynovitis. I was unable to do things like open doors or tie shoelaces for nearly three months, and it was closer to six before I had my first pain free day. I spent the intervening time reading up on a shortlist of photo library management apps I’d made beforehand, and thinking about how I was going to handle the import process/structure my collection.

As I recovered, I imported all of my images and slowly set about the task of removing duplicates, tagging, and rating. At the end of July, I’d gotten myself a list of 15 photos that I think are the best of my photos which required minimal extra work to be considered ‘done’. These shots are in my RedBubble portfolio, and will hopefully be joined by others as I find time to polish them up.

Selecting Software

Before making a start on this epic project, I knew I’d need to do some planning. I had to decide on the kind of structure I wanted for my collection, and in turn find or write some software that could do it.

In 2004 I wrote my own photo cataloguing system in JSP called imageCat, which allowed me to import/upload photos, add tags, comments and ratings, and share stuff with friends. For the most part, it did what I needed, but in the long run, maintaining it when there were services out there like Flickr seemed to not be the best use of my time.

My experiences with imageCat helped me identify several things that were important to me in a cataloguing application:

  • the ability to access and make use of my collection without software (in case I broke the app I was using or wanted to migrate to something else down the track)
  • the ability to interoperate with other sharing/editing tools (such as the ability to upload directly to Flickr, or to open a selected photo in an image editor without having to locate it on my hard drive)
  • the ability to apply batch tagging, descriptions, ratings to images (because I want to spend all day taking photos, not sorting them)
  • the ability to identify and remove duplicates (through various system upgrades, hasty file dumps and silly mistakes, I’d accidentally introduced a couple of hundred duplicates into my collection, and getting rid of them as well as having the tools to stop it from happening again are pretty helpful)
  • Linux support essential, cross platform support desirable (if it can’t run on my primary desktop, I don’t want it, and in the unlikely event that I move to another OS someday, I want to be able to take whatever software I use with me)

Up until now, I’ve always kept the default naming and foldering that my Canon cameras have provided, with those nested within folders named by year. 100 shots per folder isn’t too overwhelming for displaying thumbnails, and I have a good memory for the order things happen in, so locating specific stuff is pretty easy. I wanted to at least on some level maintain this hierarchy in whatever solution I ended up with, not only because it would mean less work, but it would also allow my knowledge of where things were to still be useful.

I wanted to be able to tag, rate and add descriptions to images (preferably embedded within the images themselves), and to use that metadata as a navigable hierarchy in itself. Search and sort facilities (by date, filename, tag, rating, and description at least – by camera settings would be a bonus) were also a must.

After some hunting around, I eventually settled on DigiKam, which met most of my requirements. In addition to supporting Linux, MacOS and Windows, it is also open source – a big plus for me. There were also a couple of extra search features (such as a sketch recognition) that interested me. Note that this isn’t meant to be an ad for DigiKam, just a demonstration that it’s important to make sure your software matches your needs.

Planning The Process

As I still wasn’t physically in a position to start ploughing through my photos, I decided I’d try to put a decent amount of thought into what exactly it was I wanted to achieve and how I would go about doing that. This turned out to be beneficial because it highlighted to me that I hadn’t really given enough consideration to how to structure my tags.

There were two obvious end-goals for me:

  • having all my existing photos accessible in a new tool
  • identifying the best photos within my collection

This in turn implies a couple of things:

  • To identify the best of my images, I need to rate every single one
  • To have my images accessible in a meaningful fashion, it would make sense to tag all my images whilst rating them
  • Before I started rating and tagging, it would make sense to get sort out my duplicates
  • Before I could import any new photos, I had to know that the underlying structure of my collection was sound otherwise I’d be making more work for myself

I started to think about how I’d rate my photos. From my experiences with imageCat, I knew it wasn’t possible to work through from start to end and rate each image on its own. Hoping to maintain any level of consistency over the kind of data set I had to deal with would be foolish, so I had to come up with an alternate way of rating my images. After discounting a few ideas, I settled on doing several rating passes and using friends’ suggestions of my “best shots” as a control/safety net to make sure I didn’t miss anything important.

The benefit of doing a couple of passes is that it’s easier and more reliable to weed out the not-so-good shots than it is to identify the best. To start with, I gave every photo a default rating of 2 (out of 5) and marked it up or down depending upon whether or not I thought it was a decent photo or garbage. After that was done, I went through everything rated 3 and above and compared them, moving anything outstanding up to 4, and then once more to separate the 4s from the 5s.

Suggestions from friends helped reveal a few 3s that I had skipped over, and gave me a chance to reconsider similar photos that had been taken at the same time.

Whilst planning this out, it seemed most efficient to do my tagging during the first rating pass, and it quickly became obvious that if I just made up tags as I went, I’d get to the end and realise I’d come up with new tags that my early images should have had and have to go back and find them again. Since DigiKam supports nested tagging, I decided to map out a simple tag hierarchy that would cover most of what I wanted to do, and verified that against a random sampling of my shots.

I tried to make sure I was grouping/nesting my tags in ways that would make sense when looking for generalisations or inspiration whilst at the same time keeping sub tags succinct and specific. My specific tags probably aren’t very relevant, but the gist of what I came up with was that instead of just having a tag for roses, I’d have a tag for plants, inside that a tag for flowers, and pop the one for roses inside that. By the time I felt I was done, I had an outline of about 80% of the parent tags that I would use and maybe 20% of the sub tags. It may not sound like much, but this made a real difference in terms of preventing backtracking and making sure new tags were consistent.

Getting It Done

Importing my collection proved to be straightforward. This was pretty much a click-the-button-and-come-back-a-few-hours-later type process. Once they were in, I focused on getting the duplicates out.

DigiKam has some interesting ‘fuzzy’ duplicate detection which allows you to set how close a match has to be before it’s included. At 100%, a copy of the same file but with a different name won’t match, whilst at 90%, consecutive shots of the same scene are likely to match. It took some time and a lot of concentration to be sure I was being consistent with which images I kept/discarded, but after several days, I’d pulled out nearly 1,000 extraneous images.

Moving onto rating and tagging, I set myself the goal of dividing each pass up into monthly segments to give me somewhere easy to stop that was achievable within a day (generally speaking). This worked out pretty well for the most part, and in three and a half months I had completed the first pass. By being ruthless and staying focused, I was able to narrow 50,000 images down to 2,000. This isn’t to say that 96% of my photos over the past decade have been junk, just that they’re didn’t match the “Is this image fit for sale, or could it be with a minimal amount of work?” qualifier I was working against. Holiday photos, reference images and video frames are great things to have, but they need to be put to one side when they don’t fit with your priorities.

My second pass brought the number down to 50, and my final pass pulled out 15 images, which were the first photos I uploaded to RedBubble.

Looking Back

I can’t stress how much time and work I saved for myself by really putting some thought into how I was going to structure my collection and how I was going to manage the importing process. As it was, it still took me over five months to finish cataloguing my collection – I can only imagine how long it would have taken if I hadn’t done all this groundwork first.

It also turned out to be an unexpectedly intense experience. As I mentioned earlier, a good portion of my collection is day to day stuff and travel photos. It was worthwhile and for the most part enjoyable, but I honestly wasn’t prepared to relive the past ten years of my life.

If I were to take away three things from this experience they would be:

  • Use the right tools – identify your needs and make sure the resources and tools you use can actually meet them
  • Think and plan before starting – if something can’t be done or needs to be done differently, it’s a lot better to find that out before you start
  • Break your project down – by having small, achievable chunks, you can turn an overwhelmingly massive job into something that can actually be done

Journal Comments

  • photosteak
  • Josh Bush
desktop tablet-landscape content-width tablet-portrait workstream-4-across phone-landscape phone-portrait
desktop tablet-landscape content-width tablet-portrait workstream-4-across phone-landscape phone-portrait