As you’ve probably figured out by now, I have a family of hamsters living in my brain and the hamster wheel is frequently out of service. I think of myself as barely house trained - I am naturally chaotic and housekeeping is SO not on the top of my list of fun things to do, but at the same time, visual clutter makes me ANXIOUS. ADHD is a barrel of laughs.

In my personal life I try to organise the layout of my environment in such a way that the correct place for storing an item is also the place my hand naturally reaches for it: mugs directly above the coffee machine, sugar next to the mugs, pantry items stored and grouped based on what is most likely to be used together, and so on. That way, both packing items away and retrieving them can essentially be done on autopilot. It's not perfect, but it helps.

As it turns out, good content management uses a lot of the same principles I use when organising my spice rack. And man, there is no better feeling than logging into my workspace (or a client’s environment) and finding that STUFF JUST MAKES SENSE. It makes me all warm and fuzzy inside.

Unfortunately, the vast majority of corporations (and households) I’ve dealt with are nowhere near that organized. A lot of them THINK they are, but precious few are anywhere close to as well organised as they believe. And bad content management costs companies in all kinds of ways many haven’t even started to quantify.

So how on earth to we get from this utter fiasco:

To this?

You hear that sound? That’s the sound of your will to live flushing itself down the toilet. But there is hope: as overwhelming as it is to contemplate bringing order to terabytes worth of entropy, it can be done. It all starts with a plan. In this post, we’ll deal with the “Discovery” portion:

Take Stock
Filter
Categorize
Group

Taking Stock

I realize I sound like an idiot saying this, but as ridiculous as it sounds, it’s amazing how often people forget this uber-obvious first step: FIGURE OUT WHAT YOU’VE GOT

I no longer assume that clients know or understand their own content. I’m prepared to accept that they may understand the day-to-day procedural running of their departments, but unless a division head or Subject Matter Expert has a background in Enterprise Content Management, I take anything a client says about the nature of their content with a grain of salt. Rather than depending on the client to tell me what they need or don’t need, I prefer to discuss the department’s operational processes and responsibilities, develop a basic grasp of their ideal end result, and then view the data myself.

I generally like to start with a basic “inventory” of content storage areas – something like this:

Department	System	Location / Path	Owner	No of Files	Storage Size	Description
HR	Payroll	\\servername\payroll\reports	Joe Soap	120,000	1.2GB	Monthly payroll reports
HR	N/A	\\servername\hr\recruitment\2018	Jenny Soap	150,000	1.4GB	Recruitment files from 2018

Business owners and / or their representatives are usually able to provide this information with an acceptable rate of accuracy. In short, we want to know:

The physical path (shared drive, legacy SharePoint space, whatever) where the data is stored;
Who officially owns the data stored at each location;
Whether there are any system dependencies on the current content structure (e.g. in the table above, there is a dependency / relationship between the “payroll reports” drive and the Payroll system)
Basic description of the data stored at each path; and
Some information on expected volumes

We can then use this information to request content structure reports from the client’s infrastructure team for each path identified in the Inventory sheet (I usually provide the infrastructure team with pre-built Powershell scripts so they don’t have to waste time putting one together themselves). I generally ask for .txt files, but .csv will work in a pinch.

These exports typically look something like this:

Oh joy. Two million rows of absolute misery. Patience, grasshopper. It’s (probably) not (quite) as bad as you think.

Filtering

I’m going to assume you’ve looked at the first screen shot of the content export and thought to yourself “this is gobbledygook”. And you would be right.

Dollars to doughnuts, your export is going to include junk data. People upload all sorts of stuff to shared spaces. I’ve seen entire SharePoint sites created for an employee’s kid’s birthday party pics, users backing up their entire hard drives to shared network drives (i.e. not their personal network drives), CD rips, pirated movies – even uber-NSFW phone gallery dumps. I’m still paying for the therapy from that one.

Point is, it’s almost guaranteed that you will have content in your export that is of no value to you, so step one is to strip out any junk. If your data set looks anything like the ones I’m used to, stripping out the obvious junk could shrink your overall data set by as much as 30%. You can usually spot patterns for this, e.g. anything with “\Windows\” or “\Program Files\” in the path is likely to be from a hard drive backup, etc. Zap as much of that nonsense as you can before moving to the next step.

Once all the obvious junk is stripped out, you’ll want to strip out the ROT. In other words, any content that is:

Redundant;
Obsolete; or
Trivial

Simple filter examples might include anything in a folder structure that contains the word “archive”, or “old” or “backup”, and so on. You will most likely be dropping more rows of data throughout the process, but the more you strip out now, the less you’ll have to step through later.

What was that? You want to know what toolset you’re supposed to use for this exercise? Well, darling, whatever tools you have. Some folks are most comfortable with Excel and provided the data set isn’t too monstrous, Excel will work. I am personally quite partial to RegEx, SQL, and VB macros, so a lot of my initial scrubbing happens in a combination of Notepad++ and good ol’ MS Access. If you like Python, use Python. The tool you use is less important than your comfort and proficiency with it.

Categorizing

Once the initial junk data and the most obvious ROT has been stripped out, we can start categorising the remaining content. This bit trips people up. People often think content should be categorised based on who it belongs to, or whether it’s a Word document or a PDF file, etc. None of that matters here. What we need to do at this point is categorise content BY FUNCTION. What do I mean by that? Look at this snippet from Bob’s Tyres:

Depending on who you speak to, there are probably at least 6 different categorisation “options” regular users might choose to employ, based on their personal work experience and how their brains work. I’ve seen clients (usually engineers) categorise their content based on the file extension and then be surprised when they can’t find anything. I’ve seen HR people try to categorise content based on the person who deals with the documents in question and be terribly disappointed when I tell them that “Joe’s Documents” is not an acceptable categorisation. All of the above make perfect sense to the person you’re asking, but they offer absolutely no value whatsoever to the content management process whatsoever.

Solid content management requires that you evaluate the function a document performs. For example, in the screen shot above, we can see that the majority of documents seem to deal with legal or regulatory matters. We have a folder called “Contracts”, which further breaks down into folders like “Agreements” (self explanatory, I guess), “Companies Act” (documents the company generates in compliance with the Companies Act, maybe?), as well as a mishmash of what looks like supplier contracts grouped by product / service (“Event Management”, “Web Solutions”, etc.) There also appear to be a few consultant employment contracts in the mix. It’s messy, but we can make a few educated guesses on what we are dealing with.

This block of data implies this company needs to manage compliance data required as evidence by the Companies Act, Anti Money Laundering legislation, etc., as well as multiple types of contracts and agreements, which we could probably divide into 2 main categories:

Supplier contracts (e.g. service contracts, SLAs, sales agreements, etc.); and
Employee contracts (e.g. employment contracts, mentorship agreements, bursary agreements, etc.)

Why am I making a distinction between supplier contracts and employment contracts? Because these contract types are likely to need different security configurations, different compliance requirements, and different lifecycles. Employee contracts are likely to be considered personal data, which may require special protections, so we want to be able to separate them out in our final design from the other contracts to make all those requirements a little easier to manage.

I usually add 2 columns to my data output to capture the content type / category (in the case of this snippet, most of these would be “contracts” or “agreements”) and the document type in question (e.g. employee contract > bursary agreement).

Let’s look at another example:

This snippet seems to deal primarily with travel, so we would expect to see things like travel requests, booking confirmations, tickets, and the like. In this case, we can also see folders for purchase orders, invoices, proof of payment, and so on. We can therefore extrapolate that we need to cater for:

Travel documents (travel requests, booking confirmations, tickets, etc.); and
Financial transaction documents (quotes, purchase orders, invoices, proof of payment, etc.)

Why am I splitting the travel quotes, invoices, etc. from the rest of the travel documents? Because those travel documents probably contain personal data. Travellers’ passport numbers, itineraries, etc. would more likely than not be governed by privacy legislation, just like the employee contracts in the previous snippet. The financial transaction data, by contrast, is likely to be used as transactional evidence by the finance or procurement departments, and therefore wouldn’t really fit under the “personal data” label.

In short, every document used or generated by the company fulfils a particular function or purpose, and that function is the primary indicator we need to establish the requirements our eventual content infrastructure needs to fulfil.

NOTE: there will be documents you absolutely cannot categorise based on their structural data. That is normal and to be expected. If you can categorise upwards of 85% of the data in your export, consider it good enough. No point driving yourself to drink over a handful of cryptic nonsense.

Grouping

Once the categorisations are complete, we can roll our categorisations up and look at the volumes. It should look something like this:

Of course, no 2 companies are alike, so the exact categorisations used will most likely differ slightly from one company to the next, but across the enterprise, you should see similar content categories emerge. These typically include but are not limited to:

Communications documents (i.e. letters, memos, emails, etc.)
Compliance documents (i.e. registers, checklists, etc.)
Contracts and Agreements
Employee documentation (privacy protected) (i.e. onboarding packs, payslips, identity documents, passports, performance assessments, etc.)
Financial documents (i.e. accounting records, transactional documents, etc.)
Governance documents (i.e. policies, procedures, standards, etc.)
HR admin documentation (i.e. employee benefit statements, remuneration package calculators, organograms, etc.)
Marketing material (i.e. brochures, marketing copy, etc.)
Meeting documents (i.e. agendas, minutes, presentations, board packs, etc.)
Training material (i.e. training manuals, facilitator manuals, course exam sheets, etc.)
Training operations documents (i.e. training bookings, attendance registers, trainee feedback forms, etc.)

Courage, grasshopper. If you've gotten this far, the worst is (most likely) behind you. In the next instalment I'll discuss the function of a Content Framework, what elements constitute a well designed framework, and give you some tips on how to create your own. Until then, stay hydrated and remember to floss.

Search This Blog

Divine by Zero

CM008 | Content Management - Start At The Beginning

Taking Stock

Filtering

Categorizing

Grouping

Comments

Post a Comment

Popular Posts

CM010 | Content Management – Frame It! (Part 2): Content Types

CM007 | Content Management - BUT MUH FOLDIZZZ!!!

CM006 | Content Management - SharePoint Lego

CM004 | Content Management - Content In The Cloud

CM001 | Content Management - The Thing With Content (Part 1)