Handling Unstructured Data

A colleague of mine was asking what unstructured data was, and why it was a concern, and what we can do about it. I gave him a quick explanation.  Here’s a longer one.

First of all, what is unstructured data?  It is data that is not broken down into individual named pieces.  Examples might be a Word document, a web page, the body of an email, a text message, and a tweet.  Contrast this with a name and address record stored in a database, with well defined fields for name, address, email, etc.

Unstructured information is a problem because a huge amount of what an organization could know about itself is stored in formats that don’t lend themselves to finding things.  By the way, it might also be stored in locations that don’t lend themselves to finding things, such as personal folders, email, dropboxes, etc., but that’s another story.

As always, before solutioning, we should have scenarios in mind.  Let’s consider a couple.

  • our Helpdesk wants to be able to manage the large number of emails between it and its clients, in order to (a) improve responsiveness, (b) find a particular email, and (c) mine the set of emails for patterns
  • the client wants an enterprise search where its users can find intellectual assets as confidently, reliably and easily as they search a large on-line catalog for physical assets (think cameras, DIY products, clothing).

There are two main strategies for handling unstructured data:

  • adding structure
  • adding description.

We’ll talk about these in turn.

Adding Structure

Although we use the term “unstructured data”, the documents and emails we deal with are not random strings of characters but already have structures that we might be able to exploit.

For example, a news article has a predicable structure; by creating a template in a web content management system with separate fields for title, author, and body content, we can achieve benefits at local and enterprise levels

  • locally, we can provide more powerful searches and filter for news articles; for example the query “find articles authored by John Doe” which is more specific than the basic full text search “find articles containing the phrase John Doe”
  • enterprise wide, we want the article to show up in a comprehensive set of information about staff; so when we query John Doe, we get
Person: John Doe
News Articles
“Handling Unstructured Data”
White Papers

Sometimes a technology change will make it easier to add structure.  For example, if we replace emails to helpdesk with web forms, we are more likely to get a well formed request, improving cycle times and enabling analytics to be generated more easily. For example, contrast an unstructured email

Hi, helpdesk!!! I just finished putting in a bunch of entries and got an error and now I can’t do anything.

with a structured form

Reason for Call Problem with A Program
Program Name Tran120
What Were You Doing Posting some invoices
Last Thing You Did Posted the batch, and tried to create a batch template so I could use it again
Error Condition Screen froze with error 22,B4.

As a final example, we can add structure to short text messages using understood codes, for example, to vote for John Doe, text this number with LUVJD.  This approach works fine for text messages.  Unfortunately, it is often used for structuring names of documents, so we often see document names like

RFP Response for Big Enterprise – final v 10 – Joe’s comments.

When we have a folder full of names like this, we realise we have a problem of clarity.  Full text search will not help, as all versions of this document will likely show up in the same search.  That’s why we need the other approach to handling unstructured data, namely “Adding Description” and that’s what we’ll talk about next.

Adding Description

Whether or not we can add structure to unstructured content, we often want to add description.  This is done by adding metadata, information external to the content that describes some useful aspects of the content.  Much of the time, this is done to help us search or file the content, but different metadata can be used for other aspects of information management, for example describing how long the content is to be retained, or whether it is to be archived.  We will focus on metadata for categorizing and finding content, but wanted to make it clear that this is not the only possible purpose for metadata.

One part of the information architect’s job is to help the client define suitable metadata. We explore with the client how they might want to search, filter, and categorize their information. If we were dealing with a library of sales collateral, for example, we might considering searching it by product or business sector, and filter on whether it was overview material or detailed, or whether it was aimed for a business or a technical audience.

Coming up with initial metadata ideas is not too hard, but additional work is needed to turn it into a practical tool. The sales collateral examples illustrate various situations that might be encountered:

  • Product
    • there might already be a product catalog that we can leverage
    • products might have both numbers and names
    • products can often be arranged in a hierarchical structure (look at a complex parts catalog or a consumer electronics site)
    • a document might refer to several products or product categories
  • Business Sector
    • this might already exist in the organization
    • if it doesn’t, we should consider looking for a scheme that could apply enterprise-wide, not just to sales collateral
    • we can help the user develop this scheme using card sorting and other familiar techniques the final result is likely to be comprehensible to users tagging content and viewing content
  • Overview or Detail
    • this might not already exist
    • it might be difficult to get a definition of what we mean by Overview and Detail
    • even if we did, the final result might not be reliably comprehensible to users tagging or viewing content
    • we might have an ah-ha moment and realise that the sales staff already talk about their documentation in terms of Two-Pager, Briefing Notes, etc., and that Collateral Type might be a more useful piece of metadata, especially for internal audiences
    • going with Collateral Type, this metadata is likely to be applicable just to the Sales Department rather than being enterprise-wide.

The next part of the information architect’s job, now that we have got the client excited, is to raise the question of how the metadata will get assigned to content.  There are two options: adding it by hand or adding it by program.

In a few lucky cases, adding metadata by hand is feasible.  This is the case where we have professionals whose job is corporate communications or corporate librarianship, who believe in the importance of tagging content, who understand the domain they are working in, where the volume of publications is small, and where the metadata structure is simple.  A feasible example might be corporate communications tagging internal news releases with category, any applicable departments, and any applicable projects.

Otherwise, there are a lot of “ifs” that might not pertain.  Some employees might not care about tagging the content they create.  Some metadata might be so complex that is would not be feasible to reliably tag a big document with all applicable instances of metadata.

In this case, we will need autotagging software to do the job of tagging.  We still need to define the  metadata that we will use. The autotagging software scans content and applies metadata tags based on rules that are set up and maintained by the organization.  For example, an HR department might have the rules:

If the document contains the words “Human Resources”, tag it with “Human Resources”
If the document contains “HR”, tag it with “Human Resources”.

The software makes it easy to test the applicability of rules and tweak them based on what we find.  For example, when we run the second rule on a set of documents, it will show which documents match the rule, like this

….. contact HR at hr@ourco.com …
… the HR Department provides services …
… to get a horizontal rule in your web page, use the hr tag …

In this case, we have learned something and can modify the rule.

If the document contains “HR” and not “tag” and not “HTML”, tag it with “Human Resources”.

Retesting the rule will now exclude the document talking about horizontal rules.

There can be quite an elaborate syntax for building tagging rules.  Some things that we can handle are Booleans, words that sound alike even though they have different spellings, and pattern matching.

By the way, pattern matching has some interesting uses.  For example, we can use it to tag documents that contain email addresses or phone numbers.  This is useful if want to scrub a set of documentation  to make sure that it does not contain any personally identifiable information.  And of course, using the rules, we could also test for the anti-bot form of email, name[at]provider[dot]extension.


Leave a Reply