Tuesday, September 18, 2012

Unstructured vs Structured

There are many terms I dislike from the misuse of the term innovation to the whole computer utility vs cloud debate. Another example of this is the whole unstructured vs structured data argument.

The terms unstructured vs structured implies there is a difference in the data sets i.e. unstructured data is by its very nature unstructured and unlike structured data. The term implies this is a permanent state of affairs, a set of data which has no structure and therefore cannot be modeled.

However, our entire history of scientific endeavour can be broken down into discovery of data we didn't understand, creation of models to explain the data and finally data we now understand. In other words we constantly move from unstructured to structured via the creation of a model.

Hence I prefer the term un-modeled data vs modeled data. Inherently this implies there is not a difference in the data sets just simply in our ability to model. It implies that there will be a movement from one to another.

What is your view? Am I the only person who dislikes this framing of unstructured vs structured?

7 comments:

Anonymous said...

I dislike it too. I've also encountered your point, although I've phrased it differently; I see data that has "explicit structure" in terms of its particular representation (eg, where that structure is reflected by splitting into records and columns in a relational database), and data that has "implicit structure" (eg, it's XML with lots of rich structure, but to the software storing it, it's just a series of bytes). Implicit structure is just waiting for a conversion process, or simply the attaching of metadata, to make it explicit. There's nothing deeper than that.

But there's another situation where structure remains implicit - when the storage model in use can't handle the complexity of the structure. The relational model's rigidity has led to much interesting data being stored in "unstructured blobs" inside a SQL database. But, again, it's merely unstructured from the viewpoint of the SQL database; the higher-level application sucks that blob up and applies some structure to it to work with it...

Andrew Elmhorst said...

Hmmm... I'm trying to understand the scope of your statements. Unstructured data is very common and makes up 80% of the world of data and information around us by some estimates (see http://en.wikipedia.org/wiki/Unstructured_data for one discussion) Do you believe that all unstructured data as in all text, structured and semi-structured, will eventually be modelled?

Or is the scope of your post specifically limited to usages of NoSQL databases? In both cases, I probably have opinions, but I'm trying to determine how wide a net you cast...

Simon Wardley said...

Hi Andrew Elmhorst,

The post was a bit rambling so I've shrunk it down to the actual point.

Anonymous said...

I tend to think of this as data that's already modelled, data that could be modelled, and data that you could spend your whole life trying to model without ever achieving very much.

We probably agree on the central point, which is that modelling comes along and tidies up data sets that really matter to people. It would be interesting to come up with a function for modelling latency versus data value 9or the value that can be extracted more effectively once the data is bashed into shape).

The Watcher! said...

Thankyou I created this Blog post (http://e-trust.blogspot.co.uk/2012/09/data-entropy-my-new-battleground.html) as a direct result of your post Simon
I enjoyed the journey and find myself even more motivated to fight data entropy, and add or maintain the; order, structure/form and meaning of my personal data. 

Frankly, today it is largely unstructured! ;-)

The Watcher! said...

My final thought is that the existence of a model doesn't necessarily result in the related data being organised. In the physical world my model for my home office is close to perfect, and keep nearly attaining it, but the majority if time reality differs Greatly, there goes that darned Entropy again!

Mark Wilson said...

This is a very interesting discussion, and I tend to agree with Simon's point that structure is less significant than whether or not it's been modelled. His latest tweet on the topic nails it for me though: http://twitter.com/swardley/statuses/254182225552240640

At some point data arrives. It may lack structure. It might be put into a NoSQL database and some analysis might be performed. Once we have derived some meaning from it we hacve to decide what to do with it. Keep in NoSQL? Discard? Move to a data warehouse?

What we really need to avoid is unnecssary duplication of data - and that's where a Linked Data database has potential. Ultimately it doens't matter which database it's in (structured, unstructured, semi-structured) but that we can access the data and make use of it - and that's where the modelling comes in.