String is not a good universal datatype

Abstract

In this post I want to make clear that using Strings as a datatype is terrible for design of objects. The development may be faster on the short run, but on the long run, a price has to be paid.

Introduction

I’m working on a EJB 2.1/Struts project where a lot fields of objects are String. These fields are used to store different kinds of information: accountnumbers, phonenumbers, status information (an enumeration of values) etc.

The problems

The problem with using Strings as datatypes:

  1.   operations on the strings are scattered all over the place. If you need a function to format a phonenumber, it often is created adhoc instead of in a suitable class. This makes logic very hard to find back and the consequence is that you get more code-duplication.  
  2.   it is easy to get inconsistent data: a phonenumber of 20 digits for example is not possible, but you can store anything in a String. This makes it very difficult to control the consistency of the data.  
  3.   you don’t have any typesafeness: you could pass a telephonenumber to a bankaccount format function. Using a specific datatype instead of a String, makes it possible to get compiletie typesafeness and this is a good way to decrease developmenttime.  
  4.   it is more difficult to understand what a field means (especially if the documentation is not that great): if you have a field of type String and name imsi, how can you tell it contains a phonenumber? If the field is of type Phonenumber, you know at least it is a phonenumber.  

And I expect there are more, but these were the first that came to mind.

Conclusion

Ofcourse it is easier on the short run (you don’t have to create any special classes), but on the long run development will slowdown. I thought most developers knew this, but it appears that not all developers share the same knowledge.

About these ads

14 Responses to String is not a good universal datatype

  1. John Smith says:

    It is obvious you don’t have much real world experience. Beginners feel this way and design each data type separately. Then after the project is late (or cancelled) because they have to make so many custom classes to support the custom types, they learn that quite often, it is OK to use Strings and accept that there might be bad data.

    After all, who wants to create an Address Line 2 data type? We should never assign an address line 1 type to address line 2 field right? The cost have having Java catch this mistake is very high. Leaving it as a string and writing a unit test would be much easier in the real world.

    If your projects don’t have due dates, this is great advice. If it does, using Strings is a good compromise.

  2. pveentjer says:

    Why should a project be late if less time has to be spend debugging and understanding a system. I have worked on systems where everything is a String and on the current project I’m working on I see the same difficulties and that is the reason I wrote this blog. I’m spending way too much time on understanding what a field means and this time could be reduced by using good datatypes.

    Why should a project be ‘easier’ to unit test if everything is a String? I have never had any problems using types other than Strings in my unittests.

    So I certainly don’t agree with your ‘professional’ conclusions.

  3. Using value objects instead of Strings effectively requires you to posses some habit, developed by practicing it. However applying this practice pays off.

    Using only Strings makes your life harder when it comes to maintanance or debugging. In maintanance of business logic you are distracted by a lot of String manipulation, that would naturally fit into the value objects. In debugging you wonder how come, that database got a strange value (missed validation) or a portal page got invalid zip codes. Creating value objects makes the code clear and helps to detect problems earlier.
    One recurring issue with value objects is often too-strict validation (“assigning address line 1 to address line 2″ issue). It can be easily avoided by relaxing validation rules to leave only those, that are really necessary.

  4. Nate says:

    I also disagree with the “professional” comment. Professional developers create quality code fast enough to meet deadlines without creating support nightmares. That being said, strings are the universal datatype because the grand majority of information in a system is human-input or at least must be human-readable. The problem that you’ve actually met up with is a question of logical control — who “owns” the logic that manipulates the data? Should it be the data itself, or should it be some sort of controlling object?

    The question you should ask is: is the information represented by a string actually an object, or is it information *describing* an object? I find it helpful to consider the physical nature of objects in order to answer this. For example, a “user” is an actual entity in the physical world, so it should probably be represented as an object in the system. However, a “phone number” exists only to describe a user, and thus it shouldn’t have its own separate datatype.

    Just my two cents. :)

  5. Greg says:

    John Smith didn’t say that using Strings made it easier to unit test; he said it was an acceptable and pragmatic approach to the storage of a variety of data. He simply made the point that it was effective enough to not have a microscopic level of type safety, and rely on unit testing catching any argument mis-matches.

    It’s also interesting the comment you make about phone numbers …

    In the SMPP specification for example (which defines the protocol for a computer to send and receive SMS text messages via an SMSC in a Telco data centre), telephone numbers are defined as strings. Furthermore, telephone numbers don’t have a fixed length. For example, in the SMS world you can have shortcodes which are typically 6 digits long, but can also be 7 or 8 digits. You also have a fully qualified international format mobile number such as 61455500042. Additionally, the telephone number field does not have to be restricted to digits. It is possible to send SMS messages where the ‘from’ is in fact a string of alphanumeric characters such as 19POLL.

    So type safety is good; but strings are often an acceptable, pragmatic solution.

  6. Although I personally can understand your arguments I’ll have to go with String is a good-enough universal data type. My argument would be that as far as development time is concerned using Strings takes less time then developing new class types (and thus less code to mantain later), you can map String easily to a DB, you can use annotations to express in metadata all that is necessary to make sure your model is sound and well formatted (I believe both hibernate and spring people are working on frameworks, and there is word a JSR is coming as well). As far as the compile-time safeness I think the argument doesn’t hold, as inserting data is mostly run-time thing so you’ll have an error on inserting wrongly formatted data in both cases and you’d need to handle that in both cases (and I’d also add one should not handle validation using exceptions). On 1) and 4) I’d go with naming conventions and helper classes for formatting. You should really move on from EJB 2.1.

  7. Ashley Aitken says:

    I agree with blog entry Peter.

    I anything should be a class, PhoneNumber should definitely be. PhoneNumbers are a real abstraction that (by their nature) don’t change much.

    As to those who complain about the extra work writing the class. It should be done once and incorporated into a library.

    That’s the whole point of OO.

    Cheers,
    Ashley.

  8. Nathan Lee says:

    Actually, phonenumbers are one example where different countries have quite different numbers of digits, formats (e.g. where people put the brackets if that’s allowed), mobiles are different from fixed lines which are different from full international codes.. There’s nothing worse than someone deciding to adopt some restrictive pattern on what they think phone numbers should look like.. Also: what if skype usernames replace the cocept of phonenumbers ;)

    You might say that “zip codes” never change either, but the rest of the world has different schemes too.
    I don’t know how many useless address forms I’ve had to butcher data to make it fit because someone assumed that USA is the world.. e.g. *postcodes* in australia are 4 digits, states in Australia might be more than just two characters (e.g. NSW, QLD, SA, WA, NT, VIC, ACT)..

    I also think that Strings are a really powerful way of loosening up your data model and anticipating change. Having an architecture where you plug in validators/formatters etc that work on Strings means you can get a lot more reuse bang for your buck.. The world as a hashtable of strings for form data works quite well (you don’t need to keep adding in fields on objects and is pretty efficient and quick to code in for user data.. Of course: depends on what type of application you’re writing of course (I’ve often had to deal with forms that need to be built to change over time a hell of a lot).

  9. Ivan says:

    Every wonderful line of TCL code I have written disagrees with you :)

  10. ctran says:

    I bet you think Ruby sucks too! (lame bait dropped…)

  11. pveentjer says:

    Maybe I was a littlebit unnuanced (something I’m very good at). Creating a datatype for every ‘type’ of information is overkill. But if you start doing things with your data like formatting, making decision based on the content, then it would be a good idea to introduce a specific datatype instead of just storing everything in plain strings.

  12. Jelmer says:

    This is what Ken Pugh in “prefactoring” refers to as adts or abstract datatypes. I think there’s certainly a lot to be said this. But making an adt for each and every field seems overkill. Even in this book that advocates their use, CommonString, which is basically just a regular string is used at lots of places.
    For fields that have a finite number of possible values using an object is a natural fit. For something like PhoneNumber it seems less to be gained and for something like AddressLine even less. So yes I would agree but I think its hard to balance their use.

  13. Ashley Aitken says:

    Nope, I think PhoneNumber is a great example of where OO should be used. The fact that different places have different formats for phone numbers etc, reinforces this. A simple constructor parameter (or subclass) can be used to choose the type of phone numbe required. Hide all that detail in the class, let all the different types of phone numbers share common code. The abstraction of a phone number is constant it is the detail that varies. Abstraction is all about hiding detail. Classes were made primarily to allow us to implement better Abstract Data Types (a class is an ADT).

    Cheers,
    Ashley.

  14. Stan says:

    Actually, phonenumbers are one example where different countries have quite different numbers of digits, formats (e.g. where people put the brackets if that’s allowed), mobiles are different from fixed lines which are different from full international codes.

    What a great argument for an abstract PhoneNumber class and a subsclass for each country. Where would you rather put validation and formatting code?

    My line of work deals with id numbers for insurance and other financial contracts. A class per format works very nicely, though it sometimes takes some external indicator to parse ambiguous user input, perhaps a country code for phone numbers.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: