From Street Coder by Sedat Kapanoǧlu

This article discusses the importance of using data types in programming for writing and maintaining code bases.


Take 40% off Street Coder by entering fcckapanoglu into the discount code box at checkout at manning.com.


Types

Programmers take data types for granted. Some even argue that programmers are faster in dynamically typed languages like JavaScript or Python because they don’t have to deal with intricate details like deciding the type of each variable.

HINT  Dynamically typed means that data types of variables or class members in a programming language can change during runtime. You can assign a string to a variable then assign an integer to the same variable in JavaScript because it’s a dynamically typed language. A statically typed language like C# or Swift doesn’t allow that.

Yes, specifying types for every variable, parameter, and member in the code is a chore, but you need to adopt a holistic approach to being faster. Being fast isn’t solely about writing code but maintaining it too. There could be few cases where you may not need to care about maintenance because you learned that you got fired and you couldn’t care less, but that isn’t going to be the case all the time (hopefully). Apart from that, software development is a marathon, not a sprint.

Failing early is one of the best practices in development. Data types are one of the earliest defenses against development friction in coding. Types let you fail early and fix your mistakes before they become a burden. Aside from the obvious benefit of not confusing a string with an integer accidentally, you can make types work for you in other ways.

Being strong on the type

I’m sure most programming languages have types. Even the simplest programming languages like BASIC had types: strings and integers; some of its dialects even had real numbers. Types were invented to make our lives easier. Types are free checks for correctness, and understanding the underlying type system can help tremendously to make you a productive programmer.

How programming languages implement types is strongly correlated with whether they’re interpreted or compiled:

  • Interpreted programming languages like Python or JavaScript let you to run code in a text file immediately without a need for a compilation step. Because of their immediate nature, variables tend to have flexible types: you can assign a string to a previously integer variable, you can even add strings and numbers together. These are usually called dynamically typed languages because of how they implement types. You can write code much faster in interpreted languages because you don’t need to declare types.
  • Compiled programming languages tend to be stricter. How strict they are depend on how much pain the language designer wants to inflict upon you. For example, Rust language can be considered the German-engineering of programming languages, extremely strict, perfectionist, and therefore error-free. C language can also be considered German-engineering but like Volkswagen: it lets you to break the rules and pay the price later. Both languages are statically typed, once a variable is declared its type can’t change, but Rust is called strongly typed like C# but C is considered weakly typed.

Strongly and weakly typed means how relaxed a language is in terms of assigning different types of variables to each other. C is more relaxed in that sense; you can assign a pointer to an integer or vice versa without issues, but C# is stricter; pointers/references and integers are incompatible types.

Table 1. Flavors of type strictness in programming languages

Statically typed

Dynamically typed

Strongly typed

  • Variable type can’t change in runtime
  • Different types can’t be substituted for each other

Examples: C#, Java, Rust, Swift, Kotlin, Pascal

  • Variable type can change in runtime
  • Different types can’t be substituted for each other

Examples: Python, Ruby, Lisp

Weakly typed

  • Variable type can’t change in runtime
  • Different types can be substituted for each other

Example: C, Visual Basic

  • Variable type can change in runtime
  • Different types can be substituted for each other

Example: JavaScript, VBScript

Strict programming languages can be frustrating. They can even make you question life and why we exist in the universe when it comes to languages like Rust. Declaring types and converting them explicitly when needed may look like a lot of bureaucracy. You don’t need to declare types of every variable, argument, and member in JavaScript for example. If many programming languages can work without explicit and strict types, why do we burden ourselves with them?

The answer is simple: types can help us write code which is safer, faster, and easier to maintain. We can reclaim the time we lost when we declare types of variables, annotating our classes with the time we gained by having to debug fewer bugs, and having to solve fewer issues with performance. Apart from obvious benefits of types, there are some subtle benefits too. Let’s go over them.

Proof of validity

Proof of validity is one of the less-known benefits of having predefined types. Suppose that you’re developing a microblogging platform which only allows certain amount of characters in every post, and you’re not judged for being too lazy to write something longer than a sentence. In this hypothetical microblogging platform, you can mention other users in a post with “@” prefix and mention other posts with “#” prefix followed by the post’s identifier. You can even retrieve a post by typing its identifier in the search box. If you type in a username with “@” prefix in the search box, that user’s profile is shown.

User input brings a new set of problems with validation. What happens if user provides letters after “#” prefix? What if they input a longer number than allowed? It might seem like those scenarios work themselves out, but usually your app crashes because somewhere in the code path, something that doesn’t expect an invalid input throws an exception. It’s the worst possible experience for the user: they don’t know what has gone wrong and they don’t even know what to do next. It can even become a security problem if you display that given input without sanitizing it.

Data validation doesn’t provide a proof of validity throughout the code. You can validate the input in client, but somebody, a third-party app for example, can send a request without validation. You can validate at the code that handles web requests, but another app of yours, such as your API code, can call your service code without necessary validation. Similarly, your database code can receive requests from multiple sources, like the service layer and a maintenance task, and you need to make sure that you are inserting the right records in the database.


Figure 1. Unvalidated data sources and places where you need to validate data repetitively


That might eventually make you validate the input at multiple places around the code and you need to make sure that you are consistent in validation too. You don’t want to end up with a post with an identifier of “-1”, or a user profile named “’ OR 1=1–” (which is a basic SQL injection attack).

Types can carry over proof of validity. Instead of passing an integer for blog post identifiers or strings for usernames, you can have classes or structs that validate their input on construction which makes them impossible to contain an invalid value. It’s simple, yet powerful. Any function that receives a post identifier as a parameter asks for a PostId class instead of an integer. This allows you to carry over proof of validity after the first validation in the constructor. If it’s an integer, it needs to be validated, if it’s a PostId, it has already been validated; there’s no need to check its contents, because there’s no way to create it without validation, as you can see in the following snippet. The only way to construct a PostId in the code snippet is to call its constructor, which validates its value and throws an exception if it fails. That means it’s impossible to have an invalid PostId instance:

 
 public class PostId
 {
     public int Value { get; private set; }    #A
     public PostId(int id) {   #B
         if (id <= 0)  {
             throw new ArgumentOutOfRangeException(nameof(id));
         }
         Value = id;
     }
 }
  

#A Our value is impossible to be changed by external code.

#B Constructor is the only way to create this object.

The style of code examples

Placement of curly braces is the second most debated topic in programming that hasn’t been settled in a consensus yet right after tabs vs spaces. I prefer Allman style for most C-like languages, like C# and Swift. Allman style is where every curly brace character resides on its own line. Swift officially recommends using 1TBS (One True Brace Style), aka improved K&R style, where an opening brace is on the same line with the declaration. People still feel the need to leave extra blank lines after every block declaration because 1TBS is too cramped. When you add blank lines, it effectively becomes Allman style, but people can’t bring themselves to admit it.

Allman style is the default for C# where every brace is on its own line. I find it much more readable than 1TBS or K&R. Java uses 1TBS by the way.

I’ve had to format the code in 1TBS style because of the publisher’s typesetting restrictions, but I suggest you consider Allman-style when using C# not only because it’s more readable, but because it’s the most common style for C#.

When you decide to go that path, it’s not as easy as the example I’ve shown though. For example, comparing two different PostId classes with the same value wouldn’t work as you expected, as by default, comparison only compares references, not the contents of the classes. You need to add whole scaffolding around it to make it work without issues. Here’s a quick checklist:

  • You need to at least implement an override for Equals method as some framework functions and some libraries can depend on it to compare two instances of your class.
  • If you plan on comparing values yourself using equality operators (“==” and “!=”) you have to implement their operator overloads in the class.
  • If you plan to use it in a Dictionary<K,V> as a key, you need to override GetHashCode method.
  • String formatting functions such as String.Format uses ToString method to get a string representation of the class suitable for printing.

Don’t use operator overloading

Operator overloading is a way to change how operators like “==”, “!=”, “+”, and “-“ in a programming language behave. Developers who learn about operator overloading might go overboard and tend to create their own language with weird behavior for irrelevant classes like overloading “+=” operator to insert a record to a table with a syntax such as db += record. It’s almost impossible to understand the intent of such code. Don’t be that person. Even you’ll forget what it does and you’ll beat yourself up over this. Use operator overloading only to provide alternatives to equality and typecasting operators, and only when needed. Don’t waste time implementing them if they won’t be needed.

A PostId class with all necessary plumbing to make sure it works in all equality scenarios is shown in listing 1. We overrode ToString() and our class becomes compatible with string formatting and easier to inspect its value when we debug. We overrode GetHashCode() and it returns Value directly because the value itself can fit perfectly into an int. We overrode Equals() method and equality checks in collections of this class work correctly in case we need unique values, or we’d like to search against this value. We finally overrode “==” and “!=” operators and we can directly compare to PostId values without accessing its value.

Listing 1. Full implementation of a class encompassing a value

 
 public class PostId
 {
     public int Value { get; private set; }
     public PostId(int id) {
         if (id <= 0) {
             throw new ArgumentOutOfRangeException(nameof(id));
         }
         Value = id;
     }
     public override string ToString() => Value.ToString();    #A
     public override int GetHashCode() => Value;    #A
     public override bool Equals(object obj) {    #A
         return obj is PostId other && other.Value == Value;
     }
     public static bool operator ==(PostId a, PostId b) {    #B
         return a.Equals(b);
     }
     public static bool operator !=(PostId a, PostId b) {    #B
         return !a.Equals(b);
     }
 }
  

#A System.Object overrides, using arrow syntax notation

#B Overloading code for equality operators

The arrow syntax

The arrow syntax is introduced to C# in 6.0 and it’s equivalent to normal method syntax with a single return statement. You can opt for arrow syntax if the code is easier to read that way. It isn’t right or wrong to use arrow syntax, readable code is right, unreadable code is wrong.

The method

 
 public int Sum(int a, int b) {
     return a + b;
 }
  

is equivalent to:

 
 public int Sum(int a, int b) => a + b;
  

It’s not usually needed but in case your class needs to be in a container which is sorted or compared, you have to implement these two additional features too:

  1. You need to provide ordering by implementing IComparable<T> because equality itself isn’t sufficient to determine the order. We didn’t use it in the listing 1 because identifiers aren’t ranked.
  2. If you plan on comparing values using less than or greater than operators, you need to implement related operator overloads (“<”, “>”, “<=”, “>=”) for them too.

This can look like a lot of work when you can pass an integer around, but it pays off in large projects, like when working in a team. You’ll see more of the benefits in following sections.

You don’t always need to create new types in order to use a validation context. You can use inheritance to create base types which contain certain primitive types with common rules. For example, you can have a generic identifier type that can be adapted to other classes. You can rename PostId class in listing 1 to DbId and derive all types from it.

Whenever you need a new type like PostId, UserId, or TopicId you can inherit it from DbId and extend as needed. Here we can have fully functional varieties of same type of identifier to be able to distinguish them better from other types. You can also add more code in the classes to specialize them in their own way:

 
 public class PostId: DbId {    #A
     public PostId(int id): base(id) { }
 }
 public class TopicId: DbId {    #A
     public TopicId(int id) : base(id) { }
 }
 public class UserId: DbId {    #A
     public UserId(int id): base(id) { }
 }
  

#A We use inheritance to create new flavors of the same type.

Having separate types for your design elements makes it easier to semantically categorize different uses of our DbId type if you’re using them together and frequently. It also protects you from passing incorrect type of identifier to a function.

RULE OF THUMB  Whenever you see a solution to a problem, make sure that you also know when not to use it. This is no exception. You may not need such elaborate work for your simple prototype, you may not even need a custom class. When you see that you’re passing the same kind of value to functions frequently, and you seem to be forgetting if that needed validation or not, it might be beneficial to encompass it in a class and pass it around instead.

Custom data types are powerful as they can explain your design better than primitive types, can help you avoiding repetitive validation therefore bugs. They can be worth the hassle to implement. Moreover, the framework you’re using might already be providing the types you need.

Don’t framework hard, framework smart

.NET, like many other frameworks, comes with a set of useful abstractions for certain data types which are usually unknown or ignored. Custom text-based values like URLs, IP addresses, file names, or even dates are stored as strings. We’ll look at some of those ready-made types and how we can use them

Some of you may already know about .NET-based classes regarding those data types, but might still prefer to use a string, because it’s simpler to handle. The issue with strings is that they lack proof of validation; your functions don’t know if given string is already validated, causing either inadvertent failures or unnecessary re-validation code, slowing you down. Using a ready-made class for a specific data type is a better choice in those cases.

When only tool you have’s a hammer, every problem looks like a nail. The same applies to strings. Strings are great generalized storage for content, and they are easy to parse, split, merge, or play around. They’re tempting, but this confidence in the strings makes you inclined to re-invent the wheel occasionally. When you start handling things with a string, you tend to do everything with string processing functions although that can be entirely unnecessary.

Consider this example: you’re tasked to write a lookup service for a URL shortening company called Supercalifragilisticexpialidocious which is in financial trouble for unknown reasons, and you’re Obi-wan, their only hope. Their service works like this:

  • User provides a long URL such as:
  • https://llanfair.com/pwllgw/yngyll/gogerych/wyrndrobwll/llan/tysilio/gogo/goch.html
  • The service creates a short code for the URL and creates a new short URL such as: https://su.pa/mK61
  • Whenever user navigates to the shortened URL from their web browser, they get redirected to the address in the long URL they provided.

The function you need to implement must extract the short code from a shortened URL. A string-based approach looks like this:

 
 public string GetShortCode(string url)
 {
     const string urlValidationPattern =
         @"^https?://([\w-]+.)+[\w-]+(/[\w- ./?%&=])?$";    #A
     if (!Regex.IsMatch(url, urlValidationPattern)) {               
         return null; // not a valid URL
     }
     // take the part after the last slash
     string[] parts = url.Split('/');
     string lastPart = parts[^1];    #B
     return lastPart;
 }
  

#A This is an alien language called a regular expression. It’s used in string parsing and occult invocation rituals.

#B This is a new syntax introduced in C# 8 which refers to the second-to-last item in a range.

This code might look okay at first, but it contains bugs already, based on our hypothetical specification. The validation pattern for URL is incomplete, it allows invalid URLs. It doesn’t take the possibility of multiple slashes in the URL path into account. It unnecessarily creates an array of strings to get the final portion of URL.

NOTE  A bug can only exist against a specification. If you don’t have any specification, you can’t claim anything to be a bug. This lets companies to avoid PR scandals by dismissing bugs like “oh, that’s a feature”. You don’t need a written document for a specification either, it can exist in your mind, as long as you can answer the question “is this how this feature is supposed to work?”

More importantly, the logic isn’t apparent from the code. A better code might use the Uri class from .NET framework and look like the example below:

 
 public string GetShortCode(Uri url)    #A
 {
     string path = url.AbsolutePath;    #B
     if (path.Contains('/')) {
         return null;    #C
     }
     return path;
 }
  

#A It’s clear what we’re expecting.

#B Look ma, no regular expressions!

This time, we don’t deal with string parsing at all. All of that has been handled already by the time our function gets called. Our code is more descriptive, it’s easier to write, only because we wrote Uri instead of string. Because validation happens earlier in the code, it becomes easier to debug too. The best debugging is not having to debug in the first place.

In addition to primitive data types like int, string, float,.NET provides many other useful data types available to use in our code.

IPAddress is a better alternative to string for storing IP addresses. Not only because it has validation in it, but because it also supports IPv6 which is in use today, unbelievable, I know. The class also has shortcut members for defining a local address:

 
 var testAddress = IPAddress.Loopback;
  

This way, you avoid writing 127.0.0.1 whenever you need a loopback address, you become faster. In case you make a mistake with the IP address, you catch it earlier than you would with a string.

Another such type is TimeSpan. It represents a duration as the name implies. Durations are used almost everywhere in a software project, like when talking about caching or expiration mechanics. We tend to define durations as compile time constants. The worst possible way is this:

 
 const int cacheExpiration = 5; // minutes
  

It’s not immediately clear that the unit of cache expiration is in minutes. It’s impossible to know the unit without looking at the source code. It’s a better idea to incorporate it in the name at least, and your colleague, or even yourself in the future, know its type without looking at the source code:

 
 public const int cacheExpirationMinutes = 5;
  

It’s better this way but when you need to use the same duration for a different function that receives a different unit, you’ll have to convert it, like:

 
 cache.Add(key, value, cacheExpirationMinutes * 60);
  

This is extra work. You have to remember to do this. It’s prone to errors too. You can mistype 60 and have a wrong value in the end and maybe spend days debugging it or try to optimize performance needlessly because of such a simple miscalculation.

TimeSpan is amazing in that sense. No reason exists for you to represent any duration anything other than in TimeSpan, even when the function you’re calling doesn’t accept TimeSpan as a parameter.

 
 public static readonly TimeSpan cacheExpiration = TimeSpan.FromMinutes(5);
  

Look at that beauty! You already know it’s a duration and its unit where it’s declared. What’s better is that you don’t have to know its unit anywhere else. For any function that receives a TimeSpan, you pass it along. If a function receives a specific unit, say, minutes, as an integer, you can call it like this instead:

 
 cache.Add(key, value, cacheExpiration.TotalMinutes);
  

And it gets converted to minutes. Brilliant.

Many more types are useful in a similar sense like DateTimeOffset, which represents a specific date and time like DateTime but along with the time zone information, and you don’t lose data when suddenly your computer’s or server’s time zone information changes. In fact, you should always try to use DateTimeOffset over DateTime as it’s also convertible to/from DateTime easily. You can even use arithmetic operators with TimeSpan and DateTimeOffset, thanks to operator overloading:

 
 var now = DateTimeOffset.Now;
 var birthDate =
     new DateTimeOffset(1976, 12, 21, 02, 00, 00,
         TimeSpan.FromHours(2));
 TimeSpan timePassed = now - birthDate;
 Console.WriteLine($"It’s been {timePassed.TotalSeconds} seconds since I was born!");
  

NOTE  Date and time handling is such a delicate concept and easy to break, like in global projects. This is why there are separate third-party libraries that cover the missing use cases, such as Noda Time by Jon Skeet.

.NET is like that gold pile that Uncle Scrooge jumps and swims in. It’s full of great utilities that make our lives easier. Learning about them might seem wasteful or boring, but it’s much faster than trying to use strings or to come up with your own makeshift implementations.

Types over typos

Writing code comments can be a chore. Even without the code comments, your code doesn’t have to lack descriptiveness. Types can help you to explain your code.

Consider you encounter this snippet in the vast dungeons of your project’s code base:

 
 public int Move(int from, int to) {
     // ... quite a code here
     return 0;
 }
  

What’s this function doing? What’s it moving? What kind of parameters is it taking? What kind of result is it returning? These are all vague without types. You can try to understand the code or try to look up the encompassing class, but they all take time. Your experience could be much better had the naming were better:

 
 public int MoveContents(int fromTopicId, int toTopicId) {
     // ... quite a code here
     return 0;
 }
  

It’s much better now, but you still have no way to know what kind of result it’s returning. Is it an error code, is it number of items moved, or is it the new topic identifier resulting from conflicts in the move operation? How can you convey this without relying on code comments? With types. Consider this code snippet instead:

 
 public MoveResult MoveContents(int fromTopicId, int toTopicId) {
     // ... still quite a code here
     return MoveResult.Success;
 }
  

It’s slightly clearer. I mean it doesn’t add much because we already knew that the int was the result of the move function, but there’s a difference: we now can explore what’s in MoveResult type to see what it’s doing by pressing F12 on Visual Studio for PC.

 
 public enum MoveResult
 {
     Success,
     Unauthorized,
     AlreadyMoved
 }
  

We’ve got a much better idea now. Not only does it improve the understanding of the method’s API but it also improves the code in the function too as instead of some constants or worse, hardcoded integer values, you see a clear MoveResult.Success. Unlike constants in a class, enums constrain the possible values that can be passed around and they come with own type name to have a better chance of describing the intent.

Because the function receives integers as parameters, it needs to incorporate some validation because it’s a publicly facing API. You can tell that it might even be needed in internal or private code because how validation got pervasive. This looks better if there’s a validation logic in the original code:

 
 public MoveResult MoveContents(TopicId from, TopicId to) {
     // ... still quite a code here
     return MoveResult.Success;
 }
  

As you can see, types can work for you by moving code to their relevant place, and making it easier to understand. Because the compiler checks if you wrote a type’s name correctly, they prevent you from having typos too.

That’s all for this article.

If you want to learn more about the book, you can check it out on our browser-based liveBook platform here.