Missing Values

When your data is floating point numbers (e.g. 2.0), then it is often easiest to let NaN denote missing values. In other cases (and more generally), use missing.

Load Packages


In [1]:
using Dates

include("printmat.jl")


Out[1]:
printyellow (generic function with 1 method)

NaN

The NaN (Not-a-Number) can be used to indicate that a float "number" is missing or otherwise strange. For other types of data that floats, you may want to use a missing (see below) instead.

Most computations involving NaNs give NaN as the result.

NaNs are often used to represent missing data, but works only with floating point numbers like (e.g. 2.0), and not with integers (e.g. 2).


In [2]:
println(2.0 + NaN)


NaN

Loading Data

When your data (loaded from a csv file, say) has special values for missing data points (for instance, -999.99), then you can simply replace those values.


In [3]:
data = [1.0 -999.99;
        2.0 12.0;
        3.0 13.0]

z = replace(data,-999.99=>NaN)    #replace -999.99 by NaN
println("z: ")
printmat(z)


z: 
     1.000       NaN
     2.000    12.000
     3.000    13.000

NaNs in a Matrix

If a matrix contains NaNs, then many calculations (eg. summing all elements) give NaN as the result.


In [4]:
if any(isnan.(z))                      #check if any NaNs
  println("z has some NaNs")
end

println("\nThe sum of each column: ")
printmat(sum(z,dims=1))


z has some NaNs

The sum of each column: 
     6.000       NaN

Getting Rid of NaNs

It is a common procedure in statistics to throw out all cases with NaNs. For instance, if z[t,:] is the data for period $t$ and it contains one or more NaN values, then it is common to throw out that entire row.

This is a reasonable approach if it can be argued that the fact that the data is missing is random - and not related to the subject of the investigation. It is much less reasonable if, for instance, the returns for all poorly performing mutual funds are listed as "missing" - and you want to study what fund characteristics that drive performance.

The code below shows a simple way of how.


In [5]:
vb = any(isnan.(z),dims=2)    #indicates rows with NaNs

z2 = z[.!vec(vb),:]           #keep only rows without NaNs
println("z2: a new matrix where all rows with any NaNs have been pruned:")
printmat(z2)


z2: a new matrix where all rows with any NaNs have been pruned:
     2.000    12.000
     3.000    13.000

Missings

can be used to indicate missing values for most types (not just floats).

Similarly to NaNs, computations involving missing (for instance, 1+missing) result in missing.

In contrast to NaNs, working with missing sometimes involves converting a traditional array to an array that can include missing (or the the other way). The Missings package has help routines.


In [6]:
using Missings

In [7]:
data = [1 -999;
        2 12;
        3 13]
z = allowmissing(data)                #convert to an array that can include missing
z = replace(data,-999=>missing)       #replace -999 by missing
println("z: ")
printmat(z)


z: 
         1   missing
         2        12
         3        13


In [8]:
if any(ismissing.(z))                      #check if any NaNs
  println("z has some missings")
end


z has some missings

In [9]:
vc = any(ismissing.(z),dims=2)

z2 = z[.!vec(vc),:]                #keep only rows without NaNs
println("z2: a new matrix where all rows with any missings have been pruned:")
printmat(z2)


z2: a new matrix where all rows with any missings have been pruned:
         2        12
         3        13

Once z2 does not have any missing (although it still allows you to) you can typically use it as any other array. However, if you (for some reason) need to work with a traditional array, then convert z2 (see below).


In [10]:
println("The type of z2 is ", typeof(z2))

z3 = disallowmissing(z2)       #convert to traditional array,
                               #same as  same as convert.(Int,z2)

println("\nThe type of z3 is ", typeof(z3))


The type of z2 is Array{Union{Missing, Int64},2}

The type of z3 is Array{Int64,2}

In [ ]: