Tuesday, 15 July 2014

Numpy masked array and NaNs

A quick look at NaNs, Numpy arrays and specifically masked arrays.
Now, these are very useful but like much of Python suffer from a surfeit of functions, methods, whatever. In short, it can be very confusing and frustrating to work with Numpy as behaviour isn’t consistent. To be fair this fact is admitted to in the help pages.
Lets’s start by importing what we need:
import sys
import os
import numpy as np
import numpy.ma as ma
import scipy.stats as stats
from scipy.stats.stats import nanmean
import textwrap
That last one is simply for presentation in the terminal and the first two are “just in case”.

Basic arrays

We’ll start by creating an array from a list.
  • Create “a” then convert it to array “b”.
  • Do the same with “c” to “d”.
These last two will be templates for our masks. The “1” indicates “mask this value” and “0” indicates “do not mask this value”. This is easy to confuse when starting with numpy masks and feels somewhat counterintuitive but the mask is a separate entity within the numpy masked_array object and so “1” means “do my thing”, which is to mask.
“a1” and “b1” are masked arrays each created from list objects and numpy arrays respectively. Using the .mean method the mean of the masked values can be calculated.
And finally, in “a2” a masked array is created by stating which values to mask out.
a = [1,2,3,4,5,6]
b = np.array(a)
c = [0,0,0,1,0,0]
d = np.array(c)
a1 = ma.masked_array(a, mask=c)
b1 = ma.masked_array(b, mask=d)
a1mean = a1.mean()
b1mean = b1.mean()
a2 = ma.masked_values(a,4)
#
# PRINT TO TERMINAL
print textwrap.dedent("""\
    a = [1,2,3,4,5,6]
    b = np.array(a)
    c = [0,0,0,1,0,0]
    d = np.array(c)
    a1 = ma.masked_array(a, mask=c)
    b1 = ma.masked_array(b, mask=d)
    a1mean = a1.mean()
    b1mean = b1.mean()
    a2 = ma.masked_values(a,4)
    """)
print "a: ", a
print "b: ", b
print "a1: ", a1
print "mean of a1: ", a1mean
print "a2: ", a2
print "b1: ", b1
print "mean of b1: ", b1mean
print "\n"*3

2D arrays and how they are masked

Now for a 2 dimensional array, the mask must also be 2D or we can specify a value.
This time, when calculating the mean we can either call .mean() which returns the mean of all unmasked values or we can specify an axis thus: .mean(0). This would return the mean of each column.
e = np.array([[1,2,3],[4,5,6]])
f = np.array([[0,0,0],[1,0,0]])
e1 = ma.masked_array(e, mask=f)
e2 = ma.masked_values(e,4)
e1mean = e1.mean()
e1mean0 = e1.mean(0)
e1mean1 = e1.mean(1)
#
# PRINT TO TERMINAL
print textwrap.dedent("""\
    e = np.array([[1,2,3],[4,5,6]])
    f = np.array([[0,0,0],[1,0,0]])
    e1 = ma.masked_array(e, mask=f)
    e2 = ma.masked_values(e,4)
    e1mean = e1.mean()
    e1mean0 = e1.mean(0)
    e1mean1 = e1.mean(1)
    """)
print "e: ", e
print "e1: ", e1
print "e2: ", e2
print "mean of e1: ", e1mean
print "mean of each column: ", e1mean0
print "mean of each row: ", e1mean1
print "\n"*3

Dealing with NaN values

So now we get onto a section that can cause some irritation: Not a Number. The NaN, if you don’t already know, means “No data here”, which is distinct from e.g. “0”. Numpy comes with a specific NaN object np.nan (note that the “np” is short for numpy and was defined at the import). Unfortunately the standard mean function can’t handle NaN’s so you have to use nanmean which is imported from scipy.stats.stats (I know, dumb as hell, but there you go). If we try to use .mean on an array containing np.nan it returns “nan”. nanmean gives us the mean of each column though and we can’t just take the mean of the means.
g = np.array([[1,2,3],[np.nan,5,6]])
gmean = g.mean()
g_nanmean = nanmean(g)
g_nanmean_all = nanmean(g_nanmean)
#
# PRINT TO TERMINAL
print textwrap.dedent("""\
    g = np.array([[1,2,3],[np.nan,5,6]])
    gmean = g.mean()
    g_nanmean = nanmean(g)
    g_nanmean_all = nanmean(g_nanmean)
    """)
print "g: ", g
print "gmean: ", gmean
print "nanmean: ", g_nanmean
print "g_nanmean_all: ", g_nanmean_all
print "\n"*3
So what to do? We could reshape the array first, make it 1D and then .meandoes what we want. np.ravel() will flatten it for us.
print "The np.ravel() function: "
print "np.ravel(g) to flatten e: "
print  g, "\nwhich becomes ", np.ravel(g), "\n"
gravelnmean = nanmean(np.ravel(g))
#
# PRINT TO TERMINAL
print textwrap.dedent("""\
    gravelnmean = nanmean(np.ravel(g))
    """)
print "gives: ", gravelnmean
This works for some of the basic, statistical functions but isn’t practical when using other, more advanced packages and methods in Numpy and Scipy. So why not just use masks? If your data contains NaNs why bother with np.nan when the masked_array seems more flexible? Well, consistency in implementation in numpy is a bit of a problem. Some packages within numpy contain functions that work in different ways, which can be very frustrating to a newcomer.

Mask np.nan values to be doubly sure

So, to make sure you have yourself covered create a mask using either:
  1. from the NaNs in your array (h1)
  2. from the NaNs in another array (h2)
  3. from the mask in another masked_array (h3)
The last two require the other array to be of equal dimensions to your array.
h = np.array([[1,2,3],[np.nan,5,6]])
h1 = ma.masked_where(np.isnan(h),h)
h2 = ma.masked_where(np.isnan(g1),h)
h3 = ma.masked_where(ma.getmask(g1), h)
#
# PRINT TO TERMINAL
print textwrap.dedent("""\
    h = np.array([[1,2,3],[np.nan,5,6]])
    h1 = ma.masked_where(np.isnan(h),h)
    h2 = ma.masked_where(np.isnan(g1),h)
    h3 = ma.masked_where(ma.getmask(g1), h)
    """)
print "h: ", h
print "h1: ", h1
print "h2: ", h2
print "h3: ", h3