Summary
This tutorial will be a very brief introduction to the numpy Python package and the different things that can be done with it.
Introduction to numpy
numpy
is the most used tool for data scientists coding in Python. Since numpy is written in mainly C and C++, it is much faster to work with than pure Python. Among other things, this package gives the user access to a new data type called arrays. These allow for fast computations on large amounts of data.
Install numpy
numpy should be included in your base conda environment, but it is always good practice to create a new environment to work on.
  Bash
conda create -n science numpy ipython
conda activate science
This will create a new conda environment called science
and install numpy
and ipython
in it. Then we can use activate
to enter the new environment.
import numpy
First thing we are going to need to do is import the numpy module so we can access their code.
  Python
>>> import numpy
>>> numpy # tells you where the code is located
Anytime you want to use this library you will have to import it before you can use it. Also, to make our life easier we can make the numpy name shorter.
  Python
>>> import numpy as np
>>> np
We would have to type numpy
many times so now we just have to type np
. This is also typical for numpy users. If you see np in other people's code it more than likely refers to numpy.
Congratulations! You are now part of a global community accessing the code written by the numpy team. Let’s learn what we can do with it.
numpy array
Now we have access to all of the great tools numpy has created! There are many useful functions and classes included. The first and most basic is the numpy array or np.array
.
  Python
>>> x = np.array([1, 2, 3])
>>> x
array([1, 2, 3])
At first glance this might look very similar to a list. You would be correct! Numpy arrays are like lists but can only contain a single data type. In this case, integers. We can also use floats, strings, and booleans.
  Python
>>> a = np.array([1, 2, 3])
>>> b = np.array([1.2, 3.5, -3.99999])
>>> c = np.array([True, True, False])
>>> d = np.array(["apple", 'foo', "grape"])
Now we can work with large amounts of data very easily.
Math
The awesome thing about numpy is the ability to use functions that are then applied to the entire array. This is called vectorized functions. In this case, we are doing basic math operations.
  Python
>>> q = np.array([1, 2, 3])
>>> s = q + 2
>>> s
array([3, 4, 5])
Indexing
Numpy arrays can be indexed just like Python lists. However, there are some fancy ways to index that are exclusive to numpy arrays.
  Python
>>> q = np.array(['a', 'b', 'c'])
>>> i = np.array([1, 0])
>>> b = np.array([True, False, True])
>>> q[i]
array([‘b’, ‘a’])
>>> q[b]
array([‘a’, ‘b’])
Notice the use of a Boolean (True/False) array to index the numpy array. Where the Trues are present, that value in the numpy array will be selected. This opens the door to using conditional statements to index numpy arrays.
  Python
>>> q = np.array([1.1, 0.2, 4.3, 1.4])
>>> q[q>1]
array([1.1, 4.3, 1.4])
Indexing in these fancy ways allows us to select different sections of the data in order to operate on or change.
  Python
>>> q = np.array(['a', 'b', 'c'])
>>> b = np.array([True, False, True])
>>> q[b] = "z"
>>> q
array(['z', 'b', 'z'])
n-dimensional arrays
In addition to one dimensional arrays, like we have been working with, numpy can create multi-dimensional arrays. These are called n-dimensional arrays or ndarrays
.
  Python
>>> t = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> t.shape
(3, 3)
>>> t + 3
array([4, 5, 6],
[7, 8, 9],
[10, 11, 12])
This allows us to start doing matrix math.
  Python
>>> t = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> t + t
array([[ 2, 4, 6],
[ 8, 10, 12],
[14, 16, 18]])
Useful functions
numpy has a large number of predefined functions that can be run on arrays as well. These make for easy and efficient ways to calculate things like mean and standard deviation.
  Python
>>> q = np.array(['a', 'b', 'a'])
>>> a = np.array([1, 1.2, 5.8])
>>> np.sum(a)
8.0
>>> np.mean(a)
2.6666666666666665
>>> np.median(a)
1.2
>>> np.unique(q)
array([‘a’, ‘b’])
These are just a few of the great functions the numpy team have provided us with! Now we can do a wide variety of data science related tasks.