Recently, I have been experimenting with a few tools to process and display larger sets of data. I’m looking at this because, increasingly, there are much larger amounts of data available that may contain valuable information about an area of study. Having some tools that can deal with this data easily and quickly would be a great help in spotting things I might want to look at more closely using some of my standard tools like AutoCAD Map 3D and Autodesk Infrastructure Map Server.
Python is a language that lends itself to this type of experiment. It is easy to use, fast and has a wide range of capabilities. In this post, I’m going to show some New York City taxi data, which is openly available, and makes an interesting dataset for testing. I’ve contained all of this experiment in a Jupyter Notebook, This is a web application that allows you to create and share documents that contain live code in a number of languages, equations, visualizations and explanatory text. It is an ideal environment for cleaning and transforming data, simulation, statistical modeling, plotting, and a lot more. These notebooks can be shared through email, and services like Dropbox, GitHub and the Jupyter Notebook Viewer.
So, let’s take a look. Here are some excerpts from my notebook.

With a few lines of Python, the data is loaded and processed. You can see the last five records of the file displayed. Next, I’ll import some diagnostics and the graphics system, datashader.

Now, I’ll set some bounds and aggregate the data based on the location of the drop-off points. Note that it took under a minute to process almost 11.5 million points (this is just on my laptop). The points are plotted using the aggregation values between a low value (light blue) to a high value (dark blue)


As you would expect, the greatest density is in Manhattan and at the airports. No surprises there. But, this is only a basic plot of the drop-off locations. I can take this a lot further by looking at parameters like time of day, number of passengers, trip distances, fares, and so on. I can share these results easily with colleagues and interactively change the values of the parameters I’m using in the notebook.
I hope to explore this further in upcoming posts and tie it together with some additional applications. Stay tuned!