Quick Analysis with Python

Picture this scenario: Your boss sends you an email with a task they need you to complete. What is the task? You open the email and notice a large attachment along with a short note. The attachment is a large data extract (.csv) with hundreds of thousands of rows

Picture this scenario: Your boss sends you an email with a task they need you to complete. What is the task? You open the email and notice a large attachment along with a short note. The attachment is a large data extract (.csv) with hundreds of thousands of rows, and the note reads, 

“Can you give me the correlation coefficient for variables A & B in this dataset?”

TACKLING THE PROBLEM

You’re a problem solver, so you quickly download and open the attachment with Microsoft Excel, attempting to tackle the problem head-on. You find columns A & B to perform the correlation; then you start typing in a new cell, =CORREL(arrayA,arrayB), and press ENTER. The Excel formula returns a correlation coefficient, but it can’t be that easy, right? That’s because it probably isn’t. There are other variables in this dataset that need to be considered in order to produce the best results, so in order to tackle this problem, you must break down those other variables as they relate to variables A & B. In other words, we need to partition the correlation of variables A & B on variables C, D, E, F, etc. Now, that all sounds very abstract, so let’s dive into something more concrete.

USING THE RIGHT TOOLS

For example, let’s say you’re looking to correlate all of your utility’s load profile data with temperature/weather data. You’re expecting a strong positive correlation because as it gets hotter, people tend to use more electricity. The issue, however, is that your utility supplies residential, commercial, and industrial customers. The usage pattern for a common residential household will vary significantly from that of an industrial plant, so in order to get the most accurate correlation, you need to partition your analysis on rates, customer class indicators, seasons, and potentially even location (e.g. zip code). Unless you really want to go through hundreds of thousands of rows of data to manually perform this kind of partitioning, Excel is probably not the best tool to use for your analysis. Why work harder than you need to when there are other tools available to help you work smarter?

Look no further than Python, as there’s an abundance of capabilities that you can now explore. Consider using libraries such and pandas, numpy, and matplotlib to read your data, perform correlation calculations, and plot graphics. You can group your data by whatever variables you’d like (e.g. rates and zip codes) and write loops to quickly and efficiently to perform all of your analysis and print the results. Once you’ve written your program, simply save the output, send it back to your boss, and your work is finished. You now have some robust and reusable code in your arsenal, and on top of that, you learned a new skill.

LEARNING SOMETHING NEW

While self-quarantining and social distancing are in full effect thanks to COVID-19, you might have some extra time on your hands, so consider taking full advantage of that time by tackling a problem that calls for learning a new toolset. Maybe there’s a problem that you haven’t quite gotten to yet, or maybe you’re consistently tackling the same problem by way of brute force. Either way, now might be the best time to try a new tool, one that is more appropriate to the problem at hand and can be kept in your toolbox for others that arise in the future.

—————————————————–

Follow QBA on social media: LinkedIn | Twitter | Facebook

Leave a Reply

Your email address will not be published. Required fields are marked *