5 Things to consider when analysing CSV data
Introduction
There is so much data available to us today, almost every website, social media platform, software package, etc, lets us export data as a CSV. The problem is that these CSV files on their own are not particularly useful, some deliberately so (to encourage you to buy the premium reporting package). By using a spreadsheet to analyse and report on the data, it can really give you good insight into what is going on. It can show you what changes are required, and then how successful your new approach is. So, how do we go about creating this spreadsheet, and what do you need to consider? Here are 5 things to think about.
The consistency and reliability of the data structure
Firstly, how reliable is the structure of the data. What I mean by this is how the data is presented. Some exports are made to ‘look good’, so they are laid out almost as you would if you were doing a report in Word. So, merged headers, data in random cells, and very little consistency. These might look slightly easier to understand, but they are horrible to try and analyse as the data is all over the place. The better data set (for analysing) is when the data is all neat and tidy in rows and columns. One big block of data (in a table layout) is the best for analysing, but it doesn’t look particularly helpful to the naked eye. The other thing to consider is how likely the layout is to change. I usually assume that columns may switch around, get added or removed, or headers may change names. I create a spreadsheet to allow for these changes, so that it does not break should any of these happen.
How clean and accurate the data is
If you have a nice batch of data, all neatly in in columns and rows, you’re then off to a good start. The next thing to consider is how ‘clean’ the data is. This includes things like the formatting of the data. For example, CSVs often have monetary values formatted as text (so it has the £ or $ sign typed in). Things like this can be a pain, as they can’t be used in calculations like that. You’ll often get other things like numbers formatted as text, random spaces at the end of text (which you don’t see, but Excel sees it), and then dates in all sorts of weird formats that Excel doesn’t read as a date. The other issue is how accurate the data is. Now, you can’t always do alerts for this, as there may be extremes, but you can do some absolute checks. For example, if there is a ‘score’ as a percentage, and it is impossible to score more than 100%, you can do a check to see if the score exceeds that. Then if there is some other random figure in there that is greater than 100%, it will flag up. There are ways and means of dealing with all the above issues, and then doing checks to see if anything wasn’t corrected. This way, you end up with the cleanest data possible, as well as being alerted to anything that doesn’t seem right. Once this is sorted, we’re now ready to get to the fun part.
What you would like to see reported
Cleaning the data and getting it all sorted, is often the biggest job. However, deciding what you want in the report is the most exciting part. After all, if you didn’t want the report, you wouldn’t be doing any of this. At this point, I actually lay out the report, asking what I would like to see in it. I make the tables of supporting data (with dummy data at this stage). I make all the graphs I want to see (using the dummy data) and do the layout. As this report needs to be visually appealing, I concentrate on the layout, and don’t concern myself too much with actual data yet. On occasions the supporting data isn’t enough to do all the reporting, in which case I need to add further tabs to collect more information (either from another source or manually), but for the sake of this article, let’s assume we have all the desired data. Once my client and I are happy that we have everything covered in the report, I then know exactly what data I need from the raw data, and in what format. I am now ready to extract it and do the required formulas. I do need to consider what raw data is available, and what calculations are possible, so I know what I can use in the report. That’s all going on in my head as I set this up.
What raw data needs to be used
At this point, you’ll likely realise that you only need 6 out of the 20 columns of available data. Instead of checking and extracting it all, we simply focus on the required columns. This is a good reason to finalise the required report before you start extracting and analysing data. I then go and pull out the required columns, making sure to use the headers as references (so it knows what to pull out even if the columns move around). The extracted data then gets checked, adapted (if need be), and shown in a new table of ‘used data’ (behind the scenes). This way, when importing new data, you can see what is being used, and immediately spot if there is a major issue. At this point, I have one more thing to do. Connect this extracted data to the dummy data being used in the report, so that it is all flowing nicely. This is the part that freaks most people out, but in all honesty, this is often one of the easiest parts for me.
The calculations to bring it all together
I then do all the formulas to filter, sum, average, manipulate, re-order, combine, etc, the raw data, so that it can be used in the tables I set up for the reporting. All of this is done behind the scenes and hidden. As I go along, more and more of the report is being fed by the raw data as I replace the dummy data, until I have finished the process and it is ready. Then as the raw data is dumped in, the data flows through the checks, and the report springs into action. This means that all you (the user) needs to do, is paste the data in (using paste values, not normal paste), check that there are no alerts (making some minor selection changes if there are) and then click to the report tab and enjoy seeing what your data is telling you!
Conclusion
So, there you have it. This is how to make use of your data when exported as a CSV. Yes, it is more complicated that simply referencing the data and reporting, but it is still achievable, and very useful if done correctly. Most of the ‘back-end’ work takes longer that the fun part of making the report, but it is vital. The last thing you need is a shiny report which is broken or using dirty data and producing incorrect figures.
If you have CSV data that you wish to analyse on an ongoing basis, please get in touch. I’d love to make a spreadsheet for you to be able to do this.
Recent Comments