One of the first things we are taught in Programming 101 is to write a well-structured and commented code. And as any newbie would, we ignore this lesson and focus on achieving the end result. Recently, I coded a R (the R language!) script to be run on files amounting to 30 GBs! This was my first professional experience after my graduation and I did not want to fuck up. So I structured the code, wrote all the comments and ran it on all the files. And what happened next?!

All files were not structured the same way and so my script broke for a few files, leaving my final data set void of some very important data. Moreover, my script was deleting some rows from every file, and thus I was tampering with the original data set without any logical and concrete reason behind it. It might not sound something as significant as not achieving the final goal, but believe me, in data science, if your data is not representative of the true data set, your analysis is considered void.

A log file is a file that records events that occur during a process. It basically helps to track back the process and discover if anything has gone wrong.

Reasons to Keep a Log File

So how to account for such cases? Maintain a Log File! If you need more reasons for maintaining a log file, here are few I can think of:

  1. Large data sets follow Murphy’s Law. Anything that can go wrong, will go wrong. And a log file is the best way to keep check.
  2. While running a common script on several multiple files, a log file will give you a gist of the whole process.
  3. A log file will help for future reference, both for your own self and also for others who will use the script or the data set again.

What to Write in a Log File

So, okay! I know a log file is important, but what do I write in the log file? It depends on the use case. As someone who works with data daily, I usually maintain the following parameters in my log file:

  • Total number of files the script was run for
  • File names
  • Number of rows in each file (before and after processing)
  • Number of columns in each file (before and after processing)
  • Any specific parameters important to the particular data set
  • Processing time

How to Keep a Log File in R

createLog <- function(df){log_con <- file("process.log",open="w")cat(nrow(df), file = log_con, sep="\n")...}

This is a very basic way to keep a log file. I prefer using this function in every script because it gives me the freedom to choose the contents of my log file.

Happy coding!


Achyut Joshi, a Data For Impact Fellow at SocialCops, originally published this article on his personal blog


rE