AC Daily Log Part 4 : Pre-process 2019 Data

Andrew Chien
4 min readMay 18, 2020

Following the article pre-process 2018 data, today’s article is going to talk about how to pre-process 2019 data. The 2019 daily log data are put in 12 separate sheets from Google sheet based on the month sequence. First of all, the first data frame df_2019 is created for storing the whole 2019 daily log data. After that, the second data frame in the first loop is used to store the temporary data from Google sheet for df_2019. And it can be found that the index start from 2 instead of 0 in the second loop because the first two rows in each sheet are the field row and the goal row that they can be omitted.

Get 2019 daily log data

After getting the result of 2019 daily log, then we merge it with the daily log data from 2017 to 2018. Finally, we can have the result like the following.

Daily log result from 2017 to 2019

Nonetheless, something weird happened when we apply describe function on the data frame. Firstly, it’s a categorical Series. Secondly, the column PD should only have value 0 or 1 but unique value here shows 3.

Descriptive statistics of daily log

In order to clear up this, groupby function is applied to check the issue. As we can see from the following image, it can be inferred that the 2nd and 3rd value in PD column is in the string type based on the result of the sum of PD columns’ value.

Group by column PD

In addition, actually we can also know that no pure int or float value in each column from the result of describe function as it return the categorical series. Because the describe function would return only numeric fields by default. Therefore, if categorical series is returned, then it means no pure numerical value in each column. Let’s take one more column as the experiment. From the following image, here we can see that the column E is also not the pure numerical column.

Group by column E

Now, we know that we have to change some columns’ data type from string to numerical data type. Consequently, apply function is used here to change the data type. Let’ s apply it to column E first to see how it goes. Sadly, we get the error message as the following. From the error message, we know some cells have the empty value.

Following the error message’s guidance, here we do something as the following to replace the empty value with 0:

df2_17to19=df_17to19.replace('', 0, regex=True)

After replacing the empty value with 0 and assigning it to the new data frame with new data type, then we can have something like the following. As we can see, now the result after applying describe function turns out to be the numerical series rather than categorical one.

Describe function of 2019 daily log

Lastly, we separate the year, month and day from the Date column and then input these newly made columns into the data frame.

The final result of data frame from 2017 to 2019

Finally, the destination to merging data from 2017 to 2019 is achieved. After that, next time’s article will talk more about how we update the data to Google sheet and how we do some basic analyses.

ps. if you want to know the definition of columns here, please check here

Photo by pine watt on Unsplash

--

--