A Reflection on My Data Scientist Traineeship Experience

Surbana Jurong Data Science Team
6 min readMay 19, 2021

Graduating in the midst of a global pandemic isn’t exactly how I imagined my four years in university would end. Nevertheless, I think things worked out pretty well and I am grateful for the opportunity to take up a traineeship at Surbana Jurong (SJ). The SGUnited Traineeship Programme was an initiative introduced by the Government as a way for fresh graduates to gain industry experience and learn new skills during the economic downturn. Personally, I saw it as a good way for me to gain some working experience while trying out a data science role to see if it was a career I would enjoy.

The Surbana Jurong Data Science team applies data science and analytics to solve problems and optimise processes in the built-industry. For example, using anomaly detection and machine learning to identify problematic lifts so that maintenance can be carried out, improving lift safety. This was one also one of the aspects that drew me to this position as I felt that I could see how the work has real-world impact.

As a trainee, I had the opportunity to work on many different projects, allowing me to gain exposure to various domains that data science can be applied to, from contact tracing to reclamation projects. Below, I will outline some of the skills I’ve learnt during my time at SJ.

Extracting data using Python libraries

In my previous projects that I had worked on in university, the data we worked on was usually provided or at most, we had to obtain some data from Kaggle or other data repositories. As a result, my experience in collecting and extracting data the traineeship was rather limited.

I was first exposed to extracting data while working on a project that was about the tracking of the progress of reclamation projects. These reclamation projects would generate monthly reports, and the aim was to automate the extraction of data from these PDF reports. This data would later be ingested into the database, and then used to create visualisations in Power BI.

The original table as seen in a PDF file, taken from the documentation of the Camelot library [1]
Table obtained as a dataframe after extraction using Camelot

In order to extract data from PDFs, I used Camelot, a python library which is able to extract tables from PDFs. Throughout this process, there was a lot of troubleshooting involved. One of the reasons was that the default configuration of Camelot was not working very well for this set of reports. For example, it would sometimes merge different columns together instead of recognizing them as two separate columns. After looking through the documentation, this problem was rectified by changing to a different configuration. One of my takeaways from this process was to look through the documentation first whenever I am working with something new. I realized I could have saved myself a lot of pain if I had done that first as I initially tried to solve the issue of the merged columns with brute force.

Creating visualisations with the audience in mind

The main tool that I used for data visualization during my traineeship was Power BI, which was something that was new to me as well. Throughout my traineeship, I made various types of visualisations. For example, for the reclamation project mentioned above, I created various bar and line charts to track the progress of key activities and whether the payment made each month was consistent with the amount of progress in the project. Another interesting visual was created using movement data of an individual in an office building. By selecting different timings using two filters, one for hour and the other for minutes, different rooms would be highlighted based on where the individual was at that time.

Movement of an individual across different rooms, part of a proof-of-concept project. The locations have been masked to protect the identity of the client.

My experiences during these few months have led me to realise that while creating informative charts is important, what is perhaps more important is to always consider what the target audience can take away from it. This also highlighted the importance of communicating with key stakeholders, as being able to understand their needs would be vital when it comes to creating data visualisations that are insightful.

Creating simple prototypes with Flask

Learning how to create webpages with Flask was one of my biggest takeaways. Having never really created webpages on my own before, it was rather daunting at first. However, thanks to the wide availability of resources online, I found myself creating my first webpage soon enough.

Flask is really useful when it comes to quickly creating simple webpages and this was particularly helpful whenever we needed to come up with a prototype to demonstrate how certain features would work. The first webpage I created using Flask was used to run a contact tracing algorithm which would generate a network graph in Power BI to show the individuals that someone with COVID-19 had come in contact with.

Feedback page with dynamic outputs

Another instance where I used Flask was to create a webpage which would take in residents’ feedback as input and return predicted categories as the output. Currently, a town council operator has to manually select the category based on the feedback that the resident provides. For example, if a resident reports that a lift is not operating, this would be placed under the ‘Lift Breakdown’ category. By making use of JavaScript and Flask, I was able to create a webpage that was able to provide predicted categories in real time, where the outputs would change dynamically as feedback was entered. This seeks to make the process of categorizing the feedback of residents more efficient by providing suggestions to operators.

Eventually the requirements of the project changed and I created an API, once again with Flask (woo hoo!), which would return the predicted categories and the respective probabilities of each category. This time, we also needed to be able to accept incoming data through the API in order to collect data that could be used to periodically retrain the NLP model. One of the benefits of using Flask is that it is simple at its core, but you can easily add extensions as required. For example, in order to connect the API to the database so that the data could be stored, I made use of SQLAlchemy. Another extension I used was Flask-Login, which was helpful when it came to adding authentication to the API.

Wrapping up

Overall, I have truly enjoyed my time at SJ so far and this traineeship has reaffirmed my decision to take up data science as a career. I was fortunate enough to be able to convert my traineeship into a full-time opportunity and am currently working on a contact tracing platform for migrant workers in Singapore. Migrant workers are issued with a token which will exchange data with other tokens nearby. In the event of a COVID-positive case, close contacts can be identified and isolated quickly to prevent further spread.

In conclusion, I am grateful to my teammates for always being supportive and willing to provide guidance to me and look forward to continuing to learn new skills and taking on new challenges in my role.

About the author

Evangel is a data scientist at Surbana Technologies. She graduated with a bachelor’s degree in Data Science and Analytics from the National University of Singapore. She began her career as a trainee at Surbana Technologies where she worked on a variety of projects with real-world applications.

Linkedin Profile

References

  1. “Camelot: PDF Table Extraction for Humans”. Accessed May 19, 2021. https://camelot-py.readthedocs.io/en/master/_static/pdf/foo.pdf

--

--

Surbana Jurong Data Science Team

A team of data scientists based in Singapore trying to solve problems in built-industry through application of data science.