Big Dive

Big Data and data visualization kick off

28 August 2017

Big data - big unknown

When we play with data that are not so big...

Experience with big data is among top skills of a data scientist profile. However, it is not trivial to learn big data tools without necessary infrastructure and proper tutoring. While pursuing my PhD in Data science dealing with medium size data frames (< 1G), usually all can be processed on one machine. In order to get some taste of methods and challenges of Big Data, I completed Big Dive course organized by TOP-IX.

... but we want to get a taste of BIG data

Image rights TOP-IX During the intense 5-week course I put in practice my Python - pandas skills and learned new libraries for distributed computing, as well as PySpark basics. We also spent a significant amount of time setting up the infrastructure on amazon aws and MongoDB. Invited keynote speakers showed us important aspects and challenges of the data science for business and science. A bonus was an introduction to beautiful visualization D3.js. Finally, for me the most interesting and important part of the course was a project realized for one of 3 companies: IFC, Eduscopio and TesiSquare where in just a few days we put our best efforts to make sense of hundreds of gigabytes of data!

About the course

The course took place in a group of 20 students, with STEM/linguistic/design background, mostly from Italy but there were also a few foreign folks like me. Our diversity helped a lot during the group project when each of us could bring different perspective and skillset.

Image rights TOP-IX In order to apply, a candidate needs to send a video explaining why he or she is a good fit for the big dive as well as state level of her/his prerequisites.

If you want to read more about the content of the course you can check out Big Dive website as well as read their great posts on LinkedIn of Christian Racca, Facebook page and twitter @bigdive_eu.

For most of the participants and for me it was the first time with Big data. I found the possibility to face the problems and tools I don't encounter in my everyday work. The content of the course was quite densely packed and I will need some time after the course to practice new skills and read all material. Thanks to great teachers like among others Alex Comunian and Fabio Franchino classes were very interactive and easy to follow.


Big Dive Yearbook 2017

Personal highlights

  • The course was great networking opportunity, partner companies posted their job openings, as well as a CV repository, was created by course organizers to facilitate their job search, also other participants of the course are new connections in my professional network that can be priceless in the future
  • D3.js is a very powerful tool for data visualization but its learning curve is quite steep. Learning jquery and communication with a remote data server, basics of web design are necessary to create a serious final product. So if you don't want to spend a significant amount of time learning all the toolkit better go for pre-set solutions like RShiny if you know R for instance.
  • Dask library seems to be an interesting alternative to Spark for smaller projects, the big plus is that it is using pandas vocabulary. It can be also misleading as we encountered quite a lot of problems building a set up for our data, lots of ram memory seems to be necessary and understanding how dask is dealing with our objects differently than pandas is crucial for a successful application.
  • Spark contains important differences in algorithms from R or scikit-learn implementations. Therefore testing locally and comparing to the distributed version is not a good idea and difference in performance can be significant. It is important to keep it in mind from the beginning of the project.
  • Github has some really cool project management features like issues system and wiki. Usually, it is not extensively used in academia or only as a code repository while for us the management part was really useful and handy to organize work. If you are a student you can get a Student Developer Pack for free.
  • We got a refreshment about ML and network science, together with some cool mini project implementations (like twitter API analysis) that can be a great starting point for a demo project one can use for its portfolio.
  • A field expert is an important part of the team. His needs should guide the analysis, it helps to make data scientist work really useful in the domain of its application
  • Data ring - a tool proposed by top-ix and IFC helps to plan the project and communicate within data science team and with the business partner. You can find more information about in the comprehensive handbook DATA ANALYTICS AND DIGITAL FINANCIAL SERVICES
  • In my project, I worked with e-money transaction data and I found the problematic pretty interesting, it encouraged me to read finance and economy articles.
  • Image rights TOP-IX

    Final comments

    Even if you cannot become an expert in big data in 5 weeks this course was a great intro to new concepts. It was also a lot of fun, intense immersion in Italian culture with lots of sun and pizza. Don't hesitate to contact me if you want to hear more about the course.

    Image rights TOP-IX

    Useful links

  • Big Dive website: http://www.bigdive.eu/
  • posts on LinkedIn: https://www.linkedin.com/in/christianracca/detail/recent-activity/posts/
  • Facebook page: https://www.facebook.com/bigdive.eu/
  • twitter: @bigdive_eu
  • post that inspired me to participate in Big DIVE : From Science to Data Science, a Comprehensive Guide for Transition
  • Acknowledgements

    I would like to thank Amodsen Chiota and Pierre Girard for their support for my application. As well as my PhD supervisor Andrei Zinovyev for allowing to participate in the course during my PhD time. Funding was provided by “Ecole Doctorale Frontières du Vivant (FdV) – Programme Bettencourt.”


    Data Scientist

    PhD in Bio-Mathematics, Data Science & Machine Learning