Data Wrangling at Scale with R’s data.table

Github repo: Data-Wrangling-R-data.table

This session will introduce you to the modern data wrangling workflow with data.table. Data wrangling is one of the core steps in the data science workflow, specifically when cleaning raw data sets into a format that is readily analyzable. Data.table offers fast and memory efficient: file reader and writer, aggregations, updates, equi, non-equi, rolling, range and interval joins, in a short and flexible syntax, for faster development. It is commonly used for data manipulation challenges, including the manipulation of datasets and variables.

Main learning objectives: The goals of this session are to:

  1. equip you with conceptual knowledge about the data.table package and data wrangling workflow
  2. demonstrate the ease of using data.table through highlighting the most common data wrangling functions
  3. provide you with a practice exercise and further resources.



Text Classification with BERT

Github repo: EU-DMA-Text-Classification-BERT

In this tutorial, we build a text classification pipeline for unstructured documents using a transformer-based deep learning model called Bidirectional Encoder Representations from Transformers (BERT). We demonstrate the application of a document classification algorithm for the European Commission (EC).

Whenever new legislation is proposed, the EC opens public consultations where various stakeholders (e.g. businesses, academia, law firms, associations, private individuals) submit documents that detail their views on the proposal. The EC receives anywhere between 10,000 to 4 million of these public consultation documents annually. Using machine learning and deep learning methods to process these documents will streamline the Commission’s review of stakeholder comments, which will consequently allow them to integrate more information into their policymaking process.

By the end of this tutorial you will understand how to:

  1. Extract, clean, and pre-process information from unstructured PDF documents
  2. Use the pre-processed text as input to machine learning/deep learning models
  3. Build a text/document classifier with BERT
  4. Compare BERT with text classifiers built using other models