Technical Workshops

Building data science capacities in statistical teams

About

Designed for professionals who have a role in defining how data can drive social progress from a technical perspective, the workshops aim to introduce attendees to the skills of Data Science needed to work with non-traditional sources of information, e.g. satellite images, Call Detail Records, bank transactional data or web derived data.

The use of Big Data is accelerating within the development and humanitarian practice. If used right, its implementation can foster inclusion, efficiency, and lower project costs which may benefit public and private organizations involved on development programs. Therefore, our courses cover different aspects on data science and data engineering relevant for the context of official statistics and sustainable development.

“Excellent course. It's great having the opportunity to participate in qualifications relevant to our work.”
John Doe
Computational Engineer from IBGE Brazil
“[I appreciated the] Promptness, cordiality and knowledge of the instructors.”
Gustavo Tavares Lameiro da Costa
Technologist in Geographic and Statistical Information
Target audience

If programming is part of your daily tasks or you lead technical teams, these courses are for you (some programming experience is required; Python is preferable though not necessary). 

Methodology

All the programing material is provided in Python using the conventional Open Source libraries for Data Science.  Most of the sessions are interactive and on a Jupyter Notebook (.ipynb). A practical exercise is completed at the end of each session.

Format

We offer three different on-site courses of 18-teaching hours distributed along 3 days and designed for 20 participants. These are delivered by a team of  2 training specialists.

Course 1

Web data collection and analysis

About

Whereas the website has interactive components or a fully fledged API, this course will teach you how to programmatically extract data from the web and which models can be applied to draw conclusions from these data. As case-studies, we develop a PPP exchange rate using only real estate rent prices on some countries in Latin America and we study migration flows using Facebook Marketing API.

Testimonials

“The whole course was excellent. It is great having the opportunity to participate in qualifications on modern issues relevant to IBGE’s work.”

Syllabus

Modern websites usually have interactive components. We focus on using web browser emulators, namely Selenium web driver, for exploiting those components programmatically.

The use-case of this module is a real-estate rental platform, where rent prices are collected. 

The collection methods used for this platform are applicable for several e-commerce websites which present a similar catalogue structure for exhibiting their products.  

Large and medium size web platforms commonly expose their data through web Application Programming Interfaces (API’s). We learn how those can be easily manipulated through general open source libraries as requests or dedicated libraries in the case of large platforms as Facebook with the Facebook Marketing SDK official library.

The Facebook  modules were inspired on the amazing work done by our colleagues from Qatar Computing Research Institute at HBKU, UNICEF, MIT Media Lab, iMMAP Colombia and the Global Protection Cluster from UNHCR, entitled “Real-Time Monitoring of the Venezuelan Exodus through Facebook’s Advertising Platform” (see publication here).

The analysis of web collected data is challenging because this data is frequently noisy and biased. We study how these challenges can be addressed by adequate modeling techniques. Additionally, visualizing as a means for extracting knowledge from data, is also fundamental when treating with web collected data. 

We cover some basic cleaning procedures, different Machine Learning models and also static and interactive visualizations. Everything using the most popular Open Source libraries from Python Data-Science stack. 

Course 2

Machine learning on satellite imagery

About

Skills in Machine Learning (ML) are ubiquitous for 21st century statisticians or Data Scientists. In this course you will be introduced to this field of science by covering the most popular tasks from supervised and unsupervised ML. 

Motivated by the fact that remote-sensing imagery is already being used to address development issues, i.e. revealing changes in soil quality or water availability, informing agricultural interventions and even measuring poverty; we structured this course around ML methods which can be applied to satellite imagery, aiming to help statistical teams to leverage this modern and omnipresent data source.

Syllabus

Methods from supervised machine learning are those which have progressed more in both academic and industrial environments. We cover the task of classification, training a model for being able to categorize the observations we give to it. We examine algorithms as Logistic Regression, Support Vector Machines, Gradient Boosted Trees and Neural Networks going though its theoretical basis and applying them in practice. 

Unsupervised machine learning is currently the second most popular area of the field. We will cover the task of clustering, forming groups of observations which are similar in a previously defined sense. We will go through the theoretical principles and we will put in practice different algorithms such as k-means, Gaussian Mixture Model and others, which are popular in the field of computer vision. 

This case-study is developed with Landsat 8 satellite images, which are free and accessible for everyone. Both supervised and unsupervised machine learning methods presented previously will be used to be able to infer urban extent from open satellite images, as part of the calculation of the SDG  11.3.1 (Tier 2).

We focus this exercise in using satellite imagery to bridge data gaps in measuring SDG the mentioned Tier 2 indicator, this means it is “conceptually clear, has an internationally established methodology and standards are available, but for which data is not regularly produced by countries”.

Course 3

Statistical methods for correcting selection bias

About

Just as questionnaires are the means for observing reality through surveys, electronic platforms have the same role for big data. Most of the big data sources offer a non-probabilistic sample of the population of study, where several errors are induced by self-selection of individuals present on the sample, targeting decisions from the owners of the electronic platform and limitations of the coverage of said platform.  

In this course you will learn about the principal techniques for correcting bias through a statistical approach. This work is based on previous research work from Data-Pop Alliance on correcting bias on mobile network data and a fundamental book published by Eurostat.

Syllabus

We define a statistical approach to big data and highlight the main challenges and opportunities it presents when trying to use it as a statistical source. We go through the specific difficulties of different big data sources of interest such as mobile network data, bank transactional data and social media among others. 

Correcting selection bias in big data can be analogue to procedures used in other data sources which have the same problem of non-random selection and had been studied for a while: web or telephone opt-in surveys.

We will go through different techniques for correcting selectivity bias at the unit of observation level, which most often will be individuals. Even if methods are analogue to those used in opt-in surveys, what you will learn in this course is how to use those procedures in massive data, by leveraging big data frameworks such as Spark, through its easy-to-use Python API: PySpark.

Correcting selection bias in big data can be analogue to procedures used in other data sources which have the same problem of non-random selection and had been studied for a while: web or telephone opt-in surveys.

We will go through different techniques for correcting selectivity bias at the domain level, which most often will be individuals. Even if methods are analogue to those used in opt-in surveys, what you will learn in this course is how to use those procedures in massive data, by leveraging big data frameworks such as Spark, through its easy-to-use Python API: PySpark.

Contact

For more information or questions regarding these courses, please contact us at trainings@datapopalliance.org