Does poor educational infrastructure influence school dropout and child labor in Brazil?

This project (Write a Data Science Blog Post) is part of Udacity Data Scientist Nanodegree Program. Detailed analysis with all required code is posted in my github repository.


The objective of this article is to apply some Data Science techniques I’ve learned in Udacity Data Scientist Nanodegree Program to extract valuable insights while exploring and analyzing data to answer the following questions:

  1. Is school dropout rate higher in schools with less infrastructure?
  2. Is child labor rate higher in regions where the schools have less infrastructure?
  3. Is child labor rate higher in regions with high school dropout rate?

Data Source

I’ve used two datasets to join schools infrastructure 2012 data (link) with a ENEM survey with school dropout and child labor 2013 data (link). Both references are in Brazilian Portuguese.

ENEM is a nation-wide exam where candidates use the grades to apply to universities all around the country, which grades can be used by students to apply for universities across the country and also Portugal.

While doing this exam, the candidates are asked to fill in a socioeconomic survey where they answer questions regarding their study period, social conditions and family composition among other aspects. I used some of those answers to find information about candidates who worked as children or had dropped out of school at least once in the past.

Data exploration

A national survey, conducted by IBGE (national research institute), which points to a 20% school dropout rate, might show a difference between the ENEM candidates and the general Brazilian profile.

Based on Brazilian law, I considered as child labor everyone who stated to have started working under 14 years old, for it’s forbidden for a child this age to work in any circumstances. IBGE informs Brazil had 1.8 million children in child labor situation in 2019.

According to our school dropout findings, child labor is also not specific in a single region. The Brazilian child labor rate based on ENEM candidates answers is 10,2%, and states of 4 regions have rates higher than the national average. The only exception is the Northeast region, which also has all its states below the national rate.

In both aspects, school dropout and child labor, we have the Midwest candidates at the top of the rates, while we have the Northeast region in the opposite side. Mato Grosso do Sul (MS) has 5 times more candidates stating they had dropped out of school at least once and has 3 times more students who had started working before turning 14 years old than Maranhão (MA), which appears among the lowest rates in both charts.

Schools infrastructure

For me, it has been particularly surprising that there are several towns where the schools have no bathrooms available for the students, no electricity or not even water supply.

As seen in the charts above, contradictory information shows that almost half of the schools in the North have no bathroom and almost a quarter have no electricity, but it’s the region with the highest meal offering rate. Around 95% of the North schools offer meals for the students during school time, and this rate is even higher for the Northern schools with no electricity: 99% of them provide meals for the students.

Schools infrastructure score

In this study, the researchers divided the features a school can have in two categories: basic and advanced. The basic category includes electricity and water supply, sewage disposal, bathrooms and a few equipment such as TVs, DVRs, computers and printers. All other items compose the advanced category, for example teachers room, library, science lab and sports court.

To determine the infrastructure of a school, I’ve attributed different weights for each basic or advanced feature it has. Then I could use this value to compare different combinations of features effectively.

Southern, Midwestern and Southeastern regions have similar infrastructure score averages for these regions states composes the highest half of the infrastructure per state chart.

When it comes to connecting school dropout and infrastructure score, by looking at the above charts, it seems there isn’t a strong relation, given the Northeastern region has the lowest school dropout rate, the highest state in terms of infrastructure score is Ceará (CE), which has a low school dropout rate if compared with the other states, however it’s not even in the “top 10” in terms of school infrastructure.

Is school dropout rate higher in schools with less infrastructure?

This means that I couldn’t find an significant reduction on the school dropout rate comparing the highest school infrastructure scores against the lowest ones.

Even if we analyze only a state, for example Maranhão (MA), which has the lowest school dropout rate, it’s not possible to confirm there is growth comparing such values.

Then, the answer for this question is no, school dropout is not higher when the school infrastructure is low.

Is child labor rate higher in regions where the schools have less infrastructure?

Going deeper on this matter, I’ve analyzed the charts of the states with the lowest and highest child labor rate, respectively Piauí (PI) and Mato Grosso (MT). The trend lines are close to flat, showing:

Both states has a similar flat trend line, showing that the child labor is not strongly connected with the schools infrastructure not even when the child labor rate is high or low.

Another aspect that points to the opposite direction of the first impression is the correlation of both axis in 41%, which is considered moderate.

Then, the answer for this question is also no. Although it’s not possible to state the opposite.

Is child labor rate higher in regions with high school dropout rate?

Differently of the previous questions, the trend line shows a positive

Analyzing further, I’ve done the same analysis as in the previous question, with the states with lower and higher child labor rate, respectively Piauí (PI) and Mato Grosso (MT).

Although the correlation between axis of 31% is considered moderate, the upward trend on all scenarios makes me believe the data shows that school dropout and child labor are related somehow.

Then, the answer for this question is yes, the data has shown that with higher child labor rates, it’s found also higher school dropout rates.


As an Data Science exercise, I’ve experienced that when the dataset is interesting, CRISP-DM process tend to be fun and pleasant.