• Melissa Rosales

Data Exploration Journal

Updated: Nov 25, 2019

For our group's data visualization project, we wanted to explore the different healthcare facilities around Boston and the services they offer. Through the Boston Government website, Melissa was able to find this data set collected by the Division of Health Care Facility Licensure and Certification. Since this data set is from a government website, Melissa thought it would be cleaned and organized. She quickly learned that no data set is ever clean.

The original data set did not just include Boston, but all of Massachusetts. The data also included multiple services. Our group decided to narrow down the data to focus on three things: 1) clinics only located in Boston 2) alcohol/substance abuse services and, 3) mental health services. We also kept the clinic name, street, city/town, zipcode, and telephone number.

After cleaning the data and creating a pivot table, Melissa noticed that the city/town corresponding to the zipcode entries do not match. Areas that should be under Jamaica Plain or Hyde Park, are also entered as Boston. Boston may be small but it's important that we set certain group areas for the clinics that are in common areas. So Melissa decided to organize it by neighborhood instead. This was when the process became more tedious. Certain zipcodes do not automatically correspond to a neighborhood. So she found an open resource Google map that has accurate and clear delineates of the neighborhoods in the city of Boston. Melissa manually entered each address just to make sure the neighborhoods were right. Simultaneously, she checked if the addresses correspond to the correct clinic names.


She found a few inaccuracies.


Three clinics didn't exist anymore. Mattapan CHC SBHC @Young Achievers turned out to be a temporary clinic in the basement of the Young Achievers Science and Math Pilot School. Melissa confirmed this in a phone call with the Mattapan Community Health Center.


Some zipcodes were entered wrong. Harvard Vanguard in West Roxbury's zipcode entered was 02167, but the correct zipcode was 02467. Melissa also changed Boston ASAP's from 02108 to 02111, and Sidney Borum Jr. Health Center of Fenway from 02115 to 02111


She also noticed that some of these clinics only offered certain services so she removed them too. She removed Ryan Center for Sports Medicine at Boston University, Northeastern Speech-Language and Hearing Center, and Planned Parenthood. Melissa also removed all dental clinics and Shields MRI X-ray Labs.


Here are the results.


The whole point of this data research was to show how available are these services and clinics are in Boston, specifically in which neighborhoods too. The addresses and zipcodes are given but that's not enough for the visualization we aim to make. We wanted to create a map in Tableau. Tableau can only register zipcodes, countries, states, counties, latitudes, and longitudes. We had street addresses. Melissa found a free web program called Geocodio that geocodes and data matches US and Canadian addresses to exact longitudes and latitudes.

Since Melissa only needed the exact coordinates, she selected the longitudes and latitudes and pasted it to the clean data set she made. Melissa's processed data set has come a long way from the original data but if it wasn't for the work she's done, we wouldn't be able to show these amazing maps.


For Gabi, the first week of working with this data had been centered around figuring out how to work with GEOID 10’s from a Boston social vulnerability data set. Instead of providing zip codes, Climate Ready Boston gave GEOID 10, which are “numeric codes that uniquely identify all administrative/legal and statistical geographic areas for which the Census Bureau tabulates data,” according to the US Census website. Although these are useful for government entities, it doesn’t help the public with easily identifying the areas the data set outlines.


The information Gabi was specifically interested in with Climate Ready Boston’s data set was the Low English Proficiency numbers. This would help with ultimately seeing how the mental health and substance abuse clinics intersect with multilingual communities. Gabi wanted to specifically check if Boston is providing services to more than just English-speaking people. The landscape of Boston is so diverse, and clinics shouldn’t solely focus their mental health and substance abuse resources on a specific population (white, English-speaking).

The first step Gabi took was downloading and opening the data in Excel. The image below displays what she saw at first.



Her first impressions were mixed. She didn’t understand the need to include the GEOIDs instead of zip codes, but Gabi figured that the reason Climate Ready Boston had this data in the first place was from the US censu. She also knew she wouldn’t be needing any other columns besides LEP (Lower English Proficiency), GEOID 10, and Neighborhoods (called ‘Names’). Gabi decided to clean up the data in Tableau to include only those columns.


The problem, however, is that she can’t plot GEOID 10 in Tableau. Gabi tried following a US Census document that outlined “How to Join the ACS Summary File to the TIGER/Line Shapefiles.” Gabi was confused about what ACS Summary Files or TIGER/Line Shapefiles were, but she still tried to follow the steps. It was easy enough at first to follow the guide. She had to create two columns to the right of the GEOID column, as her modified GEOID would go into the second column.



Gabi was then instructed to select the GEOID column, select the ‘Data’ button from the top menu, then select ‘Text to Columns,’ which would prompt a ‘“Convert Text to Columns Wizard’ box to appear. I was then told to select the ‘Delimited’ option as a description of her data. They then asked for the deliminator, but this is where the problems began appearing. 


In the GEOID 10’s provided by Climate Ready Boston, there were no letters/separation of number types in the GEOID. In a regular GEOID, there would be five digits then a ‘US’ following them. The deliminator in that case would be the ‘S’ as it would separate all the following digits after that letter. These digits would then be a listed census tract, a specific location that is outlined with borders. This meant Gabi would get parameters and the LEP of that area. These parameters would then be layered with the zip code parameters, and the group could see the overlap of mental health and substance abuse clinics with low English proficiency areas.


But why would data ever be this easy? Gabi’s Climate Ready Boston data had no ‘US’ characters in the GEOID, so her deliminator couldn’t be ‘S.’ This meant that the data couldn’t be separated easily through the ‘Text to Column’ option. Instead, she had to know what the US and county digits were so that the census tract digits could be identified. But the next challenge? It’s impossible to back-search the census tract digits to receive any information that would plot it. This website is dedicated to identifying the census tract from an address. But there was no vice-versa search option. So the GEOIDs were becoming more useless with each step.


However, Gabi found another guide online about finding census blocks with QGIS, which is “a free, open-source geographic information system (GIS) software package you can use to create, analyze, and process geographic data,” according to the guide. She followed the instructions of downloading the QGIS3.10 application on my Mac, which was a bit difficult as Apple has special security measures of not opening apps from not licensed developers. So, she had to override the security warning.


Upon opening the app, this was the image Gabi saw.


The instructions then say to download the census block boundaries from the US Census website. Afterwards, a plugin was supposed to be installed, called ‘OpenLayers.’ There was no plugin like this available on the plugin download menu. This already created a hitch in the process. Gabi decided to skip this step, and tried dragging the downloaded data into the open area. She was presented with a window of layers that her dataset included.


When clicking ‘Select All,’ the map appeared for a moment, then disappeared. Error messages appeared instead.

The GEOID data was obviously presenting a hard time for Gabi, so she decided to look for other data sources regarding Low English proficiency (or in that realm). She found a data table from The Boston Foundation that listed the amount of English-language students by Boston neighborhoods. There was an important footnote that recognized that not all Elementary and Secondary Education providers responded to the survey.


This was a start. There were neighborhood names and there were numbers attached to the question of “How many English-language learners are there?” The chart indicates that Chinatown/South Industrial have the most language learners, followed by Central/North End, then Dorchester. But the problem was that Tableau doesn’t take neighborhood names as a geographic identifier. So, zip codes had to be attached to the neighborhoods. This is where the Boston Business Partner’s table set of zip codes organized by neighborhoods came in. Combining the knowledge of these two, the number of clinics in most English-language learners neighborhoods (above) are 10, 9, 9, respectively.

With these two tables, Gabi turned them into two separate Excel data tables and then put them into Tableau. The two files uploaded as different sheets, which seemed simple at first. But when it came to mapping the different levels of English-langauge learners with zip codes attached to the neighborhoods, the Tableau map wasn’t working out the way Gabi wanted it to be.


First of all, not all of the zip codes were showing up. More importantly, the SUM total of the English-language learners in each zip code were not differentiated. This is a question to be posed for Lina in regards to working with multiple sheets of data in one map, and how this affects the outcome of the mapping.


Going forward, Gabi wants to work on several things in regards to data she worked with this week. Her first goal is to figure out the multiple sheets situation in Tableau so that she can map the English-language learners in Boston. This will then be helpful in assessing if clinics are placed equally in English-dominated zip codes as well as in multilingual zip codes. The second goal is to figure out if there is still hope for the GEOIDs. The third goal is to begin calling the clinics in Boston to figure out their language capabilities. This will be done through individual calling and asking a set of survey questions (to be developed before the calls begin, and uniform for all calls). That data will be compiled on to another spreadsheet, which will be analyzed and presented in a fitting format to best tell the story. The fourth and final goal is to research and begin setting up a timeline of mental health and substance abuse treatment in Boston, in order to give readers context about the story being told.


For Maya’s part, the data set being used was looking at the median income of households in each Boston’s neighborhood to see if the average income of the neighborhood had anything to do with how many healthcare facilities are in the area. The data set came from bostonplans.org and it started out looking like this once it was downloaded to excel:

First, the dataset had to be cleaned.  The dataset was mostly clean already, so all Maya had to do was take out “United States” and “Massachusetts” as data points because they were irrelevant to the data. The rest of the data is useful, because looking at the aggregate income of the whole neighborhood as well as the per-capita income are both helpful in determining the wealth of a neighborhood. Once the data was clean and uploaded into tableau it looked like this:


After the data was clean, Maya imported the data into tableau, but there was a problem because there were no zip codes for the neighborhoods, so making a map would have been impossible. To solve this, we imported another data set from Lina that was of Boston neighborhood zip codes. From there, the data was combined by using the “add” tool on tableau:


Next, Maya created a visualization showing both the per capita and aggregate income of each neighborhood, with the Boston Area zip codes. She dragged in the per capita income and aggregate income tab, the community tab and the zipcodes tab. She used both the size and the color features to show the income of each Boston neighborhood. The size feature helps show the income of each neighborhood relative to the other neighborhoods, and the color helps differentiate between neighborhoods. The final product looked like this:

Vivi’s part of the project was to find a dataset that explained the amount of deaths related to opioid use in Massachusetts. Originally, she struggled to find a dataset that tracked opioid-related deaths in Boston specifically because most datasets tracked opioid deaths in Massachusetts by town/city by year, but not by region in Boston. Vivi was able to find, however, a dataset that outlined drug-related deaths in the US by State. This dataset is the one she settled on because it outlines specifically which drug type caused the death, meaning she could compare opioid-related deaths to other substances. This dataset also gives her the freedom to compare opioid-related death rates to other states.


This is what the dataset originally looked like. As you can see, it is overwhelmingly large and divided by state. This photo only depicts the first few columns which is data for the state of Arizona. 


Initially, Vivi wanted to see the statistics for drug-related deaths in the US. This graph above is depicting the indicator (which drug caused the death) and the Sum from the Data Value. The graph did not initially look like this, Vivi had to omit the total number of deaths because it was a number so large, it took up the majority of the graph and made it harder to see the differences between the drug types. With that excluded from the graph, we are able to clearly see the differences in drug-related deaths. 


This graph is the same as the one above, however, instead of data for the entire country, Vivi added State values and only kept Massachusetts data. This allows her to compare the data from Massachusetts to that of the entire country.


This graph is the same as the one above, however, instead of data for the entire country, Vivi added State values and only kept Massachusetts data. This allows her to compare the data from Massachusetts to that of the entire country.

Based on this map Vivi was able to demonstrate that Massachusetts has a relatively high count of opioid-related deaths. In order to properly compare it to other states, she will make the statistics relative to state population.