Report on a Capstone Project for Applied Data Science
As part of the IBM Data Science course, I was required to do a project involving clustering of geospatial data and communicate my findings. I have chosen to perform an exploratory project of Chicago neighborhoods, and compare them based on several crucial parameters. This is result of my work. You can find the details of analysis in the notebook on GitHub (to have rendered maps, use nbviewer).
Introduction and Context
Finding the right surrounding area to live, that fits your personality and preferences, can often feel like a challenge. It is especially difficult to choose among various locations when moving to a new city that you know little or nothing about. The decision should take into account a variety of factors (such as affordability, available infrastructure, transport, etc.), as well as evaluate possible risks. With so many different options to consider, how can one better weight all the possibilities and make the right choice?
For this project, I would like to analyze the particular city of interest, using the methods of data science to better understand and compare different neighborhoods. I have chosen to consider the city of Chicago, which is completely unknown to me, so that way I will also be learning a lot in the process (which is the part of fun). I will try to evaluate the city’s various regions based on the following three criteria:
1. Which areas are most dangerous / safest?
2. Where the cost of living is more affordable and where it is high?
3. What are the differences and similarities between neighborhoods in terms of local infrastructure (e.g., like food or entertainment venues)?
To answer each question, I am going to leverage open data sources to draw few general conclusions on the relative merits of different areas, that will help to make a better informed decision. Thus, the project is mostly exploratory in nature and, hopefully, can be useful to someone new to Chicago (like me), who plans to visit or even stay there in the future. Finding patterns in the venues among different neighborhoods, in general, can also provide valuable insights about locations to open new business.
In accord with the above outline, I will combine various types of data, acquired from multiple available resources.
1. Geographic data on the Chicago’s administrative division and official boundaries through Chicago Data Portal (CDP). Webscraping (e.g., Wikipedia) and geocoding API’s (e.g., Google or OpenStreetMap) for local neighborhoods and their location.
2. CDP for the up-to-date crime statistics and population data, in order to evaluate and compare the crime rate for different areas.
3. The information on typical rental prices, again using some webscraping (or specialized API).
4. Finally, I will use the Foursquare API location data to obtain information on various categories of local venues for each of the found neighborhoods.
Understanding Chicago’s Geography
As a first step, I figure out the territorial divisions of Chicago, that turn out to be quite a mess. This allows me to formulate data requirements precisely, organize different types of data, and conveniently represent them on the map.
Althogh Chicago is often called “the city of neighborhoods”, this type of division is not official. Formally, it is made up of 77 community areas that are often grouped into 9 districts (or “sides”). The community boundaries are well defined and do not overlap. Each community area has one or more neighborhoods in it. The neighborhoods are often not well defined and overlap each other. According to the Wikipedia page:
There are sometimes said to be more than 200 neighborhoods in Chicago, though residents differ on their names and boundaries.
The bottom line is that the intuitive neighborhoods have much more name recognition, but they have nothing to do with how city governs or tracks data. The latter is usually organized by community areas or City Council wards (that often fail to reflect the physical reality on the ground).
I first visually represent the above divisions on the map, combining both official community areas with informal neighborhoods. To keep it informative and not too congested, different visual means are used for different levels:
- Choropleth map outlines the community area boundaries, obtained from CDP and stored in GeoPandas dataframe. Each community area is colored, according to one of 9 sides (distritcs, or boroughs)
2. Neighborhoods are put on top as clickable pop-up markers.
To obtain the list of neighborhoods, I am parsing the contents of the above mentioned Wiki. Apart from the main table that classify neighborhoods by community areas, the two others tell us which of the 9 sides does the area belong to, and also whether the neighborhood is recognized by the City of Chicago (“recognized”), or is one of the other districts and areas recognized by the local community (“unofficial”).
Finally, I fetch the coordinates of each neighborhood by its name, using the combination of results from 2 geocoding APIs (Google and Nominatim) to obtain the most exhaustive and accurate catalogue of 233 locations:
With the obtained data, I then create an interactive overview map of Chicago, using the Folium library. This map can be used to quickly look up the neighborhoods and community areas.
Communities Crime Rate Using Chicago Data Portal
To start exploring the city, it might be a good idea to understand the criminal situation in its various regions. Having this information can provide suggestions for which regions are most safe, and which are better to avoid.
CDP provides a very detailed and (almost) up-to-date crime statistics, starting from 2001. As a first rough approximation, I retrieve the overall number of crimes (over last two years for a good average), grouped by community areas. Since the full dataset is too large to download, I access the database through the SODA API, that uses the specialized “SoQL” query language, to obtain the total crime numbers by community area (restricted > 2018).
The overall frequency counts depends on the area size and population, and therefore is not very representative. A better approach would be to combine it with the population data, which is available by small census blocks, organized into larger census tracts. To aggregate the required values by area, I have to refer to tract boundaries to get a corresponding area number for each tract. Having done this, I obtain the following crime statistics and plot the results:
I put the crime rate on the map in order to see the spatial distribution by area.
Neighborhoods Average Rent
Another important factor for someone who may plan to stay in Chicago for a longer period is the cost of living. In particular, how expensive is it to buy or rent an appartment in various parts of Chicago? Having in mind primarily a newcomer to the city, I am not considering housing prices, but instead focus on ranking various locations by typical rent.
I found the relevant information on average rent by neighborhood on RENTCafe. After scraping the webpage, correcting some of the neighborhood names and formatting the values, one can easily see the top 5 neighborhoods with both highest and lowest rent (from those 166 with available data):
The rent prices are distributed as follows:
In principle, we could manually cut the rent prices into bins, in order to categorize neighborhoods into selected price ranges. Instead, let’s combine rent and crime data and see if we can identify similar groups of neighborhoods, using machine learning techniques.
I cluster the neighborhoods, based on crime and rent statistics, using the simple k-Means algorithm. Before training the model, one has to first normalize the data and specify the number of cluster. I perform the grid search and analyze the accuracy scores:
The “elbow” on the left graph can be seen to be roughly at k=3 but is not very pronounced. From the plot on the right, there is a relative peak in the silhoutte score also at k=3. For a more refined clustering, the value of k=7 also seems to be a nice choice. Running first k-Means with k=3 gives the following division of neighborhoods into clusters on the map:
In order to understand the cluster characteristics, I aggregate the mean values of average rent and crime rate:
Clustering with k=7 provides a more detailed picture:
The structure of the clusters can be inferred from the following aggregated statistics as well:
Types of Venues
To gain insights into different types of neighborhoods, one can finally focus on the local infrastructure. Knowing available local facilities can certainly help making the right choice. Also within a given cluster, can we observe any patterns between neighborhoods in terms of local venues?
Using Foursquare API and geographic coordinates, I collect the information on the most popular venues nearby each neighborhood. One can easily see what types of venues are most popular for each neighborhoods, by taking the mean of the frequency of occurrence of each category. This should assist in finding the appropriate area to live, according to your taste:
Running clustering based on the types of venues, using “one hot encoding”, is also possible, but does not lead to any meaningful results in our case.
Results and Discussion
Based on available crime and population statistics, I compared different community areas by the calculated crime rate. It seems like the West Garfield Park have been the most crime-prone area, followed by Fuller Park and the Loop. So I would probably recommend to be cautious visiting those areas found in the top 12! One can visually recognize the most and least dangerous areas on the Choropleth map.
This is of course only a quick glance of the overall situation, but enough for our purposes. Our estimate is based on an average of all crimes over the span of last couple of years. Many factors can contribute to the crime rate that can give a better picture. For example, we did not distinguish between various categories of crimes. Notice also that the crime rate can greatly vary during the day between various areas. This is a very broad and interesting topic on its own, but here we are just exploring the city areas.
Combining the crime rate with average rent for 1-BR apartment, and applying the machine learning algorithm (k-Means), the two classification of neighborhoods were obtained. The first division identifies 3 big clusters, based on the above parameters. Cluster 0 is relatively “safe” on the crime scale and is in the medium price range. Clusters 1 and 2 are more crime-prone on average, falling into lower and higher average rents, respectively.
One can notice that lower rent neighborhoods in cluster 1 (“orange”) roughly correrspond to areas where the crime rate is high, which is natural. The neighborhoods in the cluster 2 (“green”) break this pattern: due to its mostly central location, the rent there is highest (above $ 2000). I would probably recommend to focus attention on the dominant cluster 0 (“blue”) with more affordable rents and safer neighborhoods.
In the second more detailed division, I was able to identify 7 clusters. Here one has 3 categories of criminal index: “low” (clusters 3, 6), “medium” (clusters 2, 1, 0) and “high” (clusters 4, 5). Within each of theses categories, clusters naturally vary in average rent price.
For each neighborhood, the list of most popular types of venues has been found, using Foursquare API. The types of venues has been sorted by frequency of occurence for each neighborhood and for each of 7 identified clusters. This results can help to choose the living area that better suits you preferences.
Running k-Means clustering, based on types of venues, did not provide any discernible pattern. This is explainable, given that the whole idea of clustering is to identify similar items. In this case, the diversity within types of venues in each neighborhood is very high. One can find a wide variety of items nearby each place, and the obtained classification will often be not very telling. Some additional requirements must be met to further narrow down the search.
After performing exploratory and cluster analysis, I have determined the percentages of crime occurred, housing values, and types of venues in the city of Chicago. I have built interactive maps to visualize these analyses to help better determine the ideal neighborhood to move to.
Through this project I learned more about important stages of data analysis workflow, how to work with open data sources, prepare and clean the data. I also worked through some practical applications of data science methodology, and created some nice-looking visualizations with Folium. I feel rewarded with learning new skills during the course and this project.