Many more ideas/projects are forthcoming. Let’s build some cool stuff together!

OpenStreetMap Data Wrangling Project

29631130-atlanta-georgia-aerial-view.jpg
 

This project is for the data wrangling portion of Udacity’s Data Analyst Nanodegree. I take xml data for Greater Atlanta from OpenStreetMap. After cleaning, transforming, and storing the data in a SQL database, I run a few queries to analyze the data.

 

Wrangle OpenStreetMap Data

In this project, xml data from OpenStreetMap is taken from a location in the world. That xml data is audited and cleaned and then converted to csv. That csv data will then be loaded into a SQL database where it will be queried and explored. Documentation of the process will also be included.

 

Project Objectives


  • Gather uncleaned data from OpenStreetMap from somewhere in the world

  • Clean, Audit, and Transform the data

  • Convert cleaned data to csv

  • Convert csv files to SQL database using SQLite

  • Perform some exploratory analysis on final data in the database

 

Map Area


atlanta.png
 

Problems Encountered


The whole file that I downloaded from OpenStreeMap was 203MB. I created a 20MB sample file to do most of my preliminary testing. I actually didn't encounter many expected problems in this data set. The following are a list of issues that I did encounter:

  • Issues determining tag type

  • Ways that pass through two (or more) counties

 

Issues determining tag type

Determing the tag type was tricky because it was somewhat open-ended and I had to decide what would be the most informative value that I could use. In solving this problem, I went through the usual workflow of getting the data into the database. I was then able to look through the records in the database and ascertain which values in the key/value pair were really useful for certain elements. For example, I saw the value 'leisure' for the parks key and 'neighborhood' for the place key. I went through, in this manner, and determined what kind of tag types I could ascertain (or were worth ascertaining).

The values that I came up with were stored in the tag_type_list variable and used to programmatically set the tag type in the get_tag_type method. This logic is demonstrated in the code snippet below.

Here is a before and after example of the correction


Here is a before and after example of the correction

"Ways" that pass through two (or more) counties

Some way elements pass through more than one region with a different county name or different zip code. When I'm searching for this information on the database, I would probably like to know both regions that the way passes through, so I decided to create a new field for those compound values. When I search for them, I should see both - which seems more accurate. The code snippet below demonstrates this logic.

Here is a before and after example of the correction. When I search for the county that this way passes through in SQL, it returns both, which is technically true.

Here is a before and after example of the correction. When I search for the county that this way passes through in SQL, it returns both, which is technically true.

Other Auditing Measures

In addition to addressing the tag type and compound values, I did the following:

  • Basic validation of ways, nodes, way-nodes, and tags

  • Validated zip code for form and locality

  • Audited phone numbers to ensure that they met the standardized ###-###-#### format common in Atlanta

Note that problematic entities are added to a dictionary where I can analyze them manually.

 

Running the project


Setup

Load the necessary Python modules

Group the "way" and "node" elements and their respective tags into lists

Top-level node and way elements - and then their respective tags - are processed in a recursive manner

A list of problems that occured. Note that certain problems (such as zip code problem) were fixed in runtime.

Screen Shot 2021-07-22 at 10.30.54 AM.png

Integrate the list of elements into a csv file

Add the csv files to SQL database

 

Data Overview


File Sizes

  • map.osm --- 194MB

  • sample.osm --- 19MB

  • nodes.csv --- 8MB

  • node_tags.csv --- 155KB

  • ways.csv --- 750KB

  • way_tags.csv --- 1.3MB

  • way_nodes.csv --- 2.2MB

  • atlanta.db --- 108MB

Number of Nodes

Screen Shot 2021-07-22 at 10.37.54 AM.png

Number of Ways

Screen Shot 2021-07-22 at 10.38.02 AM.png

Number of Unique Users

Screen Shot 2021-07-22 at 10.38.09 AM.png

Number of 'Peachtree' and 'Ponce' street names

There are 71 streets in Atlanta that include the name "Peachtree". It's often joked that if you ask anyone for directions, they will always send you down Peachtree. Ponce is also a common street name in the city. I wanted to see how many Peachtrees and Ponce's are in my city map and what are their actual names?

Screen Shot 2021-07-22 at 10.38.23 AM.png
 

Additional Ideas/Statistics


Different neighborhoods define different cultural variations of a city and can give information about what to expect when you visit that city. I wanted to use coordinates to separate the OSM map by different regions and do some exploratory analysis comparing them:


Additional Ideas/Statistics

Different neighborhoods define different cultural variations of a city and can give information about what to expect when you visit that city. I wanted to use coordinates to separate the OSM map by different regions and do some exploratory analysis comparing them:

Downtown

Midtown

Uptown/Buckhead

Number of restaurants in each region

Screen Shot 2021-07-22 at 10.50.34 AM.png

Number of bike-friendly roads and paths

 

Conclusion


Based on my analysis, it seems like the data is in good form as I didn't find many obvious errors. The issues that I did have with the data had more to do with getting it in a desirable format (e.g. phone number, tag types). It might be useful to add a code which represent the neighborhood or district to which an element belongs. This would allow policymakers to use OpenStreetMap in conjunction with other datasets to understand how their district compares to other districts in the city and to make necessary adjustments. Neighborhood boundaries change, however, and are constantly subject to dispute and (gulp) politics. It is, however, a compelling angle given the historically disadvantaged nature of some areas versus others in terms of walkability, food choices, utility placement, etc. Population density and many other factors are not taken into account, so there's also the risk that any analysis with this dataset would be too simplistic.

Another correction would be to add more logic to the 'type' field. In the restaurant example from before, 'restaurant' and 'fast food' nodes could reasonably be aggregated into a single category called 'restaurant' or 'dining'. Just as the neighborhood suggestion, this would be subject to politics and disagreements on what should and shouldn't be aggregated.

 
 

Quick Exploratory Analysis of HIV Prevalence Data

Quick Analysis of the Titanic Passenger Dataset