This project is for the data wrangling portion of Udacity’s Data Analyst Nanodegree. I take xml data for Greater Atlanta from OpenStreetMap. After cleaning, transforming, and storing the data in a SQL database, I run a few queries to analyze the data.

Wrangle OpenStreetMap Data

In this project, xml data from OpenStreetMap is taken from a location in the world. That xml data is audited and cleaned and then converted to csv. That csv data will then be loaded into a SQL database where it will be queried and explored. Documentation of the process will also be included.

Project Objectives

Gather uncleaned data from OpenStreetMap from somewhere in the world
Clean, Audit, and Transform the data
Convert cleaned data to csv
Convert csv files to SQL database using SQLite
Perform some exploratory analysis on final data in the database

Map Area

Problems Encountered

The whole file that I downloaded from OpenStreeMap was 203MB. I created a 20MB sample file to do most of my preliminary testing. I actually didn't encounter many expected problems in this data set. The following are a list of issues that I did encounter:

Issues determining tag type
Ways that pass through two (or more) counties

Issues determining tag type

Determing the tag type was tricky because it was somewhat open-ended and I had to decide what would be the most informative value that I could use. In solving this problem, I went through the usual workflow of getting the data into the database. I was then able to look through the records in the database and ascertain which values in the key/value pair were really useful for certain elements. For example, I saw the value 'leisure' for the parks key and 'neighborhood' for the place key. I went through, in this manner, and determined what kind of tag types I could ascertain (or were worth ascertaining).

The values that I came up with were stored in the tag_type_list variable and used to programmatically set the tag type in the get_tag_type method. This logic is demonstrated in the code snippet below.

Here is a before and after example of the correction

"Ways" that pass through two (or more) counties¶

Some way elements pass through more than one region with a different county name or different zip code. When I'm searching for this information on the database, I would probably like to know both regions that the way passes through, so I decided to create a new field for those compound values. When I search for them, I should see both - which seems more accurate. The code snippet below demonstrates this logic.

Here is a before and after example of the correction. When I search for the county that this way passes through in SQL, it returns both, which is technically true.

Other Auditing Measures

In addition to addressing the tag type and compound values, I did the following:

Basic validation of ways, nodes, way-nodes, and tags
Validated zip code for form and locality
Audited phone numbers to ensure that they met the standardized ###-###-#### format common in Atlanta

Note that problematic entities are added to a dictionary where I can analyze them manually.

Running the project

Setup

Load the necessary Python modules

Group the "way" and "node" elements and their respective tags into lists

Top-level node and way elements - and then their respective tags - are processed in a recursive manner

A list of problems that occured. Note that certain problems (such as zip code problem) were fixed in runtime.

Screen Shot 2021-07-22 at 10.30.54 AM.png

Integrate the list of elements into a csv file

Add the csv files to SQL database

Data Overview

File Sizes

map.osm --- 194MB
sample.osm --- 19MB
nodes.csv --- 8MB
node_tags.csv --- 155KB
ways.csv --- 750KB
way_tags.csv --- 1.3MB
way_nodes.csv --- 2.2MB
atlanta.db --- 108MB

Number of Nodes

Screen Shot 2021-07-22 at 10.37.54 AM.png

Number of Ways

Screen Shot 2021-07-22 at 10.38.02 AM.png

Number of Unique Users

Screen Shot 2021-07-22 at 10.38.09 AM.png

Number of 'Peachtree' and 'Ponce' street names

There are 71 streets in Atlanta that include the name "Peachtree". It's often joked that if you ask anyone for directions, they will always send you down Peachtree. Ponce is also a common street name in the city. I wanted to see how many Peachtrees and Ponce's are in my city map and what are their actual names?

Screen Shot 2021-07-22 at 10.38.23 AM.png

Additional Ideas/Statistics

Different neighborhoods define different cultural variations of a city and can give information about what to expect when you visit that city. I wanted to use coordinates to separate the OSM map by different regions and do some exploratory analysis comparing them:

Additional Ideas/Statistics

Different neighborhoods define different cultural variations of a city and can give information about what to expect when you visit that city. I wanted to use coordinates to separate the OSM map by different regions and do some exploratory analysis comparing them:

Downtown

Midtown

Uptown/Buckhead

Number of restaurants in each region

Screen Shot 2021-07-22 at 10.50.34 AM.png

Number of bike-friendly roads and paths

Conclusion

Based on my analysis, it seems like the data is in good form as I didn't find many obvious errors. The issues that I did have with the data had more to do with getting it in a desirable format (e.g. phone number, tag types). It might be useful to add a code which represent the neighborhood or district to which an element belongs. This would allow policymakers to use OpenStreetMap in conjunction with other datasets to understand how their district compares to other districts in the city and to make necessary adjustments. Neighborhood boundaries change, however, and are constantly subject to dispute and (gulp) politics. It is, however, a compelling angle given the historically disadvantaged nature of some areas versus others in terms of walkability, food choices, utility placement, etc. Population density and many other factors are not taken into account, so there's also the risk that any analysis with this dataset would be too simplistic.

Another correction would be to add more logic to the 'type' field. In the restaurant example from before, 'restaurant' and 'fast food' nodes could reasonably be aggregated into a single category called 'restaurant' or 'dining'. Just as the neighborhood suggestion, this would be subject to politics and disagreements on what should and shouldn't be aggregated.

Many more ideas/projects are forthcoming. Let’s build some cool stuff together!

Wrangle OpenStreetMap Data

Project Objectives

Map Area