This project is for the data wrangling portion of Udacity’s Data Analyst Nanodegree. I take xml data for Greater Atlanta from OpenStreetMap. After cleaning, transforming, and storing the data in a SQL database, I run a few queries to analyze the data.
Wrangle OpenStreetMap Data
In this project, xml data from OpenStreetMap is taken from a location in the world. That xml data is audited and cleaned and then converted to csv. That csv data will then be loaded into a SQL database where it will be queried and explored. Documentation of the process will also be included.
Project Objectives
Gather uncleaned data from OpenStreetMap from somewhere in the world
Clean, Audit, and Transform the data
Convert cleaned data to csv
Convert csv files to SQL database using SQLite
Perform some exploratory analysis on final data in the database
Map Area
Problems Encountered
The whole file that I downloaded from OpenStreeMap was 203MB. I created a 20MB sample file to do most of my preliminary testing. I actually didn't encounter many expected problems in this data set. The following are a list of issues that I did encounter:
Issues determining tag type
Ways that pass through two (or more) counties
Issues determining tag type
Determing the tag type was tricky because it was somewhat open-ended and I had to decide what would be the most informative value that I could use. In solving this problem, I went through the usual workflow of getting the data into the database. I was then able to look through the records in the database and ascertain which values in the key/value pair were really useful for certain elements. For example, I saw the value 'leisure' for the parks key and 'neighborhood' for the place key. I went through, in this manner, and determined what kind of tag types I could ascertain (or were worth ascertaining).
The values that I came up with were stored in the tag_type_list variable and used to programmatically set the tag type in the get_tag_type method. This logic is demonstrated in the code snippet below.
"Ways" that pass through two (or more) counties¶
Some way elements pass through more than one region with a different county name or different zip code. When I'm searching for this information on the database, I would probably like to know both regions that the way passes through, so I decided to create a new field for those compound values. When I search for them, I should see both - which seems more accurate. The code snippet below demonstrates this logic.
Other Auditing Measures
In addition to addressing the tag type and compound values, I did the following:
Basic validation of ways, nodes, way-nodes, and tags
Validated zip code for form and locality
Audited phone numbers to ensure that they met the standardized ###-###-#### format common in Atlanta
Note that problematic entities are added to a dictionary where I can analyze them manually.
Running the project
Setup
Load the necessary Python modules
Group the "way" and "node" elements and their respective tags into lists
Top-level node and way elements - and then their respective tags - are processed in a recursive manner
A list of problems that occured. Note that certain problems (such as zip code problem) were fixed in runtime.
Integrate the list of elements into a csv file
Add the csv files to SQL database
Data Overview
File Sizes
map.osm --- 194MB
sample.osm --- 19MB
nodes.csv --- 8MB
node_tags.csv --- 155KB
ways.csv --- 750KB
way_tags.csv --- 1.3MB
way_nodes.csv --- 2.2MB
atlanta.db --- 108MB
Number of Nodes
Number of Ways
Number of Unique Users
Number of 'Peachtree' and 'Ponce' street names
There are 71 streets in Atlanta that include the name "Peachtree". It's often joked that if you ask anyone for directions, they will always send you down Peachtree. Ponce is also a common street name in the city. I wanted to see how many Peachtrees and Ponce's are in my city map and what are their actual names?
Additional Ideas/Statistics
Different neighborhoods define different cultural variations of a city and can give information about what to expect when you visit that city. I wanted to use coordinates to separate the OSM map by different regions and do some exploratory analysis comparing them:
Additional Ideas/Statistics
Different neighborhoods define different cultural variations of a city and can give information about what to expect when you visit that city. I wanted to use coordinates to separate the OSM map by different regions and do some exploratory analysis comparing them:
Number of restaurants in each region
Number of bike-friendly roads and paths
Conclusion
Based on my analysis, it seems like the data is in good form as I didn't find many obvious errors. The issues that I did have with the data had more to do with getting it in a desirable format (e.g. phone number, tag types). It might be useful to add a code which represent the neighborhood or district to which an element belongs. This would allow policymakers to use OpenStreetMap in conjunction with other datasets to understand how their district compares to other districts in the city and to make necessary adjustments. Neighborhood boundaries change, however, and are constantly subject to dispute and (gulp) politics. It is, however, a compelling angle given the historically disadvantaged nature of some areas versus others in terms of walkability, food choices, utility placement, etc. Population density and many other factors are not taken into account, so there's also the risk that any analysis with this dataset would be too simplistic.
Another correction would be to add more logic to the 'type' field. In the restaurant example from before, 'restaurant' and 'fast food' nodes could reasonably be aggregated into a single category called 'restaurant' or 'dining'. Just as the neighborhood suggestion, this would be subject to politics and disagreements on what should and shouldn't be aggregated.