EXPLORING URBAN DATA WITH MACHINE LEARNING

Urban Mobility Index

Kit Nga Chou | Kirthi Balakrishnan | Michelle Chen | Lizzie Lee
INTRODUCTION

Why Mobility?

The Need for Better Walkability & Transit

One of the most pressing issues in urban mobility today is the dependency on vehicular transportation.

Cities are slowly but surely understanding the importance of walkable cities, not only for sustainability concerns, but also as a solution for the growing congestion and shortening commute for a better quality of life.

A tool that can look at the street characteristics of a city and assess its walkability score can be a useful analytical tool for urban planners, especially planners who are working on a city that neither have metrics pre-calculated nor the capacity to produce and work with raw data

Research Question

How can we utilize Walkscore.com’s pre-existing datasets of major cities to build a training model that can predict the efficiency of any city and/or neighborhood based on their street connectivity & transit density?
methodology

Framework + Pipeline

methodology

Datasets

Three open-source API-based datasets to attempt reverse-engineering Walkscore.com's methodology

1.

Road Maps

Image classification with Keras to identify correlation between visual street network & Walk Scores

2.

Bus Stops

Neighborhood-wise bus stop location identification & occurance calculations

3.

Intersection Nodes

Extracting intersection nodes from openstreetmap plots & calculating densities for each neighborhood

CITIES

Training Data

Boulder, CO | Ann Arbor, MI | Chicago, IL
Washington D.C. | New York, NY | San Francisco, SF
CITIES

Validation Data

Madison, WI | Seattle, WA | Tulsa, OK
data preparation

Webscrapping for Existing Walkscores

using Beautiful Soup

webscrapping

Extracting Boundaries

using regex & javascript via js2py

Python code that accepts URL input (of a neighborhood) to find the encoded polygon in the page source, decode it, and return vertices

Javascript function to reverse-engineer Google Maps dynamic API's encoded polygon decoder

IMAGE CLASSIFICATION WITH KERAS

URL  EnPath  Polygon

IMAGE CLASSIFICATION WITH KERAS

1

Dynamic Google Maps API to Static Image

Static images do not accept overlaid polygons with holes, which was necessary to extract street data of only a specific boundary

Replace parameters in HTML file & write

The PNG image is a raw screenshot

Convert written HTML file to PNG 

Use Pillow (PIL) package to clean up image

keras image classifier: categorical model

Two Types of Images Compared

The PNG image is a raw screenshot

Use Pillow (PIL) package to clean up image

ISSUES FACED

Overfitting + Low Validation Accuracy

Dropped from methodology

Unprocessed Image

Accuracy

Training accuracy increases
Validation accuracy is fickle

Unprocessed Image

Loss

Training loss decreases
Validation loss is fickle

Processed Image

Accuracy

Training accuracy increases
Validation accuracy is stagnant

Processed Image

Loss

Training loss decreases
Validation loss increases

linear regression data preparation

2

Bus Stop Density Mapping

Static images do not accept overlaid polygons with holes, which was necessary to extract street data of only a specific boundary
STEP 1

Query

Use Overpass Turbo wizard to generate query

STEP 2

Extract

Use Overpass API to extract points to Python

STEP 3

Count

Use bounding box + count to find number of bus stops

STEP 4

Get Density

By area & population/1000 of the neighborhood

LINEAR REGRESSION DATA PREPARATION

3

Intersection Density Mapping

Extracting line plots from Open Street Maps via the osmnx package in Python
OSMNX Street Graph  Graph Nodes

The entire city's nodal geodata points are extracted and saved to a geodataframe, which is later spatial-joined to the polygon geodataframe created from the decoded Google Maps API javascript file.

The sum of nodes within a boundary is used to calculate the density of nodes within a neighborhood by area and by population/1000.

The dataframe containing density data for both bus stops and intersections is then put through a pred model to predict the range of the Walkscore of a neighborhood.

APPLICATION + ALGORITHM

Prediction Models

3 Clustering Models Attempted

Three different clustering methods were used after splitting the data into 10 classes based on Walkscore

K-MEANS

Clearly split up based on intersection density

AGGLOMERATIVE CLUSTERING

Clearly split up based on intersection density

gaussian mixture

Clustering seems more realistic

PREDICTIVE MODELS

Linear Regression

Diagonal Correlation in raw data pattern
parameters
Bus Stop and Intersection Densities by Sq. Km. and 1000 capita are used as predictors for the Walk Scores

(hover to see error difference)

Model Metrics
  • Mean Walk Score: 71.07

  • Root Mean Squared Error: 17.04

  • R-Squared: 0.38

Bus Stop & Intersection densities form 38% of the parameters used in evaluation of Walk Scores

RESEARCH FINDINGS

Results + Implications

We are confident that if we were able to increase the parameters, for instance, strengthening the datasets by adding cities that have diverse neighborhoods with differences in walkability, then we could more accurately predict city’s Walkscores.


RESEARCH FINDINGS

Limitations

Disparity in distribution of training and test datasets and their Walkscores

Training Set

Test Set

research FINDINGS

Improvement Gaps

1

Accuracy

Insufficient; needs more data points; needs more computing power

2

Parameters

Parameters' r-squared is not enough; more parameters can be added

3

Scalability

Front-end development to accept different kinds of input to return walkscore

Conclusion

Next Steps

Our ultimate goal behind creating a predictive Walkscore is to encourage planners to create dynamic and walkable neighborhoods, which provides health and sustainability benefits, and also increases neighborhood connectivity to the disadvantaged populations who might not have access to vehicles.

For the next steps, we would hope to devote more time into classifying each neighborhoods by their streets patterns, such as grid patterns vs. loops pattern, then we can compare if streets patterns have any correlation to a higher or lower Walk Score.