This report documents the design, construction, and presentation of the NYC Taxi Cartographic Atlas, a full-stack web application that ingests, cleans, normalises, indexes, and visualises the NYC Taxi Trip Duration dataset published by the NYC Taxi and Limousine Commission. The application processes 1,458,644 fare records from the first half of 2016 and serves them through a Flask backend, a normalised SQLite database, fourteen REST endpoints, and a single-page browser dashboard rendered as a folded paper map.
New York City's taxi fleet produces one of the largest and most studied urban mobility datasets in the world. The NYC Taxi and Limousine Commission publishes per-trip data that is rich enough to answer questions about congestion, demand, neighbourhood inequality, and the rhythms of city life, but large enough (~1.46 million rows for the first half of 2016 alone) that naive approaches loading the whole file into pandas, linking it to a real geocoder, or dropping it into a default dashboard tool are either slow, ugly, or both.
Our assignment asked us to build a full-stack Urban Mobility Data Explorer that ingests this dataset, cleans it, stores it in a normalised relational database, exposes it through a REST API, and presents it through a browser dashboard with at least three meaningful insights. We chose to interpret the brief literally: an atlas. A folded paper map of the city's taxi traffic, with plates and marginalia and a compass rose built on a Flask + SQLite backend that runs in a single Python process.
An exploratory pass over the raw CSV revealed several systematic data-quality issues that required cleaning. Each was logged in the cleaning_log table with counts and reasons, providing a transparent audit trail.
| Challenge | How we addressed it |
|---|---|
| GPS outliers | Reject pickup or dropoff coordinates outside the NYC bounding box (40.49–40.92°N, 74.27–73.68°W) |
| Impossible durations | Reject trips shorter than 60 seconds or longer than 3 hours (10,800 s) |
| Invalid passenger counts | Reject rows where passenger_count < 1 or > 6 |
| Zero-distance trips | Reject trips with haversine distance below 100 m (pickup ≈ dropoff) |
| Unrealistic distances | Reject trips over 200 km (impossible inside NYC) |
| Impossible speeds | Reject trips whose computed velocity exceeds 150 km/h |
| Mixed datetime formats | Accept both YYYY-MM-DD HH:MM:SS and MM/DD/YYYY HH:MM:SS |
| CSV delimiter variance | Auto-detect comma vs. tab from the header line |
| Missing or malformed fields | Wrap the numeric parse in try/except and count rejections by category |
Three explicit assumptions shaped the cleaning logic. First, we assumed that trips with store_and_fwd_flag = "Y" (about 0.5% of records) carry accurate timestamps despite being logged after the fact, since the device buffered them locally before uploading. Second, we assumed that any computed velocity above 150 km/h represents bad data rather than a physically realistic trip, because NYC streets rarely permit sustained speeds above 60 km/h. Third, we treated trips shorter than 100 metres as GPS jitter rather than genuine very-short rides; the device's positional uncertainty in dense urban canyons is comparable to that distance.
The most surprising pattern we found in the cleaned data is that rush hour is asymmetric: morning rush (7–9 AM) slows traffic by about 11 % vs. off-peak, but evening rush (5–7 PM) slows it by about 14 %. We expected the two peaks to be mirror images; they are not. The evening peak is both deeper and longer, and it coincides with the city's busiest dinner and nightlife windows. This single observation drove our framing of Insight 1 ("the rush-hour tax"), Chapter IV.
The application is a single deployable Flask process serving a static frontend and a JSON API. There are no microservices, no message queues, no caches, and no external databases. Everything runs out of one Python process, against one local SQLite file, behind one HTTP port.
Python 3.11 was chosen for the backend because the standard library alone is sufficient for every cleaning, indexing, and serving operation we needed, no pandas, no NumPy, no sklearn. Flask was chosen over FastAPI because the application is synchronous, there is no streaming requirement, and Flask's footprint is half the size with a simpler deployment story. SQLite was chosen over Postgres because a single 1.46M-row read-mostly dataset does not justify a separate database process; SQLite indexes are competitive with Postgres at this scale and the file-based deployment is trivial.
The frontend uses Chart.js for the eight quantitative plates and a hand-built HTML canvas for the topographic density map. There is no React, no Vue, no build step. The CSS is hand-written. The total uncompressed frontend is roughly 60 KB.
The schema is normalised to third normal form. Trips link to zones via pickup_zone_id and to time_slots via time_slot_id. Zone classification produces a bounded set of thirty-one New York zones, pre-seeded at schema-creation time so that the streaming insert can look up the foreign key inline without a second pass over the data. The time_slots table is pre-populated with all 168 hour-of-week buckets at startup. Anomalies are recorded separately in trip_flags (high speed, long trip, long distance) so that diagnostic queries do not pollute the main fact table.
We chose to perform every cleaning, statistics, and indexing operation with Python's standard library. This made the project portable, the install footprint trivial, and the logic completely transparent. The trade-off is that pandas would have been a few lines shorter; however, our hand-written ETL is faster than equivalent pandas code at this scale because we avoid the overhead of building a DataFrame.
SQLite was chosen because the dataset is read-mostly, fits comfortably on disk, and does not require concurrent writes. Deployment is a single file. The trade-off is that horizontal scaling is impossible, but at 1.46 million rows we are nowhere near needing it. Index-backed queries return in under 50 ms.
The schema creates only the tables at startup; the twelve indexes are created after the bulk insert finishes. Building indexes on an already-populated table is roughly five to ten times faster than maintaining them during the insert, because each row would otherwise trigger twelve B-tree updates instead of one append.
We grouped pickup coordinates into thirty-one lat/lon-bounded zones so trips can be aggregated by area without an external geocoder. The bucket names (Midtown South, Upper East Side, etc.) are informal labels based on where those neighbourhoods appear on a public map of Manhattan, descriptive shortcuts for aggregation, not surveyed boundaries.
We rejected the default dark-theme dashboard look in favour of a folded paper-map aesthetic. Each chart is a "plate" in an atlas, the heatmap is a "topographic density map", and the trip ledger is a "field notebook". The frontend is pure black and white, every distinction made through value, typography, and ornament rather than colour.
Two custom data structures and algorithms were implemented from scratch as required by the assignment specification. They are exposed via the /api/zones endpoint, where the top zones are sorted by trip count and the most common pickup hour is computed by frequency counting.
We implemented an in-place QuickSort with a median-of-three pivot strategy to avoid the worst case on already-sorted or nearly-sorted inputs. The median-of-three picks the median of arr[lo], arr[mid], and arr[hi] as the pivot, which significantly reduces the likelihood of O(n²) behaviour on adversarial inputs.
Approach. Naive QuickSort picks arr[lo] as the pivot, which degrades to O(n²) on already-sorted input exactly the kind of input we hit when ranking zones that arrive partly ordered from the database. Median-of-three picks the median of three elements (first, middle, last) as the pivot, making the partition more balanced in practice. We then partition in place using the Lomuto scheme and recurse on the two halves.
Pseudo-code.
FUNCTION quicksort(arr, key_func):
IF length(arr) ≤ 1:
RETURN arr
FUNCTION _sort(a, lo, hi):
IF lo ≥ hi:
RETURN
// 1. Median-of-three pivot selection
mid ← (lo + hi) / 2
IF key_func(a[lo]) > key_func(a[mid]): swap a[lo], a[mid]
IF key_func(a[lo]) > key_func(a[hi]): swap a[lo], a[hi]
IF key_func(a[mid]) > key_func(a[hi]): swap a[mid], a[hi]
// 2. Move pivot to end
swap a[mid], a[hi]
pivot ← key_func(a[hi])
// 3. Lomuto partition
i ← lo
FOR j FROM lo TO hi - 1:
IF key_func(a[j]) ≤ pivot:
swap a[i], a[j]
i ← i + 1
swap a[i], a[hi]
// 4. Recurse on the two halves
_sort(a, lo, i - 1)
_sort(a, i + 1, hi)
_sort(arr, 0, length(arr) - 1)
RETURN arr
Real implementation:
def quicksort(arr, key_func):
if len(arr) <= 1:
return arr
def _sort(a, lo, hi):
if lo >= hi:
return
mid = (lo + hi) // 2
if key_func(a[lo]) > key_func(a[mid]):
a[lo], a[mid] = a[mid], a[lo]
if key_func(a[lo]) > key_func(a[hi]):
a[lo], a[hi] = a[hi], a[lo]
if key_func(a[mid]) > key_func(a[hi]):
a[mid], a[hi] = a[hi], a[mid]
a[mid], a[hi] = a[hi], a[mid]
pivot_val = key_func(a[hi])
i = lo
for j in range(lo, hi):
if key_func(a[j]) <= pivot_val:
a[i], a[j] = a[j], a[i]
i += 1
a[i], a[hi] = a[hi], a[i]
_sort(a, lo, i - 1)
_sort(a, i + 1, hi)
_sort(arr, 0, len(arr) - 1)
return arr
Complexity analysis.
For finding the most common pickup hour, we implemented a frequency counter from scratch rather than using collections.Counter. The counter is a dict-backed accumulator followed by sorting for ranking.
Approach. The input is first processed in a single pass to count occurrences using a dictionary. The result is then converted into a list of {key, count} pairs and sorted in descending order of count using the custom QuickSort algorithm. This produces a ranked list of the most frequent values.
Pseudo-code.
FUNCTION frequency_count(items):
freq ← empty map (key → integer)
FOR each item IN items:
IF item IN freq:
freq[item] ← freq[item] + 1
ELSE:
freq[item] ← 1
pairs ← empty list
FOR each (k, v) IN freq:
append {key: k, count: v} TO pairs
// Sort pairs by count descending
quicksort(pairs, key = (x → -x.count))
RETURN pairs
Real implementation:
def frequency_count(items):
freq = {}
for item in items:
if item in freq:
freq[item] += 1
else:
freq[item] = 1
pairs = [{"key": k, "count": v} for k, v in freq.items()]
quicksort(pairs, lambda x: -x["count"])
return pairs
Complexity analysis.
The streaming ETL pipeline processes the entire 1.46-million-row dataset in approximately 30 seconds on a modest laptop (~115,000 rows per second). Nine concrete optimisations contribute to this throughput.
| Optimisation | Effect |
|---|---|
| Single-pass streaming, no intermediate row list | Saves ~500 MB of RAM on the full dataset |
csv.reader instead of csv.DictReader | Avoids 1.46 million dict allocations |
Inlined haversine formula with local references to sin/cos/sqrt | ~2× math speedup |
Manual datetime parsing (split + int) instead of strptime | ~15× datetime speedup |
Zeller's congruence for day-of-week, no datetime object construction | Eliminates 1.46 million object allocations |
| All 31 zones pre-seeded at schema time, foreign key looked up inline | Eliminates a full second pass over the data |
Indexes created after bulk load via CREATE INDEX | ~5–10× insert speedup |
SQLite pragmas tuned for bulk insert (synchronous=OFF, journal_mode=MEMORY, 64 MB cache) | ~3× transaction speedup |
| Single explicit transaction wrapping all inserts | Avoids per-row commit overhead |
Three insights emerged from the data that we considered worth surfacing in the dashboard's "Marginalia" sheet. Each is supported by a SQL query against the cleaned dataset and a visual representation in the application.
Derivation. The query buckets every trip into "Rush Hour" or "Off-Peak" using a SQL CASE expression on hour_of_day, then averages velocity and duration over each bucket.
SELECT
CASE
WHEN hour_of_day BETWEEN 7 AND 9
OR hour_of_day BETWEEN 17 AND 19 THEN 'Rush Hour'
ELSE 'Off-Peak'
END AS period,
AVG(speed_kmh) AS avg_speed,
AVG(trip_duration)/60 AS avg_duration_min,
COUNT(*) AS trips
FROM trips
GROUP BY period;
Visual. The diurnal pulse on Plate I (Sheet A) renders all 24 hours of the day as a bar chart with rush-hour bars in solid black and off-peak bars in light gray. The visual signature is unmistakable: two dark "wings" rising above a quieter middle:
Interpretation. During the morning peak (07:00–09:00) the mean trip velocity drops by roughly 11% versus off-peak, and the evening peak (17:00–19:00) is even more punishing, closer to 14% slower with trips averaging nearly 40% longer in duration. The relationship is not symmetric: the morning peak is sharp and brief (commuters arriving on a fixed schedule), while the evening peak is broader and longer (commuters leaving when their day ends, which varies by industry). For an urban mobility planner, this has direct implications: capacity-based interventions like congestion pricing should be priced higher in the evening than the morning, because each minute of evening congestion affects more trips and more total person-minutes than the equivalent morning minute. It also means dispatch algorithms that assume a symmetric peak (the standard textbook model) are systematically miscalibrated against this dataset.
Derivation. The query buckets trips into "Weekend" (Saturday/Sunday) or "Weekday" using a CASE on day_of_week, then computes average distance, duration, velocity, and total trip count for each bucket.
SELECT
CASE WHEN day_of_week >= 5 THEN 'Weekend'
ELSE 'Weekday' END AS type,
AVG(distance_km) AS avg_distance,
AVG(trip_duration)/60 AS avg_duration_min,
AVG(speed_kmh) AS avg_speed,
COUNT(*) AS trips
FROM trips
GROUP BY type;
Visual. Plate V on Sheet B renders the seven days of the week as a bar chart with weekend bars shaded lighter than weekday bars, making the volume gap immediately visible:
Interpretation. Weekend trips are longer in distance, faster in velocity, and concentrated at later hours, while weekday trips cluster around the morning and evening commute. The distance difference is the most diagnostic finding: weekend trips average roughly 10% farther than weekday trips, even though weekday commuters are more likely to live further out. This is consistent with weekend trips serving a different purpose, leisure, shopping, social trips that span longer distances by choice rather than the constrained commute pattern of the weekday. For urban mobility, this implies that weekend taxi demand is qualitatively different from weekday demand, and a service designed exclusively to optimise commute reliability will under-serve weekend users. It also suggests that the late-Friday and Saturday-night peaks (visible on Plate IV's heatmap above 22:00) represent a third demand mode entirely, the nightlife pulse that neither the morning nor the evening commute models capture.
Derivation. The query joins trips to the pre-seeded zones table on the foreign key pickup_zone_id, groups by zone, and orders descending by trip count using our custom QuickSort. Average velocity per zone is computed in the same query. The thirty-one zones are informal lat/lon buckets labelled after where those neighbourhoods appear on a public map of Manhattan and the outer boroughs, they are descriptive shortcuts for aggregation, not surveyed boundaries.
SELECT
z.zone_name,
z.trip_count,
AVG(t.speed_kmh) AS avg_speed
FROM zones z
JOIN trips t ON t.pickup_zone_id = z.zone_id
GROUP BY z.zone_id
ORDER BY z.trip_count DESC
LIMIT 5;
Visual. Plate IX on Sheet C renders pickup density by zone as a horizontal bar chart, ranked from highest to lowest. Plate X provides the full thirty-one-row register so a reader can audit the long tail:
| # | Zone | Trips | Share | Mean duration | Mean velocity |
|---|---|---|---|---|---|
| 01 | Midtown West / Times Square | 163,319 | 11.3461% | 13.33 min | 13.69 km/h |
| 02 | East Village / NoHo | 155,797 | 10.8236% | 12.74 min | 12.96 km/h |
| 03 | Midtown East / Grand Central | 134,628 | 9.3529% | 13.33 min | 12.74 km/h |
| 04 | Upper East Side South | 127,564 | 8.8621% | 11.51 min | 14.30 km/h |
| 05 | Midtown South / Flatiron | 107,438 | 7.4640% | 13.32 min | 12.70 km/h |
| 06 | Upper East Side | 107,173 | 7.4455% | 11.75 min | 14.81 km/h |
| 07 | Gramercy / Murray Hill | 92,995 | 6.4606% | 12.85 min | 13.43 km/h |
| 08 | Upper West Side | 85,433 | 5.9352% | 11.65 min | 15.08 km/h |
| 09 | West Village / Meatpacking | 83,819 | 5.8231% | 13.26 min | 13.98 km/h |
| 10 | Stuyvesant / LES North | 69,489 | 4.8276% | 12.24 min | 14.10 km/h |
| 11 | Lower East Side / Chinatown | 56,128 | 3.8993% | 13.86 min | 14.35 km/h |
| 12 | Lower Manhattan / Financial District | 48,084 | 3.3405% | 17.30 min | 15.71 km/h |
| 13 | East Queens | 36,422 | 2.5303% | 30.07 min | 21.63 km/h |
| 14 | Tribeca / SoHo | 33,875 | 2.3534% | 14.70 min | 14.31 km/h |
| 15 | South Queens | 30,911 | 2.1475% | 41.69 min | 28.11 km/h |
| 16 | West Queens | 25,384 | 1.7635% | 13.56 min | 15.98 km/h |
| 17 | Chelsea / Hudson Yards | 21,619 | 1.5019% | 12.66 min | 13.77 km/h |
| 18 | West Brooklyn | 16,174 | 1.1236% | 14.81 min | 16.64 km/h |
| 19 | Morningside Heights | 12,181 | 0.8462% | 12.84 min | 16.33 km/h |
| 20 | North Brooklyn | 8,029 | 0.5578% | 13.27 min | 16.03 km/h |
| 21 | Outer Boroughs | 7,594 | 0.5276% | 13.57 min | 14.69 km/h |
| 22 | East Harlem | 7,372 | 0.5121% | 11.54 min | 16.16 km/h |
| 23 | Harlem | 3,369 | 0.2341% | 13.41 min | 18.70 km/h |
| 24 | Central Brooklyn | 2,187 | 0.1519% | 14.39 min | 17.40 km/h |
| 25 | Washington Heights | 1,114 | 0.0774% | 14.85 min | 21.15 km/h |
| 26 | South Bronx | 507 | 0.0352% | 14.71 min | 17.65 km/h |
| 27 | Central Bronx | 287 | 0.0199% | 14.13 min | 18.78 km/h |
| 28 | South Brooklyn | 273 | 0.0190% | 14.96 min | 20.72 km/h |
| 29 | East Bronx | 136 | 0.0094% | 14.69 min | 20.80 km/h |
| 30 | Inwood | 118 | 0.0082% | 14.21 min | 18.27 km/h |
| 31 | Staten Island | 6 | 0.0004% | 24.22 min | 48.75 km/h |
Interpretation. Pickup density is overwhelmingly concentrated in Manhattan's core. The top three zones, Midtown West / Times Square, East Village / NoHo, and Midtown East / Grand Central account for more than 31.5% of all pickups in the dataset, with the top six Manhattan zones clearing 55% combined. Manhattan as a whole accounts for roughly 94–95% of cleaned pickups, while the Bronx, Brooklyn, and Staten Island combined contribute fewer than 1,500 trips. This is a striking demonstration of the Pareto distribution in urban mobility: a tiny geographic area generates the overwhelming majority of demand. For a real urban mobility planner this has three implications. First, supply must follow density, idle taxis should be repositioned toward the Midtown corridor and East Village axis rather than spread evenly across boroughs. Second, the outer boroughs are under-served by yellow cabs, which is the well-documented gap that ride-hailing services were designed to fill. Third, pricing experiments should be A/B tested on the dense top-6 Manhattan zones first, where statistical power is highest, before being rolled out city-wide.
Building this application taught us as much about restraint as it did about engineering. The hardest decisions were not about what to add but about what to leave out. We could have added a join with weather data, an interactive map with real geographic coordinates, a machine-learning model to predict trip duration, or a streaming pipeline that ingested live TLC data. We chose not to, because each addition would have widened the surface area of the project past what we could maintain confidently in the time available.
The single most impactful technical decision was pre-seeding the zones and time_slots tables before the ETL loop begins. This let us look up the foreign key inline during the streaming insert, eliminating an entire second pass over 1.46 million rows. It is the reason the pipeline finishes in 30 seconds instead of 90.
Six concrete improvements we would pursue if we had another sprint: