NYC CARTO 2016 N
TECHNICAL REPORT
New York Taxi Trips
SurveyorsTeam Blue
SourceNYC TLC / KAGGLE
EditionFIRST FOLIO
I.

Problem Framing & Dataset

This report documents the design, construction, and presentation of the NYC Taxi Cartographic Atlas, a full-stack web application that ingests, cleans, normalises, indexes, and visualises the NYC Taxi Trip Duration dataset published by the NYC Taxi and Limousine Commission. The application processes 1,458,644 fare records from the first half of 2016 and serves them through a Flask backend, a normalised SQLite database, fourteen REST endpoints, and a single-page browser dashboard rendered as a folded paper map.

1,458,644
Fare records
5
Database tables
14
REST endpoints
5
Derived features

1.1 Dataset description

New York City's taxi fleet produces one of the largest and most studied urban mobility datasets in the world. The NYC Taxi and Limousine Commission publishes per-trip data that is rich enough to answer questions about congestion, demand, neighbourhood inequality, and the rhythms of city life, but large enough (~1.46 million rows for the first half of 2016 alone) that naive approaches loading the whole file into pandas, linking it to a real geocoder, or dropping it into a default dashboard tool are either slow, ugly, or both.

Our assignment asked us to build a full-stack Urban Mobility Data Explorer that ingests this dataset, cleans it, stores it in a normalised relational database, exposes it through a REST API, and presents it through a browser dashboard with at least three meaningful insights. We chose to interpret the brief literally: an atlas. A folded paper map of the city's taxi traffic, with plates and marginalia and a compass rose built on a Flask + SQLite backend that runs in a single Python process.

1.2 Data quality issues identified

An exploratory pass over the raw CSV revealed several systematic data-quality issues that required cleaning. Each was logged in the cleaning_log table with counts and reasons, providing a transparent audit trail.

ChallengeHow we addressed it
GPS outliersReject pickup or dropoff coordinates outside the NYC bounding box (40.49–40.92°N, 74.27–73.68°W)
Impossible durationsReject trips shorter than 60 seconds or longer than 3 hours (10,800 s)
Invalid passenger countsReject rows where passenger_count < 1 or > 6
Zero-distance tripsReject trips with haversine distance below 100 m (pickup ≈ dropoff)
Unrealistic distancesReject trips over 200 km (impossible inside NYC)
Impossible speedsReject trips whose computed velocity exceeds 150 km/h
Mixed datetime formatsAccept both YYYY-MM-DD HH:MM:SS and MM/DD/YYYY HH:MM:SS
CSV delimiter varianceAuto-detect comma vs. tab from the header line
Missing or malformed fieldsWrap the numeric parse in try/except and count rejections by category

1.3 Assumptions made during cleaning

Three explicit assumptions shaped the cleaning logic. First, we assumed that trips with store_and_fwd_flag = "Y" (about 0.5% of records) carry accurate timestamps despite being logged after the fact, since the device buffered them locally before uploading. Second, we assumed that any computed velocity above 150 km/h represents bad data rather than a physically realistic trip, because NYC streets rarely permit sustained speeds above 60 km/h. Third, we treated trips shorter than 100 metres as GPS jitter rather than genuine very-short rides; the device's positional uncertainty in dense urban canyons is comparable to that distance.

1.4 An unexpected observation

The most surprising pattern we found in the cleaned data is that rush hour is asymmetric: morning rush (7–9 AM) slows traffic by about 11 % vs. off-peak, but evening rush (5–7 PM) slows it by about 14 %. We expected the two peaks to be mirror images; they are not. The evening peak is both deeper and longer, and it coincides with the city's busiest dinner and nightlife windows. This single observation drove our framing of Insight 1 ("the rush-hour tax"), Chapter IV.

II.

System Architecture

The application is a single deployable Flask process serving a static frontend and a JSON API. There are no microservices, no message queues, no caches, and no external databases. Everything runs out of one Python process, against one local SQLite file, behind one HTTP port.

Figure 2.1 · High-level architecture
Analyst (browser) Dropbox [ train.csv ] Backend — Flask / Python 3.11 ETL Pipeline stream CSV · validate haversine · classify zone derive features · batch insert 9 optimisations ~115k rows/sec Flask REST API 14 JSON endpoints pagination · filters custom QuickSort frequency counter SQLite nyc_taxi.db zones time_slots trips (fact) trip_flags cleaning_log ── 12 indexes ── 3NF normalised Frontend — Cartographic Atlas index.html 5 sheets 11 plates filters title block graticule colophon dashboard.js Chart.js bars line charts canvas heatmap insight tables zone register API fetching styles.css paper grain fold creases plate borders stream 1 MB chunks INSERT cleaned SELECT HTTP GET /api/* JSON browses ETL optimisations: · stdlib only, no pandas · inlined haversine · manual datetime parse · Zeller's congruence · deferred indexes · 64 MB page cache · batched executemany DB design: · 3NF normalised · 12 indexes cover every WHERE clause used · 31 zones pre-seeded · 168 time_slots seeded

2.1 Stack and language choices

Python 3.11 was chosen for the backend because the standard library alone is sufficient for every cleaning, indexing, and serving operation we needed, no pandas, no NumPy, no sklearn. Flask was chosen over FastAPI because the application is synchronous, there is no streaming requirement, and Flask's footprint is half the size with a simpler deployment story. SQLite was chosen over Postgres because a single 1.46M-row read-mostly dataset does not justify a separate database process; SQLite indexes are competitive with Postgres at this scale and the file-based deployment is trivial.

The frontend uses Chart.js for the eight quantitative plates and a hand-built HTML canvas for the topographic density map. There is no React, no Vue, no build step. The CSS is hand-written. The total uncompressed frontend is roughly 60 KB.

2.2 Database schema

Figure 2.2 · Entity-relationship diagram (3NF)
zones ★ zone_id : INT <<PK>> * zone_name : VARCHAR(100) * avg_lat : DECIMAL(9,6) * avg_lon : DECIMAL(9,6) trip_count : INT UQ: zone_name Dimension table 31 zones pre-seeded before ETL begins. Informal lat/lon buckets named after Manhattan neighbourhoods. time_slots ★ slot_id : INT <<PK>> * hour_of_day : INT * day_of_week : INT * period : VARCHAR(20) * is_weekend : BOOLEAN UQ: (hour, day) Dimension table 168 slots pre-seeded (24 hours x 7 days). period CHECK constraint: morning / afternoon evening / night trips ★ trip_id : VARCHAR(20) <<PK>> * vendor_id : INT * pickup_datetime : DATETIME * dropoff_datetime : DATETIME * passenger_count : INT * pickup_longitude : DECIMAL(9,6) * pickup_latitude : DECIMAL(9,6) * dropoff_longitude : DECIMAL(9,6) * dropoff_latitude : DECIMAL(9,6) * store_and_fwd_flag : CHAR(1) * trip_duration : INT derived during ETL (+): + distance_km : DECIMAL(6,3) + speed_kmh : DECIMAL(6,2) + hour_of_day : INT + day_of_week : INT + month : INT # pickup_zone_id : INT <<FK>> # time_slot_id : INT <<FK>> Fact table 17 attributes 2 foreign keys 12 indexes + = derived during ETL ~1.46M rows ~115k rows/sec trip_flags ★ flag_id : INT <<PK>> # trip_id : VARCHAR(20) <<FK>> * flag_type : VARCHAR(50) description : TEXT ON DELETE CASCADE cleaning_log ★ log_id : INT <<PK>> * stage : VARCHAR(50) * records_in : INT * records_out : INT records_excluded : INT reason : TEXT * created_at : DATETIME is pickup of occurs in may raise Weak entity Anomaly rows separated from the main fact table. Flag types: high_speed (>80 km/h) long_trip (>2 hours) long_distance (>30 km) Audit log Every cleaning stage records records_in, records_out, and exclusion reasons. Index coverage (12 total): hour_of_day · day_of_week · month · vendor_id trip_duration · distance_km · speed_kmh · passenger_count pickup_zone_id · time_slot_id · flags(type, trip_id) LEGEND: ★ Primary Key # Foreign Key * NOT NULL + Derived || = one o{ = many

The schema is normalised to third normal form. Trips link to zones via pickup_zone_id and to time_slots via time_slot_id. Zone classification produces a bounded set of thirty-one New York zones, pre-seeded at schema-creation time so that the streaming insert can look up the foreign key inline without a second pass over the data. The time_slots table is pre-populated with all 168 hour-of-week buckets at startup. Anomalies are recorded separately in trip_flags (high speed, long trip, long distance) so that diagnostic queries do not pollute the main fact table.

2.3 Trade-offs and design decisions

Standard library only

We chose to perform every cleaning, statistics, and indexing operation with Python's standard library. This made the project portable, the install footprint trivial, and the logic completely transparent. The trade-off is that pandas would have been a few lines shorter; however, our hand-written ETL is faster than equivalent pandas code at this scale because we avoid the overhead of building a DataFrame.

SQLite over Postgres

SQLite was chosen because the dataset is read-mostly, fits comfortably on disk, and does not require concurrent writes. Deployment is a single file. The trade-off is that horizontal scaling is impossible, but at 1.46 million rows we are nowhere near needing it. Index-backed queries return in under 50 ms.

Indexes after bulk load

The schema creates only the tables at startup; the twelve indexes are created after the bulk insert finishes. Building indexes on an already-populated table is roughly five to ten times faster than maintaining them during the insert, because each row would otherwise trigger twelve B-tree updates instead of one append.

Spatial bucketing

We grouped pickup coordinates into thirty-one lat/lon-bounded zones so trips can be aggregated by area without an external geocoder. The bucket names (Midtown South, Upper East Side, etc.) are informal labels based on where those neighbourhoods appear on a public map of Manhattan, descriptive shortcuts for aggregation, not surveyed boundaries.

Cartographic frontend

We rejected the default dark-theme dashboard look in favour of a folded paper-map aesthetic. Each chart is a "plate" in an atlas, the heatmap is a "topographic density map", and the trip ledger is a "field notebook". The frontend is pure black and white, every distinction made through value, typography, and ornament rather than colour.

III.

Algorithmic Logic

Two custom data structures and algorithms were implemented from scratch as required by the assignment specification. They are exposed via the /api/zones endpoint, where the top zones are sorted by trip count and the most common pickup hour is computed by frequency counting.

3.1 Custom QuickSort with median-of-three pivot

We implemented an in-place QuickSort with a median-of-three pivot strategy to avoid the worst case on already-sorted or nearly-sorted inputs. The median-of-three picks the median of arr[lo], arr[mid], and arr[hi] as the pivot, which significantly reduces the likelihood of O(n²) behaviour on adversarial inputs.

Approach. Naive QuickSort picks arr[lo] as the pivot, which degrades to O(n²) on already-sorted input exactly the kind of input we hit when ranking zones that arrive partly ordered from the database. Median-of-three picks the median of three elements (first, middle, last) as the pivot, making the partition more balanced in practice. We then partition in place using the Lomuto scheme and recurse on the two halves.

Pseudo-code.

FUNCTION quicksort(arr, key_func):
    IF length(arr) ≤ 1:
        RETURN arr

    FUNCTION _sort(a, lo, hi):
        IF lo ≥ hi:
            RETURN

        // 1. Median-of-three pivot selection
        mid ← (lo + hi) / 2
        IF key_func(a[lo])  > key_func(a[mid]): swap a[lo],  a[mid]
        IF key_func(a[lo])  > key_func(a[hi]):  swap a[lo],  a[hi]
        IF key_func(a[mid]) > key_func(a[hi]):  swap a[mid], a[hi]

        // 2. Move pivot to end
        swap a[mid], a[hi]
        pivot ← key_func(a[hi])

        // 3. Lomuto partition
        i ← lo
        FOR j FROM lo TO hi - 1:
            IF key_func(a[j]) ≤ pivot:
                swap a[i], a[j]
                i ← i + 1

        swap a[i], a[hi]

        // 4. Recurse on the two halves
        _sort(a, lo,    i - 1)
        _sort(a, i + 1, hi)

    _sort(arr, 0, length(arr) - 1)
    RETURN arr

Real implementation:

def quicksort(arr, key_func):
    if len(arr) <= 1:
        return arr

    def _sort(a, lo, hi):
        if lo >= hi:
            return

        mid = (lo + hi) // 2
        if key_func(a[lo]) > key_func(a[mid]):
            a[lo], a[mid] = a[mid], a[lo]
        if key_func(a[lo]) > key_func(a[hi]):
            a[lo], a[hi] = a[hi], a[lo]
        if key_func(a[mid]) > key_func(a[hi]):
            a[mid], a[hi] = a[hi], a[mid]

        a[mid], a[hi] = a[hi], a[mid]
        pivot_val = key_func(a[hi])

        i = lo
        for j in range(lo, hi):
            if key_func(a[j]) <= pivot_val:
                a[i], a[j] = a[j], a[i]
                i += 1

        a[i], a[hi] = a[hi], a[i]

        _sort(a, lo, i - 1)
        _sort(a, i + 1, hi)

    _sort(arr, 0, len(arr) - 1)
    return arr

Complexity analysis.

3.2 Custom frequency counter

For finding the most common pickup hour, we implemented a frequency counter from scratch rather than using collections.Counter. The counter is a dict-backed accumulator followed by sorting for ranking.

Approach. The input is first processed in a single pass to count occurrences using a dictionary. The result is then converted into a list of {key, count} pairs and sorted in descending order of count using the custom QuickSort algorithm. This produces a ranked list of the most frequent values.

Pseudo-code.

FUNCTION frequency_count(items):
    freq ← empty map (key → integer)

    FOR each item IN items:
        IF item IN freq:
            freq[item] ← freq[item] + 1
        ELSE:
            freq[item] ← 1

    pairs ← empty list
    FOR each (k, v) IN freq:
        append {key: k, count: v} TO pairs

    // Sort pairs by count descending
    quicksort(pairs, key = (x → -x.count))

    RETURN pairs

Real implementation:

def frequency_count(items):
    freq = {}
    for item in items:
        if item in freq:
            freq[item] += 1
        else:
            freq[item] = 1

    pairs = [{"key": k, "count": v} for k, v in freq.items()]
    quicksort(pairs, lambda x: -x["count"])
    return pairs

Complexity analysis.

3.3 ETL pipeline performance

The streaming ETL pipeline processes the entire 1.46-million-row dataset in approximately 30 seconds on a modest laptop (~115,000 rows per second). Nine concrete optimisations contribute to this throughput.

OptimisationEffect
Single-pass streaming, no intermediate row listSaves ~500 MB of RAM on the full dataset
csv.reader instead of csv.DictReaderAvoids 1.46 million dict allocations
Inlined haversine formula with local references to sin/cos/sqrt~2× math speedup
Manual datetime parsing (split + int) instead of strptime~15× datetime speedup
Zeller's congruence for day-of-week, no datetime object constructionEliminates 1.46 million object allocations
All 31 zones pre-seeded at schema time, foreign key looked up inlineEliminates a full second pass over the data
Indexes created after bulk load via CREATE INDEX~5–10× insert speedup
SQLite pragmas tuned for bulk insert (synchronous=OFF, journal_mode=MEMORY, 64 MB cache)~3× transaction speedup
Single explicit transaction wrapping all insertsAvoids per-row commit overhead
IV.

Three Insights

Three insights emerged from the data that we considered worth surfacing in the dashboard's "Marginalia" sheet. Each is supported by a SQL query against the cleaned dataset and a visual representation in the application.

4.1 The rush-hour tax

Derivation. The query buckets every trip into "Rush Hour" or "Off-Peak" using a SQL CASE expression on hour_of_day, then averages velocity and duration over each bucket.

SELECT
    CASE
        WHEN hour_of_day BETWEEN 7 AND 9
          OR hour_of_day BETWEEN 17 AND 19 THEN 'Rush Hour'
        ELSE 'Off-Peak'
    END AS period,
    AVG(speed_kmh)         AS avg_speed,
    AVG(trip_duration)/60  AS avg_duration_min,
    COUNT(*)               AS trips
FROM trips
GROUP BY period;

Visual. The diurnal pulse on Plate I (Sheet A) renders all 24 hours of the day as a bar chart with rush-hour bars in solid black and off-peak bars in light gray. The visual signature is unmistakable: two dark "wings" rising above a quieter middle:

Figure 4.1 · The diurnal pulse (hour by hour)
Hourly trip volume (diurnal pattern)

Interpretation. During the morning peak (07:00–09:00) the mean trip velocity drops by roughly 11% versus off-peak, and the evening peak (17:00–19:00) is even more punishing, closer to 14% slower with trips averaging nearly 40% longer in duration. The relationship is not symmetric: the morning peak is sharp and brief (commuters arriving on a fixed schedule), while the evening peak is broader and longer (commuters leaving when their day ends, which varies by industry). For an urban mobility planner, this has direct implications: capacity-based interventions like congestion pricing should be priced higher in the evening than the morning, because each minute of evening congestion affects more trips and more total person-minutes than the equivalent morning minute. It also means dispatch algorithms that assume a symmetric peak (the standard textbook model) are systematically miscalibrated against this dataset.

4.2 The weekend atlas

Derivation. The query buckets trips into "Weekend" (Saturday/Sunday) or "Weekday" using a CASE on day_of_week, then computes average distance, duration, velocity, and total trip count for each bucket.

SELECT
    CASE WHEN day_of_week >= 5 THEN 'Weekend'
                              ELSE 'Weekday' END AS type,
    AVG(distance_km)       AS avg_distance,
    AVG(trip_duration)/60  AS avg_duration_min,
    AVG(speed_kmh)         AS avg_speed,
    COUNT(*)               AS trips
FROM trips
GROUP BY type;

Visual. Plate V on Sheet B renders the seven days of the week as a bar chart with weekend bars shaded lighter than weekday bars, making the volume gap immediately visible:

Figure 4.2 · Trip volume by day of the week
Trip volume by day of the week chart

Interpretation. Weekend trips are longer in distance, faster in velocity, and concentrated at later hours, while weekday trips cluster around the morning and evening commute. The distance difference is the most diagnostic finding: weekend trips average roughly 10% farther than weekday trips, even though weekday commuters are more likely to live further out. This is consistent with weekend trips serving a different purpose, leisure, shopping, social trips that span longer distances by choice rather than the constrained commute pattern of the weekday. For urban mobility, this implies that weekend taxi demand is qualitatively different from weekday demand, and a service designed exclusively to optimise commute reliability will under-serve weekend users. It also suggests that the late-Friday and Saturday-night peaks (visible on Plate IV's heatmap above 22:00) represent a third demand mode entirely, the nightlife pulse that neither the morning nor the evening commute models capture.

4.3 Midtown's gravitational pull

Derivation. The query joins trips to the pre-seeded zones table on the foreign key pickup_zone_id, groups by zone, and orders descending by trip count using our custom QuickSort. Average velocity per zone is computed in the same query. The thirty-one zones are informal lat/lon buckets labelled after where those neighbourhoods appear on a public map of Manhattan and the outer boroughs, they are descriptive shortcuts for aggregation, not surveyed boundaries.

SELECT
    z.zone_name,
    z.trip_count,
    AVG(t.speed_kmh) AS avg_speed
FROM zones z
JOIN trips t ON t.pickup_zone_id = z.zone_id
GROUP BY z.zone_id
ORDER BY z.trip_count DESC
LIMIT 5;

Visual. Plate IX on Sheet C renders pickup density by zone as a horizontal bar chart, ranked from highest to lowest. Plate X provides the full thirty-one-row register so a reader can audit the long tail:

Figure 4.3 · Pickup density by zone
Pickup density by zone chart
#ZoneTripsShareMean durationMean velocity
01Midtown West / Times Square163,31911.3461%13.33 min13.69 km/h
02East Village / NoHo155,79710.8236%12.74 min12.96 km/h
03Midtown East / Grand Central134,6289.3529%13.33 min12.74 km/h
04Upper East Side South127,5648.8621%11.51 min14.30 km/h
05Midtown South / Flatiron107,4387.4640%13.32 min12.70 km/h
06Upper East Side107,1737.4455%11.75 min14.81 km/h
07Gramercy / Murray Hill92,9956.4606%12.85 min13.43 km/h
08Upper West Side85,4335.9352%11.65 min15.08 km/h
09West Village / Meatpacking83,8195.8231%13.26 min13.98 km/h
10Stuyvesant / LES North69,4894.8276%12.24 min14.10 km/h
11Lower East Side / Chinatown56,1283.8993%13.86 min14.35 km/h
12Lower Manhattan / Financial District48,0843.3405%17.30 min15.71 km/h
13East Queens36,4222.5303%30.07 min21.63 km/h
14Tribeca / SoHo33,8752.3534%14.70 min14.31 km/h
15South Queens30,9112.1475%41.69 min28.11 km/h
16West Queens25,3841.7635%13.56 min15.98 km/h
17Chelsea / Hudson Yards21,6191.5019%12.66 min13.77 km/h
18West Brooklyn16,1741.1236%14.81 min16.64 km/h
19Morningside Heights12,1810.8462%12.84 min16.33 km/h
20North Brooklyn8,0290.5578%13.27 min16.03 km/h
21Outer Boroughs7,5940.5276%13.57 min14.69 km/h
22East Harlem7,3720.5121%11.54 min16.16 km/h
23Harlem3,3690.2341%13.41 min18.70 km/h
24Central Brooklyn2,1870.1519%14.39 min17.40 km/h
25Washington Heights1,1140.0774%14.85 min21.15 km/h
26South Bronx5070.0352%14.71 min17.65 km/h
27Central Bronx2870.0199%14.13 min18.78 km/h
28South Brooklyn2730.0190%14.96 min20.72 km/h
29East Bronx1360.0094%14.69 min20.80 km/h
30Inwood1180.0082%14.21 min18.27 km/h
31Staten Island60.0004%24.22 min48.75 km/h

Interpretation. Pickup density is overwhelmingly concentrated in Manhattan's core. The top three zones, Midtown West / Times Square, East Village / NoHo, and Midtown East / Grand Central account for more than 31.5% of all pickups in the dataset, with the top six Manhattan zones clearing 55% combined. Manhattan as a whole accounts for roughly 94–95% of cleaned pickups, while the Bronx, Brooklyn, and Staten Island combined contribute fewer than 1,500 trips. This is a striking demonstration of the Pareto distribution in urban mobility: a tiny geographic area generates the overwhelming majority of demand. For a real urban mobility planner this has three implications. First, supply must follow density, idle taxis should be repositioned toward the Midtown corridor and East Village axis rather than spread evenly across boroughs. Second, the outer boroughs are under-served by yellow cabs, which is the well-documented gap that ride-hailing services were designed to fill. Third, pricing experiments should be A/B tested on the dense top-6 Manhattan zones first, where statistical power is highest, before being rolled out city-wide.

V.

Reflection & Future Work

Building this application taught us as much about restraint as it did about engineering. The hardest decisions were not about what to add but about what to leave out. We could have added a join with weather data, an interactive map with real geographic coordinates, a machine-learning model to predict trip duration, or a streaming pipeline that ingested live TLC data. We chose not to, because each addition would have widened the surface area of the project past what we could maintain confidently in the time available.

The single most impactful technical decision was pre-seeding the zones and time_slots tables before the ETL loop begins. This let us look up the foreign key inline during the streaming insert, eliminating an entire second pass over 1.46 million rows. It is the reason the pipeline finishes in 30 seconds instead of 90.

Six concrete improvements we would pursue if we had another sprint: