Languages above the NYC Subway

ITP student Michelle Johnson (Linguistics) reflects on her independent study project 

  1. Introduction

Languages above the NYC Subway, the project I built for the ITP Independent Study (IS) illustrates how languages are distributed along the New York City Subway system. Ultimately, it is a large, interactive infographic where users can explore what languages are living above each subway station. I initially thought I was going to build a website about the population I’m looking at in my dissertation: Long-Term English Language Learners.  I spent my time in Core II exploring ways to make this topic fit within the IS (it’s more of a topic than an idea). At that time, I wanted a digital component to my dissertation and a platform to increase awareness about the population. Once I let go of those external goals, I focused on what I am most interested in: Language and New York City.

Around this time, the New Yorker Magazine put out a very interesting infographic examining income inequality in New York City as viewed along the subway lines (Buchanan, 2013). I immediately wanted make something similar for languages. The things I liked about the New Yorker piece was how simple the idea is. In New York City, the subway defines a second geography. It’s a common sentiment in my part of Brooklyn (Bay Ridge, in the southwest corner) to prefer going to Central Park than the [geographically closer] Williamsburg because it’s a more direct and easier subway ride. The subway system has altered the concept of space in New York City.

Language is also defined by special relationship to space, albeit more abstract. Human language is a unique phenomenon that is separate from human communication1. It is notable that learning a first language is almost exclusively conducted through face-to-face interaction, while additional languages can be learned anywhere. Even today, as we look at digital communication technologies, and globalization, we are witnessing a breaking down of the barriers to communication, but not to how people learn their first language. Communication is not tied to space, but language is. Communication was only removed from space with the advent of writing, but for most of human history, it has been tied to being within a few yards of one’s interlocutor. Language also defines a cultural community. People who speak the same language (at home) usually identify with that language (Bucholtz & Hall, 2004). In many instances2, this corresponds with food people eat, worldview, family structure, how commerce is conducted and other indicators of a cultural group (Bucholtz & Hall, 2004; Gumperz, 1983).

Every language group with significant representation in New York City immigrated3. The languages included in this infographic are those with a significant community in NYC, which means most groups have been represented in New York for at least a generation. Obviously English is the dominant language, and speaking English allows for greater access to opportunity, power, and resources. By looking at languages other than English along the subway lines, it is possible to track which language groups have access to mobility and within that, opportunity (Kim & Garcia, 2014). Almost uniformly, the ends of these maps are more interesting than the middle. Rent gets cheaper at the end, and linguistic minority groups follow cheap rent. This is not to suggest that income and this map have anything to do with each other. That is not what is being represented here, but ultimately, language critically influences education, access to jobs, and economic mobility (and so do a lot of other factors).

  1. Goals

My goal in this project was then to capture the second space of New York City (the subway system) and an element of human communication that is still tied to space: Language. I wanted to show who speaks what where in New York, and by extension, what linguistic neighborhoods are on which transit lines.

My second goal in this project was to show how resources can be distributed with easy transit between them. This is interesting in terms of education because in order to offer dual language education, there must be a high enough population. An alternative to a high population in one particular area is the ability for students and educators to travel between nearby places.

My third goal in this project was to plan and execute a medium sized digital project.  This was also a requirement of the Interactive Technology and Pedagogy Certificate, but equally importantly, I wanted the experience of building something that could be used by non-academics.

  1. Methodology

In this section, I will discuss both how I intended to do the project and how I actually did it. The plan I had would have been more precise, but in the end, I was not actually sure that the greater precision would be more informative than admitting that the data is inherently abstract.

I made one graph for every language that had at least 2 data points above 5% on a given subway line. There was one exception to this rule. With these criteria, Chinese was represented on all graphs except the J. The J is a very short train. I included Chinese on the J in order to represent the 3 most popularly spoken languages in the city on all of the lines. New York City is dominated by English, Spanish and Chinese in a very identifiable way, and I wanted to preserve that information.

3.1   The Data

The data is based on zipcodes from the 2010 U.S. American Community Survey (ACS).  The subway stations were aligned to their zipcodes based on latitude and longitude data.  When a station was on the boundary between one zipcode and another, the number of speakers from those zipcodes were averaged with a simple mean.  Drawing the boundaries based on zipcodes, which are obviously geographically skewed, is meant to represent a community of speakers who share resources (a post office and space) and are vaguely identified by the neighborhood names. This is obviously not an exact measure, but it is, in some way, a cultural measure.  Secondly, in much of the rest of the country, a zipcode represents a town or community, and representing the data this way is meant to parallel that.

The original plan was to organize the language data based on census tract, using QGIS. I worked on this in the New Media Lab for roughly 6 months. I successfully imported the language data, laid the subway lines and station information on top of it and wrote a Python code to calculate distance from a subway point, and figured out an equation to weight the averages.  The basic idea was to take the subway stations and create a circle around them and then take the percent of the circle that each tract represented and weight the number of speakers based on that percentage. I could not figure out how to draw an area around the stations, though. Below is an example of the math for Spanish:


  1. p1 = percent of the circle that made up the tract
  2. t1= total number of speakers in that tract
  3. s1 = number of Spanish speakers in that tract
  4. q1 = p1*s1  (number of Spanish speakers from that tract)
  5. sraw = Σ q1 + q2 … q(the raw number of Spanish speakers within the radius of the subway station)
  6. t= Σ t1 + t2 … tn (total number of people within the radius of the subway station
  7. s = sraw/t (percent Spanish speakers in the radius around the subway station)

After figuring this out, there were still theoretical problems with this equation. First, just like the zipcode data, one side of a tract can have a higher concentration of speakers of one language than another. Secondly, it would still be impossible to determine who is riding the subway because people take the bus, they walk to different stations, they work in different places, they do all kinds of things for different reasons. Therefore, the zipcode and tract data could both only tell the neighborhood one is stepping into.

The real problem with this approach, however was that creating the radiuses around QGIS made it crash for 3 weeks, and after 6 months of investing time into learning QGIS, I gave up and decided to use zipcode data because I was not any closer than I had been 5 months earlier and there were still theoretical problems with the more precise data.

3.2   The platform/interface

Priorities for the interface was to have it hosted on the CUNY Academic Commons (AC) in order to connect it to the rest of the work that I do at the Graduate Center and the mission of the CUNY AC gives me hope for the future of higher education. This priority required me to build it using WordPress. Since I had some familiarity with WordPress, this seemed reasonable.  Ultimately, because of a technical issue, this was not possible with the site I built.  I then made a new site, but did not like how it looks, so I went with external hosting.  More on this technical issue below.

I planned to show the information on line graphs. I wanted to make one line graph for every language/subway line combination that reached the criteria I set out about. After the CUNY AC, my first priority in terms of features was to have hover-over tooltips that would display each station’s exact information. In terms of design, I wanted to make the colors, font and circles invoke a feel for the subway. Finally, I received permission from the MTA to use the subway map, so I wanted to incorporate it in some way to give a sense of place based on a constructed system rather than geography. Ideally, I wanted to keep with a circle theme but was open to navigation buttons being rectangular if need be.

3.3   Graphs

Making the graphs have tooltips was going to be my biggest challenge. The tooltips are dependent upon Javascript, and now does not seem like the time to take learning that on.  Initially, I tried building these graphs with Adobe Illustrator, knowing that it’s good for making graphs and showing relationships. Within this program, I really struggled with getting the tooltips to work, so I tried taking Illustrator images and enhancing them with Muse. I actually had some success with Muse and made 2 graphs. The first one took me 15 hours, which seemed fine for making a template. But, the second one took me 12 hours since everything had to be adjusted. This seemed like a fine solution for one or two graphs without a lot of data, but I was talking about over 100 graphs. It wasn’t going to work (ultimately this was a valuable lesson in “do it without code” style tools). After going around these for about a month, I went back to the one thing that I know well enough to make things quickly and easily: WordPress plugins.

I went looking for a WordPress plugin that I could make the graphs with, but the free and easily integrated ones did not natively support tooltips. I did not want to purchase one because I wanted to still have control to alter the plugin. I considered altering the plugin to make the tooltips work, but then I found Easy Visualization Tools from Code Canyon for $15. This is what I ended up using since it allowed me to do everything I wanted: tooltips were natively supported, I could do custom colors, I could set the CSS for batches of graphs at a time.  Towards the end of the project, I would regret using this plugin as I could have used an open source one that I could have altered with a jquery event. This was probably the most valuable take away from this project: The balance of having control of the project versus getting it done quickly should fall on the side of control.

Once I purchased the Code Canyon plugin, I made a template for each line, set the features each graph would have, loaded all the data from my excel spreadsheets, and cleaned up each graph as needed.

3.4   The site

I wanted the graphs to be viewable by language and by line. To do this, I decided to use the Spun Theme because the homepage allows blog posts to be identified by just a word in a circle and I altered the theme to allow a pages menu as well. I decided that languages would be in blog posts and subway lines in pages so that navigation by line would be visible from every page/post. I decided to prioritize the lines because people are more likely to have a personal connection to more lines rather than more languages, and in my preliminary versions, most people’s first response was to look for lines rather than languages.

I altered the CSS to change the appearance of this theme in a variety of ways. I made small changes to the HTML to change some of the functionality as well, adding a pages menu, changing the navigation, editing a glitch with the tagline, making the title hyperlinked to the homepage. While I was new to the theme, I was familiar enough with the HTML and the CSS from other projects to feel confident in making these changes.

After it was done, I made some changes to make it easier to read, including showing the divisions between one borough and the next. Again, this is a case where I regret using a non-open source plugin since I was not able to include lines on the graphs to show borough boundaries. Ultimately, I decided to add the <Man>, <Bk>, <Bx>, <Qns> tags to each station to indicate the borough. It is not very elegant, but it does display the information without cluttering the graphs.

3.5   Importing

The last stage of this project is to import it to the CUNY AC. Boone Gorges installed the theme for me, and I exported and tried to import my site. They bought the Code Canyon Plugin for me to host this there. I then went to work exporting and uploading the site. There was a problem, though. By building it on my computer, I had created a database for the site that allowed me to title pages with code rather than with just letters and numbers. This made my menu obsolete. On a multisite install, the main WordPress install will always overwrite the  .htaccess/mod_rewrite rules and cause problems. So I remade the site without the menu. But now, too much is lost on that site, so I went with private hosting on a Small Orange. This allows me to have my own site and connect this site to it. For the time being, the Commons site exists, but I will be taking it down shortly and leave the privately hosted site.

  1. Conclusion

As I stated above, the biggest take away from this for me was how limiting a non-open source plugin was. At the point when I bought it, it seemed like a good idea, but was really a quick fix that allowed the project to be completed, but it made some of the goals potentially impossible to achieve. At the same time, I’m not sure I would have understood what needed to be done to make the graphs without using the plugin and trying to alter it the best I was allowed.

Secondly, if I were to do this project again, I would be more explicit with myself about what the end product would look like and what features I would like to have in it. I taught a project based class this semester at Lehman College and getting my students to do any kind of planning was a major challenge. It was very eye opening to see myself doing less than thorough planning in my own project at the same time.

Overall, the information can be understood and seen with the way that I’ve built this project, and I learned a lot about planning and executing digital projects. I’ve used this analogy before, but it is a lot like construction: the plans need to be thorough and detailed before any boards get cut.


Buchanan, L. (2013, April 16). Idea of the Week: Inequality and New York’s Subway. The New Yorker Blogs. Retrieved from

Bucholtz, M., & Hall, K. (2004). Language and identity. A Companion to Linguistic Anthropology, 1, 369–394.

Gumperz, J. J. (Ed.). (1983). Language and Social Identity (2nd ed.). Cambridge University Press.

Kim, W. G., & Garcia, S. B. (2014). Long-Term English Language Learners’ Perceptions of Their Language and Academic Learning Experiences. Remedial and Special Education, 0741932514525047. doi:10.1177/0741932514525047

  1. The last speaker of Birked (extinct, 199*, Nilo-Saharan, Sudan) still spoke his language even when there was no one left to speak it with.
  2. This statement is true for languages like Yiddish, Hebrew, Italian, but not true for languages like English, Spanish or Chinese. The issue is more complicated for dialects such as Puerto Rican or Mexican Spanish, but that complication is not what this paper is about.
  3. Which is to say that Munsee (the language spoken by the Lenape in pre-colonial New York) is all but unrepresented within New York City today.