Perhaps you are one to plan trips such that your destination has a good farmers market 🥑 🍎 🌶 🥦 🍠🧅 🍓. Given a data set of farmers markets, let’s learn how to access, browse, and query data in Python. Time. for. some. data. science.👩🏾‍🔬
Let’s start by reviewing ways that we can get data into a program.
input function to prompt a user for data.sys.argv).Today we’ll learn a fourth way: we’ll read data from a file.
Go back and take a look at your geography data from the Carmen Sandiego lab. data was defined directly inside of Python code, so it was usable only in a Python program. It would have better if it came from a file (outside of the program), and stored in a language-independent format. Many such formats exist. One of the best-known is called CSV. A CSV file for our Carmen Sandiego app would look like:
country,capital,latitude,longitude,region,landmarks
colombia,Bogotá,4.7110,-74.0721,South America,Monserrate|El Peñón de Guatapé|Las Lajas Sanctuary
barbados,Bridgetown,13.0971,-59.6132,Caribbean,Crane Beach|Harrison's Cave|St. Nicholas Abbey
vanuatu,Port Vila,-17.7430,168.3173,Oceania,Mount Yasur|Champagne Beach|Chief Roi Mata's Domain
eswatini,Mbabane,-26.3264,31.1442,Southern Africa,Sibebe Rock|Mlilwane Wildlife Sanctuary|Mantenga Nature Reserve
bhutan,Thimphu,27.4716,89.6386,South Asia,Paro Taktsang|Punakha Dzong|Rinpung Dzong
iceland,Reykjavik,64.1470,-21.9408,Northern Europe,Blue Lagoon|HallgrĂmskirkja|Gullfoss Waterfall
But what we really need is data about farmers markets in CSV form.
Create the folder ~/cmsi1010/lab17, make this your current folder, and fetch the free Farmers Market CSV file with curl:
mkdir ~/cmsi1010/lab17cd ~/cmsi1010/lab17curl -L https://github.com/rtoal/cmsi-1010-classroom/raw/refs/heads/main/labs/farmers_markets.csv -o farmers_markets.csvDISCLAIMERThis dataset is for educational purposes only and emphatically does NOT reflect the most current information about farmers markets.
Let’s do some exercises to warm up our skill set in using files.
A good programming skill to develop when learning a new language is to read a file in and print it right back out. Here’s how we do it in Python. Start by creating a new Python file called markets.py in your ~/cmsi1010/lab17 directory with this code:
with open('farmers_markets.csv', 'r') as file: for line in file: print(line.strip())
Run the program, so you can see the raw form of the CSV file.
The with statementPython’s
withstatement wraps the execution of a block with something called a context manager. When using it with opening a file, it ensures that the file is properly closed after the block finishes, even if an error is raised.
CSV files are sooooooo common in the world that Python provides some built-ins for reading and writing CSV files. Like magic!
To read a CSV file in Python, we can use the built-in csv module. Replace the existing code with:
import csv with open('farmers_markets.csv', 'r') as file: for row in csv.reader(file): print(row)
The result looks a bit different. It has the same content, but each row is printed as a Python list. Aha, so the csv.reader parses the comma-separated lines into lists.
Take a look at the output. The first line is:
['FMID', 'MarketName', 'street', 'city', 'County', 'State', 'zip', 'x', 'y', 'Website', 'Facebook', 'Twitter', 'Youtube', 'OtherMedia', 'Organic', 'Tofu', 'Bakedgoods', 'Cheese', 'Crafts', 'Flowers', 'Eggs', 'Seafood', 'Herbs', 'Vegetables', 'Honey', 'Jams', 'Maple', 'Meat', 'Nursery', 'Nuts', 'Plants', 'Poultry', 'Prepared', 'Soap', 'Trees', 'Wine', 'Coffee', 'Beans', 'Fruits', 'Grains', 'Juices', 'Mushrooms', 'PetFood', 'WildHarvested', 'updateTime', 'Location', 'Credit', 'WIC', 'WICcash', 'SFMNP', 'SNAP', 'Season1Date', 'Season1Time', 'Season2Date', 'Season2Time', 'Season3Date', 'Season3Time', 'Season4Date', 'Season4Time']
This is called the header.
Each of the remaining 8,546 lines contain information about a specific farmers market. One of these lines is:
['1001461', 'South Pasadena Farmers Market', 'Meridan Ave. and El Centro Street', 'South Pasadena', 'Los Angeles', 'California', '91030', '-118.1572818', '34.1147385', 'http://www.southpasadenafarmersmarket.org', 'https://www.facebook.com/pages/South-Pasadena-Farmers-Market/130261473671096', '', '', '', 'Y', 'N', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'N', 'N', 'Y', 'Y', 'N', 'Y', 'Y', 'N', 'N', '3/31/2014 11:23:28 PM', 'Closed-off public street', 'Y', 'Y', 'N', 'Y', 'Y', '03/14/2013 to 11/07/2013', 'Thu: 4:00 PM-8:00 PM;', '11/14/2013 to 03/13/2014', 'Thu: 4:00 PM-7:00 PM;', '', '', '', '']
It’s as if each farmers market is an object with properties named after the header fields. But they’re not, really—they are just lists, which are pretty clunky to search. For example, to find whether or not a market m sold seafood, you would look at the value of m[21]. But that’s silly. How do we know it is in the 22nd position? Because "Seafood" is in the 22nd position of header. That is, header.index('Seafood') == 21. So to find out whether market m sells seafood:
m[header.index('Seafood')]
index method.
It works, but it is painful. What if instead of lists for each market, we made them dictionaries? It would suck up a lot more memory because the keys are repeated for every row, but it makes our code a lot more readable. Tradeoffs are everywhere in programming. Let’s go for it. We just have to change one thing:
import csv with open('farmers_markets.csv', 'r') as file: for market in csv.DictReader(file): print(market)
Try it out. Note that the SouthPas market is no longer just a list but is now a more readable dictionary:
{'FMID': '1001461', 'MarketName': 'South Pasadena Farmers Market', 'street': 'Meridian Ave. and El Centro Street', 'city': 'South Pasadena', 'County': 'Los Angeles', 'State': 'California', 'zip': '91030', 'x': '-118.1572818', 'y': '34.1147385', 'Website': 'http://www.southpasadenafarmersmarket.org', 'Facebook': 'https://www.facebook.com/pages/South-Pasadena-Farmers-Market/130261473671096', 'Twitter': '', 'Youtube': '', 'OtherMedia': '', 'Organic': 'Y', 'Tofu': 'N', 'Bakedgoods': 'Y', 'Cheese': 'Y', 'Crafts': 'N', 'Flowers': 'Y', 'Eggs': 'Y', 'Seafood': 'Y', 'Herbs': 'Y', 'Vegetables': 'Y', 'Honey': 'Y', 'Jams': 'Y', 'Maple': 'N', 'Meat': 'Y', 'Nursery': 'N', 'Nuts': 'Y', 'Plants': 'N', 'Poultry': 'Y', 'Prepared': 'Y', 'Soap': 'Y', 'Trees': 'Y', 'Wine': 'N', 'Coffee': 'N', 'Beans': 'Y', 'Fruits': 'Y', 'Grains': 'N', 'Juices': 'Y', 'Mushrooms': 'Y', 'PetFood': 'N', 'WildHarvested': 'N', 'updateTime': '3/31/2014 11:23:28 PM', 'Location': 'Closed-off public street', 'Credit': 'Y', 'WIC': 'Y', 'WICcash': 'N', 'SFMNP': 'Y', 'SNAP': 'Y', 'Season1Date': '03/14/2013 to 11/07/2013', 'Season1Time': 'Thu: 4:00 PM-8:00 PM;', 'Season2Date': '11/14/2013 to 03/13/2014', 'Season2Time': 'Thu: 4:00 PM-7:00 PM;', 'Season3Date': '', 'Season3Time': '', 'Season4Date': '', 'Season4Time': ''}
Shouldn’t we have a Market class?I suppose, if you were a purist, you might say that dictionary objects aren’t ideal for representing objects like this. But it turns out this is fine in data science type applications, for reasons that an AI chatbot might find for you or hallucinate for you. But seriously, it’s good to be pragmatic, so for this lab, dictionaries it is!
Looking at our data, you might start thinking of so many possible questions, such as:
Let’s answer these questions in just plain ol’ Python. First, rather than printing each line, we’ll change our code to gather everything up into a list of market dictionaries:
import csv with open('farmers_markets.csv', 'r') as file: markets = [] for row in csv.DictReader(file): markets.append(row)
Oh wait wait wait wait. How about this way? It’s does exactly the same thing as the previous code.
import csv with open('farmers_markets.csv', 'r') as file: markets = [row for row in csv.DictReader(file)]
That’s right, people. We have to start getting used to comprehensions! (We saw comprehensions in an earlier lab but did not study them in depth. It’s time to learn them for reals and make them part of your natural programming toolbox.) It is important to know what exactly the comprehensions are doing, so that’s why we presented the long, readable, completely understandable for-loops right next to the concise comprehensions.
Let’s build some code to answer some of the questions above. Ask questions during the code-alongs if you need clarification, or ask a chatbot for explanations. Also delight in all the practice you are getting with comprehensions! You see comprehensions a lot in...job interviews!
def california_market_names(): return [m['MarketName'] for m in markets if m['State'] == 'California'] def alaska_market_names(): return [m['MarketName'] for m in markets if m['State'] == 'Alaska']
Ooh how about this:
def market_names_in_state(state): return [m['MarketName'] for m in markets if m['State'] == state]
More filtering:
def markets_selling_nuts(): return [m for m in markets if m['Nuts'] == 'Y'] def markets_in_maine_selling_nuts_but_not_seafood(): return [ m for m in markets if m['State'] == 'Maine' and m['Nuts'] == 'Y' and m['Seafood'] == 'N' ] def markets_with_you_tube_west_of_100(): return [ (m['MarketName'], m['street'], m['city'], m['zip']) for m in markets if float(m['x']) < -100 and m['Youtube'] != '' ]
Now to get the counts by state, we need to do some planning. Some thinking. What we want to end up with is a dictionary in which the keys are the states and the values are the counts of markets in those states. It is a super common operation in data analysis, database theory, data science, etc. It’s a thing that has been in computer science long before people ever came up with name “data science” as some kind of separate field of study.
Let’s learn how to group and count. We’ll look first at a direct approach, then a more powerful approach.
Okay, first pass:
def market_counts_by_state(): state_counts = {} for m in markets: state = m['State'] state_counts[state] = state_counts.get(state, 0) + 1 return state_counts
The expression state_counts.get(state, 0) means get the value in the state_counts dictionary for the given state, or return 0 if the state is not found. If you instead just wrote state_counts[state], it would raise a KeyError if the state was not found. Make sure to always use .get() when accessing dictionary keys that may not exist!
Now to glow up a little. Since grouping and counting are so common, wouldn’t Python come with some kind of powerful built-in mechanism for this? Of course it does! Here’s how it is done:
from collections import Counter def market_counts_by_state(): return dict(Counter(m['State'] for m in markets))
Woah, right?
We installed matplotlib in the last lab and used it to create visualizations. Now, let’s use it to create a histogram of the number of farmers markets by state. Let’s jump right to the code:
def plot_market_histogram(): state_counts = market_counts_by_state() plt.bar(state_counts.keys(), state_counts.values()) plt.xticks(rotation=90) plt.xlabel('State') plt.ylabel('Number of Farmers Markets') plt.title('Farmers Markets by State') plt.tight_layout() plt.show()
To make this work, you will need to add
import matplotlib.pyplot as plt
to the top of your file. Add code to call the function (you know where, right?) and check out the result! If it’s not working, and you’ve checked that your code is correct, make sure you are in your virtual environment.
Python does have a lot of powerful built-ins for processing data—some of which you will explore in the challenges—which is why it is popular in data science. But there’s a super popular third-party library called Pandas that makes working with structured data even more easier! And it comes with even more features. Here is just a brief taste of what it can do.
But first, as you may have guessed:
pip install pandas
For this lab, we’ll have two separate programs. So leave your markets.py file as is. This will be your file for practicing with Python built-ins for data science. Make a new file called markets_pandas.py starting with this code:
import pandas as pd import matplotlib.pyplot as plt markets = pd.read_csv('farmers_markets.csv') state_counts = markets['State'].value_counts() state_counts.plot(kind='bar') plt.xticks(rotation=90) plt.xlabel('State') plt.ylabel('Number of Farmers Markets') plt.title('Farmers Markets by State') plt.tight_layout() plt.show()
We’ll talk about Pandas a bit in class, but not much, as our focus is on programming and not on specific libraries like Pandas. You’ll see a lot more Pandas in the future, and learn that the object we got from reading in the CSV file is a data frame—a powerful data structure for working with tabular data—and that the .value_counts() method is a convenient way to get the frequency of unique values in a column and can be directly sent to Matplotlib via its .plot() method. Pandas makes working with data pretty easy. Make sure, though, as time allows, to practice using it, and dive into documentation and tutorials. The teaching staff will be happy to guide you, too.
Now it’s your turn. Here are some ideas for you to extend the activities above:
Python is known for making it easy to, so:
We’ve covered:
index methodcsv.DictReaderHere are some questions useful for your spaced repetition learning. Many of the answers are not found on this page. Some will have popped up in lecture. Others will require you to do your own research.
input(), sys.argv, reading from files, hardcoded datawith open(filename) as f:with-statement, what happens to the file when the block is exited? csv module to read a CSV file? csv.reader and csv.DictReadercsv.DictReader? csv.reader, uses more memory.[row for row in csv.DictReader(open(filename))]Counter(items) from the collections module.plt.bar(keys, values)