Farmers Markets
LAB17

farmers_market.png

Welcome

Perhaps you are one to plan trips such that your destination has a good farmers market 🥑 🍎 🌶 🥦 🍠 🧅 🍓. Given a data set of farmers markets, let’s learn how to access, browse, and query data in Python. Time. for. some. data. science.👩🏾‍🔬

Activity

Let’s start by reviewing ways that we can get data into a program.

Today we’ll learn a fourth way: we’ll read data from a file.

Go back and take a look at your geography data from the Carmen Sandiego lab. data was defined directly inside of Python code, so it was usable only in a Python program. It would have better if it came from a file (outside of the program), and stored in a language-independent format. Many such formats exist. One of the best-known is called CSV. A CSV file for our Carmen Sandiego app would look like:

country,capital,latitude,longitude,region,landmarks
colombia,Bogotá,4.7110,-74.0721,South America,Monserrate|El Peñón de Guatapé|Las Lajas Sanctuary
barbados,Bridgetown,13.0971,-59.6132,Caribbean,Crane Beach|Harrison's Cave|St. Nicholas Abbey
vanuatu,Port Vila,-17.7430,168.3173,Oceania,Mount Yasur|Champagne Beach|Chief Roi Mata's Domain
eswatini,Mbabane,-26.3264,31.1442,Southern Africa,Sibebe Rock|Mlilwane Wildlife Sanctuary|Mantenga Nature Reserve
bhutan,Thimphu,27.4716,89.6386,South Asia,Paro Taktsang|Punakha Dzong|Rinpung Dzong
iceland,Reykjavik,64.1470,-21.9408,Northern Europe,Blue Lagoon|HallgrĂ­mskirkja|Gullfoss Waterfall

But what we really need is data about farmers markets in CSV form.

Create the folder ~/cmsi1010/lab17, make this your current folder, and fetch the free Farmers Market CSV file with curl:

DISCLAIMER

This dataset is for educational purposes only and emphatically does NOT reflect the most current information about farmers markets.

Reading Files

Let’s do some exercises to warm up our skill set in using files.

A good programming skill to develop when learning a new language is to read a file in and print it right back out. Here’s how we do it in Python. Start by creating a new Python file called markets.py in your ~/cmsi1010/lab17 directory with this code:

with open('farmers_markets.csv', 'r') as file:
    for line in file:
        print(line.strip())

Run the program, so you can see the raw form of the CSV file.

The with statement

Python’s with statement wraps the execution of a block with something called a context manager. When using it with opening a file, it ensures that the file is properly closed after the block finishes, even if an error is raised.

CSV files are sooooooo common in the world that Python provides some built-ins for reading and writing CSV files. Like magic!

Reading CSV Files

To read a CSV file in Python, we can use the built-in csv module. Replace the existing code with:

import csv

with open('farmers_markets.csv', 'r') as file:
    for row in csv.reader(file):
        print(row)

The result looks a bit different. It has the same content, but each row is printed as a Python list. Aha, so the csv.reader parses the comma-separated lines into lists.

Take a look at the output. The first line is:

['FMID', 'MarketName', 'street', 'city', 'County', 'State', 'zip', 'x', 'y', 'Website', 'Facebook', 'Twitter', 'Youtube', 'OtherMedia', 'Organic', 'Tofu', 'Bakedgoods', 'Cheese', 'Crafts', 'Flowers', 'Eggs', 'Seafood', 'Herbs', 'Vegetables', 'Honey', 'Jams', 'Maple', 'Meat', 'Nursery', 'Nuts', 'Plants', 'Poultry', 'Prepared', 'Soap', 'Trees', 'Wine', 'Coffee', 'Beans', 'Fruits', 'Grains', 'Juices', 'Mushrooms', 'PetFood', 'WildHarvested', 'updateTime', 'Location', 'Credit', 'WIC', 'WICcash', 'SFMNP', 'SNAP', 'Season1Date', 'Season1Time', 'Season2Date', 'Season2Time', 'Season3Date', 'Season3Time', 'Season4Date', 'Season4Time']

This is called the header.

Each of the remaining 8,546 lines contain information about a specific farmers market. One of these lines is:

['1001461', 'South Pasadena Farmers Market', 'Meridan Ave. and El Centro Street', 'South Pasadena', 'Los Angeles', 'California', '91030', '-118.1572818', '34.1147385', 'http://www.southpasadenafarmersmarket.org', 'https://www.facebook.com/pages/South-Pasadena-Farmers-Market/130261473671096', '', '', '', 'Y', 'N', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'N', 'N', 'Y', 'Y', 'N', 'Y', 'Y', 'N', 'N', '3/31/2014 11:23:28 PM', 'Closed-off public street', 'Y', 'Y', 'N', 'Y', 'Y', '03/14/2013 to 11/07/2013', 'Thu: 4:00 PM-8:00 PM;', '11/14/2013 to 03/13/2014', 'Thu: 4:00 PM-7:00 PM;', '', '', '', '']

It’s as if each farmers market is an object with properties named after the header fields. But they’re not, really—they are just lists, which are pretty clunky to search. For example, to find whether or not a market m sold seafood, you would look at the value of m[21]. But that’s silly. How do we know it is in the 22nd position? Because "Seafood" is in the 22nd position of header. That is, header.index('Seafood') == 21. So to find out whether market m sells seafood:

m[header.index('Seafood')]
Exercise: Read up on the index method.

It works, but it is painful. What if instead of lists for each market, we made them dictionaries? It would suck up a lot more memory because the keys are repeated for every row, but it makes our code a lot more readable. Tradeoffs are everywhere in programming. Let’s go for it. We just have to change one thing:

import csv

with open('farmers_markets.csv', 'r') as file:
    for market in csv.DictReader(file):
        print(market)

Try it out. Note that the SouthPas market is no longer just a list but is now a more readable dictionary:

{'FMID': '1001461', 'MarketName': 'South Pasadena Farmers Market', 'street': 'Meridian Ave. and El Centro Street', 'city': 'South Pasadena', 'County': 'Los Angeles', 'State': 'California', 'zip': '91030', 'x': '-118.1572818', 'y': '34.1147385', 'Website': 'http://www.southpasadenafarmersmarket.org', 'Facebook': 'https://www.facebook.com/pages/South-Pasadena-Farmers-Market/130261473671096', 'Twitter': '', 'Youtube': '', 'OtherMedia': '', 'Organic': 'Y', 'Tofu': 'N', 'Bakedgoods': 'Y', 'Cheese': 'Y', 'Crafts': 'N', 'Flowers': 'Y', 'Eggs': 'Y', 'Seafood': 'Y', 'Herbs': 'Y', 'Vegetables': 'Y', 'Honey': 'Y', 'Jams': 'Y', 'Maple': 'N', 'Meat': 'Y', 'Nursery': 'N', 'Nuts': 'Y', 'Plants': 'N', 'Poultry': 'Y', 'Prepared': 'Y', 'Soap': 'Y', 'Trees': 'Y', 'Wine': 'N', 'Coffee': 'N', 'Beans': 'Y', 'Fruits': 'Y', 'Grains': 'N', 'Juices': 'Y', 'Mushrooms': 'Y', 'PetFood': 'N', 'WildHarvested': 'N', 'updateTime': '3/31/2014 11:23:28 PM', 'Location': 'Closed-off public street', 'Credit': 'Y', 'WIC': 'Y', 'WICcash': 'N', 'SFMNP': 'Y', 'SNAP': 'Y', 'Season1Date': '03/14/2013 to 11/07/2013', 'Season1Time': 'Thu: 4:00 PM-8:00 PM;', 'Season2Date': '11/14/2013 to 03/13/2014', 'Season2Time': 'Thu: 4:00 PM-7:00 PM;', 'Season3Date': '', 'Season3Time': '', 'Season4Date': '', 'Season4Time': ''}

Shouldn’t we have a Market class?

I suppose, if you were a purist, you might say that dictionary objects aren’t ideal for representing objects like this. But it turns out this is fine in data science type applications, for reasons that an AI chatbot might find for you or hallucinate for you. But seriously, it’s good to be pragmatic, so for this lab, dictionaries it is!

Querying

Looking at our data, you might start thinking of so many possible questions, such as:

Exercise: What other questions come to mind?

Let’s answer these questions in just plain ol’ Python. First, rather than printing each line, we’ll change our code to gather everything up into a list of market dictionaries:

import csv

with open('farmers_markets.csv', 'r') as file:
    markets = []
    for row in csv.DictReader(file):
        markets.append(row)

Oh wait wait wait wait. How about this way? It’s does exactly the same thing as the previous code.

import csv

with open('farmers_markets.csv', 'r') as file:
    markets = [row for row in csv.DictReader(file)]

That’s right, people. We have to start getting used to comprehensions! (We saw comprehensions in an earlier lab but did not study them in depth. It’s time to learn them for reals and make them part of your natural programming toolbox.) It is important to know what exactly the comprehensions are doing, so that’s why we presented the long, readable, completely understandable for-loops right next to the concise comprehensions.

Let’s build some code to answer some of the questions above. Ask questions during the code-alongs if you need clarification, or ask a chatbot for explanations. Also delight in all the practice you are getting with comprehensions! You see comprehensions a lot in...job interviews!

def california_market_names():
    return [m['MarketName'] for m in markets if m['State'] == 'California']

def alaska_market_names():
    return [m['MarketName'] for m in markets if m['State'] == 'Alaska']

Ooh how about this:

def market_names_in_state(state):
    return [m['MarketName'] for m in markets if m['State'] == state]

More filtering:

def markets_selling_nuts():
    return [m for m in markets if m['Nuts'] == 'Y']


def markets_in_maine_selling_nuts_but_not_seafood():
    return [
        m
        for m in markets
        if m['State'] == 'Maine' and m['Nuts'] == 'Y' and m['Seafood'] == 'N'
    ]


def markets_with_you_tube_west_of_100():
    return [
        (m['MarketName'], m['street'], m['city'], m['zip'])
        for m in markets
        if float(m['x']) < -100 and m['Youtube'] != ''
    ]

Now to get the counts by state, we need to do some planning. Some thinking. What we want to end up with is a dictionary in which the keys are the states and the values are the counts of markets in those states. It is a super common operation in data analysis, database theory, data science, etc. It’s a thing that has been in computer science long before people ever came up with name “data science” as some kind of separate field of study.

Let’s learn how to group and count. We’ll look first at a direct approach, then a more powerful approach.

Okay, first pass:

def market_counts_by_state():
    state_counts = {}
    for m in markets:
        state = m['State']
        state_counts[state] = state_counts.get(state, 0) + 1
    return state_counts

The expression state_counts.get(state, 0) means get the value in the state_counts dictionary for the given state, or return 0 if the state is not found. If you instead just wrote state_counts[state], it would raise a KeyError if the state was not found. Make sure to always use .get() when accessing dictionary keys that may not exist!

Now to glow up a little. Since grouping and counting are so common, wouldn’t Python come with some kind of powerful built-in mechanism for this? Of course it does! Here’s how it is done:

from collections import Counter

def market_counts_by_state():
    return dict(Counter(m['State'] for m in markets))

Woah, right?

Histograms!

We installed matplotlib in the last lab and used it to create visualizations. Now, let’s use it to create a histogram of the number of farmers markets by state. Let’s jump right to the code:

def plot_market_histogram():
    state_counts = market_counts_by_state()
    plt.bar(state_counts.keys(), state_counts.values())
    plt.xticks(rotation=90)
    plt.xlabel('State')
    plt.ylabel('Number of Farmers Markets')
    plt.title('Farmers Markets by State')
    plt.tight_layout()
    plt.show()

To make this work, you will need to add

import matplotlib.pyplot as plt

to the top of your file. Add code to call the function (you know where, right?) and check out the result! If it’s not working, and you’ve checked that your code is correct, make sure you are in your virtual environment.

Pandas

Python does have a lot of powerful built-ins for processing data—some of which you will explore in the challenges—which is why it is popular in data science. But there’s a super popular third-party library called Pandas that makes working with structured data even more easier! And it comes with even more features. Here is just a brief taste of what it can do.

But first, as you may have guessed:

  pip install pandas

For this lab, we’ll have two separate programs. So leave your markets.py file as is. This will be your file for practicing with Python built-ins for data science. Make a new file called markets_pandas.py starting with this code:

import pandas as pd
import matplotlib.pyplot as plt

markets = pd.read_csv('farmers_markets.csv')
state_counts = markets['State'].value_counts()

state_counts.plot(kind='bar')
plt.xticks(rotation=90)
plt.xlabel('State')
plt.ylabel('Number of Farmers Markets')
plt.title('Farmers Markets by State')
plt.tight_layout()
plt.show()

We’ll talk about Pandas a bit in class, but not much, as our focus is on programming and not on specific libraries like Pandas. You’ll see a lot more Pandas in the future, and learn that the object we got from reading in the CSV file is a data frame—a powerful data structure for working with tabular data—and that the .value_counts() method is a convenient way to get the frequency of unique values in a column and can be directly sent to Matplotlib via its .plot() method. Pandas makes working with data pretty easy. Make sure, though, as time allows, to practice using it, and dive into documentation and tutorials. The teaching staff will be happy to guide you, too.

Challenges

Now it’s your turn. Here are some ideas for you to extend the activities above:

Further Study

Python is known for making it easy to, so:

Summary

We’ve covered:

  • Reading from files
  • The CSV format
  • The index method
  • csv.DictReader
  • Querying CSV data
  • Histograms with Matplotlib
  • The Pandas library

Recall Practice

Here are some questions useful for your spaced repetition learning. Many of the answers are not found on this page. Some will have popped up in lecture. Others will require you to do your own research.

  1. What are four ways to get input into a Python program?
    input(), sys.argv, reading from files, hardcoded data
  2. What is the CSV format?
    A compact, language-independent way to represent tabular data as plain text.
  3. What does CSV stand for?
    Comma-Separated Values
  4. What statement “wraps” the code to read files?
    with open(filename) as f:
  5. When a file is opened as part of the with-statement, what happens to the file when the block is exited?
    The file is automatically closed.
  6. What are the two ways in the csv module to read a CSV file?
    csv.reader and csv.DictReader
  7. What are the primary advantages and disadvantages of using csv.DictReader?
    Advantages: allows access to columns by name, improves code readability. Disadvantages: slightly slower than csv.reader, uses more memory.
  8. What list comprehension expression is used to gather up all the entries in a CSV file into a list of dictionaries?
    [row for row in csv.DictReader(open(filename))]
  9. What is the simplest way to gather up items and their frequencies?
    Counter(items) from the collections module.
  10. How do you plot a histogram with matplotlib?
    plt.bar(keys, values)
  11. What is the pandas library?
    A data manipulation and analysis library for Python.