using python to access web data week 2 assignment

Regular Expressions (Chapter 11)

1. Which of the following regular expressions would extract 'uct.ac.za' from this string using re.findall?

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

  • @(\S+)
  • F.+:
  • @\S+
  • ..@\S+..

2. Which of the following is the way we match the "start of a line" in a regular expression?

  • (Answer)
  • str.startswith()
  • \linestart
  • String.startsWith()
  • variable[0:1]

3. What would the following mean in a regular expression? [a-z0-9]

  • Match an entire line as long as it is lowercase letters or digits
  • Match any number of lowercase letters followed by any number of digits
  • Match anything but a lowercase letter or digit
  • Match a lowercase letter or a digit
  • Match any text that is surrounded by square braces

4. What is the type of the return value of the re.findall() method?

  • A string
  • A boolean
  • A single character
  • A list of strings
  • An integer

5. What is the "wild card" character in a regular expression (i.e., the character that matches any character)?

  • ^
  • *
  • (Answer)
  • $
  • +
  • ?

6. What is the difference between the "+" and "*" character in regular expressions?

  • The “+” matches at least one character and the “*” matches zero or more characters
  • The “+” matches upper case characters and the “*” matches lowercase characters
  • The “+” matches the beginning of a line and the “*” matches the end of a line
  • The “+” matches the actual plus character and the “*” matches any character
  • The “+” indicates “start of extraction” and the “*” indicates the “end of extraction”

7. What does the "[0-9]+" match in a regular expression?

  • Any number of digits at the beginning of a line
  • Any mathematical expression
  • Zero or more digits
  • One or more digits
  • Several digits followed by a plus sign

8. What does the following Python sequence print out?

x = ‘From: Using the : character’
y = re.findall(‘^F.+:’, x)
print(y)
  • From:
  • ^F.+:
  • [‘From: Using the :’]
  • [‘From:’]
  • :

9. What character do you add to the "+" or "*" to indicate that the match is to be done in a non-greedy manner?

  • ? (Answer)
  • $
  • **
  • ++
  • \g
  • ^

10. Given the following line of text:

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

What would the regular expression '\S+?@\S+' match?

  • \@\
  • From
  • marquard@uct
  • d@uct.ac.za
  • stephen.marquard@uct.ac.za

Extracting Data With Regular Expressions

Finding Numbers in a Haystack In this assignment you will read through and parse a file with text and numbers. You will extract all the numbers in the file and compute the sum of the numbers. Data Files We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment. Sample data: http://py4e-data.dr-chuck.net/regex_sum_42.txt (There are 90 values with a sum=445833) Actual data: http://py4e-data.dr-chuck.net/regex_sum_1913240.txt (There are 72 values and the sum ends with 641) These links open in a new window. Make sure to save the file into the same folder as you will be writing your Python program. Note: Each student will have a distinct data file for the assignment - so only use your own data file for analysis. Data Format The file contains much of the text from the introduction of the textbook except that random numbers are inserted throughout the text. Here is a sample of the output you might see: Why should you learn to write programs? 7746 12 1929 8827 Writing programs (or programming) is a very creative 7 and rewarding activity. You can write programs for many reasons, ranging from making your living to solving 8837 a difficult data analysis problem to having fun to helping 128 someone else solve a problem. This book assumes that everyone needs to know how to program ... The sum for the sample text above is 27486. The numbers can appear anywhere in the line. There can be any number of numbers in each line (including none). Handling The Data The basic outline of this problem is to read the file, look for integers using the re.findall(), looking for a regular expression of '[0-9]+' and then converting the extracted strings to integers and summing up the integers. Turn in Assignent Enter the sum from the actual data and your Python code below: Sum: (ends with 641) Python code:From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 Once you have accumulated the counts for each hour, print out the counts, sorted by hour as shown below.

import re
import urllib.request

url = “http://py4e-data.dr-chuck.net/regex_sum_1913240.txt”
response = urllib.request.urlopen(url)
data = response.read().decode()

numbers = re.findall(‘[0-9]+’, data)

total = sum(int(number) for number in numbers)

print(“Sum:”, total)

Leave a Reply