12/01/2104
At 15:56:41, AsianAvenger wrote:
FreedomFighter69, JudgeDoom, SneakyPete, Jason440953, you’re nothing but a pack of racists. Let the light of righteousness shine upon your squalid little viper’s nest!
At 16:40:17, BigTom44 wrote:
Well, here we frickin’ go.
At 16:58:42, FreedomFighter69 wrote:
Racist? For killing Hitler? WTF?
At 17:12:52, SaucyAussie wrote:
AsianAvenger, you’re not rehashing that whole Nagasaki issue again, are you? We just got everyone calmed down from last time.
At 17:22:37, LadyJustice wrote:
I’m with SaucyAussie. AsianAvenger, you’re making even less sense than usual. What gives?
At 18:56:09, AsianAvenger wrote:
What gives is everyone’s repeated insistence on a course of action which, even if successful, would only save a few million Europeans. It would be no more trouble to travel to Fuyuanshui, China, in 1814 and kill Hong Xiuquan, thus preventing the Taiping Rebellion of the mid–nineteenth century and saving fifty million lives in the process. But, hey, what are fifty million yellow devils more or less, right, guys? We’ve got Poles and Frenchmen to worry about.
International Association of Time Travelers: Members’ Forum Subforum: Europe – Twentieth Century – Second World War, Page 263
It’s probably obvious from the last few entries here that I’m attempting to remake myself as a python programmer. It isn’t a terribly hard task, in fact the language in and of itself is quite fun to learn and use. Some not obvious links I’ve found to be of use:
David Goodger’s Style Guide is great. At this point I don’t feel like I’m writing idiomatic Python, though I feel I’ve mostly mastered the syntax, so grasping common idioms is key right now.
urllib2 - The Missing Manual mostly lives up to its own hype in helping to explain a few of the weirder parts of urllib and urllib2 - libraries for retrieving remote html and xml pages from the web.
And finally this page has several nice examples of dealing with dates and stuff.
Totally Irrelevant Update: While clicking around David Goodger’s site I discovered in his resume this line:
JOHN ABBOTT COLLEGE CEGEP, Ste. Anne de Bellevue, Québec, Canada.
Pure and Applied Sciences.
He didn’t put any years on there, so I’m going to assume we were actually classmates ;-).
Another Totally Irrelevant Update: I love this sentence, seen in another Python style guide (about comments):
“When writing English, Strunk and White apply.” Though since they’re referring to a book, rather than to all the pronouncements of Strunk and White then shouldn’t it be “applies?”
Like I can’t think of a good simile… Anyway based on seeing a python toolkit for the digg api, and more importantly a comment about how a generator-based toolkit would be really useful, I’ve started to build the latter thing. And it’s true, the use of generators makes the need to get multiple pages from the api so effortless it’s a real concern making sure that you don’t accidentally suck down way more data than you need and get blocked by the people who monitor digg api usage. I’ve currently got a nice class to retrieve stories, I’m going to do the rest of the datatypes and add some bells and whistles before I release. But I thought I would just show how nice the code is to call the api.
d = Digg()
s = d.stories()
i = 0
try:
for j in range(40):
story = s.next()
i += 1
print i, story
except StopIteration:
pass
print 'Read %d out of %d items' % (i, d.total)
# using keyword arguments
start = int(time.mktime(datetime.datetime(2008, 4, 30, 0, 0, 0).timetuple()))
stop = int(time.mktime(datetime.datetime(2008, 4, 30, 1, 0, 0).timetuple()))
# endpoint is /stories/ + what's specified below
s = d.stories(endpoint='popular', kwargs={'min_submit_date': start, 'max_submit_date': stop })
try:
for j in range(10):
story = s.next()
print j, story
except StopIteration:
pass
print 'Read %d out of %d items' % (j, d.total)
This is 2 examples actually, first retrieving from the /stories endpoint, and then from the /stories/popular endpoint using 2 keyword arguments (min_submit_date and max_submit_date). s.next() gets the data, and hides all the complexity of getting data from the api page-by-page. If instead of “for j in range(40):” you had “while (1):” then that little bit of code would proceed to suck down the 382,682 stories currently available at the /stories endpoint. And that’s cool…
Disclaimer
My pollution photo continues to get a lot of play, featured today on a nowpublic story. The story is about how Canada ranks second last in a study of “consumer habits” related to the environment. It made me smile when I discovered that the U.S. was last, because there’s that whole Canadian inferiority thing which would probably (if this were a sports story) result in lots of lusty cheering because we beat the goddamn Americans and that’s all that counts!
I loved this book. In 1993 it clearly set out an agenda (the importance of citations) that foreshadowed the rise of search engines. Just suddenly popped into my head tonight - it was written by a senior reference librarian at the Library of Congress.
Probably dated now, but influential without most people knowing about it.
After that last post, I’d like to point out that trying to post code to wordpress (not wordpress.com) is a fiasco. I tried various combinations of pre and code tags, either inserting the opening and closing tags at the right place or selecting the text and clicking on the code button (not using the wysiwig editor because I have already abandoned it over just this issue). It kept adding an extra end tag or (seemingly the best I could get) it generated the monstrosity in the title.
For what it’s worth.
Here’s the code for those images. Having something of a scientific background, I’m inclined to think that I’m stepping ahead of those other graphers I mentioned, by exposing my methodology (ha!) to public scrutiny. Open source is really just “peer review” in disguise. The program spits out data to stdout which can be piped to a file and creates the 3 plots as pngs. As the comments say, it needs a little bit more work to handle the DST switch properly. BeautifulSoup is probably overkill since I only parse one attribute out of one tag. The program takes a couple of hours to run, mostly because of the need to be polite - It calls the digg servers 672 times with 15 second delays in between.
from BeautifulSoup import BeautifulStoneSoup
from rpy import *
import os
import urllib
import datetime
import time
import sys
# please change
digg_appkey= urllib.quote('http://example.com/example', '')
class AppURLopener(urllib.FancyURLopener):
# userAgent string (please change)
version = 'FillYourBots'
# San Francisco is the center of the universe
os.environ['TZ'] = 'US/Pacific'
time.tzset()
urllib._urlopener = AppURLopener()
endpoint_upcoming = 'http://digg.com/tools/services?endPoint=/stories&type=xml'
endpoint_promoted = 'http://digg.com/tools/services?endPoint=/stories/popular&type=xml'
# it's currently May, so we're dealing with DST for the next few months.
# this will need updating before the times switch
# 4 weeks (30 days ago to 2 days ago)
start_date = datetime.datetime.now() - datetime.timedelta(days=30)
# Zero out minutes and seconds
start_date = datetime.datetime(start_date.year, start_date.month, start_date.day,
start_date.hour, 0, 0)
# 2 days in the past to allow for promotion
end_date = datetime.datetime.now() - datetime.timedelta(days=2)
hourly_totals = list(0 for i in range(24))
hourly_prom_totals = list(0 for i in range(24))
dayhour_totals = dayhour_prom_totals = list(list(0 for j in range(24)) for i in range(7))
day_totals = day_prom_totals = list(0 for i in range(7))
while start_date < end_date:
interval_start = time.mktime(start_date.timetuple())
interval_end = interval_start + 3600
url = '%s&appkey=%s&min_submit_date=%d&max_submit_date=%d' % (endpoint_upcoming,
digg_appkey, interval_start, interval_end)
instring = urllib.urlopen(url).read()
d = BeautifulStoneSoup(urllib.urlopen(url).read())
promoted_url = '%s&appkey=%s&min_submit_date=%d&max_submit_date=%d' %
(endpoint_promoted, digg_appkey, interval_start, interval_end)
dp = BeautifulStoneSoup(urllib.urlopen(promoted_url).read())
data_line = (start_date.strftime('%Y/%m/%d %H:%M'),
int(d.stories['total']),
int(dp.stories['total']),
float(dp.stories['total'])/float(d.stories['total'])*100,
)
sys.stderr.write('Date: %s Total Upcoming: %d Total Promoted: %d %12.10f\n' % data_line)
print data_line
# need hour and weekday in PST/PDT
hour = start_date.hour
# isoweekday - 1 means 0=Mon, 6=Sun
weekday = start_date.isoweekday()-1
print weekday, hour, interval_start
hourly_totals[hour] += int(d.stories['total'])
hourly_prom_totals[hour] += int(dp.stories['total'])
dayhour_totals[weekday][hour] += int(d.stories['total'])
dayhour_prom_totals[weekday][hour] += int(dp.stories['total'])
day_totals[weekday] += int(d.stories['total'])
day_prom_totals[weekday] += int(dp.stories['total'])
start_date = start_date + datetime.timedelta(hours=1)
# be polite
time.sleep(15)
print 'Hourly Totals:'
for i in range(24):
print i, hourly_prom_totals[i], hourly_totals[i],
float(hourly_prom_totals[i])/hourly_totals[i]*100
# do some plotting
hourly_outfile = 'hours.png'
x = range(24)
y = [float(hourly_prom_totals[p])/hourly_totals[p]*100 for p in hourly_prom_totals]
r.bitmap('hours.png', res=200)
xlabels = [ "%d" % (i,) for i in x ]
ylabels = [0, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5]
r.barplot(y, xlab="Hour", ylab="POP (%)", names_arg=xlabels, ylim=(0, 1.5),
main="Digg.com POP% By Hour (Pacific Time)")
print 'Daily Totals:'
for i in range(7):
print i, day_totals[i], day_prom_totals[i], float(day_prom_totals[i])/day_totals[i]*100
x = range(7)
y = [float(day_prom_totals[p])/day_totals[p]*100 for p in day_prom_totals]
print day_totals
print day_prom_totals
r.bitmap('days.png', res=200)
xlabels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
r.barplot(y, xlab="Day", ylab="POP (%)", names_arg=xlabels, ylim=(0, 1.5),
main="Digg.com POP% By Day (Pacific Time)")
print 'Day-hour Totals:'
y = []
for i in range(7):
for j in range(24):
print i, j, dayhour_prom_totals[i][j], dayhour_totals[i][j],
float(dayhour_prom_totals[i][j])/dayhour_totals[i][j]*100
y.append(float(dayhour_prom_totals[i][j])/dayhour_totals[i][j]*100)
r.bitmap('dayhours.png', res=200)
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
xlabels = ['%s %2d:00' % (days[i], j) for i in range(7) for j in range(24)]
r.barplot(y, xlab="Day, Hour", ylab="POP (%)", names_arg=xlabels, ylim=(0, 2.5),
main="Digg.com POP% By Day/Hour (Pacific Time)")
It seems there is a real thirst for data about digg.com and “social news” in general. Last week I read a post about the best time to submit a story to social news sites that seemed somewhat flawed. For starters it didn’t account for volume of submissions (or at least didn’t seem to). Furthermore it draws the conclusion that because Thursday is the best day to post, and between 10am and 2pm PST is the best time to post, therefore Thursday between 10am and 2pm is the best time of all to post. And I’ve taken far too many standardized tests to not see the logical fallacy in that.
Another stats guy has also popped up at statsbot.com, and apparently among other things, he has a large digg dataset from somewhere. Despite my role in the creation of digg I unfortunately don’t have such a thing. Oh, and he’s 17 years old and in India. Anyway… turns out the digg api is great for answering the question of what is the best time to post.
I collected some data on total submissions and promotions based on the date and time the item was submitted over the period April 4-May 4 (one of the calls in the digg api that is required is limited to 30 days).
The best time to post to digg? According to the data from the last 30 days its Saturday between 1 and 2 pm PST. Sorry to ruin your weekends ;-).
This also illustrates the logical fallacy I was talking about - the best day to post is Sunday, and the best time to post is 9-10pm PST, overall.
Here be graphs (click for larger version) They refer to POP a lot - that’s my acronym for Probability of Promotion - the likelihood of any story being promoted to the digg front page:
I’ll be posting the code to do this shortly, since it was a nice exercise in using python, r, rpy, and the digg api. And probably other stuff, starting with making those graphs prettier. That 17 year old in India gets me down :-(.
Also, because San Francisco is the known center of the universe, I’ve done all this using PST/PDT.
The usual disclaimer.
In a world where small software companies seem to have an expected life of less than ten years, its kind of neat to discover that a company still makes a product I used back in the eighties - Grapher by Golden Software - “Technologically Advanced Graphics and Mapping solutions since 1983.”
Techmeme isn’t going to be usable for the next few days, as blogiarrhea takes over after the big MS-Yahoo story. It almost resembles a message board right now, with a bunch of highly paid pundits posting their equivalent of “First!” and trying to spin 250 words out of “No Deal!”