Jiaaro

…on life, the universe, and everything

Safari Web Content using all my CPU and RAM

This post is a bit of a departure from my normal style, but it took me hours of searching to finally nail this down and resolve the issue.

I’m using OS X 10.9.1 on a Mid-2010 MacBook Pro, Safari is version 7.0.1.

Activity Monitor

Edit: I’ve also experieced this issue now on OS X 10.8.5 on a Mid-2010 iMac, in Safari 6.1 (Pretty much the same version of Safari)

The Problem

At some point a few days ago I noticed that my CPU fan was blasting for what seemed like no reason at all. So I did what any rational person would do. I opened Activity Monitor.

Safari Web Content was using 100% CPU over 1GB of RAM and was “Not Responding”.

I tried killing the little bugger, but to no avail. Every time I went back to using Safari the problem resurfaced.

I found a very helpful, but unfortunately also very long thread on Apple’s Support Community.

Turns out this is a side effect of things going badly with Top Sites.

Solutions

You can disable Top Sites all together:

  1. Go to “Safari” > “Preferences” > “General” and change new windows and new tabs to open with a blank page (or anything besides Top Sites)

  2. Try to figure out which of your top sites is causing the issue. I suspect that if there is one with no preview (having the black background with a grey safari icon overlaid instead) it is probably the culprit, but I can’t verify that. [update]: I have confirmed this to be the case.

Safari Top Sites

Another theory I have about which sites to remove from Top Sites is ones that use a lot of JavaScript or Flash.

Safari has to run and render all code in the website in order to generate the preview. The more code it has to run, the more potential for problems.

Anyway hope this helps some poor person with this problem :)

edit: I’ve just discovered that if you hover over the “Safari Web Content” item in Activity Monitor, the hover text will show the url of the page being rendered!

edit 2: The trick mentioned in the previous edit is nice, but doesn’t work on the Safari Web Content process that hangs, only the ones spawned for browser tabs. Doh!

Machine Learning for Humans: K Nearest-Neighbor

Machine Learning in Action - cover

I’ve been reading Peter Harrington’s “Machine Learning in Action,” and it’s packed with useful stuff! However, while providing a large number of ML (machine learning) algorithms and sufficient example code to learn how they work, the book is a bit dry.

So I’ve decided to make my contribution to democratizing ML by posting simple explanations of these algorithms.

Why Python?

Pure Python isn’t the most (computationally) efficient way to implement these algorithms, but that isn’t the purpose here. The goal is to help humans understand how these algorithms work. Python is great for that. That’s why the book uses Python as well.

But Harrington takes the alternate route of using the (very powerful) numpy from the get-go, which is more performant, but much less clear, at the expense of the reader.

Well that’s crap; let’s start learning!

What is KNN (K nearest neighbor) good for?

This is a good question to answer up front. Why are we doing this in the first place?

KNN is a “classifier”, which is a type of algorithm that (you guessed it) classifies things.

Let’s put it in more concrete terms: We want to teach the computer to answer the question, “What kind of fruit is this?

You’re the owner of an orchard, and you’re tired of paying workers to sort your fruits on the assembly line. The job is boring, the workers hate it, and you already measure the weight and color of every fruit on the line anyway. It should be simple enough to have a machine do it.

You have a set of already classified (categorized, tagged, etc) information - and you want to automatically figure out where new data (fruits) fits into your classification automatically. i.e., Is it an Apple or a Banana?

Here’s some fruit the workers logged before they got shit-canned:


     --------------------------------------------
    |  weight (g)  |   color  ||  Type of fruit  |
    |==============|==========||=================|
    |  303         |  3       ||  Banana         |
    |  370         |  1       ||  Apple          |
    |  298         |  3       ||  Banana         |
    |  277         |  3       ||  Banana         |
    |  377         |  4       ||  Apple          |
    |  299         |  3       ||  Banana         |
    |  382         |  1       ||  Apple          |
    |  374         |  4       ||  Apple          |
    |  303         |  4       ||  Banana         |
    |  309         |  3       ||  Banana         |
    |  359         |  1       ||  Apple          |
    |  366         |  1       ||  Apple          |
    |  311         |  3       ||  Banana         |
    |  302         |  3       ||  Banana         |
    |  373         |  4       ||  Apple          |
    |  305         |  3       ||  Banana         |
    |  371         |  3       ||  Apple          |
     --------------------------------------------

Notice they assigned numbers to the colors, that’s useful because we need to do math with these values (numbering non-numerical stuff is known as discretizing). The colors are in order of the color wheel, so similar colors are closer together than less similar colors.

Here’s the color key from the foreman’s clipboard:


          red         1
          orange      2
          yellow      3
          green       4
          blue        5
          purple      6

So our data set has some apples which are red, green, and yellow, and a bunch of bananas which are all yellow except one that is green.

It’s 9 AM.

A loud bell rings.

The conveyor belt starts turning, and fruit starts flowing in from outside.

But the factory is empty. All our factory workers are home learning to maintain fruit classifying robots and re-reading the primitive accumulation of wealth.

…and the first fruit rolls onto the classification machine.

What is this thing?


    Weight:  373g
    color:   1 (red)

We better write some software to handle that fruit before it rots!

How do we decide whether this unknown fruit is an apple or banana?

The K-nearest-neighbor approach is to calculate the distance between our unknown fruit and each of the known fruits and assume the “k” closest fruits are probably the same type of fruit.

Sort of like graphing all the fruits and drawing a circle around the “?”. Whatever is closest to it is probably the same kind of fruit.


                   Graph of Fruits
             |
             |            
         380 |   AA      AA       
    weight   |   A?      A  A
         330 |       A    B B
             |           BB BB     
         280 |            
             |__________________________
                 1   2   3   4   5   6              
                        color

        note: the question mark is the “unknown fruit”

Math is delicious

Before we can start with the KNN algorithm, we need to do a little math review. Remember good old pythagorus? a² + b² = c² right? If you’re comfortable with this, just skip to the next section.

If you don’t remember: this is the formula for calculating the hypotenous (see: the diagonal side) of a right triangle.


         J
        | \
        |  \    
     7  |   \   ⟵  this side == (5**2 + 7**2) **0.5
        |    \   
         -----
           5    K

You could also say “longitude” instead of “height”, “latitude” instead of “width” and say you’re calculating the “distance“ from point “J“ to point “K”.

That part is crucial.

Well the real world isn’t 2D, it’d 3D, but I have great news! You can do this in 3D too. So now we can calculate distances in a 3D space the same way:

    a**2 + b**2 + c**2 == d**2

or

    d ==  (a**2 + b**2 + c**2) **0.5

aside: raising something to the .5 power (i.e., **0.5) is the same as taking the square root.

With KNN you can actually have as many dimensions as you want, but to keep it simple, we’ll just use 2.

OK, back to KNN

So this is what a function would look like that tells us the distance from our unknown fruit to one of the known fruits in our dataset

def distance(fruit1, fruit2):
    """
    The args are iterables of the values in the table. 
    for example the args should look something like this:
    
    #         weight,  color
    fruit1 = [303,     3]  # Banana from the data set
    fruit2 = [373,     1]  # the unclassified fruit
    """
    
    # first let's get the distance of each parameter
    a = fruit1[0] - fruit2[0]
    b = fruit1[1] - fruit2[1]
    
    # the distance from point A (fruit1) to point B (fruit2)
    c = (a**2 + b**2) **0.5
    
    return c

Here is the python representations of the stuff we’ve discussed so far:

# the unknown fruit from above
unknown_fruit = [373, 1]

# This is arbitrarily chosen for this example. Generally
# you need to play with this magic number to find what works
# best for your case.
k = 3

# here's the dataset as a python list…
dataset = [
  # weight, color, type
  [303, 3, "banana"],
  [370, 1, "apple"],
  [298, 3, "banana"],
  [277, 3, "banana"],
  [377, 4, "apple"],
  [299, 3, "banana"],
  [382, 1, "apple"],
  [374, 4, "apple"],
  [303, 4, "banana"],
  [309, 3, "banana"],
  [359, 1, "apple"],
  [366, 1, "apple"],
  [311, 3, "banana"],
  [302, 3, "banana"],
  [373, 4, "apple"],
  [305, 3, "banana"],
  [371, 3, "apple"],
]

…and with that being said, let’s sort the dataset using this function…

# using the distance() function from above, sort
# the data set by smallest distances on top
sorted_dataset = sorted(dataset, key=lambda fruit: distance(fruit, unknown_fruit))

Here is the table of distances from our unknown fruit to the known fruits in the data set.


        --------------------------------------------
       |  weight (g)  |   color  ||  Type of fruit  |  distance
       |==============|==========||=================|
       |  371         |  3       ||  Apple          |   2.8
       |  370         |  1       ||  Apple          |   3.0
       |  373         |  4       ||  Apple          |   3.0
        \/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/  
       
       |  374         |  4       ||  Apple          |   3.2
       |  377         |  4       ||  Apple          |   5.0
       |  366         |  1       ||  Apple          |   7.0
       |  382         |  1       ||  Apple          |   9.0
       |  359         |  1       ||  Apple          |  14.0
       |  311         |  3       ||  Banana         |  62.0
       |  309         |  3       ||  Banana         |  64.0
       |  305         |  3       ||  Banana         |  68.0
       |  303         |  3       ||  Banana         |  70.0
       |  303         |  4       ||  Banana         |  70.1
       |  302         |  3       ||  Banana         |  71.0
       |  299         |  3       ||  Banana         |  74.0
       |  298         |  3       ||  Banana         |  75.0
       |  277         |  3       ||  Banana         |  96.0
        --------------------------------------------

At this point, which classification the unknown fruit belongs to is determined by taking a vote of the “k“ nearest neighbors – so if “k“ is 3, then we take the top 3 fruits by distance and select whichever is most common.

# from the python std library
from collections import Counter

# take only the first K items
top_k = sorted_dataset[:k]

class_counts = Counter(fruit for (weight, color, fruit) in top_k)

# class_counts now looks like this:
# {"apple": 3}
   
# get the class with the most votes
classification = max(class_counts, key=lambda cls: class_counts[cls])

# There you have it!
classification == "apple"

In this case we see that the top 3 are all “Apple“ so we conclude this unknown fruit must be an apple.

You can expand this to more more than two features though. You can actually use that distance formula from earlier with as many dimensions as you want.

Let’s try it with 4:

    e == (a**2 + b**2 + c**2 + d**2) **0.5

So if we’d done this using more characteristics of the fruits than just weight and color (like number of seeds in the fruit for instance), the distance calculation (we have 3 factors now) would have just been:

    
#         weight, color, seeds
fruit1 = [303,    3,     1]  # Banana from the data set
fruit2 = [297,    1,     4]  # unknown (but it's an apple)

a = fruit1[0] - fruit2[0]
b = fruit1[1] - fruit2[1]
c = fruit1[2] - fruit2[2]

distance = (a**2 + b**2 + c**2) **0.5

You: This is repetitive

True. This code is designed to make it easy to understand… in real life, you should use numpy (or similar) for performance reasons anyway (ML is very computationally expensive).

You: What if one factor is more important than the others?

That’s a really good point. Maybe the number of seeds is much more important than the color of the fruit (it is), but color is still an important differentiator among fruits with the same number of seeds?

Neutralizing the effects of different units

Right now your weight values are much bigger than our color ones, which we’ve discretized to single digit numbers.

That means weight is causing much bigger changes in distance between fruits than color is.

What are we going to do about that?

Well, what if we measure all our inputs on a scale of 0 - 1.0?

That’s Normalization Kyle!

In short, we’re going to take the biggest value in the dataset and the smallest value in the dataset and put all the other numbers on a scale of 0.0 - 1.0 from smallest to biggest.

Biggest fruit:   382g
Smallest fruit:  277g
range = 382-277 = 105
def normalize_weight(weight):
    # convert the units to a float so python's wonky
    # division doesn't break anything later on
    weight = float(weight)

    # first subtract 277 (the smallest weight)
    # so that the smallest fruit becomes 0.0
    x = weight - 277

    # now the biggest fruit is 105 (382-277), but
    # we want the biggest fruit to become 1.0 so 
    # let's divide!
    x = x / 105

    return x

Obviously you wouldn’t hard-code those numbers (largest/smallest weight) in real life. Again, just for clarity.

You can do the same approach with the colors, number of seeds, etc.

If you’re going to normalize your dataset, you have to normalize all the columns. Otherwise you’re not doing what you think you are.

Wait, but I thought we mainly cared about seeds?

Right. So once you’ve done this, let’s say you want number of seeds to be twice as important as weight and color to be half as important as weight.

You’d just multiply the “number of seeds“ value for every fruit in your dataset by 2.0, and multiply the “color“ value of every fruit in your dataset by 0.5. Try calculating the weights now.

These are magic numbers that (like the K value) need to be tested and tweaked to see what will work best for you.

HOMEWORK:

I’m going to leave it as an exercise to the reader to apply these ideas (dataset follows):

  1. Write a distance function that will accept 3 columns of data instead of 2

  2. Normalize the Color, Weight, and # of Seeds columns of the dataset

  3. Apply weights to the columns:

    • Color is least important: give it a weight of 0.5
    • Weight is a good differentiator: give it a weight of 1.0
    • # of Seeds is most important: give it a weight of 2.0
  4. Classify these 3 unknown fruits (UFs) using your classifier

    • UF1: [color: green, weight: 301g, seeds: 1]
    • UF2: [color: yellow, weight: 346g, seeds: 4]
    • UF3: [color: red, weight: 290g, seeds: 2]

dataset:


     -------------------------------------------------------
    |  weight (g)  |  color  |  # seeds  ||  Type of fruit  |
    |==============|=========|===========||=================|
    |  303         |  3      |  1        ||  Banana         |
    |  370         |  1      |  2        ||  Apple          |
    |  298         |  3      |  1        ||  Banana         |
    |  277         |  3      |  1        ||  Banana         |
    |  377         |  4      |  2        ||  Apple          |
    |  299         |  3      |  1        ||  Banana         |
    |  382         |  1      |  2        ||  Apple          |
    |  374         |  4      |  6        ||  Apple          |
    |  303         |  4      |  1        ||  Banana         |
    |  309         |  3      |  1        ||  Banana         |
    |  359         |  1      |  2        ||  Apple          |
    |  366         |  1      |  4        ||  Apple          |
    |  311         |  3      |  1        ||  Banana         |
    |  302         |  3      |  1        ||  Banana         |
    |  373         |  4      |  4        ||  Apple          |
    |  305         |  3      |  1        ||  Banana         |
    |  371         |  3      |  6        ||  Apple          |
     -------------------------------------------------------

as python:

dataset = [
  # weight, color, # seeds, type
  [303, 3, 1, "banana"],
  [370, 1, 2, "apple"],
  [298, 3, 1, "banana"],
  [277, 3, 1, "banana"],
  [377, 4, 2, "apple"],
  [299, 3, 1, "banana"],
  [382, 1, 2, "apple"],
  [374, 4, 6, "apple"],
  [303, 4, 1, "banana"],
  [309, 3, 1, "banana"],
  [359, 1, 2, "apple"],
  [366, 1, 4, "apple"],
  [311, 3, 1, "banana"],
  [302, 3, 1, "banana"],
  [373, 4, 4, "apple"],
  [305, 3, 1, "banana"],
  [371, 3, 6, "apple"],
]

After you’ve classified the 3 Unknown Fruits, consider which columns you could remove without losing any accuracy. It’s often the case that simpler classifiers are better, and the facets of your data may not be as related as you originally thought!


This is my first crack at this type of tutorial, so please give me feedback, and/or corrections! (email: blog@jiaaro.com )

The myth of pervasive Internet & why “offline mode” is the best free marketing you could ask for

I'm sitting on a train, making my 45 minute, 3 mile commute.

And by train, I mean: tiny aluminum can filled to capacity with iPhones and their owners.

The alarming speed and sheer mass of concrete above our heads isn't getting any attention from these nerds.

Because they're too busy looking at iPhones.

But wait a minute – not their own iPhones. A lot of them are looking at somebody else's. In fact, I'd say in a given subway ride at least half of them will glance at their neighbor's display. You know you've done it. Moving, flashing lights are hard to ignore.

Eavesdropping. iVesdropping? Heh. I love a good pun.

Let's talk about pervasive Internet. That idea mobile developers keep spouting about how we have Internet access "everywhere" thanks to our iThings. What shit.

I spend an 90 minutes a day using an iPhone with no Internet. That's very possibly the majority of my phone usage. 5 days a week.

And I'm not the only one.

Most of the iVesdropping I see is people watching somebody else play a game.

I think it's because games are immersive and the device owner is least likely to look up and trigger that awkward moment where you both realize just how long you've been snooping.

Speculations aside, this is not going away. And I know I've searched the App Store on more than one occasion for an app I saw in that sardine can.

Guess which apps I never see down there. Words with friends, song pop, facebook, twitter, buffer.

All those social ones that demand network access.

But Mail works, so does Podcasts, and Reeder, and letterpress (sort of).

And I know that not everyone lives in the city. But cities are cultural centers. Getting big in New York or San-fran can catapult an app into the charts, and the visibility of being in the charts can make or break your sales numbers.


Dear app developers, I'm begging you. Please make your apps work offline. At the very least make sure they don't crash when you launch them without internet (I'm looking at you zynga).

It's in your best interest.

The Good Idea Lottery

Have you ever tried to sit down and come up with a good idea?

It's really hard.

After long, painful hours you still have a blank sheet of paper. Not a single good idea written down. Sound familiar?

I think we've all been there. If you've ever been a student you know exactly what I mean.

Blank notepad

The craziest part is that during those excruciatingly long minutes, you probably weren't even thinking of ideas the whole time. An idea comes into your mind, you consider it, decide it's not the one and then you try to think of something else.

But then you get distracted.

You'd be appalled how much time distractions take out of the process when you do “brainstorming” this way.

I've been in this situation many, many times, and I finally realized something that totally changed the way I approach creative thinking:

Coming up with 1 good idea is actually harder than coming up with 10 good ideas.

It sounds crazy, but I'm about to convince you it's true ;)

Being prolific when you're brainstorming is absolutely key to finding good ideas.

In part, because as humans, we're not very good at focusing our thoughts when the rest of our body is idle. But also because we're notoriously bad at identifying which ideas are good, and which ones aren't.

The trick is to accept failure from the start.

Most of your ideas will suck, but that's fine. Just write them all down. Every. Single. Stinking. Idea.

Don't stop until you have at least 15 (and hopefully 30, 50, or 100). Every idea you write down is a ticket in the Good-Idea Lottery.

Obviously this exercise leaves you with a list of dozens of ideas which you now have to vet, but (and here's a third reason to do it my way):

Vetting ideas and coming up with ideas are different frames of mind.

When you're coming up with ideas you want to be open minded, creative, and optimistic. When you're vetting ideas you want to be analytical, realistic, and tactical. Trying to do both is hard – yet another fault of our species – we suck at multi-tasking.

So how do you vet all these ideas?

Well, I thought about this for a while and the answer is really: "it depends" as it always seems to be (nod to startups for the rest of us).

But in most cases (all?) you are well served by soliciting opinions from others.

So do that.

If it's artwork, show it to people with good taste.

If it's a business idea, show it to some would-be customers.

If it's a blog post… well if it's a blog post quit loafing and just post it ;) Your loyal readers are step 1 in vetting the content. When something seems to resonate with them, *that's* when you go all-out and try to promote it.

I started doing this sort of thing a lot.

At first I was just keeping hand written lists of people's feedback. Then it was excel spreadsheets.

Business ideas, marketing plans, songs, resume designs… all kinds of stuff.

I was really just collecting anecdotes – I'd ask 3 people here, 5 people there – sometimes that's good enough. It's certainly a good place to start.

As I continued my education I learned about statistical significance, and other fun things and the urge to make these decisions based on (more) data resurged.

I was an economics student, and a closet hacker.

There are a lot of ways to gather data, but I didn't use any of them. I did want any naïve hacker would: I built myself a tool that took all these unwashed ideas and ranked them.

And I called it the Whicher. (this may sound familiar if you follow me on twitter, facebook, etc).

Essentially, you put in a question, and 20 images and it shows people 2 at a time until they've gone through all of them.

You're probably wondering at this point, "why not just ask people to rank the ideas instead of this convoluted tournament thing?"

That was my first plan.

Let's rewind. It's story time…

My band was making a CD. There were 6 of us which meant lots of arguments about how much violin should be in the chorus of song X, and whether or not it was a good idea to cover songs our friends had written.

One day we're all sitting around in the "Recording Studio" (e.g., my parent's formal living room) trying to pick which 10 of our 30ish songs deserved to be included in the album.

We took ourselves very seriously. Don't laugh, this is important.

So Matt (drummer) proposes that we all make a top 10 list and then we'll all compare. 10 minutes of scribbling, crossing out, eraser dust, new sheets of paper, and general kindergarden activity ensues.

When Matt read my list his reaction was, "really? you like Song A more than Song B?". And answer was, "No." I didn't.

That's the funny thing about rankings. You can get circular logic: Song A beats Song B, Song B beats Song C, but Song C beats Song A.

In other words, lots of rankings aren't necessarily transitive. A brief aside: Phil Haack wrote a fascinating article about this in regards to political elections.

Our (again, suboptimal) human brains end up performing the ranking process as a series of X vs Y choices anyway.

“Do I like this song more than that one, and less then the one above it? Perfect. I'll just slip it in there.”

And it occurred to me,

why don't we just simplify the whole thing to just A vs B choices?

My hacker brain took off from there.

In fact it's not just simpler, but now we can get useful data even if the person don't make a choice about every song.

And besides, who are we do choose? We should ask THE FANS.

So here we are… in the present day. I didn't actually build the tool because I was too busy economics-ing or whatever.

I think we voted on which songs to put on the album. yay irony!

But I did eventually build this thing.

I'm still working on it, but you can check out theWhicher.com.

Also, there's a little more history to the story, which you can read on the Whicher's About Page.

But really, seriously, in conclusion: Stop doing it the hard way

Don't let your single-minded, reptilian, teenaged, brain thing hold you back.

Please consider brainstorming without any fear of failure; in fact, do it for me. Plan on coming up with swaths of bad ideas.

It's a good idea.

And you'll have plenty of time to vet them afterward ;)

You need to support mobile, even if you don't support mobile

Your application emails people when important things happen, right? 

If it doesn't it should. People like to know about important things (and it's the best marketing you could ask for).

Most people check email on their phones – among other places – and the best email is actionable.

…so they log in to take care of whatever you emailed about.

On their phone.

You probably have more mobile traffic than you think. And if you don't, you probably don't use email very effectively.

tl;dr: You need to support mobile, even if you don't.