Sunday, 8 September 2013

Naive Bayes Classifier, Programmatically identify a composer given a symphony

Naive Bayes Classifier, Programmatically identify a composer given a symphony

I have been assigned my first homework in machine learning, I am tasked
with implementing a Naive Bayesian Classifier. This classifier is trained
with music from two composers and asked to classify a given sample of
music to one of the two composers. Specifically this classifier is given
34 examples of music from Beethoven and 94 examples from Bach, and asked
to then classify 4 more pieces into one of the two composers.
The music samples are broken up into melodies (called grams in my
problem), and each sample of music is a list of indexes of melodies; which
form the musical piece. So there are three files I am working with. One
holds the melodies in order from 1 to numberOfGrams and is called
combined4grams, the other two are rows of lists of melodies. So for bach
there is a 94 row file (names bachWtc4grams) and each row is a list of
indices that correspond to a melody in the combined4grams file. For
Beethoven the file is called beethovenSymphony4grams.
My goal is to finish implementing the functions in the code given in the
problem, using my knowledge of Naive Bayesian Classifiers. There are 3
steps to this process. Read the data, train from the data, and predict
what class a musical piece is with our trained model. We are then asked to
identify melodies with mutual information greater than 1.6E-4.
For step two the model fitting we are told to use the following algorithm
This algorithm is then implemented in function fitModel().
For the third step of prediction we are given following formula, which is
implemented in the function predict().
Here soon you will see the code I was given. We were not told which if any
functions are complete. I have to think after reading the readIn()
function that step one is complete, its basically a given; we're just
reading in the data. For the last two steps though I need help to figure
out what steps are missing in the code, by first identifying the steps in
the algorithms that are actually taken. I will post the code fully
verbatim and then discuss each function individually and hopefully ask
questions that can lead me to an answer.
Verbatim NBC() class:
import java.io.*;
import java.util.*;
public class NBC
{
static final int numberOfGrams = 24644; // a gram is a melody
static final int bachNumbers = 94; // Bach's Preludes and Fugues
static final int beethovenNumbers = 34; // Beethoven's symphony
movements
static final int bachNumberOfTokens = 36145; // Bach data size
static final int beethovenNumberOfTokens = 40641; // Beethoven data
size
String[] grams = new String[numberOfGrams]; // coded melodies
int[] bachStarts = new int[bachNumbers + 1]; // pointers to bachTokens
int[] beethovenStarts = new int[beethovenNumbers + 1]; // pointers
to beethovenTokens
short[] bachTokens = new short[bachNumberOfTokens]; // Bach data
short[] beethovenTokens = new short[beethovenNumberOfTokens]; //
Beethoven data
double bachPi = 0; double beethovenPi = 0; // prior class probs
learned with algorithm 3.1
double[] bachTheta = new double[numberOfGrams]; // Bernoulli
parameters learned 3.1
double[] beethovenTheta = new double[numberOfGrams];
double[] mi = new double[numberOfGrams]; // mutual information for
feature selection
void readin()
{
Scanner in = null;
try
{
in = new Scanner(new
File("F:\\_Downloads\\combined4grams.txt"));
}
catch (FileNotFoundException e)
{
System.err.println("combined4grams not found");
System.exit(1);
}
for (int i = 0; i < numberOfGrams; i++)
grams[i] = in.nextLine();
in.close();
try
{
in = new Scanner(new
File("F:\\_Downloads\\bachWtc4grams.txt"));
}
catch (FileNotFoundException e)
{
System.err.println("bach not found");
System.exit(1);
}
int m = 0;
for (int i = 0; i < bachNumbers; i++)
{
bachStarts[i] = m;
String[] terms = in.nextLine().split(" ");
for (String s: terms)
bachTokens[m++] = Short.parseShort(s);
}
in.close();
bachStarts[bachNumbers] = m;
try
{
in = new Scanner(new
File("F:\\_Downloads\\beethovenSymphony4grams.txt"));
}
catch (FileNotFoundException e)
{
System.err.println("beethoven not found");
System.exit(1);
}
m = 0;
for (int i = 0; i < beethovenNumbers; i++)
{
beethovenStarts[i] = m;
String[] terms = in.nextLine().split(" ");
for (String s: terms)
beethovenTokens[m++] = Short.parseShort(s);
}
in.close();
beethovenStarts[beethovenNumbers] = m;
}
void fitModel()
{ // follow algorithm 3.1
double total = bachNumbers + beethovenNumbers;
bachPi = bachNumbers / total;
beethovenPi = beethovenNumbers / total;
int[] freq = new int[numberOfGrams];
for (int i = 0; i < numberOfGrams; i++)
freq[i] = 0;
for (int i = 0; i < bachNumberOfTokens; i++)
freq[bachTokens[i]]++;
for (int i = 0; i < numberOfGrams; i++)
bachTheta[i] = freq[i] * 1.0 / bachNumbers;
for (int i = 0; i < numberOfGrams; i++)
freq[i] = 0;
for (int i = 0; i < beethovenNumberOfTokens; i++)
freq[beethovenTokens[i]]++;
for (int i = 0; i < numberOfGrams; i++)
beethovenTheta[i] = freq[i] * 1.0 / beethovenNumbers;
}
void predict()
{ // Algorithm 3.2
Scanner in = null;
try
{
in = new Scanner(new File("F:\\_Downloads\\bbtest.txt"));
// four test files
}
catch (FileNotFoundException e)
{
System.err.println("bbtest not found");
System.exit(1);
}
for (int i = 0; i < 4; i++)
{
String[] terms = in.nextLine().split(" ");
int len = terms.length;
boolean[] query = new boolean[numberOfGrams]; // xij == 1?
for (int j = 0; j < numberOfGrams; j++)
query[j] = false;
for (int j = 0; j < len; j++)
query[Integer.parseInt(terms[j])] = true;
double bachLike = Math.log(bachPi); //L = log pi
double beethovenLike = Math.log(beethovenPi);
for (int j = 0; j < numberOfGrams; j++)
{
if (query[j])
{ // line 5 of Algorithm 3.2
if (bachTheta[j] > 0)
bachLike += Math.log(bachTheta[j]);
else
bachLike -= 10.0;
if (beethovenTheta[j] > 0)
beethovenLike += Math.log(beethovenTheta[j]);
else
beethovenLike -= 10.0;
}
else
{
if (bachTheta[j] < 1)
bachLike += Math.log(1.0 - bachTheta[j]);
else
bachLike -= 10.0;
if (beethovenTheta[j] < 1)
beethovenLike += Math.log(1.0 -
beethovenTheta[j]);
else
beethovenLike -= 10.0;
}
}
// what to print?
}
in.close();
}
void MI()
{ // formula (3.76)
for (int i = 0; i < numberOfGrams; i++)
{
if (bachTheta[i] == 0 || beethovenTheta[i] == 0)
mi[i] = 0;
else
{
double combinedTheta = bachPi * bachTheta[i] +
beethovenPi * beethovenTheta[i];
mi[i] = bachTheta[i] * bachPi *
Math.log(bachTheta[i] / combinedTheta)
+ (1 - bachTheta[i]) * bachPi * Math.log((1 -
bachTheta[i]) / (1 - combinedTheta))
+ beethovenTheta[i] * beethovenPi *
Math.log(beethovenTheta[i] / combinedTheta)
+ (1 - beethovenTheta[i]) * beethovenPi *
Math.log((1 - beethovenTheta[i]) / (1 -
combinedTheta));
if (mi[i] > 1.6E-4)
System.out.println(grams[i] + " " + mi[i]);
}
}
}
public static void main(String[] args)
{
NBC nbc = new NBC();
nbc.readin();
nbc.fitModel();
nbc.predict();
}
}
So lets look at the fitModel() function. After reading the book and
reviewing the lecture slides I think that the goal in fit model is to find
bachPi,beethovenPi,bachTheta[], and beethovenTheta[]. I am having a hard
time understand what these 4 variable represent in the problem though.
The bachPi and beethovenPi I think these represent the bias in our
training data, because we have more samples of one class versus the other.
The log() of these are the Lic in the prediction algorithm as an initial
value. They are basically an offset used in defining how bachLike or
beethovenLike a piece of music might be in the prediction function; the
offset represents the bias in our model to choose the likeness.
These are calculated as "pi hat c" equals "N c over N" in the 8th line of
the model fitting algorithm. Which if I look at the given code "N c"
number of sample pieces of music per artist and "N" is the total number of
sample piece over all artists.
The bachTheta[] and beethovenTheta[] arrays I think represent the
individual biases per melody per artist. So for each index of the theta
arrays we represent a specific melody (again called grams) and the value
of each index represents the frequency of a melody per class.
Let me know if you can give me a more intuitive feeling of what pi and
thetas represent in this problem.
Now for the prediction function. Here we want to look at a given piece of
music. We look at it melody by melody or as the problem puts it gram by
gram. We have two values for each piece of music
double bachLike = Math.log(bachPi);
double beethovenLike = Math.log(beethovenPi);
These are an overall measure of the likeliness that a song belongs to one
of the two classes. For each melody in the piece of music we change the
likelihood for each class per the second algorithm. For each likeliness
value we start it offset with the classes bias; the pi values from the
model fitting.
So from looking at the code and the algorithm it looks like the code is
missing the last two lines of the prediction algorithm. I think they
involve first calculating the probability per class and then finding the
maximum probability per class to finally find the answer of the
classification. The last line makes sense of finding the maximum
probability, but calculating the probability per class is confusing me.
Mainly the how to write the logsumexp() function and what the argument of
the function (L subtext i comma colon) is and represents. And why we are
doing this.
I am also missing a piece of information regarding why we are doing the
log() for each theta, which I think links to this second to last line I'm
missing. I am pretty sure it has to do with avoiding numerical underflow;
the book says that is what the logsumexp function is for but I am unsure
why.
So overall my questions are
Is my understanding of the model fitting and the variables it produces
correct? Could you offer a better explanation if mine is off? Also I use
words like bias and offset, what are the proper names for these?
Are we doing the log to keep the problem from numerical underflow?
What is the bachLike and beethovenLike variables called? Are they measured
in something?
What is the second to last line in the prediction algorithm, how does it
go from the likelihood to the probability? (and for clarification is
exp(x) mean e to the power of x?)
I am looking for clarification and feedback about whether I am approaching
and understanding this problem correctly.
Update:
Added the following code to the prediction routine as a setup for the last
two lines I presume are missing. double li represents the argument to
logsumexp which I am not sure what it is by the algorithm.
double li = 0.0;
double beethovenPic = Math.exp(beethovenLike-logsumexp(li)); //
logsumexp() currently returns 0.0
double bachPic = Math.exp(bachLike-logsumexp(li));
double maxPic = Math.max(beethovenPic, bachPic);
System.out.println("Song " + (i+1) + " Probabilities \n Bach:" +
Double.toString(bachPic) + "\n Beethoven:" +
Double.toString(beethovenPic) + "\n Winner:" + Double.toString(maxPic));

No comments:

Post a Comment