Population of Northeast Ohio

Let's just look at some data of the population of northeast Ohio (NEO).

This is what the population distribution in NEO looks like:

In contrast, this is what the overall US population distribution looks like:

NEO is older than the other two big metros in Ohio, Cincinnati and Columbus:

This is what Dallas looks like:

The average ages in Cleveland, Cincinnati, and Columbus are: 41.1, 37.9, and 35.7. As a comparison, the average age of Dallas is 34.2. Cleveland population is 7 years older than that of Dallas.

The fertility rate of Cleveland, Cincinnati, and Columbus are: 4.8%, 5.2%, and 5.4%. As a comparison, the fertility rate of Dallas is 5.4%.

Hence, NEO has fewer women at birth age (15 to 50) than the national average and its fertility rate is under national average. The result? Fewer babies.

Bowtie index building problem for U's in sequences

I am having a weird problem with Bowtie index builder.

I have two entries in my FASTA file:
forrest@narnia:/bioinfo/out-house/miRBase$ cat 157.fa

After building the index, I use bowtie-inspector to check the index.
forrest@narnia:/bioinfo/out-house/miRBase$ bowtie-build 157.fa  157 -q
forrest@narnia:/bioinfo/out-house/miRBase$ bowtie-inspect -s 157
Colorspace 0
SA-Sample 1 in 32
FTab-Chars 10
Sequence-1 ath-miR157a-5p-U 18
Sequence-2 ath-miR157a-5p_T 21

Strangely, the length of ath-miR157a-5p-U becomes 18 instead of 21. The 3 U's of it are missed.

Even more strangely, not all U's in all sequences are ignored. This problem happens for some but not all.

Named Entity Recognition using SpaCy in 5 minutes

Recently, I am looking it SpaCy, a startup and an NLP toolkit. It is fabulous on its speed. Today, I just gave it a try on NER. Just a few lines (as in iPython):

In [1]: import spacy.en
In [2]: parser = spacy.en.English()
In [12]: ParsedSentence = parser(u"alphabet is a new startup specializing in eating their own words on leaving china to fight for information freedom")

In [13]: print ParsedSentence.ents

In [14]: ParsedSentence = parser(u"Alphabet is a new startup specializing in eating their own words on leaving China to fight for information freedom")

In [15]: print ParsedSentence.ents
(Alphabet, China)

In [16]: for Entity in  ParsedSentence.ents:    
   ....:     print Entity.label, Entity.label_, ' '.join(t.orth_ for t in Entity)
349 ORG Alphabet
350 GPE China

I used only default settings. Apparently, the NER of SpaCy is very sensitive to case of words.

Why is it hard to learn Amazon AWS

Amazon AWS is awesome. And it is not really difficult to learn - except the part of reading documents and learning terminology.

If you want answer for one thing, AWS could give you more than one places. For example, if you wanna know how to SSH into your instance, here are the two places in the same user guide:




I wish the learning experience can be easier.

Epilepsy 101 for Computational People

Disclaimer: This document cannot be used as any ground for medical and legal purposes. I write this up just because I am tired of explaining basic epilepsy terms to my colleagues (informatists, not neurologists) again and again.

  1. Epilepsy is a neurological disorder while seizure is a characteristic symptom of epilepsy. Jerks are one kind of seizures.
  2. In 1981, ILAE made a seizure classification that is widely used until today
    1. Focal Seizures: seizure limited in or originated from one hemisphere of the brain
      1. simple partial seizures, (simple: no interruption on consciousness)
      2. complex partial seizures, (complex: with interruption on consciousness) and
      3. partial seizure evolving into secondary generalized seizures, about 1/3 of partial seizures
    2. (Primary) Generalized Seizures: seizure all over the brain
      1. absence seizures (kinda like blackout), formerly called petit mal
      2. tonic-clonic seizures (kinda like whole-body jerk), formerly called grand mal
      3. many other generalized seizures
    3. Unclassified epileptic seizures
  3. In 1989, ILAE made an epilepsy classification that is widely used until today
    1. Focal Epilepsies
      1. sympotomatic or cryptogenic (cause unknown): e.g., temporal lobe epilepsy (TLE)
      2. idopathic (genetic causes): e.g., benign childhood epilepsy
    2. Generalized Epilepsies
      1. sympotomatic or cryptogenic (cause unknown): e.g., West syndrome, Lennox-Gastaut syndrom
      2. idiopathic: e.g., childhood absence epilepsy, juvenile absence epilepsy,
    3. Epilepsies undertermined whether focal or generalized
  4. When seizure is onset, we say the subject is in ictal state. Otherwise, interictal state. Because seizure duration (few seconds/minutes) is much shorter compared with interictal period, interictal EEG is more accessible.
  5. EEG is a gold standard for epilepsy diagnosis. However, neurologists heavily rely on either ictal EEG or interictal epileptiform discharges (IEDs, such as sharp wave and spikes) in diagnosis. The IEDs are the distinctive EEG patterns for epilepsy when subjects are not in seizure. Because seizure and IEDs happen unpredictably and sporadically, the way to catch them is to use long-term EEG recordings which can last from hours to days. Such tedious procedure is costly and inconvenient.
  6. Typical signal processing and machine learning problems related to epileptic EEG
    1. coarse epilepsy diagnosis: distinguishing epileptic interictal EEG (with or without IEDs) and non-epileptic EEG
    2. fine epilepsy/seizure diagnosis: identifying the type of seizure and/or epilepsy based on ictal or interictal EEG
    3. seizure detection: detecting seizure activities from epileptic subjects' EEG
    4. seizure prediction: predicting the onsets of seizure activities from epileptic subjects' EEG
    5. focus localization: locating the epileptogenic zone (for focal seizures only)

1. For complete list of seizure types and epilepsy types, check Tables 9-1 and 9-2 on Pages 121-122 of this doc from British gov: http://www.nice.org.uk/guidance/cg137/resources/cg137-epilepsy-full-guideline3 which are reprinted from 1981 and 1989 ILAE classifications.