## Saturday, April 15, 2017

### Finding the function that is its own derivative

I saw this "derivation" some years ago but now I can't find it.  If anyone knows the source please let me know.

Let's say we want to find a function $f(x)$ for which
$$\frac{d}{dx} f(x) = f^\prime (x) = f(x)$$
Well, how about $f(x) = x$?
$$f(x) = x \implies f^\prime (x) = 1$$
No, because $f^\prime (x) \ne x$. OK, how about $f(x) = 1+x$?
$$f(x) = 1+x \implies f^\prime (x) = 1$$
That get's me a little closer.  I get the "$1$" back, but I need something whose derivative will give me back the $x$:
$$f(x) = 1+x+\tfrac{1}{2}x^2 \implies f^\prime (x) = 1 + x$$
Now I need something whose derivative will me back the $\tfrac{1}{2}x^2$:
$$f(x) = 1+x+\tfrac{1}{2}x^2 + \tfrac{1}{2 \cdot 3}x^3 \implies f^\prime (x) = 1 + x + \tfrac{1}{2}x^2$$
OK, so if I use infinitely many terms then this function fits the bill
$$f(x) = 1+x+\tfrac{1}{2}x^2 + \tfrac{1}{2 \cdot 3}x^3 + \cdots \tfrac{1}{n!}x^n \cdots \implies f^\prime (x) = 1 + x + \tfrac{1}{2}x^2 + \cdots \tfrac{1}{(n-1)!}x^{n-1} \cdots$$
Can I write $f(x)$ in some compact form?  Well, $f(0) = 1$ and $f(1)$ will give me some number, which I'll call $e$.
$$f(1) = 1+1+\tfrac{1}{2} + \tfrac{1}{2 \cdot 3} + \cdots = e$$
You only need a few terms before $e$ has converged to the first two decimal places (2.717).

For $x$ = 2 you need a few more terms to get the same precision, but the number (7.382) is roughly 2.717 x 2.717, an agreement that quickly gets better with a few more terms
$$f(2) = 1+2+\tfrac{1}{2}4 + \tfrac{1}{2 \cdot 3}8 + \cdots = e^2$$
So
$$f(x) = e^x = 1+x+\tfrac{1}{2}x^2 + \tfrac{1}{2 \cdot 3}x^3 + \cdots \tfrac{1}{n!}x^n \cdots$$
and
$$\frac{d}{dx} e^x = e^x$$

### Where does the ln come from in S = k ln(W) ? - Take 2

Some years ago I wrote this post. Now I want to come at it from a different angle.

A closed system spontaneously goes towards the state with maximum multiplicity, $W$. For a system with $N$ molecules with energies $\varepsilon_1, \varepsilon_2, \ldots$ we therefore want to find the values of $N_i$ that maximises $W(N_1, N_2, \ldots)$.

This is easier to do for $\ln W$ than $W$, which is fine because $W$ is a maximum when $\ln W$ is a maximum, and this happens when
$$\frac{N_i}{N}= p_i = \frac{e^{-\beta \varepsilon_i}}{q}$$
$\beta$ can be found by comparison to experiment.  For example, the energy of an ideal monatomic gas with $N_A$ particles is
$$U^{\mathrm{Trans}} = N_A \langle \varepsilon^{\mathrm{Trans}} \rangle = N_A\sum_i p_i \varepsilon_i = \frac{3N_A}{2\beta} = \tfrac{3}{2}RT \implies \beta = \frac{1}{kT}$$
where $R$ is determined by measuring the temperature increase due to adding a known amount of energy to the system.

So far we have used $\ln W$ instead of $W$ for convenience, but is there something special about $\ln W$? Yes, the change in $\ln W$ has can be expressed quite simply
$$d \ln W = \beta \sum_i \varepsilon_i dN_i = N \beta \sum_i \varepsilon_i dp_i = \frac{dU}{kT}$$
So change in internal energy $U$ due to a redistribution of molecules among energy levels is equal to the change in $\ln W$ (as opposed to $W$) multiplied by $kT$
$$dU = Td\left( k \ln W \right) = TdS$$
The final question is whether there is something special about $\ln = \log_e$ as opposed to say $\log_a$ where $a \ne e$? Well, $\log_a W$ is a maximum when
$$\frac{N_i}{N}= p_i = \frac{a^{-\beta \varepsilon_i}}{q}$$
There is an extra term in the derivation but that cancels out in the end.  So no change there.

What about $\beta$?  There are two changes.  The previous derivation of $U^{\mathrm{Trans}}$ relied on this relation (I'll drop the "Trans" label for the moment)
$$\varepsilon_i e^{-\beta \varepsilon_i} = - \frac{d }{d \beta} e^{-\beta \varepsilon_i} \implies \langle \varepsilon \rangle = - \frac{1}{q} \frac{dq}{d\beta}$$
which now becomes
$$\varepsilon_i a^{-\beta \varepsilon_i} = - \frac{1}{\ln(a)} \frac{d }{d \beta} a^{-\beta \varepsilon_i} \implies \langle \varepsilon \rangle = - \frac{1}{q \ln(a)} \frac{dq}{d\beta}$$
While $q$ has an extra $\ln (a)$ term, the derivative wrt $\beta$ is still the same and
$$U^{\mathrm{Trans}} = N_A \langle \varepsilon^{\mathrm{Trans}} \rangle = N_A\sum_i p_i \varepsilon_i = \frac{3N_A}{2 \ln (a) \beta} = \tfrac{3}{2}RT \implies \beta = \frac{1}{\ln (a) kT}$$
So, $S = k \log_e W$ is special in the sense that the proportionality constant $k$ is the experimentally measured ideal gas constant divided by Avogadro's number.  In any other base we have to write either $S = k \ln (a) \log_a W$ where $k = R/N_A$ or $S = k^\prime \log_a W$ where $k^\prime = \ln(a)R/N_A$

Clearly, $S = k \log_e W$ is the most natural choice, and this is because $e$ is the (only) value of $a$ for which
$$\frac{d}{dx} a^x = a^x$$
In fact that is one way to define what $e$ actually is.

## Tuesday, February 28, 2017

### Biotechnology and drug-design: My latest paper explained without the jargon

Our latest paper has just appeared in the open access journal Chemical Science. It's ultimately related to biotechnology and drug design so first some background.

Background
Most biotechnology and medicine involves proteins in some way. Many diseases involve mutations that alter the function of proteins, most drugs are molecules that bind to proteins and inhibit their actions, and a large part of industrial biotechnology involves making new types of proteins such as enzymes. Like with everything else, it is easier to deal with proteins of you know what they look like but protein structure determination can be very difficult and we don't know what many important proteins actually look like.

The most popular way of determining protein structure is a technique called x-ray crystallography where you basically take an x-ray of a crystal made from the protein. Unfortunately, it can be very difficult or impossible to grow crystals of some proteins and if you can't get the protein to crystallise you can't use x-ray crystallography to find the structure.  The other main way for determining protein structure is a technique called NMR spectroscopy where you basically take an MRI of a solution containing the protein. The advantage is that there is no need for crystallisation, but the disadvantage it that it is difficult to extract enough information from the "NMR-MRI" go get a good structure.

The "NMR-MRI" of a protein actually provides a unique fingerprint of each protein so in principle all one has to do is generate a lot of possible structures of a protein, compute the NMR fingerprint for each, and compare to the measured fingerprint. The structure with the best fingerprint match should be the correct protein structure.  The questions are how to best generate the structure and how to best predict the NMR fingerprint using the structure.

The New Study
In 2015 we published a new method for predicting NMR fingerprints and in the paper that just got published we combined it with a method for generating a lot of protein structures. We started with known x-ray structures and generated millions of relatively small variations of the structure and found the structure with the best match.  We started from a known structure to answer the question: what is the best match we can hope for? The answer is: not perfect but good enough.  Now that we know this the next step will be to start with a structure we know is wrong and see if the program can find the right structure.  Also, our NMR fingerprint method does not generate fingerprints for all parts of the protein so we need to improve the model as well.

## Monday, February 13, 2017

### What are the most important reactions in drug synthesis and stability?

Synthesis
1. Aromatic electrophilic substitution of heteroaromatics
2. Suzuki coupling of heteroaryl halides
3. Diels-Alder reaction (of what)?
4. Michael reaction (of what)?
5. Friedel-Crafts alkylation (of what)?
6. Nazarov cyclization reaction (of what)?
7. ...

Stability (Probably solvolysis and oxidative degradation, but can we be more specific)
1. Autooxidation of C-H bonds?
2. Ester hydrolysis?
3. ..

## Sunday, February 5, 2017

### Preprints and the speed of publishing

When I talk about preprints with colleagues some of them say "Oh, what's the rush?" or "Publishing is so fast these days. Why, my last paper was online 6 weeks after submission don't you know" and then go on to clean their pipe with a thoughtful smile or get up to stoke the coal fire.

I'm currently writing a couple of proposals and as I was updating the reference list on one of them when I noticed that two preprints I cited apparently still haven't been "published" by a journal.  One of them first appeared on arXiv on October 7th and the other October 27th.

Both papers are important to the proposal in the sense that they changed my thinking on what is possible and I think the proposal would be less ambitious if I hadn't read them. So I am very happy these authors chose to deposit them as preprints.  I wish more people would do this and I wish fewer journals/journal editors would stand in the way of them doing so.

## Saturday, January 28, 2017

### Drug design: My latest paper explained without the jargon

Our latest paper has just appeared in the Journal of Physical Chemistry A.  If you don't have access to this journal you can find a free version of an earlier draft here. It's ultimately related to making better drugs so first some background.

Background
Designing new drugs currently involves a lot of trial-and-error, so you have to pay a lot of smart scientists a lot of money for a long time to design new drugs - a cost that is ultimately passed on to you and I as consumers.  There are many, many reasons why drug design is so difficult. One of them is that we often don't know fundamental properties of drug-candidates such as the charge of the molecule at a given pH. Obviously, it is hard to figure out whether or how a drug-candidate interact with the body if you don't even know whether it is postive, negative or neutral.

It is not too difficult to measure the charge at a given pH, but modern day drug design involves the screening of hundreds of thousands of molecules and it is simply not feasible to measure them all. Besides, you have to make the molecules to do the measurement, which may be a waster of time if it turn out to have the wrong charge. There are several computer programs that can predict the charge at a given pH very quickly but they have been known to fail quite badly from time to time.  The main problem it that these programs rely on a database of experimental data and if the molecule of interest doesn't resemble anything in the database this approach will fail.

Last year we developed a "new" method for predicting the charge of a molecule that relies less on experimental data but it fast enough to be of practical use in drug design. We showed that the basic approach works reasonably well for small prototypical molecules and we even tested one drug-like molecule where one of the commercial programs fail and show that our new method performs better (but not great).

The New Study
We test the method on 48 drug molecules and show that it works reasonably well.  It is not quite as accurate as the methods that rely on experimental data, but this is probably because many of the molecules we test are in the databases that the programs use.  But we felt we had to test these molecules first because they are some of the first molecules other users will try to test the method. The next step is to test the method on molecules where some of the existing methods perform poorly. We also have to think about how best to make this method available to researchers who are acutually doing the drug design.

## Saturday, January 21, 2017

### Prediction of the Regioselectivity of Electrophilic Aromatic Substitution Reactions of Heteroaromatic Systems Using Semi-Empirical Quantum Chemical Methods

Art Winter tweeted this paper by Morten Jørgensen and co-workers last year and I decided to see if semi-empirical methods could help here.  The paper uses Chemdraws chemical shift predictor to predict where a bromine atom will be added to a heteroaromatic molecules using electrophilic aromatic substitution reactions. They tested this on 132 different compounds and achieved an 80% success rate, which is very good.

Googling a bit let me to this paper by Wang and Streitwieser where they show a correlation between the rate of electrophilic aromatic substitution reactions and the lowest proton affinity of the protonated species.  This suggests that the protonated carbon with the lowest proton affinity (or pKa if solvent is included) should be the reacting carbon.  So I tested this using semiempirical QM methods for these 132 compounds.  When I say "I" I should say that +Jimmy Charnley Kromann ran many of the calculations and Monika Kruszyk provided most of the structures as Chemdraw files, which I could convert to SMILES strings using OpenBabel. These are preliminary results and may contain errors.

The reactions for the 132 compounds are not all run in the same solvent, so I first tested gas phase, chloroform (i.e. dielectric 4.8) and DMF (dimethylformamide, dielectric 37) using PM3 and COSMO in MOPAC. I chose PM3/COSMO because that gave the best results in a previous pKa study. The most representative choice of solvent seems to be chloroform, where PM3/COSMO predicted the correct bromination site in 95% of the cases, i.e. it fails for 7 cases. Gas phase and DMF fails for 14 and 8 cases, so it's important to include solvent, but the value of the dielectric constant is not all that important.  Using chloroform as a solvent, I then tested AM1,  PM6, PM6-DH+, PM7 and DFTB3/SMD (using GAMESS for the last one), which resulted in 12, 12, 12, 9, and 13 wrong predictions. One of the compounds includes an Si atom, which the DFTB3 parameter set I used couldn't handle so the 13 wrong predictions is out of 131 compounds.  Anyway, PM3/COSMO/chloroform works best.

In some cases the lowest pKa value is quite close to some of the other pKa values, so I took an approach similar to that of Jørgensen and co-workers: if the correct bromination site is included in the set of atoms with pKa values within 0.74 pH units (corresponding to 1 kcal/mol at room temperature) then I counted it as correct.  For PM3/COSMO/chloroform this occurred 10 times. In 9 cases the set included 2 atoms and in 1 case, 3 atoms.  In one of the 9 cases (15) there are only two possible bromination sites, so this case is not a successful prediction and PM3/COSMO/chloroform actually gets 8 wrong. However, in all other cases there are more possibilities than those predicted. Furthermore, in all but 2 of thes 10 cases the atom with the lowest pKa is the "correct" atom.

Bromination, or more generally, halogenation is often a first step towards adding an aryl group, usually using a Suzuki reaction.  Often there is more than one halogen of the same type so there is also interest in predicting where the aryl group will go.  I tried the PM3/COSMO/chloroform approach on the six molecules in this paper by Houk and co-workers. Computing pKa's of the halogenated carbon atoms let to correct predictions in 4 of the 6 cases, while computing proton affinities of the carbon atoms in the non-halogenated parent compounds let to correct predictions in 2 of the 6 cases. The former approach seems promising but needs to be tested on a much larger set of molecules.

Next step is to write this up and get the set-up and analysis code in such a shape that we can distribute it. I've also started thinking about how to make the approach more generally available and usable for non-experts. A grant proposal is also in the works, so if we're successful that should definitely be possible to achieve.