Alan R. Reich, Ph.D.
Forensic Acoustics Consultant
Office__________________Cell:___________
E-Mail_________________
=============================================
May 9, 2013
Richard Mantei
Assistant State Attorney
220 East Bay Street
Jacksonville, Fl. 32202
__________Office
__________Cell
Dear Mr. Mantei:
May this letter serve as a partial summary of my ongoing aural and
digital acoustical examination of two 911 recordings re: State of Florida
v. George Zimmerman. The supplied recordings were represented as
unredacted digital copies of original digital audio recordings. You
requested that I process and analyze two 911 Dispatch recordings,
hereafter referred to as CALL1 and CALL3. Immediately after receiving
them, I archived the zip-extracted files on magnetic and lazer media.
In addition, several other digital recordings were supplied as possible
sources of voice exemplars for George Zimmerman and Trayvon Martin. They
are described briefly in a subsequent section of this summary.
Technical Considerations Regarding the 911 Recordings
The monderate-fidelity 911 recordings presumably were the stereo output
of a 24-hour, digital-audio recording system. The sampling rate of the
911 recordings was only 8,000 samples/sec, compared to the 44,100
samples/sec associated with the audio CD quality. The frequency
bandwidth of CALL1 and CALL3 thus was estimated to be only 40 Hz to 4,000
Hz compared to an audio CD bandwidth of 10 Hz to 22,050 Hz. Howver,
this high-frequency insensitivity is not particularly troublesoe in the
present investigatiaon context, since telephone systems are designed to
be relatively unresponsive to frequencies above 3,500 Hz.
Audio CD and 911 data-logging recording both have 16-bit amplitude
resolution, which divides the vertical amplitude scae of the digital
signal into 2^16 =65,526 amplitude gradations. Although 8-bit amplitude
resolution is attractive for situations requiring small data files, it's
vertical scale has only 2^8=256 amplitidue gradations. The 911-Dispatch
System's 16-bit resolution was critical to the success of this
investigation, in which the recorded signlas had a very wide dymanic
range (from very distant speech to softly wishpered speech to a single
loud gunshot to several heart-poundingly loud screams.)
General Structure and Scope of the Present Investigation
In this summary, I will try to: (a) answer general questions regarding
the nature, usefulness, and scope of the materials on the CALL1 AND CALL3
recordings, (b) provide some illustrative examples of the approach that I
took to analyze selected words and phrases, (c) discuss the complexities
and obstacle that one encounters when trying to decode highly distorted,
emotionally driven, overlapping speech, and (d) provide an analytic
framework for arriving at trustworthy and perceptually stable
transcriptions and demo recordings of the most difficult-to-understand
speech on the CALL1 and CALL3 wave files.
page 2
Nature, Usefulness, and Scope of the Material on the CALL1 and CALL3
Wave Files
CALL1 represents the digital audio record of George Zimmerman's 911 call
to report his seeing a young male whome he thought was acting
suspiciously. The two speakers are Mr. Zimmerman and a male 911
Dispatcher. The fidelity of CALL1 is reasonably good but the recording
has a number of puzzling acoustic anomalies. There are numerous
instances of "nonconforming speech" on CALL1, e.g., whispered speech,
pitch break, garbled or unintelligible speech, vocal impressions,
tremulous speech, and very rough voice quality. The observed behaviours
were outside the customary speech modes of both the dispatcher and Mr.
Zimmerman.
Those nonconforming segments indicate that Mr. Zimmmerman frequently
shifts or switches voice modes or speaking styles. His first utterance
on CALL1, he syas "or...um...the best...address I can give you is
one-eleven Retreat View Circle." During the four-second untterance, he
shifts from whispered voice to customary voice to detective impression
back to customary voice. At 97 seconds, the voiced but tremulous "These
assholes, they always getaway." is preceded by a whispered "Dear God"
and followed by a whispered "but not on me."
Mr. Zimmerman's speech patterns periodically show measurable effects of
psychological stress (e.g., vocal tremor, pitch breaks, rapid speech).
This latter finding is not to be contrued necessarily as negaltive since
perpetrator pursuits by enforcement officers typically are accomplained
by increased levels of adrenaline and excitatory neurochemicals. In any
case, Mr Zimmerman's vocal-mode switching behaviors need to be examined
in greater detail and correlated with relevant physical and behavioral
events on both recordings.
CALL3 principally represents the digital audio record of an unidentifed
woman caller, a female 911 Dispatcher, and two males involved in a very
loud but somewhat distant confrontation just outside the woman caller's
home. One of the male speakers appears to be George Zimmerman, whose
idiosyncratic "voice-mode switching" behaviors, vocial impressions,
whispering, and tremulous voice are present on both CALL1 and CALL3.
For example, approximately one second after the start of CALL3, Mr.
Zimmerman makes a seeminly religious proclamation, "These shall be."
His speech is characterized by the low pitch and exaggerated pitch
contour reminiscent of an evangelical preacher or carnival barker. The
statement is challenging for the untrained listener to detect as it
occurs simultaneously with Trayvon Martin's loud, high-pitched,
distressed, and tremulous "I'm begging you." and the 911 Dispatcher's
"Nine-one-one." Many of Mr. Zimmerman's "side-bar" utterances are
subject to such multiple-talker masking effects and to low signal levels.
The other male speaker was identified tentatively as Trayvon Martin from
the audio track of a digital video file present on Mr. Martin's cell
phone. His voice is younger and he generates much of what some observers
have called screams. If a scream is defined in operational terms as
speech with a very high pitch and loudness level, then my findings would
support that conclusion. The two males are engaged in a loud,
purposeful, mostly "turn-taking" linguistic dialogue. The speech
associated with the confrontation is often is quite difficult to
understand, but is amenable to individualized digital enhancement and
computer-aided transcription, using an interactive, segment-by-segment
approach.
Example of the Analytic and Scientific Approach
It is often helpful in scientific investigations to begin at the end and
work backwards, slogging through the inevitably complex details to
arrive at a more complete understanding of multifaceted physical or
Page 3
behavioural events. Thus, my investigation began by addressing questions
about the last "scream," the very high-pitched, very loud production of
a single monosyllabic word on the CALL3 wave file.
Speech and Hearing Scientists often characterize speech as a "series of
rapid, complex, overlapping movements that have been made audible." The
"final"cry" on the CALL3 recording is the result of very high-effort
speech movements, but, regrettably, the large distance between the
highly distressed talker and the microphone of the 91 caller's phone
markedly attenuates or reduces the speech's amplitude.
Consequently, the resulting sound pressure level of the final male
pre-gunshot utterance is 30.4 decibles(dB) below the Woman Caller's
"Yes." When the amplitude level of the final word before the shot was
digitally gained or amplified by a factor of ten, the word appears to be
"stop" not "help," as previously perceived by some listeners.
Perceptually, the two monosyllabic words are quite similar and easily
confused, especially within the context of a high-effort production.
Nonetheless, digital spetrographic examination of the word's component
frequencies supports a "stop" transcription. On CALL3, the first
Formant or Resonant Frequency of the /a/vowlel in /stap/is 870 Hz, about
10% above the adult male average. This value is highly appropriate for a
17-year-old male who likely still had 10% more growth remaining before
reaching his "adult-male" vocal-tract length, diameter, and tonicity.
The resonant frequency position (largely related to oral, nasal, and
pharyngeal anatomy), the fundamental frequency location (a physical
measure of pitch related principally to laryngeal anatomy), and glottal
source spectrum (voice quality resulting from the complex, rapid
vocal-fold valving of exhaled lung air) suggest sthat the speaker had not
completed his homornally-driven, anotomical and physiological tansition
into adult-male voice production. In addition, the acoustic voice data
are consisten with the audio/video samples extracted from Mr. Martin's
cell-phone. They are inconsistent with audio/video samples from Mr.
Zimmerman's crime-simulation video recording and from an audio recording
of a telephone conversations with his wife during his incarceration.
Taken together, the above scientific observations of the recorded
pre-gunshot word allowed me to conclude tentatively that the word was
produced by the younger of the two male speakers, Trayvon Martin. The
scientific data may also explain why some witnesses have characterized
the final utterance as a "boy crying." Of course, the fact that the
speaker of the final word was rendered silent by the weapon's discharge
and George Zimmerman was not, also suggests the identity of the "boy"
who was crying.
To illustarate my analytic approach to these acoustic data, I am
attaching air pressure-versus-time waveforms and corresponding
frequency-versus-time spectrograms (KAY Pentax Multi-Speech) of the
interval that includes and closely surrounds the word "stop." These
acoustical plots and a corresponding wave file comprise the raw speech
interval, followed by the fully processed and enhanced version. The
word "stop" on the raw interval, followed by the fully processed and
enhance version. The word "stop" on the raw intervale is very soft on
the wave demo, very low in amplitude on the time waveform, and lacking
complexity on the spectrogram.
Feasibility of Using Global Enhancement Strategies on CALL1 and CALL3
To explore the feasiblilty of find a less-time-comsuming approach to
analysing CALL1 and CALL3, numerous global digital-enhancement
algorithmys (SONY Sound Forge Pro) were applied to the Microsoft Windows
WAV files, with varying degrees of success. Global enhancement
strategies are designed to improve the overall fidelity of a noisy,
distored, and/or unbalanced recording. In the
Page 4
present investigation, the enhanced signals often were rendered somewhat
less noisy but the speech intelligibility was compromised or unchanged
rather than improved.
Thank you for allowing me to consult on this interesting case. If you
have questions or need further information, please feel free to call or
write.
DECLARATION
I declare under penalty of perjury under the laws of the State of New
Jersey that the foregoing is true and correct. Dated at Oakland, New
Jersey on May 9, 2013.
/signature/
____________________
Alan R. Reich, Ph. D.
Forensic Acoustics Consultant
Forensic Acoustics Consultant
Office__________________Cell:___________
E-Mail_________________
=============================================
May 9, 2013
Richard Mantei
Assistant State Attorney
220 East Bay Street
Jacksonville, Fl. 32202
__________Office
__________Cell
Dear Mr. Mantei:
May this letter serve as a partial summary of my ongoing aural and
digital acoustical examination of two 911 recordings re: State of Florida
v. George Zimmerman. The supplied recordings were represented as
unredacted digital copies of original digital audio recordings. You
requested that I process and analyze two 911 Dispatch recordings,
hereafter referred to as CALL1 and CALL3. Immediately after receiving
them, I archived the zip-extracted files on magnetic and lazer media.
In addition, several other digital recordings were supplied as possible
sources of voice exemplars for George Zimmerman and Trayvon Martin. They
are described briefly in a subsequent section of this summary.
Technical Considerations Regarding the 911 Recordings
The monderate-fidelity 911 recordings presumably were the stereo output
of a 24-hour, digital-audio recording system. The sampling rate of the
911 recordings was only 8,000 samples/sec, compared to the 44,100
samples/sec associated with the audio CD quality. The frequency
bandwidth of CALL1 and CALL3 thus was estimated to be only 40 Hz to 4,000
Hz compared to an audio CD bandwidth of 10 Hz to 22,050 Hz. Howver,
this high-frequency insensitivity is not particularly troublesoe in the
present investigatiaon context, since telephone systems are designed to
be relatively unresponsive to frequencies above 3,500 Hz.
Audio CD and 911 data-logging recording both have 16-bit amplitude
resolution, which divides the vertical amplitude scae of the digital
signal into 2^16 =65,526 amplitude gradations. Although 8-bit amplitude
resolution is attractive for situations requiring small data files, it's
vertical scale has only 2^8=256 amplitidue gradations. The 911-Dispatch
System's 16-bit resolution was critical to the success of this
investigation, in which the recorded signlas had a very wide dymanic
range (from very distant speech to softly wishpered speech to a single
loud gunshot to several heart-poundingly loud screams.)
General Structure and Scope of the Present Investigation
In this summary, I will try to: (a) answer general questions regarding
the nature, usefulness, and scope of the materials on the CALL1 AND CALL3
recordings, (b) provide some illustrative examples of the approach that I
took to analyze selected words and phrases, (c) discuss the complexities
and obstacle that one encounters when trying to decode highly distorted,
emotionally driven, overlapping speech, and (d) provide an analytic
framework for arriving at trustworthy and perceptually stable
transcriptions and demo recordings of the most difficult-to-understand
speech on the CALL1 and CALL3 wave files.
page 2
Nature, Usefulness, and Scope of the Material on the CALL1 and CALL3
Wave Files
CALL1 represents the digital audio record of George Zimmerman's 911 call
to report his seeing a young male whome he thought was acting
suspiciously. The two speakers are Mr. Zimmerman and a male 911
Dispatcher. The fidelity of CALL1 is reasonably good but the recording
has a number of puzzling acoustic anomalies. There are numerous
instances of "nonconforming speech" on CALL1, e.g., whispered speech,
pitch break, garbled or unintelligible speech, vocal impressions,
tremulous speech, and very rough voice quality. The observed behaviours
were outside the customary speech modes of both the dispatcher and Mr.
Zimmerman.
Those nonconforming segments indicate that Mr. Zimmmerman frequently
shifts or switches voice modes or speaking styles. His first utterance
on CALL1, he syas "or...um...the best...address I can give you is
one-eleven Retreat View Circle." During the four-second untterance, he
shifts from whispered voice to customary voice to detective impression
back to customary voice. At 97 seconds, the voiced but tremulous "These
assholes, they always getaway." is preceded by a whispered "Dear God"
and followed by a whispered "but not on me."
Mr. Zimmerman's speech patterns periodically show measurable effects of
psychological stress (e.g., vocal tremor, pitch breaks, rapid speech).
This latter finding is not to be contrued necessarily as negaltive since
perpetrator pursuits by enforcement officers typically are accomplained
by increased levels of adrenaline and excitatory neurochemicals. In any
case, Mr Zimmerman's vocal-mode switching behaviors need to be examined
in greater detail and correlated with relevant physical and behavioral
events on both recordings.
CALL3 principally represents the digital audio record of an unidentifed
woman caller, a female 911 Dispatcher, and two males involved in a very
loud but somewhat distant confrontation just outside the woman caller's
home. One of the male speakers appears to be George Zimmerman, whose
idiosyncratic "voice-mode switching" behaviors, vocial impressions,
whispering, and tremulous voice are present on both CALL1 and CALL3.
For example, approximately one second after the start of CALL3, Mr.
Zimmerman makes a seeminly religious proclamation, "These shall be."
His speech is characterized by the low pitch and exaggerated pitch
contour reminiscent of an evangelical preacher or carnival barker. The
statement is challenging for the untrained listener to detect as it
occurs simultaneously with Trayvon Martin's loud, high-pitched,
distressed, and tremulous "I'm begging you." and the 911 Dispatcher's
"Nine-one-one." Many of Mr. Zimmerman's "side-bar" utterances are
subject to such multiple-talker masking effects and to low signal levels.
The other male speaker was identified tentatively as Trayvon Martin from
the audio track of a digital video file present on Mr. Martin's cell
phone. His voice is younger and he generates much of what some observers
have called screams. If a scream is defined in operational terms as
speech with a very high pitch and loudness level, then my findings would
support that conclusion. The two males are engaged in a loud,
purposeful, mostly "turn-taking" linguistic dialogue. The speech
associated with the confrontation is often is quite difficult to
understand, but is amenable to individualized digital enhancement and
computer-aided transcription, using an interactive, segment-by-segment
approach.
Example of the Analytic and Scientific Approach
It is often helpful in scientific investigations to begin at the end and
work backwards, slogging through the inevitably complex details to
arrive at a more complete understanding of multifaceted physical or
Page 3
behavioural events. Thus, my investigation began by addressing questions
about the last "scream," the very high-pitched, very loud production of
a single monosyllabic word on the CALL3 wave file.
Speech and Hearing Scientists often characterize speech as a "series of
rapid, complex, overlapping movements that have been made audible." The
"final"cry" on the CALL3 recording is the result of very high-effort
speech movements, but, regrettably, the large distance between the
highly distressed talker and the microphone of the 91 caller's phone
markedly attenuates or reduces the speech's amplitude.
Consequently, the resulting sound pressure level of the final male
pre-gunshot utterance is 30.4 decibles(dB) below the Woman Caller's
"Yes." When the amplitude level of the final word before the shot was
digitally gained or amplified by a factor of ten, the word appears to be
"stop" not "help," as previously perceived by some listeners.
Perceptually, the two monosyllabic words are quite similar and easily
confused, especially within the context of a high-effort production.
Nonetheless, digital spetrographic examination of the word's component
frequencies supports a "stop" transcription. On CALL3, the first
Formant or Resonant Frequency of the /a/vowlel in /stap/is 870 Hz, about
10% above the adult male average. This value is highly appropriate for a
17-year-old male who likely still had 10% more growth remaining before
reaching his "adult-male" vocal-tract length, diameter, and tonicity.
The resonant frequency position (largely related to oral, nasal, and
pharyngeal anatomy), the fundamental frequency location (a physical
measure of pitch related principally to laryngeal anatomy), and glottal
source spectrum (voice quality resulting from the complex, rapid
vocal-fold valving of exhaled lung air) suggest sthat the speaker had not
completed his homornally-driven, anotomical and physiological tansition
into adult-male voice production. In addition, the acoustic voice data
are consisten with the audio/video samples extracted from Mr. Martin's
cell-phone. They are inconsistent with audio/video samples from Mr.
Zimmerman's crime-simulation video recording and from an audio recording
of a telephone conversations with his wife during his incarceration.
Taken together, the above scientific observations of the recorded
pre-gunshot word allowed me to conclude tentatively that the word was
produced by the younger of the two male speakers, Trayvon Martin. The
scientific data may also explain why some witnesses have characterized
the final utterance as a "boy crying." Of course, the fact that the
speaker of the final word was rendered silent by the weapon's discharge
and George Zimmerman was not, also suggests the identity of the "boy"
who was crying.
To illustarate my analytic approach to these acoustic data, I am
attaching air pressure-versus-time waveforms and corresponding
frequency-versus-time spectrograms (KAY Pentax Multi-Speech) of the
interval that includes and closely surrounds the word "stop." These
acoustical plots and a corresponding wave file comprise the raw speech
interval, followed by the fully processed and enhanced version. The
word "stop" on the raw interval, followed by the fully processed and
enhance version. The word "stop" on the raw intervale is very soft on
the wave demo, very low in amplitude on the time waveform, and lacking
complexity on the spectrogram.
Feasibility of Using Global Enhancement Strategies on CALL1 and CALL3
To explore the feasiblilty of find a less-time-comsuming approach to
analysing CALL1 and CALL3, numerous global digital-enhancement
algorithmys (SONY Sound Forge Pro) were applied to the Microsoft Windows
WAV files, with varying degrees of success. Global enhancement
strategies are designed to improve the overall fidelity of a noisy,
distored, and/or unbalanced recording. In the
Page 4
present investigation, the enhanced signals often were rendered somewhat
less noisy but the speech intelligibility was compromised or unchanged
rather than improved.
Thank you for allowing me to consult on this interesting case. If you
have questions or need further information, please feel free to call or
write.
DECLARATION
I declare under penalty of perjury under the laws of the State of New
Jersey that the foregoing is true and correct. Dated at Oakland, New
Jersey on May 9, 2013.
/signature/
____________________
Alan R. Reich, Ph. D.
Forensic Acoustics Consultant
No comments:
Post a Comment
Keep it Civil. Ignoring the evidence will not be allowed!
Thank you.